Defaulting to 16-bit sizes is what I did in my original assembler.

Always ask yourself, "Am I doing this because it is the easy path? Or because it is the courageous path?"
I value the courageous choices in life!
In addition to that, to solve the #if problem, I had a requirement that evaluations passed to directives had to be immediately resolvable or else you get an error.
Hmm. Same question to ask yourself.
This also skirted around other problems that arose, like setting DirectPage to a label that was not defined yet. But I'm worried that might be too restrictive.
I think you are touching on one of the serious questions to be asking yourself. If an "ASSUME DP" directive (whatever you name it) is set to an external symbolic, you have no choice but to allow the linker to locate that symbol (or expression containing external symbols needing resolution) before you can resolve the quantity, exactly. And without the exact quantity, you cannot resolve the DP references in the instructions (like LDA) that may refer to a symbolic, which by definition is to be taken as relative to DP. (Or, if you automatically choose between two different LDA's, then to find out if you need to do so.) Of course, you didn't even address yourself to an external here, but to a "label that was not defined yet," which might be one that is in the same source file but found later (so you would need to wait until pass 2.) My example couldn't even be resolved in pass 2, but only at link time. Which is yet another question to ask.
I want this to "just work" without the programmer having to manually select the size of every instruction and/or worry about whether or not the assembler is choosing the optimal sizes.
This whole area is something that can easily fall into arguments over "matters of style." Better experienced programmers can argue this question into any corner you want and make it stick pretty well.
Should the assembler just do what you say and let the programmer make all the decisions, emitting errors to help guide the hand of the programmer?
Should the assembler do a fair job of "counting bytes" and "figuring out offsets" for the programmer, freeing the programmer from having to worry about such details?
But in regards to the original question I was asking, with regard to informing the assembler about the DP setting, things may get interesting.
Suppose you support structs. Suppose the assembly programmer has a subroutine they want to write that assumes that the DP points somewhere useful before being called. But it doesn't know (or care) exactly where. Instead, as it turns out, this subroutine does something interesting and fun with a palette structure. However, there are a dozen different palettes in use. This subroutine doesn't care which one is in use. It only cares that the DP register points to the palette structure you want it to examine before you call it. (This could be a palette structure or it could be an NPC structure or it could be a saved-game structure. It doesn't matter. Make up something you consider worthy of the example here.) Now, the assembler needs to be told the
type of the structure that DP points at. But the assembler doesn't need to know the exact address contained in DP. Just that whatever the DP is pointing it happens to be something of this
type. Now, DP-relative LDA instructions should be able to be generated by the assembler just fine without any need for fix-ups during link-time because the assembler knows all it needs to know in order to correctly generate the DP-relative LDA instruction.
So while the assembler may need to know what kind of thing the DP points at, it doesn't actually need to know the absolute value of the DP. On the other hand, one might actually want to tell the assembler about the absolute address of the DP and not tell it about the type of the data items that proceed there, at all.
Should one be able to over-ride all this in the instruction operand itself? Should I be able to say to the assembler,
LDA ((struct X *) DP)->field1
And then have it figure out that
field1 is at offset $12 relative to DP?
I don't know. You tell me?
I really don't like to waste my time sitting down with a piece of paper, working out DP-relative offsets. Worse, this in effect hard-codes these deltas. If I later decide to move the DP base somewhere else or if I decide to modify a structure there and add some more fields.... then I'm running around having either to modify a lot of instructions that I should never have had to bother with or else I have to go find my long list of EQU/= symbols and go hack that thing into shape so that the offsets are correctly stated, again. This is seriously bad. I really think the assembler needs to have some information about where DP is established and that the programmer should allow the assembler to figure out the offsets. The assembler is really good at bookkeeping details like that. The programmer isn't so good.
Let the assembler do what it is good at doing.
But once you open that door and walk through it, when do you stop? Frankly, I see the need for a very good, high quality expression/operand analyzer.
I was going to say... I'd argue that shouldn't be an assembler feature and SHOULD be done with macros.
I anticipated that.
On a side note: BRL is the dumbest and most useless instruction on the 68516. It's basically just JMP, but takes an extra cycle. I guess maybe it's useful for relocatable or self-modifying code? Whatever.
Yeah. Mostly for PIC. You might want that if you are transferring code into RAM for execution. No, I don't know why. In the MSP430 from TI, you may need to do that if you are modifying your flash because the RAM still works fine when the flash is being written. So there, you'd need something like that. For the 65816 in the SNES? I don't know. Maybe I'll think about making something up that sounds really important.

Anyway I have an idea for how this could be accomplished, but it basically amounts to the linker doing a lot of the same work the assembler would have to do. Like, to the point where they might as well be the same executable just with different commandline args.
That would be bad. You don't want to bury
knowledge in two different places. This is why I let slip the idea of having the assembler pass along a
list of options for the linker to consider, done up in such a way that the linker doesn't actually need to know what it is doing. Just a thought for now.
Here's a question:
- Is it unreasonable to expect every source file to have an ORG before any binary output?
Of course it is. Relocatable code doesn't use ORG. In the Merlin32 system I modified a few weeks back, the ORG can occur either in the linker file OR in the source code OR in both. But if it is in the linker file and not in the assembly source code, then the source code is moved to the location indicated in the linker file. If the source includes an ORG (or more than one) then the linker file ORG merely sets a barrier so that all ORGs in the source file must be at or after that address. But it otherwise doesn't restrict the use.
I would think this would be a safe assumption, but I can see someone creating library-like files that don't care where they're put. But would those be assembled directly or would they be #included into another source file?
I don't think it is safe. Here's why.
When I'm hacking ROM code, one of the first things I do is get rid of certain subroutines, replacing them with others I place in some 0xFF region I believe isn't used for anything. Doing so leaves "holes" in the code. I mark these holes for later use by other, shorter routines I might later write.
Suppose a 4Mb ExHiROM. Suppose I know that everything at the tail end, from $F40000 to $FFFFFF, is safe to use for new code. Suppose I tear out subroutines OLDX, OLDY, and OLDZ, located at $C40000 to $C400E7, $C41320 to $C41410, and $C72010 to $C72251, respectively. My new replacement routines for these three functions will be located somewhere in the $F40000 to $FFFFFF region, but I really don't care exactly where. It doesn't matter. But I do have to keep the address of OLDX at $C40000, OLDY at $C41320, and OLDZ at $C72010, because the rest of the ROM expects them to stay there and I don't intend wasting tons of time tracking down all of the calls to these functions. So I insert a small snippet (perhaps just a JML) of code at the beginning. This leaves me with three useful holes in the code that I may use for something else (moved data table, additional data table, other subroutines I write that are short, etc.) When I write my patches, I want to specify the start of the patch areas (in this case, there are four patch areas: OLDX, OLDY, OLDZ, and NEWCODE) to the linker. But the assembler shouldn't care, at all. So far as the assembler knows, I have four named code segments that are, each of them, fully relocatable. The assembler should not need to know anything about their location in memory. (Aside from the rule that the linker won't locate a named code segment so that it sprawls across a bank boundary.) Only the linker knows that. So my oldx_seg has a JML followed by some small subroutine or two; my oldy_seg has another JML plus some additional other small subroutines; my oldz_seg has yet another JML followed by still more personal subroutines; and newcode_seg has the replacement code for OLDX, OLDY, and OLDZ plus a bunch more library routines, tables, data, and other stuff I couldn't fit into the earlier, tiny holes.
Why should the assembler care about an ORG here?