Do you do anything else of interest that requires the string in list of token form?
My primary motivation for maintaining token lists rather than the translated text/hex strings was a couple of edge cases in post-processing dump output. Even given their start and end addresses, detecting exactly where two strings overlap can only really be done at the hex level; in general, it cannot be determined from the translated text (at least not without running through the table translation/tokenization process again). Here's an extreme example: suppose pointer A points to the hex sequence "00 01 02 03 04", and we parse that according to table X as "0001 0203 04". Now suppose pointer B overlaps with pointer A's hex range, but starts one byte later, in the middle of what was a multibyte token for pointer A, and we parse it (also according to table X) as "0102 03 04". Suppose further that pointer C also overlaps with pointer A, starting at the same byte but according to table Y this time, and we parse it as "00 01 02 03" (no "04", since I've decided that "03" is an end token in my table Y).
Now, you may call that crazy, and I would agree with you, but it's still valid input to the utility, so I prefer to handle it rather than ignore it. I found that token lists allowed me to more elegantly handle cases like these, and since the translated strings are easily recoverable from a token list, using token lists involved no loss in functionality. There are a couple of other places where I take advantage of having access to the intermediate tokenization whenever I want it, but that's mostly about program design rather than functionality that couldn't be easily achieved otherwise.
Concept is clear, but I'm a little unclear on the mechanics of this.
Tauwasser pretty much covered my response to this

. I'll add that the idea of replacing parameters by their corresponding regular expressions also takes care of cases where the insertion script might appear to be ambiguous (e.g. <window x=45, y=10> in the script and $FE=<window x=%X, y=10> in the table).
I think that treads in undefined behavior of the text engine territory.
Fair enough. What I'm trying to get at is where the extra power for these kinds of tokens needs to live, i.e. whether the fallback condition is determined by the opening or the closing byte. It sounds like it belongs on the closing byte, which probably makes our lives a little easier.
Don't you think it's a bad idea (and irresponsible) to release a standard without having come up with any known implementation of it?
Depends how I'm feeling

. You raise many valid points here. On the one hand, this entire project is basically your one-man mission to bring order to chaos and make the world a better place (

), so I appreciate the personal aspect of it. On the other hand, it can also be viewed as a community project, and as such,
ideally it shouldn't rely too heavily on any one person. I dunno. There are a lot of people on the internet with a lot of free time on their hands, and some of them are pretty smart. Does anyone have any connections with Japanese romhacking groups? They must have figured out a reasonable way to handle all this, right? If so, why re-invent the wheel?
I think I mentioned before, that all games that use variable length parameters (or variable number of parameters) would be unsupported due to limitations in linked values. That bit of knowledge should be put into the standard if it isn't.
Agreed. So far what I'm hearing is that nobody knows of any game that actually uses this. My suspicion is that any such game would use different hex representations for the different functions, just like C++ uses different code blocks for similar methods with different full signatures, or assembly languages use different opcodes for similar operations with different number or lengths of operands.
They would be required to start and end with [] or <> and in turn those characters would not be allowed in normal entries. I'd probably choose [].
Going with [] instead of <> would also make things cleaner for XML-based scripts/utilities.
I wonder if that would work well with more complex control codes.
I guess I just don't like the redundancy that is introduced when mapping the ids in arguments to lengthy commands.
It gets kind of annoying, yeah. There are a bunch of ways the standard could be extended to make this work, though... One way would be if we allowed control codes to include matches from another table. Then you could have something like this:
@table1
$F0=<portrait: %name %expression>
@name
00=Carl
01=Ricky
02=Forrest
@expression
00=Happy
01=Grumpy
02=Sleepy
and the hex "F00201" would output "<portrait: Forrest Grumpy>". If we were to do such a thing, I think we'd want to restrict these kinds of lookup tables to contain only normal entries. Thoughts?
Another thing I just realized, what about common text formating controls? While no two game engines are alike, surely most of them have the usual formating features, such as changing the text color and what not. Should these commonly occurring features be standardized?
Possibly, but I think that's a separate issue. From the table file standard's perspective, these would probably show up as control codes, so you could call them whatever you wanted, e.g. "$CD=<red>".
Sorry for my long absence, but real life grabbed me by the neck...
Happens to us all from time to time, though hopefully not too literally!
Table matches with hex literals
I think we should count hex literals as matches for the current table in the event the table produced (or would have produced) the literal: That is to say,
- an unlimited table would have counted an unknown byte as a fallback case and fallen back to the table below it,
- If the stack is empty, or the table is not unlimited, the hex literal counts towards it in insertion direction.
This behavior mirrors dumping. It can produce cases that are not correct for the game's display mechanism, but since this case is ambiguous, either choice can. This would extend to table matches other than infinite and one match(es).
Assuming that table entries cannot produce output that looks like a hex literal, the only way for an unassociated literal to appear in dumper output is when the stack is empty. By that logic, mirroring dumping behaviour would mean that when a hex literal is encountered during insertion, we should immediately clear the stack, insert the hex literal, and resume tokenization with the starting table.
I'm in favor or restricting the current release of the standard and extend the syntax upon finding a suitable solution
Since it's been on my mind recently, there's also
the approach the W3C takes, where a standard is released as a "Working Draft" and only makes its way up to a full "Recommendation" later, generally after many years and multiple successful implementations. I'm not saying I'm encouraging this approach, but a toned-down version might merit some consideration.
Control Codes
...
For that matter, it seems we should only allow one-byte arguments to avoid Endianess issues and we can simplify the arguments to %D, %X, %B for decimal, hexadecimal and binary respectively.
I agree with this, but think we should also include %% for literal %.
Notice that we do need to keep identifiers in the output, contrary to what was suggested. If we do not do that, we are open to the following exploits:
!7E=<window x=%X y=%X>
!7F=<window x=%D y=%D>
How do you parse <window x=10 y=10> without additional context information? You can't. Also, we might want to include signed decimal and signed hex.
Assuming that control codes are checked for uniqueness, and that control code parameters are handled correctly during that check, including !7E and !7F in the same table would make that table invalid. Contrary to what I said earlier, however, correctly handling control code parameters during uniqueness checks does not mean simply erasing them. If that were all we did,
this would be an exploit:
!7E=<window x=%X y=%X>
!7F=<window x=%D y=10>So in addition to erasing the parameters, we should also check for values that they can take on. Fortunately, these two checks can be combined in one expression (add/remove backslashes as appropriate for your expression engine):
- %D → \|[0-9]\{1,3\}
- %X → \|\$[0-9A-Fa-f]\{2\}
- %B → \|%[01]\{8\}
As long as we have uniqueness modulo parameter values, we should always be able to determine the correct encoding. That said, including identifiers in the output is certainly another valid way of addressing the problem. My preference for not including them is based purely on aesthetics.