Well as you can see, already Klarth does not like the idea of eliminating end tokens altogether. I imagine it would hit additional resistance. However, he is in favor of eliminating artificial end tokens. He also wants to keep formatting. I'm on the fence with it.
The case for eliminating artificial end tokens certainly seems stronger than that for eliminating end tokens or newlines. As an end user, I think I prefer keeping them in the table file; even though that may not be the most logical place for them, it does seem to be the most convenient place.
a) I agree with this. As a real world example, it might be interesting to consider a game like The Legend of Zelda, which uses different tokens for regular characters, characters ending the first line, characters ending the second line, and characters ending the string.
b) The utility interface for defining a line break control can be pretty much the same as for defining an end token. I definitely prefer the formatting flexibility of \n over the old * entries.
c) As far as implementation goes, stripping newlines can be done in approximately 2 lines of code. My objections are primarily philosophical and are entangled with other issues.
d) I've never used \r myself. Is that a popular option? If you have some interface for defining line break controls and end tokens, the same thing should work for comment controls. Maybe something along the lines of this (clearly user interface is not my strong suit
Or maybe you could use your GUI to layer extra effects on a per-token basis? That sounds very much like a utility-specific table file extension, though.
I'm not even going to begin to talk about the ideal process. That's a huge can of worms as the ideal process depends on what platform and games you're working with. The table file escapes much of that and look at all the disagreement on that.
All too true. Consider the question retracted
The difference is, with DTE, there is direct token conversion in both directions. With the Kanji Array and most other data compression, you do not have that. An algorithm (even if simple) is required. As soon as you leave the realm of direct token conversion, I think you depart from the scope of the table file. That's a clear line to me. DTE is a static bidirectional map with no manipulation needed, the Kanji Array is not. Transformation of the hex must occur to get the map, right?
DTE is only direct for dumping, and only when we assume all tokens have the same hex length. If the hex lengths vary, you need an algorithm to decide when to stop reading bytes (technically you still need an algorithm even if all it does is say "read one byte"). When you're inserting, you need an algorithm to tokenize the input... oh, I guess you're talking about what happens after tokenization. In that case, yes, DTE is direct both ways. But if you've already tokenized, Kanji Arrays are also direct conversion. The only problem we're having is that some of the tokenization information (the location and type of table switch tokens) is missing, and deducing the exact nature of the missing bits is tricky. If we could rely on the input script having the correct table switch tokens in the correct places, we wouldn't have most of these problems. It's the uncertainty of tokenization at issue, not the mapping of tokens.
Look at what you did. You defined table2 with 82=c and 8280=~5~. I can't say I've ever seen a text engine that could operate in that way that I recall. 0x82 would be reserved to indicate the next character is a two byte character.
Yes, that was a little sneaky of me. It is fully supported by the longest hex sequence rule for dumping, though. A quick peek at the wiki
shows at least one real world example where this does in fact happen (on that note, people should submit more table files!).
I believe S-JIS and UTF-8 work in a similar manner as well off hand.
Not sure about S-JIS, but UTF-8 was actually designed to avoid this very weakness (and a few others). The same cannot be said of character encodings used in video games.
If your game is processing characters that are one OR two bytes and the next character is valid as one byte and two byte character, that's an ambiguous situation for the text engine unless it were to arbitrarily declare one or the other the default to choose from.
Which is exactly what the table file standard does
On top of all that, again, the user specifically plopped that raw value in there in the midst of a table switch to cause it. This is another case of the 'gotcha-table' theoretical condition that is so rare, I can't say it's worth any time to handle.
Raw hex literals are (potentially) problematic anywhere
in a game using multibyte tokens, not just at table switch boundaries.
What would you propose to do about it anyway? I don't think you'll find much support on the issue from the others.
That's what I was asking! Do they trigger fallback? Do they count as a match towards NumberOfTableMatches? What happens if they combine with the preceding or following hex to mutate into a completely different token? How can you write code to handle multi-table insertion and not consider these cases?
You can try and iron out theoretical edge cases all day and spend a wealth of resource on them in your program, and continue to put it out of reach of more and more people, or you can start adding some practical limits, keep it simple, and get the product out the door. I think that's kind of where several of us are coming from now.
Being 100% standard compliant means
handling the theoretical edge cases. All of them
. I haven't introduced anything in any of my examples that wasn't already allowed by the standard, and I haven't even raised all of the issues I'm currently aware of. If you want practical limits, I think either you have to adjust the standard or you can't claim compliance. And since you can't advocate a standard you don't want to comply with, I guess that means adjusting the standard.
One more try: in addition to the raw hex literal, let's also ignore all the multibyte entries and all the multicharacter entries. That leaves us with
Now instead of "ab~5~bc", your hex represents "abcabc", which is still not the same as "abcABC". In this case, the only hex that works is "C1 80 C1 81 C1 82 80 81 82". If we are going to drop support for counts other than 0 and 1, what would you think about also dropping support for one table linking to another table using tokens with both counts? In this case, that would mean the combination in @table1 of "!C0=table2,0" and "!C1=table2,1" would be an error.
On the topic of falling back... how exactly does the last Dakuten example in 3.2 work?
Good catch. I remember writing those examples. I think I originally wrote it with an additional switch entry in Table2 of '!7F=NORMAL,0'. That should take care of it (and revising the logic paragraph). I remember having a (in hindsight) brain logic malfunction when I read it over and thought I didn't need that and it would just fallback. I rewrote it after that.
I thought about that too, but it does cause the dumper's table stack to become desynchronized from the game's stack. The next time the dumper hits an unrecognized entry in @NORMAL, it would fall back to @Dakuten (which might contain a hit) instead of dumping a hex literal. Worse yet would be if @NORMAL weren't the bottom of the stack, and the game actually fell back to some other table entirely while the dumper stayed in @Dakuten. But I guess that's just another theoretical edge case waiting to be ignored.