Well, it looks like I'm super late to the party, but in the fine tradition of this particular thread, I do come bearing a wall of text
(actually more than one wall, since it seems there's a 15000 character limit on post length). Recently I've been working on updating my extraction/insertion utility based on updates to the Table File Standard (TFS) between June 2011 and March 2012 (that's still the current version, right?). After going through the March 2012 document in detail, I've compiled a list of things in the TFS that I believe are incorrect or unclear. I'll mostly skip over spelling, grammatical, and formatting issues except where they affect understandability of the TFS. Much of this is going to be nit-picking details, so I apologize in advance if the "constructive" gets buried under the "criticism"; I'm just trying to help! As always, thanks for all your hard work
: Maybe say ".tbl file extension" instead of "TBL file extension"? Are we actually requiring table files to follow any particular naming conventions?2.1
: I'm assuming "text-to-text" is a typo and should actually say "text-to-hex".18.104.22.168/2.2.2
: The TFS is pretty clear about only supporting "Longest Prefix" for hex-to-text translation, but there are many different algorithms available for text-to-hex translation. My understanding is that what we're concerned about here isn't really TFS compliance (since that's covered by 22.214.171.124), but letting the end user know what behaviour they should expect from text-to-hex translation, and I think these aren't the right words to use for doing that. As the end user of an insertion utility that implements the TFS, the things I would be most concerned about in terms of text-to-hex translation are knowing whether the inserted bytes are going to be correctly interpreted by the game's text engine as the text I want to see displayed and knowing whether as few bytes will be inserted as possible to achieve that correctness.
With that in mind, how about replacing this section of the TFS with requiring some sort of documentation on the text-to-hex translation algorithm(s) that the utility implements, preferably including correctness and optimality conditions? It doesn't have to be a novel, but seeing something in the utility's documentation like this:
This utility implements a Longest Prefix insertion algorithm, which guarantees correct text-to-hex translation based on the provided table files as long as the following conditions are satisfied:
- all table entries are contained in a single table; and
- no table entry's hex sequence is a prefix of any other table entry's hex sequence; and
- for each character used in normal table entries, a table entry exists which maps some hex sequence to that single character;
and at least one of the following conditions is satisfied:
- the text to be translated does not contain raw hex bytes; or
- the hex sequence of every table entry represents a single byte.
It also guarantees the smallest possible hex length of any correct text-to-hex translation as long as the following additional conditions are satisfied:
- the hex sequence of every table entry is the same length; and
- the text sequence of every normal table entry is no more than 2 characters long.
This utility implements an A* path-finding insertion algorithm, which guarantees correct text-to-hex translation based on the provided table files and guarantees the smallest possible hex length of any correct text-to-hex translation.
would be pretty great, right?
I think it would also be useful for utility authors to note how their utility handles situations not fully defined by the standard (we'll see a few examples of those below; maybe it would be worth adding a section about theoretically possible scenarios which have never been observed "in the wild"?).126.96.36.199
: Another approach would be to let utilities list which parts are implemented and which parts are not, which would, for example, let people claim partial compliance for otherwise useful utilities that handle everything except multi-table insertion or that need to break a bit of the defined behaviour in order to deal with some weird game's crazy coding (e.g. game X doesn't actually use a Longest Prefix text engine, game Y doesn't exhibit stack behaviour for mid-string encoding changes, etc.). As long as the utility still respects the syntax of valid table files, there's probably not too much harm in an approach like that. And as long as the game an end user is working on doesn't require the missing features, the end user will probably be just as happy either way.2.5.Label.1/2.5.Label.3
: Since there's a new Star Wars movie coming out next week, I'll misquote Yoda: "must" or "must not"; there is no "should" :p.2.5.Label.3
: There appears to be some confusion here about how a Label is defined. If Labels can only contain the characters [0-9A-Za-z], then the Label itself is only the text between '[' and ']' and basically every other statement the TFS makes about Labels contradicts the examples which use Labels.2.5.1
: There's nothing here that actually specifies how to represent "$hexadecimal sequence=[label],parameter1,parameter2" in text. From the examples, it seems that the rule is something like converting commas to spaces in the text sequence, replacing placeholders with their values (while being careful with parameter text like "%%D"), moving the ] from the end of [label] to the end of the text sequence, and then moving any \n to the end of the text sequence and replacing each \n with a newline, but that rule is never stated and the examples aren't quite consistent with it:2.5.1.Example 1
: Where does the \n belong in the text sequence? 2.5.Label.5 and part of 2.6 suggest that the newlines should go between the Label and its following ']', resulting in a text sequence of "[keypress\n]" instead of "[keypress]\n".2.5.1.Example 2
: Where did that '$' come from? There's nothing in the TFS which indicates %X placeholder values can or must be prefixed with '$'.
Also, when replacing placeholders with their numeric values, the TFS should address the issue of leading zeroes. We're explicit about the hex sequence part of a table entry being 2 characters per byte, which implies displaying leading zeroes there, but I don't see anything that enforces that for %B, %D, or %X placeholder values. We should also be explicit about the expected behaviour here. How about making leading zeroes mandatory for %B and %X and optional for %D?2.5.2.Example 1
: The explanation here is not strictly correct: if 0xFF is encountered as part of another token (e.g. 0xFF13 or 0xDEFF), we don't output "[END]\n\n", since that would violate the Longest Prefix extraction specified by 2.10. This wording issue also occurs in other examples.2.5.3.Rules.2
: Have we really decided to include this restriction? I'm not sure what value it adds, especially since it can be trivially circumvented by e.g. making N copies of the same table with different Table IDs and then setting up your main table like:
If we get rid of this rule (which I think we should), we'll also need to update 2.5.3.Table Switch Format.Notes.2, which also effectively prevents a source table from switching to a destination table via multiple entries.2.5.3.Table ID
: As far as I can tell, the TFS still allows multiple @TableIDString lines in a file. Since we've killed off support for multiple logical tables in a single table file, we want there to be at most one @TableIDString line per file, right? Or are we supporting multiple IDs for the same table file?2.5.3.Table ID
: This section states that "The TableIDString can contain any characters except for ','.", but 2.5.3.Table Switch Format.Notes.2 implies that only tables whose ID is composed entirely of the characters [0-9A-Za-z] are able to be switched into, which means that tables whose ID contains any other characters can only be used as starting tables. Is that intentional?2.5.3.Table ID
: We should specify that the TableIDString only needs to be unique across all tables provided to the utility, not e.g. across all tables on Data Crystal
or something like that.2.5.3.TableID
: I don't see anything that prevents a table from "switching" to itself; should e.g.
be considered an error?2.5.3.NumberOfMatches.-1
: Does this type of entry in one table imply a corresponding entry in another table? e.g. is
an error? Would it be possible to also have e.g. !7E=[Dakuten],5 (maybe in some table other than NORMAL) and match 7F=foo in Dakuten?
Closely related to the above point, it's not clear whether the closing token counts as a match in the pre-fallback table or in the post-fallback table. E.g., for
with input 0x23 0x7F 0x02 0x7F 0x01, which of these scenarios (if any) is correct?
0x23 in NORMAL: switch to Kanji for 2 matches
0x7F in Kanji, match #1: switch to Dakuten until 0x7F
0x02 in Dakuten: "baz"
0x7F in Dakuten: fall back to Kanji
0x01 in Kanji, match #2: "bar"
0x23 in NORMAL: switch to Kanji for 2 matches2.5.3.NumberOfMatches.-1
0x7F in Kanji, match #1: switch to Dakuten until 0x7F
0x02 in Dakuten: "baz"
no match in Dakuten: fall back to Kanji
0x7F in Kanji, match #2: returning from Dakuten
made 2 matches in Kanji: fall back to NORMAL
0x01 in NORMAL: "foo"
: Further exploring the behaviour of these forced fallback entries, what happens for
with input 0x00 0x01 0x00 0x02? Under the current TFS wording, that second 0x00 triggers fallback all the way to table1 and the output is "a", but I feel like we should be expecting 0x00 to match in table3 and have "cd" as our output.2.5.3.NumberOfMatches.X
: The TFS never really defines the term "match", but in this case the precise meaning becomes more important: do bytes which are dumped as control code parameters count as separate matches towards X? I think we decided earlier in this thread that they did not (i.e. that the control code together with all of its parameter bytes count as a single match), but that decision doesn't seem to have made its way into the TFS.2.9
: It might be worth noting that this behaviour is algorithm dependent; e.g. it's possible for Longest Prefix insert to back itself into a corner and fail on input that other algorithms would succeed on.2.11
: Going back to my points about 188.8.131.52/184.108.40.206/2.2.2, it would be nice to see some kind of correctness condition here, i.e. that the hex produced by the utility when inserting text A must be extracted as text A by the Longest Prefix extraction algorithm from 220.127.116.11.Duplicate Hex Sequences
: Maybe add a note here to confirm that having the same hex sequence occur in different tables is okay.2.12.Unrecognized Line or Invalid Syntax
: I'd like to propose a slight extension to the TFS: any line which begins with the character '#' must be ignored during parsing. This would allow for comments inside table files, which would be very useful for end users, and comes at a negligible cost to utility authors.2.13.Duplicate Text Sequences
: It looks like this rule is left over from when Longest Prefix was the only insertion algorithm that had been considered. Now that the floor is open to other algorithms, this rule should be removed (or at least reduced to a suggestion for anyone wanting to implement Longest Prefix), since it can make legitimate text sequences impossible to insert correctly. As an example,
the hex sequence 0x00 0x01 0x02 is dumped as "testfoo", but enforcing this rule would result in "testfoo" being inserted as 0x01 0x02, which the Longest Prefix hex-to-text algorithm would as translate to "bar". Smarter algorithms not bound by this rule would be able to use the different options for encoding "test" to find a tokenization that would not be misinterpreted by the dumping algorithm.2.13.Blank Text Sequences
: Does anyone have a use case for this? I'm having trouble coming up with a good reason why anyone would ever want this.
I'll see if I can include my comments on sections 3+ in a separate post. Edit: nope, auto-merge killed that idea :p.