Romhacking.net

Romhacking => Programming => Topic started by: abw on May 15, 2011, 05:56:11 pm

Title: Table File Standard Discussion
Post by: abw on May 15, 2011, 05:56:11 pm
I've moved from working on paper to actually writing code. Table switching did create a few headaches, but I'm using recursion as my stack now, so it's all good (well, except for post-processing of overlapping strings with bytes interpreted by different tables, which is still a mess, but hopefully no game was insane enough to do that; same goes for pointers into the middle of multibyte tokens). I can locate, translate, and output text now, and it is fun :). However, there are a couple of things in the table standard (http://transcorp.parodius.com/scratchpad/Table%20File%20Format.txt) that I'd like to get a sanity check on.


Raw Hex Inserts
Given the following table:
00=<$01>
01==
02=this is a '<$01>' string
03=this is a '
04=' string
05=<
06=>
07=$
08=\
10=0
11=1
the hex sequence "00 01 02" will be dumped as the text "<$01>=this is a '<$01>' string" but must be inserted as "01 01 03 01 04" as per 2.2.1. This seems wrong, but the problem could be resolved by replacing the entries for 00 and 02 with e.g.
00=<$$01>
02=this is a '<$$01>' string
so perhaps text sequences containing <$[0-9A-Fa-f][0-9A-Fa-f]> should be forbidden. Also, it might be more appropriate to include the section on inserting raw hex in 2.2.4 instead of in 2.2.1. Also also, it might be worth mentioning that hex characters are equally valid in upper or lower case (e.g. "AB" == "ab" =="Ab" == "aB").


Control Codes in End Tokens Requiring Hex Representation
Given the following table:
00=text
01=more text
/FF=<end>\n\n//
the hex sequence "00 FF" will be dumped as the text "text<end>

//". When inserting, control codes are ignored for end tokens requiring hex representation, so any of "text<end>//", text<end>
//", "text<end>




//", etc. will be mapped to "00 FF", but "text<end>" will be mapped to "00" since "<end>" is ignored as per 2.2.2.

Assuming Atlas-compatible output, the hex sequence "00 FF 01 FF" would be dumped as "#W16($XXXX)
text<end>

//

#W16($XXXX)
more text<end>

//", which is probably not what was intended (maybe you could try to interpolate the pointer output into the end token's newlines, but that sounds like an extremely bad idea). Output commenting should probably be controlled at the utility level rather than the table level.

When inserting, should control codes be ignored for all tokens, or just end tokens requiring hex representation?


Uniqueness of End Token Names
Quote
Note: End Tokens, regardless of type, must be uniquely named.
The standard makes no definition of what constitutes a "name". Given that duplicate hex sequences are forbidden by 2.2.5, I assume name refers to the text sequence. Presumably an error should be generated when encountering a duplicate text sequence and at least one of the tokens involved is an end token... maybe? Is this dependent on the type of token? How about dumping vs. inserting? Given the following table:
/00=<end>,2
/01=<end>,2\n
$02=<end>,2
!03=<end>,2
04=<end>,2
05=<end>,2\n
/<end>,2
/<end>,2\n
what errors (if any) should be generated when dumping? when inserting?

While on the topic of uniqueness, it might be worth including a note in the standard that restricts the definition of a "unique" entry to the current logical table. Otherwise an error (duplicate hex sequence) should be generated by the following table:
@table1
00=foo

@table2
00=bar
Conversely, TableIDString in 2.6 should be considered unique across all tables (including across multiple table files) provided to the utility.


Linked Entries
Attempting to insert a linked entry which lacks its corresponding <$XX> in the insert script should generate an error under 4.2, right?

Under the "wishing for the moon" category:
It would be nice if we could define linked entries in such a way that we could obtain output with both pre- and post-fix, like "A(Color:<$A4><$34>)a". Theoretically, there's no reason a table file has to be restricted to dealing with in-game script. I can imagine, for instance, somebody writing a table like:
...
A9=LDA #,1
AA=TAX
AC=LDY $,2
...
and wanting "B1" to map to "LDA ($<$XX>),Y". You could do that with e.g. "B1=LDA ($,1,),Y", but since we lack a general escape character, you couldn't determine which commas were field delimiters and which were data :(. It might also be nice to be able to specify how raw bytes were output in case you don't want the <$XX> format.


Table Switching
NumberOfTableMatches in 2.6 refers to tokens (each of variable length) rather than bytes, yes? How many tokens do the bytes of a linked entry count as?

Given the following table:
@table1
00=1
!01=table2,2

@table2
00=A
01=B
0100=C
$02=<Color>,2
I think the process for translating "01 02 25 25 01 00 00" starting with table1 would be this:

table1 --> table2 --> <Color><$25><$25> --> C --> fallback to table1 --> 1


Various errata
2.2.1: "sequenes"
2.2.1: "hex byte insertion takes precedent" should read "hex byte insertion takes precedence"
2.3: "Control Codes" is a somewhat ambiguous term, since it can refer to the only defined table entry format control code ("\n") or to game-specific control codes as referenced in 2.5
3.1: "in and automated fashion" should read "in an automated fashion"
3.4, Example 1: should "E060" read "E030"?
4.2: "paramter"
Title: Re: Table File Standard Discussion
Post by: Nightcrawler on May 16, 2011, 03:06:54 pm
I've moved from working on paper to actually writing code. Table switching did create a few headaches

I handled table switching with an array of tables, index to active table, and a stack of table indexes to handle the table jumps and returns. It was only a few lines of code. It was more difficult to handle notification of the several conditions that would required the table to fall back or jump to another.

Quote
Raw Hex Inserts
 the hex sequence "00 01 02" will be dumped as the text "<$01>=this is a '<$01>' string" but must be inserted as "01 01 03 01 04" as per 2.2.1.

Yes, that would be the way it would currently operate. The possibility of normal text sequences containing the "<$XX>" pattern was deemed to be very low and thus the notation that raw hex insertion takes precedence was chosen. It would make sense to disallow the pattern of usage in normal text sequences altogether. I will run it by a few others and see if any other thoughts come from it.

Quote
Control Codes in End Tokens Requiring Hex Representation
//", which is probably not what was intended (maybe you could try to interpolate the pointer output into the end token's newlines, but that sounds like an extremely bad idea). Output commenting should probably be controlled at the utility level rather than the table level.

It is controlled at the utility level. The table file has no knowledge of what commenting characters are. The example simply illustrates a possibility of being able to output like that, if you desire, for situations it's appropriate for. Even then, you still would not be able to insert without the insertion utility being aware of your commenting characters, which the table file certainly does not do. You would obviously not choose to do this at all with Atlas output like you showed in your example. I will think about making that a bit more clear, or change the example to not have anything to do with comments to eliminate the confusion. I just thought that would be a common and useful application for many simple cases.

Quote
When inserting, should control codes be ignored for all tokens, or just end tokens requiring hex representation?
They should be ignored for all tokens. Linebreak control exist only for dumping output/readability only. It serves no use for insertion. Cases have been made for eliminating it entirely and adding line breaks via regex or search and replace later to eliminate control codes from the table file entirely.

Quote
Uniqueness of End Token Names
The standard makes no definition of what constitutes a "name". Given that duplicate hex sequences are forbidden by 2.2.5, I assume name refers to the text sequence. Presumably an error should be generated when encountering a duplicate text sequence and at least one of the tokens involved is an end token... maybe? Is this dependent on the type of token? How about dumping vs. inserting?

Here, we're talking specifically about end tokens. The end token's name (or more precisely text sequence) should be unique regardless of end token type. End Tokens are the only token type that must have unique text sequences.

Quote
Given the following table:
/00=<end>,2
/01=<end>,2\n
$02=<end>,2
!03=<end>,2
04=<end>,2
05=<end>,2\n
/<end>,2
/<end>,2\n
what errors (if any) should be generated when dumping? when inserting?

The only error would be with the duplication of end token names. Everything else follows the rules in 2.2.6.  You do make a good point with the same end token text sequences differing only by '\n'. At present that would result in a duplicate for insertion purposes, but would pass for dumping. That feeds into the case for elimination of control sequences altogether, but then there's no way to have any line breaks anywhere other than what your dumper or other third party app may be able to do. It is a possibility considered, but the majority of people who would use the applications based on this format are going to want to have some line breaks without having to jump through hoops or additional programs/steps.


Quote
While on the topic of uniqueness, it might be worth including a note in the standard that restricts the definition of a "unique" entry to the current logical table. Conversely, TableIDString in 2.6 should be considered unique across all tables (including across multiple table files) provided to the utility.
The standard provides only for a single table per file. There can be only one Table ID line that will uniquely identify that table.

Quote
Linked Entries
Attempting to insert a linked entry which lacks its corresponding <$XX> in the insert script should generate an error under 4.2, right?
Not sure what you mean here. Linked entries aren't really inserted. For insertion, they pretty much become normal entries for all practical purposes. The parameter bytes are all raw hex at that point.

Quote
Under the "wishing for the moon" category:
It would be nice if we could define linked entries in such a way that we could obtain output with both pre- and post-fix, like "A(Color:<$A4><$34>)a". Theoretically, there's no reason a table file has to be restricted to dealing with in-game script. I can imagine, for instance, somebody writing a table like: It might also be nice to be able to specify how raw bytes were output in case you don't want the <$XX> format.

That would make it much more difficult to insert. At present, a linked entry is only used for dumping. After that, you just have a normal entry and some raw hex bytes. Insertion can then by done by token stream. Having a prefix and postfix and keeping the notion of a linked entry would be more complex. Usage outside of ROM Hacking is beyond the scope of the document. Lastly, we explicitly defined the raw hex format and do not allow for other formats for simplicity and unification  of upcoming utilities based on this standard. It was a real black eye for incompatibility in the past. If you don't define raw hex format, it leads to many more overlapping and ambiguous possibilities being counter productive to what we're doing.

Quote
Table Switching
NumberOfTableMatches in 2.6 refers to tokens (each of variable length) rather than bytes, yes? How many tokens do the bytes of a linked entry count as?

Your example looks correct. We use the term 'matches' for a reason. There are no characters or bytes involved. Thus a linked entry hit counts as a single match as you have.


Quote
Various errata
2.2.1: "sequenes"
2.2.1: "hex byte insertion takes precedent" should read "hex byte insertion takes precedence"
2.3: "Control Codes" is a somewhat ambiguous term, since it can refer to the only defined table entry format control code ("\n") or to game-specific control codes as referenced in 2.5
3.1: "in and automated fashion" should read "in an automated fashion"
3.4, Example 1: should "E060" read "E030"?
4.2: "paramter"

Thanks. Will correct.
Title: Re: Table File Standard Discussion
Post by: abw on May 16, 2011, 10:43:29 pm
It would make sense to disallow the pattern of usage in normal text sequences altogether.
As added support for this, I'll point out that the raw hex precedence rule makes affected table entries entirely useless for insertion. This behaviour also feels counter intuitive at first glance, unlike the case of duplicate text sequences.

They should be ignored for all tokens.
The current placement strongly suggests that the "ignore newlines when inserting" rule applies only to end tokens requiring hex representation, so 2.3 should probably say so instead. Better yet, I think, would be to strike the rule altogether. If somebody went to the trouble of putting a newline in their table in the first place, they must have had a reason to do so. Assuming that reason was for formatting, the odds are very good those newlines will still be exactly where they should be when it comes time to insert, in which case ignoring them or not leads to the same result.

In any case, the decision whether to ignore newlines or not should be left to the utility rather than imposed by the table standard.

That feeds into the case for elimination of control sequences altogether
I think rather that it feeds into the case for eliminating the "ignore newlines when inserting" rule :p A table should be valid period, not valid for dumping or valid for inserting.

The only error would be with the duplication of end token names. Everything else follows the rules in 2.2.6.
So the only errors are
/00=<end>,2
/<end>,2
and
/01=<end>,2\n
/<end>,2\n
for dumping and
/00=<end>,2
/01=<end>,2\n
/<end>,2
/<end>,2\n
for inserting. Ok, let's clean those up and add a (very) little variety:
/00=<end>,2
$02=<end>,2
!03=<end>,2
04=<end>,2
05=<end>,2\n
06= blah
Following 2.2.6, then, when an inserter reads " blah<end>,2 blah" it should insert "06 05 06".

What's the reasoning behind disallowing duplicate text sequences for end tokens only? The "last occurring shortest hex sequence" rule from 2.2.6 seems like it should also apply here, with artificial end tokens counting as hex sequences of length 0. If somebody decides to give their artificial end tokens the same text as some other end token and then things break, how is that different from giving any two game-specific control codes (which are likely represented as normal entries) the same text? Is it just that with end tokens we're guaranteed to be dealing with game-specific control codes?

I'm not even sure this is something the table standard should try to enforce - like commenting, artificial end tokens would appear to be something more properly dealt with at the utility level than at the table level.

The standard provides only for a single table per file.
Eh? Oh. Hmm. Somehow I was under the impression that the standard supported multiple logical tables within the same physical file. At the very least, it isn't explicitly forbidden, and I see no difficulty in supporting it (having already done so myself) if we impose two more constraints:
1) a table's ID line must precede any entries for that table
2) if a file contains multiple tables, every table in that file after the first table must have an ID line
(in fact I require an ID for every table used, but I don't feel that condition should be imposed as part of the standard)

For insertion, they pretty much become normal entries for all practical purposes.
So if instead of "A<Color><$A4><$34>a" we only have "A<Color><$A4>a", you'd want to insert "A0 E0 A4 21"? How does this interact with table switching? Continuing my table switching example, what does the insertion process look like when instead of "<Color><$25><$25>C1", we're trying to insert "<Color><$25>C1" starting with table1?

Thanks. Will correct.
You're welcome :). Like I said before, it's a pretty good document. I'm just trying to flesh out some of the edge cases a little more >:D.
Title: Re: Table File Standard Discussion
Post by: Nightcrawler on May 17, 2011, 12:48:23 pm
As added support for this, I'll point out that the raw hex precedence rule makes affected table entries entirely useless for insertion. This behaviour also feels counter intuitive at first glance, unlike the case of duplicate text sequences.

Counter intuitive as opposed to what? You need to have the ability for raw hex handling. You either need to have a proper escape mechanism, which we (those that discussed it) agreed early on to omit for simplicity, or you need to define a rule to cover the overlap case. The rule can only really lean one way or the other, raw hex or table entry. As you exemplified, the rule isn't even needed if you simply disallow using the raw hex pattern in your table entry. So, I'm not sure what other behavior would be more intuitive here.

Quote
The current placement strongly suggests that the "ignore newlines when inserting" rule applies only to end tokens requiring hex representation, so 2.3 should probably say so instead. Better yet, I think, would be to strike the rule altogether. If somebody went to the trouble of putting a newline in their table in the first place, they must have had a reason to do so. Assuming that reason was for formatting, the odds are very good those newlines will still be exactly where they should be when it comes time to insert, in which case ignoring them or not leads to the same result. In any case, the decision whether to ignore newlines or not should be left to the utility rather than imposed by the table standard. I think rather that it feeds into the case for eliminating the "ignore newlines when inserting" rule :p A table should be valid period, not valid for dumping or valid for inserting.

Processing newlines for insertion as part of your tokens is a bad idea. No thank you to that. No insertion utilities currently do that to my knowledge, and for good reason. I made a private utility once that did do it and that was bad enough to make work perfectly for my private controlled use. The newlines are there only to aid in the output spacing of the dump. You're guaranteed to change it. One extra line break, carriage return entered silently by your word processor, inserted rogue space in your game script, translator hit enter one too many times, etc.  breaks it right away. Let me tell you from direct experience, you will not be able to maintain XXK of game script and have all your line breaks be exactly as they were in the dump without many hard to find/fix mistakes!

In my opinion, the alternative if the newlines cause such a problem, is simply to eliminate them altogether.  The negative to that is it's going to make the dump utility be responsible for all line breaks. To get it as flexible as it is in the table standard, allowing for newline to be used in any table entry, would be a bit tough and burdensome. I'd imagine most utilities would go no further than line breaks after game line breaks, and line breaks after end tokens. That's why I hang on to it, even though good case is made for it's elimination. It's also inheriting what Cartographer, Romjuice, and Atlas already do.

What's the reasoning behind disallowing duplicate text sequences for end tokens only? The "last occurring shortest hex sequence" rule from 2.2.6 seems like it should also apply here, with artificial end tokens counting as hex sequences of length 0. If somebody decides to give their artificial end tokens the same text as some other end token and then things break, how is that different from giving any two game-specific control codes (which are likely represented as normal entries) the same text? Is it just that with end tokens we're guaranteed to be dealing with game-specific control codes?

I didn't find any backing for a reason when I looked back at past conversation. It may have been a first come solution to the ambiguity. I agree it could be allowed and follow 2.2.6 rule.  I'm not seeing the reason for exception right now.

Quote
I'm not even sure this is something the table standard should try to enforce - like commenting, artificial end tokens would appear to be something more properly dealt with at the utility level than at the table level.

This is another situation where it is and it isn't. It should be handled on the utility level. It has to be for dumping because it takes logic to define when an artificial control code is to be output. However, after it's dumped, it's no longer artificial. It becomes a real end token and it should be defined in your table then to be used for insertion as normal.

Quote
Eh? Oh. Hmm. Somehow I was under the impression that the standard supported multiple logical tables within the same physical file. At the very least, it isn't explicitly forbidden, and I see no difficulty in supporting it (having already done so myself) if we impose two more constraints:

I don't really want to do this. It takes away some of the simplicity of table file parsing and loading. It is a table file, singular, after all. It also increases the complexity of defining the starting table in the utility.  Instead of simply picking the starting file with no processing required, you need to process all table files and generate a list of tables by ID. Just having table switching to begin with probably puts this out of reach for many to implement (even you indicated difficulty). Many in our community are amateur programmers at best (not that we don't also have pros) and the more burdensome it is, the less it will be used. It adds a layer of complication I'm not too interested in. Even if it is a good idea, at the end of the day, I can't advocate a standard I don't want to program a utility for, so this is probably something I won't go with.

Quote
So if instead of "A<Color><$A4><$34>a" we only have "A<Color><$A4>a", you'd want to insert "A0 E0 A4 21"? How does this interact with table switching? Continuing my table switching example, what does the insertion process look like when instead of "<Color><$25><$25>C1", we're trying to insert "<Color><$25>C1" starting with table1?

Yes, that's correct (insert A0 E0 A4 21). For table switching, when dumping, a linked entry, it would count as a single match. On the insertion side, you don't really have this nested table switching setup anymore. You're either going to have the table switching raw hex bytes there or you're going to have a command. When last I spoke to Klarth (author of Atlas), there were still some implementation details to work out, but the idea would be the dumping utility would output a table command the inserter would use. This is a utility issue. The scope of the table is only to properly decode the character with the right table. How to indicate to the inserter what table to use is a different story. In my own utility I will probably just end up outputting the table switch bytes in the dump and/or provide an option to omit.
Title: Re: Table File Standard Discussion
Post by: abw on May 18, 2011, 10:26:45 pm
Counter intuitive as opposed to what?
Ideally, if I dump a bunch of hex to text and then reinsert that text unaltered, my basic expectation is to get the original hex again. In the case of duplicate text sequences or when compression is involved, I would expect to get some optimally compressed variant of the text. Giving precedence to raw hex bytes breaks that expectation. Except in the case of linked entries, raw hex bytes represent a failure of the hex-to-text translation process. I understand why that failure needs to be addressed with higher priority. I just don't like it  :P. Disallowing situations in which it can arise seems like the simplest solution, and besides, who uses <$XX> in their table entries anyway?

Processing newlines for insertion as part of your tokens is a bad idea. No thank you to that.
I don't feel it's that big of a problem, but it is entirely possible that I will change my tune later on :P.

This is another situation where it is and it isn't.
The only thing differentiating "/text" from the disallowed "=text" is the end token flag, and the end token flag is primarily a utility hint rather than hex <-> text translation information. As such, it just feels like including artificial end tokens in the table standard wanders a little further than necessary into defining the content of script files and utility interfaces, leaning towards the Atlas/Cartographer style status quo in the interests of backwards compatibility. I'm not saying any of this is necessarily bad, but it does discourage growth in other directions. After all, there are other ways for a utility to keep track of where strings end without using artificial end tokens.

I don't really want to do this.
Fair enough. Table switching complicates hex <-> text translation, but doesn't have much impact on table file parsing. I agree it gets a bit silly if the user provides you with large quantities of unused table data (like if somebody stored all the tables for every game they'd ever worked on in the same file [wait, that sounds kind of awesome >:D]), but assuming the utility only receives tables it needs, you still have to go through all the same parsing/selection steps anyway, so the overhead is pretty low. As an added convenience, I'm defaulting table ID to the table's file name, so there's very little work required for the end user, and single table files with no ID still work. In any case, being burdensome to implement is not the same as being burdensome to use, and this seems like something people might use. I kind of want to keep it, but at the same time, people going off and doing their own thing is what led to all the incompatibilities this standard tries to correct, and I definitely agree in principle to having inter-operable tools. Thoughts?

Yes, that's correct (insert A0 E0 A4 21).
Right, no safeguarding the user from their own mistakes, just like the policy on newlines. Oh wait  :P (no, I'm not seriously suggesting that ignoring newlines for insertion is a panacea; there are still lots of ways to get into trouble). I guess I don't much care either way, but I think maybe I'll print a warning, just in case.

Table switching does appear to make insertion more complex. I don't think I can just tell perl to do it for me anymore (or maybe I just need to ask nicer)  :P. Outputting table switch tokens/inserter commands might work well enough if all you want to do is re-insert the original text, but if that text has been modified, I think you might need to keep the nested setup in order to deal with table switch tokens that come with a non-zero token count, under the assumption that the game is going to expect to find that many tokens in the new table. That's why I'm concerned about linked tokens - it sounds as though a linked token with its parameter bytes counts as one match for dumping, but as multiple matches for inserting?

Since I missed this last time:

I handled table switching with an array of tables, index to active table, and a stack of table indexes to handle the table jumps and returns. It was only a few lines of code. It was more difficult to handle notification of the several conditions that would required the table to fall back or jump to another.
I've got a table object parsing tokens and then calling the parsing method on another table when it reads a switch token. End conditions were also interesting - I've got whichever comes first of "read X bytes", "read Y tokens", and "read an end token", with "unable to read a token" for sub tables only, with each condition propagating to parent table(s) as appropriate. I am unconvinced this is sufficient in general, but have no real-world example to support that belief. It would be nice if we had a repository of known unsupported formats and things in general that still require a custom dumper/inserter. I'm trying to be as flexible as I can, but specific goals are nice sometimes too!

As a point of possible interest, I found it more useful (especially for post-processing) to return a list of tokens rather than the translated text.

New business:

I disagree with the "longest prefix" insertion rule from 2.2.4. With a slightly altered example:
12=Five
13=SixSeven
00=FiveSix
the "longest prefix" rule makes it impossible to insert the string "FiveSixSeven", despite that being valid output from a dumper. A less greedy algorithm would be able to insert "12 13" instead, which seems superior to inserting "00" but losing the "Seven".

Quote
Note: A table switch hex sequence match counts toward NumberOfTableMatches.
It's not clear from the wording (2.6), but presumably NumberOfTableMatches here refers to the match count in the old table rather than the new table, i.e. "!hex=tableID,1" is not a no-op.

Quote
NumberOfTableMatches is the non-negative number of table matches to match before falling back to the previous table.
It's not clear from the wording (2.6), but when counting matches, I think this should be restricted to matches in the new table itself, rather than including matches from any other table(s) the new table might subsequently switch to. Here's an example to illustrate:
@table1
00=A
!01=table2,2
02=X
05=Y

@table2
02=B
!03=table3,2

@table3
$04=<Color>,2
05=C
When dumping "01 03 04 AB BA 05 02 00" starting with table1, which process is correct?

table1 matches "01" --> table2 matches "03" (table2 count  =  1) --> table3 matches "04 AB BA" (table2 count = 1, table3 count = 1) --> table3 matches "05" (table2 count = 1, table3 count = 2) --> fallback to table2 (table3 count completed) --> table2 matches "02" (table2 count = 2) --> fallback to table1 (table2 count completed) --> table1 matches "00" (output is "<Color><$AB><$BA>CBA")

or

table1 matches "01" --> table2 matches "03" (table2 count = 1) --> table3 matches "04 AB BA" (table2 count = 2, table3 count = 1) --> fallback to table1 (table2 count completed, but table3 count still = 1) --> table1 matches "05" --> table1 matches "02" --> table1 matches "00" (output is "<Color><$AB><$BA>YXA")
Title: Re: Table File Standard Discussion
Post by: Nightcrawler on May 19, 2011, 11:41:32 am
The only thing differentiating "/text" from the disallowed "=text" is the end token flag, and the end token flag is primarily a utility hint rather than hex <-> text translation information. As such, it just feels like including artificial end tokens in the table standard wanders a little further than necessary into defining the content of script files and utility interfaces, leaning towards the Atlas/Cartographer style status quo in the interests of backwards compatibility. I'm not saying any of this is necessarily bad, but it does discourage growth in other directions. After all, there are other ways for a utility to keep track of where strings end without using artificial end tokens.

Absolutely, but that's exactly what we're going for. We're not trying to making something radically new. We're refining and defining that status quo. The second paragraph of the overview sums it up. The end result should be able to be adapted into Atlas and Cartographer without all that much modification. The idea here is to get everybody on the same page for the first time ever, and hold us over.  Baby steps. Otherwise, I can guarantee you it will go nowhere and nobody else will ever use it. Just look at the whole patching format fiasco and why many people still use IPS. I can't guarantee this will fare any better, but what I can guarantee is, if at the very least, it is supported in Atlas, Cartographer, and TextAngel, it will be used by 85% of everybody out there and provide a dumping and inserting standard for the near future. This will hold until we eventually move on to XML or whatever the next evolution will be where radical new and improved ideas can run crazy and new tools developed. In the ROM Hacking community, such a move can often take a decade or more. We need something in between. :laugh:


Quote
Fair enough. Table switching complicates hex <-> text translation, but doesn't have much impact on table file parsing. I agree it gets a bit silly if the user provides you with large quantities of unused table data (like if somebody stored all the tables for every game they'd ever worked on in the same file [wait, that sounds kind of awesome >:D]), but assuming the utility only receives tables it needs, you still have to go through all the same parsing/selection steps anyway, so the overhead is pretty low. As an added convenience, I'm defaulting table ID to the table's file name, so there's very little work required for the end user, and single table files with no ID still work. In any case, being burdensome to implement is not the same as being burdensome to use, and this seems like something people might use. I kind of want to keep it, but at the same time, people going off and doing their own thing is what led to all the incompatibilities this standard tries to correct, and I definitely agree in principle to having inter-operable tools. Thoughts?

It makes the difference between having to process the tables just to pick a starting table, and not having to process at all until operation time. You're requiring I process and parse all the tables just to be able to provide the options necessary to the user to select the starting table. I don't need do do any of that now. You just pick the file. Table parsing and processing is only needed when the operation commences. That's my objection. It's a fine idea, it just requires utility changes I don't want to make.

It's every bit as much as much about implementation as it is about end user use. That's especially true for me. I'm setting forth a standard. I'm the only one developing a dumper that will use it (Cartographer will hopefully pick it up when/if Klarth updates TableLib). If I were to not develop TextAngel because I find the standard to be too much of a pain in the ass, then why am I involved in pushing this standard to begin with? And if then there is no dumper that uses the standard, the standard really has no point in existing. You see where I'm going with this? It's got to be something I'm comfortable and motivated to develop a utility for or it's pointless for me to invest any time on. I'm trying to make something for everybody, but since I'm the only one doing development of a dumper that will use it, I have to be a little selfishly biased in the standard in order for it to ever see the light of day. ;D

Quote
Right, no safeguarding the user from their own mistakes, just like the policy on newlines. Oh wait  :P (no, I'm not seriously suggesting that ignoring newlines for insertion is a panacea; there are still lots of ways to get into trouble). I guess I don't much care either way, but I think maybe I'll print a warning, just in case.

There's no reason to prevent the user from inserting more or less raw hex data. I wouldn't call that a user mistake. I've done it intentionally several times. As for newlines, they are nothing but whitespace in your script file for insertion. They have no effect on mistake or insertion. Look, I used to think the same thing. I told you I had done a project with newlines being processed as part of the tokens for insertion. It's certainly possible, but totally undesirable. It's just prone to too many problems. When the script was ready for insertion, many files were broken. Remember it passes through hands of several people. As mentioned, different text processors silently made changes (0x0d and 0x0a mangling), extra line breaks put in by mistake by human, extra spaces nobody sees by copy paste or human error. All sorts of things in real world practice. I will never actually insert newlines again. Those are probably similar reasons Atlas doesn't do it either. I don't know of any utility that does come to think of it.

Quote
Table switching does appear to make insertion more complex. I don't think I can just tell perl to do it for me anymore (or maybe I just need to ask nicer)  :P. Outputting table switch tokens/inserter commands might work well enough if all you want to do is re-insert the original text, but if that text has been modified, I think you might need to keep the nested setup in order to deal with table switch tokens that come with a non-zero token count, under the assumption that the game is going to expect to find that many tokens in the new table. That's why I'm concerned about linked tokens - it sounds as though a linked token with its parameter bytes counts as one match for dumping, but as multiple matches for inserting?

Klarth previously said "As far as the switch after X characters, I haven't seen (nor can formulate) a case where it would've helped with an English translation.  I'd probably have the user create a table entry specifically for switching the table."

I think there's going to be a small number of situations where on a theoretical level, there may be a concern or issue, but real world dictates the case never occurs. We've got to appease Klarth because Atlas support holds major weight in any success this standard may have. Atlas IS probably the closest thing to a standard we have currently.

Quote
New business:

I disagree with the "longest prefix" insertion rule from 2.2.4. With a slightly altered example:
12=Five
13=SixSeven
00=FiveSix
the "longest prefix" rule makes it impossible to insert the string "FiveSixSeven", despite that being valid output from a dumper. A less greedy algorithm would be able to insert "12 13" instead, which seems superior to inserting "00" but losing the "Seven".

You do? I'd love to hear your algorithm, especially if we happen to add a few more table values to the mix:

02=F
03=i
04=v
05=e
06=Fi
07=ve
08=ive

Now, what are you going insert and how are you going to determine it? ;)

Quote
It's not clear from the wording (2.6), but presumably NumberOfTableMatches here refers to the match count in the old table rather than the new table, i.e. "!hex=tableID,1" is not a no-op.
Correct.

Quote
It's not clear from the wording (2.6), but when counting matches, I think this should be restricted to matches in the new table itself, rather than including matches from any other table(s) the new table might subsequently switch to. Here's an example to illustrate:
Yes, it is. Example 1 is correct processing. I can try to clarify.
Title: Re: Table File Standard Discussion
Post by: abw on May 19, 2011, 11:31:25 pm
Quote
New business:

I disagree with the "longest prefix" insertion rule from 2.2.4. With a slightly altered example:
12=Five
13=SixSeven
00=FiveSix
the "longest prefix" rule makes it impossible to insert the string "FiveSixSeven", despite that being valid output from a dumper. A less greedy algorithm would be able to insert "12 13" instead, which seems superior to inserting "00" but losing the "Seven".

You do? I'd love to hear your algorithm, especially if we happen to add a few more table values to the mix:

02=F
03=i
04=v
05=e
06=Fi
07=ve
08=ive

Now, what are you going insert and how are you going to determine it? ;)

The rest I'll have to think about, but this I can answer off the top of my head:

In the absence of table switching, I would model my insertion algorithm (which is currently hypothetical, unlike my dumping algorithm) on perl's regular expression engine. Ignoring all of its optimizations, the engine is generally longest prefix, but adds the concept of backtracking - if it's not possible to complete tokenizing the string based on your current tokenization, go back a little bit and try a different tokenization. So tokenizing "FiveSixSeven" would start by finding 4 possible initial tokens ("00=FiveSix", "02=F", "06=Fi", and "12=Five"). It would then tentatively accept the longest of those, "FiveSix", and try to continue matching the remainder of the string ("Seven"). Since that can't be done, it would backtrack, discarding the previous token ("FiveSix") and trying again with the next longest, "Five". The remainder of the string ("SixSeven") is able to be matched ("13=SixSeven"), so hurray, we're done. If it hadn't worked out so nicely, "Fi" and then "F" would have been tried in turn, until eventually it becomes obvious that the string just can't be matched. In that case, 2.2.2 kicks in and it would start all over again with "iveSixSeven". It's not as fast as a straight up "longest prefix", but it is more accurate. For our purposes you might want to modify the token preference to be based on hex length rather than text length, but in order to guarantee an optimal encoding, you'll probably have to run every possible tokenization, remember the ones that worked, and then compute each one's hex length.
Title: Re: Table File Standard Discussion
Post by: Nightcrawler on May 20, 2011, 08:53:33 am
You've illustrated my point. Now take a poll and see how many people can a.) understand how to implement that b.) have the ability to implement that and c.) have the desire to implement that. You've just increased complexity by 10x for an otherwise trivial task. Has anybody ever actually written an inserter that behaves like this? Of all the ones I've seen source code for, none have. From all the scripts I've ever inserted with my own utilities, and all the scripts that have ever been inserted with Atlas, it has never been an issue.

If you were writing this standard, I'm pretty sure I'd want no part of it. You like to take the mountain climbing approach for the molehill. :P

Anybody else, don't be afraid to comment here. If I'm off my rocker, I'd like to know, but I can't see anybody really doing this as evidenced by our past history. I think it'll be out of reach for most and necessary/undesirable for those it isn't out of reach for.

Title: Re: Table File Standard Discussion
Post by: abw on May 20, 2011, 09:48:41 pm
Sure, as long as your table contains separate entries for all the individual characters of your string, longest prefix will work. I ran this example through Atlas, and it does indeed insert three times as many bytes as required - "FiveSix" "S" "e" "v" "e" "n" instead of "Five" "SixSeven". My point here is that the longest prefix insertion algorithm is provably non-optimal, and I do object to the standard imposing a non-optimal algorithm on any utility author wishing to implement the standard, regardless of whether any such author is ready, willing, or able to step forward with a better implementation.

Speaking of which... you said you were the only one developing for this standard. Is it worth mentioning at this point that I believe I've already written a 100% standard compliant dumper? (I've yet to write/run a full test suite, and I'm assuming the standard will be updated based on recent discussion, so no promises.) In addition to almost everything Cartographer can do (it's difficult to determine the full range of Cartographer's capabilities without source code), I also support multi-table files (my own happy misinterpretation of the standard, subject to the additional constraints listed earlier (http://www.romhacking.net/forum/index.php/topic,12644.msg183844.html#msg183844)), discontinuous pointer bytes (as suggested by Geiger in the previous thread (http://www.romhacking.net/forum/index.php/topic,12462.0.html)), arbitrary pointer tree structures (your own pointer table to pointer table to pointer table to string example), optional overlapping string output, and optional string fragment output (my own feature requests from a couple of years ago (http://www.romhacking.net/forum/index.php/topic,8945.0.html)).

You like to take the mountain climbing approach for the molehill. :P
Aye, perhaps I've seen enough molehills turn in to mountains that I've gotten used to exploring unlikely possibilities first. There's nothing wrong with being prepared :P.

I'd say we're in agreement on 99% of the material. Most of the issues I've raised have been more about presentation ("X would make more sense here instead of there", "Y could be stated more clearly", etc.) than content, but where content is at issue, yes, we do appear to have some philosophical differences. You argue for strictness in places I would prefer freedom, I argue for strictness in places you would prefer freedom. It's almost like we're writing programs in two different styles :P.

Anybody else, don't be afraid to comment here. If I'm off my rocker, I'd like to know
Comments from others would definitely be appreciated. I for one don't think you're off your rocker - your position is based on years of hard-won experience, and I respect it even if I don't always agree with it. I do think the table file standard oversteps its bounds in a few places - if all it does is codify and enforce the behaviour of the currently popular utilities, I don't see it providing much room for improvement. Cartographer and Atlas are extremely useful, but it's not hard to imagine utilities that would be even more useful than either or both of them, and I don't think the table file standard should hinder the emergence of such a utility in any way not directly related to table files.
Title: Re: Table File Standard Discussion
Post by: Nightcrawler on May 24, 2011, 03:06:38 pm
Sure, as long as your table contains separate entries for all the individual characters of your string, longest prefix will work. I ran this example through Atlas, and it does indeed insert three times as many bytes as required - "FiveSix" "S" "e" "v" "e" "n" instead of "Five" "SixSeven". My point here is that the longest prefix insertion algorithm is provably non-optimal, and I do object to the standard imposing a non-optimal algorithm on any utility author wishing to implement the standard, regardless of whether any such author is ready, willing, or able to step forward with a better implementation.

Ran this one by Klarth. "Bad token selection can occur sometimes, but I'd estimate it very rare for it to be detrimental...unless it's a "gotcha table". The optimal algorithm is simply out of reach for most, and non desirable for the rest of us. Just because it may be more optimal doesn't mean it's desirable or the best choice for the job.

Quote
Speaking of which... you said you were the only one developing for this standard. Is it worth mentioning at this point that I believe I've already written a 100% standard compliant dumper?

It's only worth mentioning if a.) It will end up being standard compliant, b.) it's released to the public, and c.) It's not in Perl.  I'm joking about that last part as Perl syntax makes me ill. However I do want to raise the point that hardly any windows users (the statistical majority of the end users) have Perl installed.

Quote
I'd say we're in agreement on 99% of the material. Most of the issues I've raised have been more about presentation ("X would make more sense here instead of there", "Y could be stated more clearly", etc.) than content, but where content is at issue, yes, we do appear to have some philosophical differences. You argue for strictness in places I would prefer freedom, I argue for strictness in places you would prefer freedom. It's almost like we're writing programs in two different styles :P.

Too much freedom for utilities caused the problem to begin with. It's the freedom of utilities that led to not being able to use the same table between various utilities. Not only do we want to be able to use the same table amongst all compliant utilities, we also want to ideally be able to be able to dump and insert interchangeably. My utility and your utility shouldn't give something different (as far as the basic text goes) when dumping or inserting the same script. And whether you insert with my utility or Atlas, you'll get the same hex inserted. They will differ in features, formatting, pointers, and abilities beyond the basic text/hex.

With that said, I'm not saying that can't necessarily still be accomplished with some of your proposed changes, but it is the reason why the standard appears to overstep boundaries in those areas. To steal the words of a colleague, "While the aims is unifying table files, what is being unified is the textual representation of certain dumping and insertion processes and their matching behavior between textual representation and hexadecimal representation of the game script. As such, these definitions are naturally a part of the spec."


Quote
Comments from others would definitely be appreciated. I for one don't think you're off your rocker - your position is based on years of hard-won experience, and I respect it even if I don't always agree with it.

Thanks. I appreciate that. I hope I do not come off as condescending as a result. I do not discredit any of your ideas, I merely argue my position on them. I would like to try and incorporate some more of what's been discussed here. I am in process of making several of the changes mentioned and running some of the hot button items by some of the other guys. It seemed they were a bit scared off by the walls of text here. I will likely get someone to stop by and comment yet. ;)

I will summarize later with changes I made and remaining open business to decide on. I think we're going to try to wrap this whole thing up. It would be nice to reach something you also can agree with, but it looks like there will be a few items of business such as optimal algorithm item above that I (with backing from others) may be steadfast on.
Title: Re: Table File Standard Discussion
Post by: abw on May 25, 2011, 12:16:36 am
The longest prefix algorithm can fail to be optimal in a variety of ways. It can fail by being too greedy, when choosing a long text sequence at one point forces it to choose many short text sequences at a later point. This can only happen when your table compresses at least 3 characters into one token (i.e. games that use at most DTE are immune to this weakness). It can also fail by assuming that all tokens have equal hex length, an assumption which the table standard explicitly invalidates. What would happen in the "FiveSix" example of 2.2.4 if the "FiveSix" entry were 3 bytes long instead of 1?

Sometimes all a project really wants is just 4 more bytes in order to fit its masterpiece script in without having to make ASM changes. In cases like those an optimal insertion algorithm might make a significant difference. I say "an" optimal algorithm, since there are many different algorithms that produce optimally encoded output. Just to be clear, I'm not arguing in favour of any particular insertion algorithm. I'm arguing for the freedom for utility authors to choose their own. The longest prefix + backtracking algorithm I rattled off earlier produces optimal output, but its runtime and memory requirements tend to grow exponentially with the input length, making it prohibitively expensive in practice. After getting bored waiting for a single medium-length string to encode, I ended up abandoning the longest prefix idea altogether and created a different optimal insertion algorithm that runs in roughly linear time and memory instead. It took a couple hours to get everything working right, but the end result is only about 50 lines of code and it chews through a 200KB script in under a second.

While on the topic of insertion algorithms... how are you handling table switching? I've been thinking more about this, and in the general case it presents an even larger can of worms than I was anticipating. I have yet to come up with a reliable method that doesn't involve breaking out old CS textbooks  :-\.

c.) It's not in Perl
Oh, I am wounded :P. Maybe it's different with .NET, but grubbing through the Win32 API always left me feeling dirty, so I can understand your reaction to perl. I also agree that Windows and perl are not a frequent combination. In theory, installing perl on Windows is analogous to installing the .NET framework, so if people are willing to do one, they might be willing to do the other. I do see threads every now and then where people ask about utilities for Linux or Mac, and many flavours of those OSes come with perl already installed. It is entirely possible that my utilities may end up being about as popular as my translations ;). I do plan on releasing it publicly (this includes source code, of course), and will at least note any intentional deviations from the standard, assuming I make any. Dumping will likely be 100% compliant, but insertion will likely not be, since I feel guaranteeing encoding optimality is superior to just hoping for it.

I hope I do not come off as condescending as a result.
Not usually, no. I was tempted to suggest spending a few weeks working on approximation algorithms for NP-complete problems before cracking wise about mountains and molehills, but it's all good fun  :P. One of the nice things about this place (due in large part to your own influence, I think) is that people with divergent viewpoints can have discussions like these in a reasonably mature manner. I think it's been mutually beneficial even if the television audience did fall asleep  :huh:.
Title: Some Basic Points
Post by: Tauwasser on May 25, 2011, 06:27:51 pm
First of all, this post will be pretty long and I would like to apologize to abw if this seems like I jump on his posts only. I originally discussed the table file standard with NC back on his own board and we pretty much figured out a way to do it. I'm also quite late to the party, which is why I cover almost every second point you make here.

except for post-processing [for table switching] of overlapping strings with bytes interpreted by different tables, which is still a mess

Can you elaborate on this one?

perhaps text sequences containing <$[0-9A-Fa-f][0-9A-Fa-f]> should be forbidden.

While developing the standard, we opted for more simple text checking algorithms, so we decided to give the user the power to do this at the cost of possible mix-ups. NC tried to tone complexity down as much as possible. Even regular expressions are considered a hindrance here, which I will elaborate on further down below. However, I would support and have already offered to design regular expressions for identifying entry types, so we might as well disallow it.

Also, it might be more appropriate to include the section on inserting raw hex in 2.2.4 instead of in 2.2.1.

It seemed logical to include it there, because that way, the dumping and insertion process for hexadecimal literals is completely defined, instead of breaking these two up. 2.2.4 doesn't deal with literals at all right now.

Also also, it might be worth mentioning that hex characters are equally valid in upper or lower case (e.g. "AB" == "ab" =="Ab" == "aB").

This is addressed in 2.2.1 in the following manner:

Code: [Select]
"XX" will represent the raw hex byte consisting of digits [0-9a-fA-F].
Control Codes in End Tokens Requiring Hex Representation

I admit having these commenting operations and dumped and to-be-inserted text in the same file always irked me and I personally handle it differently, i.e. no line breaks and no content mixing.
I felt like this was an atlas-specific hack and easily remedied by the user taking action after dumping. However, I also felt that done properly, one could easily address this with regular expression grouping and get user-customizable behavior. However, regular expressions are considered "one step too far" right now.

When inserting, should control codes be ignored for all tokens, or just end tokens requiring hex representation?

The only control codes that are currently implemented ― and again in a fashion that makes for the inability to implement literal "\n" instead of control codes '\n' for simplicity ― are line breaks and are as per 2.3 to be ignored by insertion tools:

Code: [Select]
These codes are used by dumpers only and will be ignored by inserters.
An additional burden here is the different line end control codes used in different OS. Basically, we might have 0x0D 0x0A, or 0x0A, or 0x0D. This also favors completely ignoring line ends, because it cannot be assured that some text editing program doesn't silently convert from dumping standard to OS standard so the insertion tool would not find the proper bytes in the text file.
On the other hand, "OS-independent" ReadLine functions do exist and will worst case, read two lines instead of one for 0x0D 0x0A. Therefore, by ignoring the number of line breaks and empty lines, we actually gain a little bit of stability here.

The standard makes no definition of what constitutes a "name".

This should currently match [^,]*, i.e. any string that does not contain a comma. I would be willing to settle for [0-9A-Za-z]* in the light of not wanting to deal with Unicode confusables or different canonical decompositions of accented letters etc.

As for uniqueness of labels, I have to admit I was silently going for uniqueness in each type, but this might have to be discussed again.

While on the topic of uniqueness, it might be worth including a note in the standard that restricts the definition of a "unique" entry to the current logical table.

Good point :)

Eh? Oh. Hmm. Somehow I was under the impression that the standard supported multiple logical tables within the same physical file. At the very least, it isn't explicitly forbidden, and I see no difficulty in supporting it (having already done so myself) if we impose two more constraints:
1) a table's ID line must precede any entries for that table
2) if a file contains multiple tables, every table in that file after the first table must have an ID line
(in fact I require an ID for every table used, but I don't feel that condition should be imposed as part of the standard)

I think this did not occur to anybody, simple because one file per table and one table per file is the way it has always been. I feel we should leave it that way and be specific about it.

Linked Entries
Attempting to insert a linked entry which lacks its corresponding <$XX> in the insert script should generate an error under 4.2, right?

Exactly. It would basically be a sequence that cannot be inserted according to general insertion rules. Like for instance "福" cannot be inserted when it is not defined in any table.

Under the "wishing for the moon" category:
It would be nice if we could define linked entries in such a way that we could obtain output with both pre- and post-fix
[...]

[W]e lack a general escape character, you couldn't determine which commas were field delimiters and which were data :(. It might also be nice to be able to specify how raw bytes were output in case you don't want the <$XX> format.

Both of these ideas were sacrificed for the sake of simplicity :(

So if instead of "A<Color><$A4><$34>a" we only have "A<Color><$A4>a", you'd want to insert "A0 E0 A4 21"? How does this interact with table switching? Continuing my table switching example, what does the insertion process look like when instead of "<Color><$25><$25>C1", we're trying to insert "<Color><$25>C1" starting with table1?

I couldn't find your Color example. Nevertheless, parsing of this would become a search for a normal entry, because a linked entry cannot be found. If a normal entry exists, the whole thing is obviously not ambiguous and insertion will progress as one would expect. On the other hand, when no normal entry can be found for the label that was inadvertently misplaced, the insertion tool will not know what to do and should produce an error.

Ideally, if I dump a bunch of hex to text and then reinsert that text unaltered, my basic expectation is to get the original hex again.

This expectation is a false premise, really. You open an image file saved with one program and save a copy in another program. You will likely find that the compression changed, or the picture was silently up-converted from 16bit rgb to 32bit rgb etc. What matters in the end is that the final result is equivalent. Now in the case of text, this means the following:

You dump text and insert it back. It should display as the same thing when the game is running. When it does not ― barring edge cases we explicitly forbade, such as <$xx> ― it just means your table file matches two entries to the same text which is really not the same text. A common example would be having two different text engines that happen to share parts of their encoding (or were even built from one encoding!). In such cases, 0x00 might be displayed as "string" in engine A and entry 0x80 might be displayed as "test" in engine B. However, as long as the rendering of the other code fails in one engine, the strings must not be considered the same to begin with, because they are not. In such cases, dumping text per engine might be an option or marking entries with custom tags etc.
Title: Algorithm Selection
Post by: Tauwasser on May 25, 2011, 06:40:33 pm
New business:

I disagree with the "longest prefix" insertion rule from 2.2.4. With a slightly altered example:
12=Five
13=SixSeven
00=FiveSix
the "longest prefix" rule makes it impossible to insert the string "FiveSixSeven", despite that being valid output from a dumper. A less greedy algorithm would be able to insert "12 13" instead, which seems superior to inserting "00" but losing the "Seven".

You do? I'd love to hear your algorithm, especially if we happen to add a few more table values to the mix:

02=F
03=i
04=v
05=e
06=Fi
07=ve
08=ive

Now, what are you going insert and how are you going to determine it? ;)

That is a basic depth-first tree search, so it's not overly complicated for implementation. However, as we all know, complexity is O(b^n) where n is the maximum number of levels and b is the number of average branches. Another way to think of this is a transducer, which will naturally only finish on valid paths. The only criterion is that we need to find a shortest path, not enumerate all shortest paths.
Basically, an A* search with cost in bytes and a basic heuristic counting letters per byte will do. This could also be expanded for target language and occurrence inside the script to accomodate for the simple fact that lots of table entries doesn't mean all of them get used with the same probability. However, basic A* with cost in bytes and even heuristic of zero will work out the shortest path directly. Since the heuristic must not overestimate the byte count and already one table entry with 1 byte = more than 1 letter will basically mean that letter per bytes < 1, we deal with a (0,1) bytes per letter range of possible values here, really, so this means even an ideal heuristic for the source file will have little impact on finding the right way. The only nitpick here is, that possibly normal entries like 3 bytes = 1 letter exist and therefore letter per byte would be > 1 on average (when entry probability is not calculated). However, since a heuristic giving 0.1 bytes per letter will still be permissible, because it doesn't overestimate, the (first) shortest path will still be found.

Code: [Select]
01=F
02=i
03=v
04=e
05=Fi
06=ve
07=ive
08=iveS
09=ixSeven
10=Six
11=Seven

See PDF (https://sites.google.com/site/tauwasser2/A_star.pdf?attredirects=0&d=1).

Cost is 1.2 at the start with h(x) = 0.1 * x where x is letters left in buffer. Simple mean for bytes per letter would be 11/30 (30 letters per 11 bytes), so we're somewhat far away from even close to optimal and we will see some jumping towards the end because of this.

Before you ask, NC, this can be programmed with a simple stack that pushes node IDs back with reordering or a list that is sorted after every insertion of node(s).

I hope to have demonstrated that this is neither a laughable nor impossible claim or problem. However, having an admissible heuristic here is key and a simple (unweighed) mean will most likely not do because of outliers and one would need to use median or mode or some analysis of input first. Worst case is h(x) = 0 for all x, so you get depth-first search or h(x) = x * [lengths of longest hex sequence in table] which yields breadth-first search.

In the absence of table switching, I would model my insertion algorithm [...] on perl's regular expression engine. [...] the engine is generally longest prefix, but adds the concept of backtracking - if it's not possible to complete tokenizing the string based on your current tokenization, go back a little bit and try a different tokenization.

This is basically A* over tokens with just text length like all regex engines do for greedy star. However, I'm currently not aware that how this would work with the added impetus of having multiple bytes for some tokens thus changing the cost from same cost for all tokenizations to cost per tokenized token.
Having said that, a way to implement this via general purpose regex engines would probably be more accessible in more programming languages.

You've just increased complexity by 10x for an otherwise trivial task.

Neither tokenization not optimal search are trivial tasks. Indeed, defining the search problem itself mathematically is not a trivial task. That's why so much brain power (and money) went into things like SQL queries and the like.

Has anybody ever actually written an inserter that behaves like this?

I'm not sure what you're thinking of as an inserter in this case, but pretty much every compression that does not use sliding window techniques will have to use a backtracking algorithm to be optimal alongside a good heuristic. So yes, people have written inserters that insert input data into an output file while contemplating size and combination of mappings. It might just have been for binary data or the like.

Ran this one by Klarth. "Bad token selection can occur sometimes, but I'd estimate it very rare for it to be detrimental...unless it's a "gotcha table". The optimal algorithm is simply out of reach for most, and non desirable for the rest of us. Just because it may be more optimal doesn't mean it's desirable or the best choice for the job.

Not sure where the quote ends, so I assume it's all Klarth'. If so, I'd like to see him defend the POV that an optimal algorithm is non-desirable, because I cannot think of a single argument except burden of implementation, which he ruled out. Once a provably optimal algorithm is used, why is it not desirable? Speed-wise we can do brute force of some 100 kB in a few seconds, so speed doesn't seem to be the issue here, does it?



After getting bored waiting for a single medium-length string to encode, I ended up abandoning the longest prefix idea altogether and created a different optimal insertion algorithm that runs in roughly linear time and memory instead. It took a couple hours to get everything working right, but the end result is only about 50 lines of code and it chews through a 200KB script in under a second.

I'm quite interested how you achieved and proved linear time/memory. Please don't hesitate to elaborate.

While on the topic of insertion algorithms... how are you handling table switching? I've been thinking more about this, and in the general case it presents an even larger can of worms than I was anticipating.

Since I'm the guy that introduced this, I can only say that ― yes ― it does pose a problem. However, my naïve solution was to use ordered data sets and do a longest-prefix over all tables. You can still do this with A*, where path cost is adapted to table switching as well, i.e. the path to an entry with switched table from another table is the cost in bytes in the new table plus the cost in bytes from the switching code in the old table. This of course, needs just some bookkeeping to know for each explored node in which table it belongs and how many matches the table can have before switching back.

Since I thought about it some time now, I would actually like to introduce the notion of leveled compliance. Basically, one could incorporate into the standard more than one notion of insertion whereas tools can then specify which version(s) they comply to and by. This way, tool x can comply to a) longest prefix, b) the proposed A* method (or maybe just a depth first etc.) and c) it's own method. IMHO, this would also preclude utility authors inventing their own insertion technique but not indicating it anywhere. Table files themselves would be required to stay the way they are for compatibility to other tools once the final draft is done. Versioning will likely happen anyway once some <glaring oversight that really should have been caught and everybody feels miserable about once the standard is out> has been identified and rectified in a subsequent version of the standard.

cYa,

Tauwasser
Title: Re: Table File Standard Discussion
Post by: Klarth on May 26, 2011, 12:49:58 am
When it comes to longest match vs hex-saving, there are a few things to consider.  The speed decrease for hex-saving is negligible.  The space saved is equally negligible.  The implementation time (and ease) is strongly in favor of longest match.  The ease of understanding to the end user is probably in favor of longest match.

You can formulate scenarios where erroneous or non-optimal insertions could happen with both algorithms.  This is where best practices/experience can come in.  Such as representing a no-default-name renameable character's name with [Hero] rather than just plain Hero.  So in case a townsperson talks about a "Legendary Hero", the game won't print the character's name.

On the space savings part.  The FiveSix example is rooted in substrings and how to insert them.  There isn't an optimal substring analysis tool that I'm aware of geared towards romhacking.  ScriptCrunch computes a fresh longest substring analysis for each entry (which is also pretty slow thanks to my poor implementation).  So the savings will be pretty negligible in a large script unless you guys write both "optimal" inserter and "optimal" substring analysis tools.  If you implement those, you might bump a greedy 30% substring compression up to 31%-33%.  Maybe.

Longest match has been a pretty solid algorithm over the years and it'll take a bigger argument than the small potential space saving (think reducing real world erroneous insertion) for me to implement it instead of features that can help insert scripts that were previously impossible.  But I'd be fine with the standard saying that either algorithm is acceptable.

On a more ideological side: it's real world vs text book.  If you're writing your thesis, you can spend days designing and implementing an insertion algorithm if you want.  In the real world, at some point it becomes time to move on and implement real features so a script can actually get inserted.  Having that kind of practicality is central to Atlas: wider variety of scripts inserted, lowering the learning curve, and ways to limit Atlas command clutter.  As Patton said, "A good plan violently executed now is better than a perfect plan executed next week."
Title: Re: Some Basic Points
Post by: abw on May 26, 2011, 11:04:32 pm
First of all, this post will be pretty long and I would like to apologize to abw if this seems like I jump on his posts only. I originally discussed the table file standard with NC back on his own board and we pretty much figured out a way to do it. I'm also quite late to the party, which is why I cover almost every second point you make here.
No problem. This thread must be in the running for the prestigious "Most Words Per Post" award anyway, so why stop now? Anyway, you end up agreeing with me on most of the points I care about :P.

@Nightcrawler
As an aside, I didn't receive any email notifications about Tauwasser's posts. Does this have something to do with the thread title changing? For reference, I did get email when this topic was split off from Program Design and Data Structure (TextAngel) (http://www.romhacking.net/forum/index.php/topic,12462.0.html), and I also received notification of Klarth's post under the original thread title.


except for post-processing [for table switching] of overlapping strings with bytes interpreted by different tables, which is still a mess

Can you elaborate on this one?
Certainly. The desired post-processing I referred to is described in more detail here (http://www.romhacking.net/forum/index.php/topic,8945.0.html). As for the bit about different tables: given a byte range x, it is conceivable that a game could read x once using table A and then read x (or a subset of x) again using table B, producing different text in each case. I'm not sure whether that would qualify as genius or insanity. In either case, adding a check to ensure the tokenizations of overlapping strings agree solves this issue and the multibyte token overlap alignment issue, both of which I believe to be unlikely to occur in a commercially released product.


Also, it might be more appropriate to include the section on inserting raw hex in 2.2.4 instead of in 2.2.1.

It seemed logical to include it there, because that way, the dumping and insertion process for hexadecimal literals is completely defined, instead of breaking these two up. 2.2.4 doesn't deal with literals at all right now.
On the other hand, 2.2.1 specifically refers to dumping ("No-Entry-Found Behavior for Dumping"). The treatment of hexadecimal literals applies to multiple sections, so it might be cleaner to split it off into a separate section.


Code: [Select]
"XX" will represent the raw hex byte consisting of digits [0-9a-fA-F]....
Code: [Select]
These codes are used by dumpers only and will be ignored by inserters.
Oh good. I'm glad I just overlooked those rather than them not being in the standard at all. Thanks!


Control Codes in End Tokens Requiring Hex Representation

I admit having these commenting operations and dumped and to-be-inserted text in the same file always irked me and I personally handle it differently, i.e. no line breaks and no content mixing.
I'm interested in hearing more about your approach.

I felt like this was an atlas-specific hack and easily remedied by the user taking action after dumping.
It does add an extra step to the process, but s/<([^$>]*)>/<$1>\n/g takes care of any use case I foresee myself having.


The standard makes no definition of what constitutes a "name".

This should currently match [^,]*, i.e. any string that does not contain a comma. I would be willing to settle for [0-9A-Za-z]* in the light of not wanting to deal with Unicode confusables or different canonical decompositions of accented letters etc.
I'm pretty sure we're not talking about the same thing here. That would group end tokens' "endtoken" in with linked entry's "label", table id's "TableIDString", and table switch entry's "TableID", all of which may contain any character except comma. I understand the comma restriction for entries using comma as their internal delimiter (i.e. linked and table switch entries) and and how that restriction follows through to other entities referenced by those entries (i.e. @TableIDString lines), but I'm not sure what we gain by applying the same restriction to end tokens. If that actually is the intent, then at the very least
Code: [Select]
Any combination of formatting control codes and text representation may be used. This allows for nearly all variation of string ends.in 2.4 should be amended. Restricting "endtoken" to [0-9A-Za-z]* would disallow end token text sequences like "<end>", which seems undesirable. That does bring up another point I keep forgetting to mention: linked entry labels and table ID strings (whether in @ or ! form) should probably not be allowed to be the empty string.


As for uniqueness of labels, I have to admit I was silently going for uniqueness in each type, but this might have to be discussed again.
This is where best practices/experience can come in.  Such as representing a no-default-name renameable character's name with [Hero] rather than just plain Hero.  So in case a townsperson talks about a "Legendary Hero", the game won't print the character's name.
A section in the standard about best practices would probably be useful. I remember making that very mistake when I first started :P.

I think the intent of the uniqueness condition is to prevent the creation of tables which encourage insertion errors. What do people think about taking this a step further by providing some mechanism for denoting table entries as generic in-game control codes and enforcing uniqueness of text sequences based on that?

As one possible approach, we could introduce another entry prefix character such as # (or even hijack the currently pointless $hex=label,0 construct) to identify in-game control codes, require that the first and last characters of the text sequence (or label) do not occur in the text sequences of any normal entry or as anything other than the first or last character of any non-normal entry, then check for uniqueness of all non-normal entries (probably across table boundaries?). This would draw a clear distinction between text which is a candidate for table compression and text which is not. Under such a scheme, an entry like #hex=[Hero] would guarantee that the string [Hero] would never be parsed as any other token during insertion while still allowing personal preference for other styles ("<Hero>", "~Hero~", etc.) and flexibility in cases where [ or ] appear as normal entries.


I think this did not occur to anybody, simple because one file per table and one table per file is the way it has always been. I feel we should leave it that way and be specific about it.
In the interests of cross-utility compatibility, I definitely agree we should be specific about it. Allowing multiple tables per file has some organizational advantages for the end user, but if it's going to cause problems for other utilities, the costs may outweigh the benefits.


Linked Entries
Attempting to insert a linked entry which lacks its corresponding <$XX> in the insert script should generate an error under 4.2, right?

Exactly. It would basically be a sequence that cannot be inserted according to general insertion rules. Like for instance "福" cannot be inserted when it is not defined in any table.
[...]
I couldn't find your Color example. [...]
The <Color> example comes from 2.5, and your description of the parsing process is also how I thought it should occur. The alternative seems like an excellent recipe for mangling the rest of the insertion script, particularly so in the presence of multibyte entries. Nightcrawler disagrees with us, alas :'(.


Ideally, if I dump a bunch of hex to text and then reinsert that text unaltered, my basic expectation is to get the original hex again.

This expectation is a false premise, really.
As a hard and fast rule, yes, it is. I did make allowances for differences in compression and equivalent text sequences. However, the example as I gave it results in a loss of textual integrity - the original text "<$01>=this is a '<$01>' string" becomes "==this is a '=' string" after being dumped and re-inserted without modification, which is why I then argued for disallowing <$XX> sequences in table entries. Once that change makes its way into the standard, my example will become invalid and order will be restored to the universe :P.
Title: Re: Some Basic Points
Post by: Klarth on May 26, 2011, 11:58:06 pm
I'll quickly address why linked entries are necessary.  If you have an entry, say $80 which has one parameter byte and represents how many options there are in a dialogue user choice box.  You can do $80=<selectoption>,1 or you can do the following:
Code: [Select]
8000=<selectoption><$00>
8001=<selectoption><$01>
...
80FF<selectoption><$FF>

Manual entry allows you clean up the format, but linked entries are a way to expedite this tedious work when meaning is not necessary.  Speaker portraits are sometimes necessary to define.

The next point is on verification.  I don't verify linked entry parameter bytes during insertion in my TableLib for simplicity and that was a minor oversight.  Most linked entries (usually outside of text) will be transitioned well.  But there's a large margin for error when it comes to linked entries inside of dialogue that will be translated/edited and that needs to be mitigated.

So there's two ways to do so: one is to keep the current $XX=<token>,Y and read Y hex-formatted bytes afterwards.  The second, a bit more complex, is to implement a format string like $XX=<token %2X> which can clean things up a bit (a single control code and its parameters will be clumped together inside of angle brackets in this case).  I won't advocate for either way yet and that string format is just an example.  You could do $XX=<token>,%2X,%2X to print <token $01 $E0> for example.

The last point is that linked entry is a terrible term.  I should've made up something sensible instead of continuing the name from the Thingy readme.
Title: Re: Algorithm Selection
Post by: abw on May 27, 2011, 12:06:18 am
I'm quite interested how you achieved and proved linear time/memory. Please don't hesitate to elaborate.
Gladly.

The actual calculation is specific to the input table and string, so I said "roughly" linear time/memory to gloss over the details. The idea of the algorithm is to find every valid token for each position in the string (starting from the end of the string and working backwards), and for each valid token, determine the minimum forward cost that must be incurred by any complete tokenization with a token starting at the current position. After we've done all that, starting at the beginning of the string, we simply take the minimum cost token for the current position, move to the string position following the chosen token, and repeat until the end of the string is reached.

For each position, then, we'll need to remember the minimum cost and the token which produces the minimum cost. Memory requirements are thus limited to the size of the table and 3 * n, where n is the (character) length of the string, plus a small, constant amount of storage space (for e.g. loop index variables).

Run time is dominated by the sum (for x = 0 to n) of v(x), where v(x) is the cost of finding all valid tokens at position (x). In the worst case (e.g. every token matches at every position [except maybe near the end of the string]), v(x) is bounded by m, where m is the length of the longest text sequence in the table, giving us a runtime of O(m*n) (I'm using a pre-constructed trie (http://en.wikipedia.org/wiki/Trie) for O(m) valid token lookup). In practice, however, large values for v(x) are rare (in the particular case of DTE, m and thus v(x) are at most 2), so the cost of finding valid tokens becomes negligible in comparison to the string length.

Running your example under my algorithm (elided steps are left as an exercise for the reader :P):

Code: [Select]
01=F
02=i
03=v
04=e
05=Fi
06=ve
07=ive
08=iveS
09=ixSeven
10=Six
11=Seven

Text:                    F           i           v           e           S         i           x           S           e           v           e           n           (end of string)
Position:                0           1           2           3           4         5           6           7           8           9           10          11          12

Possible Tokens,         F(3)*       i(4)        v(4)        e(3)*       Six(2)*   i(-)        -(-)        Seven(1)*   e(-)        v(-)        e(-)        -(-)        -(0)
(Costs), and             Fi(4)       ive(3)      ve(3)*                            ixSeven(1)*
Selectability*:                      iveS(2)*

(12) end of string shown for illustrative purposes only [cost is 0 by definition]
(11) no valid tokens [cost is undefined]; no token to remember
(10) we find 04=e [cost is 1 + cost of (11): 1 + undefined = undefined]; no token to remember
...
(7)  we find 11=Seven [cost is 1 + cost of (12): 1 + 0 = 1]; remember 11=Seven
(6)  no valid tokens [cost is undefined]; no token to remember
(5)  we find 02=i [cost is 1 + cost of (6): 1 + undefined = undefined] and 09=ixSeven [cost is 1 + cost of (12): 1 + 0 = 1]; remember 09=ixSeven
...
(2)  we find 03=v [cost is 1 + cost of (3): 1 + 3 = 4] and 06=ve [cost is 1 + cost of (4): 1 + 2 = 3]; remember 06=ve
(1)  we find 02=i [cost is 1 + cost of (2): 1 + 3 = 4], 07=ive [cost is 1 + cost of (4): 1 + 2 = 3], and 08=iveS [cost is 1 + cost of (5): 1 + 1 = 2]; remember 08=iveS
(0)  we find 01=F [cost is 1 + cost of (1): 1 + 2 = 3] and 05=Fi [cost is 1 + cost of (2): 1 + 3 = 4]; remember 01=F

Then comes the fun part: starting with (0), we select 01=F, move to (1), select 08=iveS, move to (5), select 09=ixSeven, move to (12), and we're done! Since we disqualify impossible tokenizations while calculating each position's token/cost, we know the string can not be tokenized if we move to a position with no remembered token.

Other benefits:
This algorithm can be easily adapted to give precedence to raw byte sequences <$XX> by undefining the token/cost for the positions covered by $XX>.
The trie can easily include tokens from all tables accessible from the starting table, which in a multiple table context gives us access to every possible tokenization of a string (across table boundaries) without having to list them all (listing is of course O(2^n)).
The algorithm can be split into multiple passes (finding tokens on the first pass, calculating costs on the second pass), which in a multiple table context gives us a chance to do any extra validity checks that may be required before cost analysis is begun using full knowledge of all possible tokenizations.


While on the topic of insertion algorithms... how are you handling table switching? I've been thinking more about this, and in the general case it presents an even larger can of worms than I was anticipating.

Since I'm the guy that introduced this, I can only say that ? yes ? it does pose a problem. However, my naïve solution was to use ordered data sets and do a longest-prefix over all tables. You can still do this with A*, where path cost is adapted to table switching as well, i.e. the path to an entry with switched table from another table is the cost in bytes in the new table plus the cost in bytes from the switching code in the old table. This of course, needs just some bookkeeping to know for each explored node in which table it belongs and how many matches the table can have before switching back.

Since I thought about it some time now, I would actually like to introduce the notion of leveled compliance. Basically, one could incorporate into the standard more than one notion of insertion whereas tools can then specify which version(s) they comply to and by. This way, tool x can comply to a) longest prefix, b) the proposed A* method (or maybe just a depth first etc.) and c) it's own method. IMHO, this would also preclude utility authors inventing their own insertion technique but not indicating it anywhere. Table files themselves would be required to stay the way they are for compatibility to other tools once the final draft is done. Versioning will likely happen anyway once some <glaring oversight that really should have been caught and everybody feels miserable about once the standard is out> has been identified and rectified in a subsequent version of the standard.

cYa,

Tauwasser

I have more to say about this (and Klarth's post), but it is now past my bedtime :(.
Title: Re: Algorithm Selection
Post by: Nightcrawler on May 27, 2011, 03:20:23 pm
I hope to have demonstrated that this is neither a laughable nor impossible claim or problem.

I cannot think of a single argument except burden of implementation, which he ruled out. Once a provably optimal algorithm is used, why is it not desirable? Speed-wise we can do brute force of some 100 kB in a few seconds, so speed doesn't seem to be the issue here, does it?

I think you got the wrong inference. I wasn't implying it was laughable or impossible. I was implying the resulting solution is magnitudes more complicated than a longest match algorithm. Burden of implementation is a very big part of this standard and is argument enough. This is especially true now with the detailed algorithms described in the past few posts. Before, I was arguing on behalf of others that needed simplicity, but now I can simply state that *I* would have difficulty successfully implementing these algorithms. You guys are computer science guys, many of the rest of us are not. I can't support a standard I can't implement, thus it's undesirable. I had read several sources on A* path finding algorithms to even understand what was posted, let alone implement it in my program.

With that said, I do entertain the idea of freedom of implementation as an alternative. Just because I may be inadequate, I see the merit in not explicitly disallowing you to do it better, especially for a more optimal output. I have two hesitations.

1.) Ideally, I wanted to see Utility X and Utility Y to translate the basic text to the same hex output. I understand that we're less concerned with that and more concerned with it resulting in the same text output in the video game. This is along the examples you gave with data compression or images.  It just makes life easier for testing, comparisons and interchangeability though if they did do it the same. It would keep things simpler and I like simple.

2.) If freedom of insertion algorithm is given, I'm not sure how to give a satisfying answer to '2.2.4 Text Collisions'. It seems desirable and logical that a simple,  straight forward answer should be given on what to do with text collision situations. We do that now, but if it then becomes an algorithm free-for-all, it's not clearly defined. It seems counter-intuitive to the standard to not standardize what to do with that case.

Quote
I'm not sure what you're thinking of as an inserter in this case, but pretty much every compression that does not use sliding window techniques will have to use a backtracking algorithm to be optimal alongside a good heuristic. So yes, people have written inserters that insert input data into an output file while contemplating size and combination of mappings. It might just have been for binary data or the like.

I'm talking about ROM Hacking utilities that can be used for script insertion. To my knowledge there are no public utilities available that use such an algorithm. I'd take an educated guess and say that's due to the unnecessary complexity. That illustrates my point of putting the standard in the realm of nobody using it because it's just too burdensome and/or complicated to implement.

Quote
As for uniqueness of labels, I have to admit I was silently going for uniqueness in each type, but this might have to be discussed again.

Is there any reason why all types can't simply follow the rules of 2.2.6?

The next point is on verification.  I don't verify linked entry parameter bytes during insertion in my TableLib for simplicity and that was a minor oversight.  Most linked entries (usually outside of text) will be transitioned well.  But there's a large margin for error when it comes to linked entries inside of dialogue that will be translated/edited and that needs to be mitigated.

You just exemplified that verifying is not necessary for insertion. In text form, it's already in the format of a normal token followed by raw hex bytes. Doing any extra verifying is a utility issue in my opinion, and is not required for translating text-to-hex. It *IS* required for proper hex-to-text conversion and thus why it is included there for dumping.

Additionally, as I mentioned previously, there have been times where I edited the number of raw hex parameters or the linked entry itself in game, and wouldn't want the original behavior any more. Although I suppose, I could always edit my table to reflect this, but it was previously unnecessary for insertion. We also seem to blur the line here in our discussions being able to dump and insert with the same table and expecting changes would be made for the insertion table. Originally, I had always intended a different table to be used for insertion and many issues would be alleviated, simplified, or not have to be dealt with in the standard. Historically, I have always used a different table for insertion. Others pushed to ensure you can dump and insert with the exact same table, which is fine, but adds a number of nuances to deal with that we otherwise probably wouldn't have.

Title: Re: Algorithm Selection
Post by: Klarth on May 27, 2011, 10:24:12 pm
Additionally, as I mentioned previously, there have been times where I edited the number of raw hex parameters or the linked entry itself in game, and wouldn't want the original behavior any more. Although I suppose, I could always edit my table to reflect this, but it was previously unnecessary for insertion. We also seem to blur the line here in our discussions being able to dump and insert with the same table and expecting changes would be made for the insertion table. Originally, I had always intended a different table to be used for insertion and many issues would be alleviated, simplified, or not have to be dealt with in the standard. Historically, I have always used a different table for insertion. Others pushed to ensure you can dump and insert with the exact same table, which is fine, but adds a number of nuances to deal with that we otherwise probably wouldn't have.
Well there's two cases that I can think of when modifying linked values in a script.  If you can edit the number of raw hex parameters, it's either a null-terminated parameter list or you've modified the game code to account for this.  In the case of control codes that use null-terminated lists of parameters (I don't know if these exist, but it's plausible), linked values can't tackle this problem because of their defined bytes-after value.  And the second case of modifying game code, you've changed the underlaying representation of the text engine.  Which means a new table for insertion just you would do as you transition a game through DTE changes, overwriting Japanese font, etc.  In a global control code modification like this, it might be nice to have validation to ensure a correct transition.  And if you want to add real validation, the hex values need to be within the linked value tag: ie. <color $FF>.  Otherwise you lose the context for validation when such tags are in clumps of poorly defined hex output.

You're correct in saying that this validation is not necessary for insertion.  But it still has some value to consider.

Lastly on A*.  I read about the algorithm almost 10 years ago for pathfinding in videogames.  I honestly have no desire to implement an A* insertion algorithm for reasons I discussed two posts ago in longest match vs hex-saving.  I don't yet see a clearly superior algorithm in A* that makes me to want to jump to transition.
Title: Re: Table File Standard Discussion
Post by: abw on May 28, 2011, 07:41:07 pm
When it comes to longest match vs hex-saving, there are a few things to consider.
...
I agree with almost everything here. In particular, if a utility author wants to spend more time making the previously impossible possible and less time squeezing a little bit extra out of things that already have reasonable solutions, then that's what they should do and I wish them well.

I had assumed ScriptCrunch produced optimal results (the Algorithm Overview section makes it sound like it should), but upon closer examination, something odd is definitely going on - Dict with DictEntrySize=1, DictMinString=2, DictMaxString=2  produces different results than DTE, and some of its numbers look off (e.g. I feed it files totalling 82K of script [including comments and Atlas commands, which my .ini file excludes] and it says there are 211533 ScriptBytes). Interestingly, the resulting Dict table produced an encoding 4% smaller than that of the DTE table, according to Atlas.

The last point is that linked entry is a terrible term.  I should've made up something sensible instead of continuing the name from the Thingy readme.
I wasn't going to say it, but yeah :P. I've been thinking of them as function calls and parameter bytes. Maybe that's slightly better?

With that said, I do entertain the idea of freedom of implementation as an alternative. Just because I may be inadequate, I see the merit in not explicitly disallowing you to do it better, especially for a more optimal output. I have two hesitations.
1) This would have to be sacrificed, and the resulting complications for testing are lamentable. All other things being equal, I agree that simplicity is to be preferred, but in this case all other things are not equal.
2) The longest text sequence rule for insertion nicely mirrors the longest hex sequence rule for dumping, so you also lose some parallelism :P. In the interests of accessibility, I'd definitely keep longest prefix as a suggested algorithm, since it is easy to explain and understand, but make a note that other algorithms are possible, and that the same text should be output by the video game no matter which algorithm is used.
... except that that doesn't cover cases where longest prefix fails to insert otherwise insertable text. On that note, the "ignore and continue" rule of 2.2.2 further complicates the description of 2.2.4 when different algorithms are allowed. What's the reasoning behind "ignore and continue"? I think I prefer Atlas' "give up and die" approach :P.

> Doing any extra verifying is a utility issue in my opinion
I have a preference for validation, but it's not a strong enough preference for me to be willing to say that validation should be enforced. At least, not yet. Until somebody comes up with a workable algorithm for multi-table insertion (more on that later), I'm not entirely comfortable with finalizing some of these issues, since it might turn out that a different decision is necessary or expedient.

And if you want to add real validation, the hex values need to be within the linked value tag: ie. <color $FF>.  Otherwise you lose the context for validation when such tags are in clumps of poorly defined hex output.
We're already using < and > for representing raw hex bytes, and even if we disallow <$XX> in text sequences, there could still be issues with combining entries like <$X and X>. Allowing < and > only as the first and last character respectively of non-normal entries would almost solve that edge case (and requiring non-normal entries to use < and > as the first and last character respectively, coupled with checking for uniqueness across <$XX> and all non-normal entries, would put that issue [and a few more] to rest for good). I was thinking about just pre- and post-fix strings for linked entries, but a format string is definitely more useful. You could then say things like $hex=<window x=%2X, y=%2X>. What kind of format strings would you allow? The full range of printf?

Putting all of that together would make $ the general in-game control code identifier I was shooting for earlier with #. Salient changes would be:
There's probably a cleaner way to handle 2.6, since "table2,6" looks nicer than "<table2 %2X %2X %2X %2X %2X %2X>", "table2,0" is perhaps clearer than "<table2>", and it doesn't matter much anyway if table switch tokens aren't output.
Title: Re: Table File Standard Discussion
Post by: Klarth on May 28, 2011, 09:22:36 pm
Normally I wouldn't check out ScriptCrunch because you aren't supposed to use the dictionary approach for DTE, but with that strange savings boost, it makes it worthwhile.  I'll have to try to figure out that size bug too.  Thanks.  I've been working on and mostly off to make a frontend for ScriptCrunch so it's not as ridiculously cumbersome to configure.

Function calls / parameter bytes isn't much better terminology but at least has some meaning.  On the string format, I'd consider formatting support for only 1, 2, 3, and 4 (...maybe 8 too) byte argument printing as hex, decimal, and maybe binary.  Perhaps endian swap support.  No strings/characters because the purpose of linked values is to output hex so that non-text data doesn't get put through the table.
Title: Re: Table File Standard Discussion
Post by: abw on May 30, 2011, 11:14:27 pm
First off, @Klarth:
Interestingly, the resulting Dict table produced an encoding 4% smaller than that of the DTE table, according to Atlas.
This part turns out to be directly attributable to user error (I forgot to strip out the existing DTE from the table I gave to ScriptCrunch, which resulted in the table it gave back to me having two different DTE sections for the same byte range. Atlas was then quite happy to make use of both DTE sections when inserting, which is where the compression gain came from). Apologies for bringing up a red herring :-[.

While on the topic of insertion algorithms... how are you handling table switching? I've been thinking more about this, and in the general case it presents an even larger can of worms than I was anticipating.

Since I'm the guy that introduced this, I can only say that ― yes ― it does pose a problem. However, my naïve solution was to use ordered data sets and do a longest-prefix over all tables. You can still do this with A*, where path cost is adapted to table switching as well, i.e. the path to an entry with switched table from another table is the cost in bytes in the new table plus the cost in bytes from the switching code in the old table. This of course, needs just some bookkeeping to know for each explored node in which table it belongs and how many matches the table can have before switching back.
Tokenizing over all tables (accessible, directly or indirectly, from the starting table) and making tokens remember which table they came from are both pretty easy. I think there are more subtle problems with the cost function for A*, however, and with keeping track of table switch conditions in general.

NB: I'm talking here about operations permitted by the standard (and which thus must be handled by any insertion utility claiming to implement the standard), regardless of the existence of any real-world examples.

The text hacking process (as we all know) is essentially hex_1 -> text_1 -> text_2 -> hex_2 -> text_3, where in-game hex_1 is dumped to text_1 in one (or more) external file(s), text_1 becomes text_2 after some (possibly empty) set of modifications, text_2 is inserted as hex_2, and the new hex_2 is displayed in-game as text_3. We've mentioned a couple of ways that the hex_1 => hex_2 conversion process can result in a different hex sequence being inserted than was originally dumped, and how the display of in-game text is the guiding principle for determining whether such variation is acceptable, i.e. text_2 and text_3 must be the same text. In order to achieve this goal, our text -> hex insertion process must mirror the game's hex -> text display process, which by assumption is identical to our hex -> text dumping process.

This problem would be much easier if every group of tables had the ability to switch amongst themselves with a 0 count. Instead, the only path from table A to table B may necessitate switching through table C, and each switch brings with it certain conditions that must be fulfilled by subsequent tokens. These conditions come in three types:
- to match a specified number of tokens in the new table and then return to matching tokens in the current table;
- to match an unspecified number of tokens in the new table and only return to matching tokens in the current table when it is no longer possible to match another token in the new table; or
- the switch conditions for all tables may be prematurely satisfied by reading an end token.

When dumping hex -> text, an unassociated hexadecimal literal can only be output when it is no longer possible to match the next token in any of the tables contained in the table stack at that point in the dumping process. For insertion, then, an unassociated hexadecimal literal should be taken as a signal to empty the table stack and continue matching tokens with the starting table. Unfortunately, hexadecimal literals are essentially wildcards. As is the case with linked tokens lacking their literals, unassociated literals can combine with subsequently inserted hex in unintended ways, some of which may produce valid tokens in some table on the stack. Even without that danger, they cannot reliably be assigned to any specific table, which means they cannot reliably count as a match in any specific table, which in turn makes it much more difficult to determine whether all switch conditions have been satisfied.

Maybe there's a way I'm just not seeing, but put together, I think these factors break any A* heuristic. My single-table insertion algorithm doesn't appear to fare any better. My backup plan, alas, is much more complicated (build a CFG (http://en.wikipedia.org/wiki/Context-free_grammar) out of all the table switch entries, build a DPDA (http://en.wikipedia.org/wiki/Deterministic_pushdown_automaton) from the CFG to parse the text, and remember where in that parsing the switch tokens occurred so we can output them in hex) and I'm not sure it will even work (the language of table switch tokens appears at first glance to be non-deterministic). Even if it does, I don't see optimality in less than O(2^n).

Somebody mentioned making the end user do all the hard work?
Title: Re: Table File Standard Discussion
Post by: Nightcrawler on May 31, 2011, 02:45:51 pm
1) This would have to be sacrificed, and the resulting complications for testing are lamentable. All other things being equal, I agree that simplicity is to be preferred, but in this case all other things are not equal.
2) The longest text sequence rule for insertion nicely mirrors the longest hex sequence rule for dumping, so you also lose some parallelism :P. In the interests of accessibility, I'd definitely keep longest prefix as a suggested algorithm, since it is easy to explain and understand, but make a note that other algorithms are possible, and that the same text should be output by the video game no matter which algorithm is used.
... except that that doesn't cover cases where longest prefix fails to insert otherwise insertable text. On that note, the "ignore and continue" rule of 2.2.2 further complicates the description of 2.2.4 when different algorithms are allowed. What's the reasoning behind "ignore and continue"? I think I prefer Atlas' "give up and die" approach :P.

Ok, I think I can agree to resolve this by by doing the following:

1. Amend 2.2.4 to state text collisions should be resolved by a suitable algorithm. The example and suggested algorithm will show longest text sequence as the simplest and suggested algorithm. Note will be given to mention other more intelligent algorithms (such as A*) may be used for optimal collision resolution.

2. Amend 2.2.2 to state an error should be generated if no match is found for a text sequence. I don't recall anymore why ignoring was chosen. I don't have an argument at this time for not throwing an error.

Quote
Putting all of that together would make $ the general in-game control code identifier I was shooting for earlier with #. Salient changes would be:
  • adding a note in 2.2 forbidding < and > in normal entry text sequences;
  • changing "endtoken" (2.4), "label" (2.5), and "TableID" (2.6) to "<label>" where label can contain any character not in [,<>];
  • adding a note about uniqueness of all non-normal entries;
  • explaining how format strings work in 2.5 (each gets replaced with value of next byte) and 2.6 (translate next characters in new table).
There's probably a cleaner way to handle 2.6, since "table2,6" looks nicer than "<table2 %2X %2X %2X %2X %2X %2X>", "table2,0" is perhaps clearer than "<table2>", and it doesn't matter much anyway if table switch tokens aren't output.

I see the possible benefits of parameter inclusion for Linked Entries. A formatting string probably be some nice icing on that cake. I do not understand what exactly you are proposing for the other items, nor what the advantages would be. Your previous post on in-game control identifiers with # made a little more sense, but now I am further confused.

Table Switching Insertion

First, I believe this is in the utility realm and need not be mentioned in the standard other than passing application/implementation suggestion or note.

I see two simple solutions here for myself.

1. The dumper outputs a table switch marker. This is simplest, but probably undesirable. It will really hurt readability of the dump output if you use it for Kanji and Kana.

2. We pretty much do the same process as we do for dumping with inferred table jump by token lookup, instead of explicit from hex.  We then determine a suitable switch path when a switch is detected. Let's look at the example from the standard (2.6) for insertion.

We're in table 1, our starting table.
We see ア. Not in current table.
We find that in table 2, can we get to table 2 from table 1?
Yes, we find a switch token (infinite matches) in table 1 that goes to table 2. Insert 0xF8 followed by 0x00.
Now we're in table 2 (table 1 is on the stack).
We see イ. Found in current table, output 0x01.
We see ウ. Found in current table, output 0x02.
We see 意. Not in current table.
We find that in table 3. Can we get from table 2 to table 3?
Yes, we found a switch token (infinite matches) to table 3 in the current table. Output 0xF9 and 0x01.
We're currently in table 3 (table 2 and 1 are on the stack).
We see <PlayerName>.
Not in current table. We find that in table 1. Can we switch table 3 to table 1?
No, there is no switch available. No further match is possible, this table is expired. Fall back to table 2.
Yes, there is a switch available in table 2 to table 1. Output 0xF8 0x03. Finished.

Alternatively, logic could be added to check and see that table 1 was already on the stack and we can fall back there for optimal result without the extra 0xF8 in there. Both would give equivalent game text output.

As far as possible issues with linked entries and raw hex, if we move to include the parameter bytes with linked entries, that leaves us only with the raw hex problem. For raw hex to be dumped, it is because it's not found in any table on the current stack at the time of dump (no entry found causes the table to fallback under all conditions). If so, you shouldn't have a case where raw hex is involved in any table switching operations because it will only occur in the starting table. The only way raw hex can get involved in a table switching operation is when the user explicitly stuck it in the script post-dump. If that occurs, it's probably best to just insert it and carry on with normal insertion. It should not count toward matches. I'm unsure if it should or should not cause the table to fallback. It'll probably screw things up if it does fallback to undesired output. However, it seems more logical from an operational point of view to have it fallback.
Title: Re: Table File Standard Discussion
Post by: abw on June 01, 2011, 11:44:08 pm
Ok, I think I can agree to resolve this by by doing the following:
Works for me :).

Your previous post on in-game control identifiers with # made a little more sense, but now I am further confused.
Basically, I'm proposing reserving "<" and ">" for exclusive and mandatory use as the opening and closing delimiters of every entry that isn't a normal entry (i.e. any entry starting with "/", "$", or "!"). I am further proposing that it should be an error for a table to contain multiple entries (regardless of type) containing "<foo>" for any value of "foo", and that "foo" may not have the form "$XX" for any two-character hexadecimal sequence "XX". While I was at it, I figured I might as well include format strings, since that idea seems good and dovetails nicely with having reserved delimiters. The rest was just bookkeeping highlights.

We already reserve "<" and ">" as delimiters for hexadecimal literals, so really, this is just carrying the idea through to the next step. It's like how you're not allowed to have a literal "<" as data inside an xml tag.

Benefits:
- integrity of the standard;
- raw hex insertion priority is no longer an issue, conceptually or programmatically, since no sequence of table entries will be able to generate "<$XX>" when dumping;
- similarly, parsing of all in-game control codes becomes absolutely unambiguous during insertion (at least in single-table insertion scenarios);
- anyone working with dumped text will immediately be able to identify in-game control codes (i.e. we're enforcing a best practice);
- format strings further disambiguate parsing while increasing output flexibility and restoring "," as a usable character in linked and table switch entries.

Costs:
- "<" and ">" are no longer able to be used in normal entries, or as anything other than the first and last character, respectively, of non-normal entries;
- we lose the ability to have entries terminated with newlines. So maybe we make an exception and allow newlines after ">";
- checking for uniqueness marginally increases complexity;
- in order to be most effective, uniqueness should be applied across all tables used during dumping/inserting. This would also marginally increase complexity, and may be undesirable if people really want to use the exact same text sequence in multiple tables (e.g. they want "<end>" to be output regardless of the table in which the end token appeared; this may perhaps have some benefit in multi-table insertion scenarios);
- format strings also increase complexity;
- somebody would have to write all these changes into the standard (I listed some of the most important changes).

I think the benefits outweigh the costs, but that's just my opinion. At the very least, it's worth presenting for consideration.

Table Switching Insertion
First, I believe this is in the utility realm and need not be mentioned in the standard other than passing application/implementation suggestion or note.
Ah, the easy way out :P.

1. The dumper outputs a table switch marker. This is simplest, but probably undesirable. It will really hurt readability of the dump output if you use it for Kanji and Kana.
It also doesn't work in general. If I dump a sequence of Kanji and then stick some Kana in the middle, the old table switch markers won't be appropriate for insertion anymore.

2. We pretty much do the same process as we do for dumping with inferred table jump by token lookup, instead of explicit from hex.  We then determine a suitable switch path when a switch is detected. Let's look at the example from the standard (2.6) for insertion.
As multi-table insertion scenarios go, 2.6 is fairly simple, since you can move from any table to any other table (either explicitly for tables 1 and 2 or through fallback for table 3) whenever you need to and stay in the new table for as long as you need to. The only penalty for making a "wrong" table switch is inserting more hex than required. Here's a (very) slightly more convoluted scenario:
@tableA
00=A
01=D
!C1=tableB,1
!C3=tableB,3

@tableB
C0=AB
C1=AC
Insertion string: ABACD

In this case, you're faced with a choice right away, and this time, guessing wrong means you can't insert the string :(.

If [manual raw hex] occurs, it's probably best to just insert it and carry on with normal insertion. It should not count toward matches.
We can define whatever behaviour we want, but the true test is how the game handles situations like this. Aside from the possibility of being interpreted as (part of) a valid token when a modified table switch path does not follow the original path, there's still some danger in cases where changes have been made to the text engine. My feeling is that if somebody's making changes to the text engine, they should know enough to be careful with raw hex, but if the utility is going to be responsible for choosing the table switch path (which I think it has to be), it should also be responsible for checking for raw hex suddenly becoming (part of) a valid token, which in turn would affect the game's match count.
Title: Re: Table File Standard Discussion
Post by: Nightcrawler on June 02, 2011, 01:59:00 pm
Basically, I'm proposing reserving "<" and ">" for exclusive and mandatory use as the opening and closing delimiters of every entry that isn't a normal entry (i.e. any entry starting with "/", "$", or "!"). I am further proposing that it should be an error for a table to contain multiple entries (regardless of type) containing "<foo>" for any value of "foo", and that "foo" may not have the form "$XX" for any two-character hexadecimal sequence "XX". While I was at it, I figured I might as well include format strings, since that idea seems good and dovetails nicely with having reserved delimiters. The rest was just bookkeeping highlights.

Couple of issues with this:

1. There are many in-game control codes that currently I use just a normal entry for. Let's take keypress as an example. $FD=<kp>. You want to disallow that? Or is that allowed as it's the first and last character in a normal entry? What if I then define a linked entry or end token as <kp> for some reason? Are you saying it should be unique across normal entries, linked entries, and end tokens? Certainly different hex sequences should be able to yield the same text sequence though.

One other logical line of thinking might be that it should instead be a linked entry with zero parameters and not a normal entry to begin with. This essentially forces (if you want to use < or > as it typical for your controls) you to make all in-game controls be linked entries regardless of parameters.

2. I'm not sure if users might object to being locked into having to '<' and '>' for all their controls, although I understand we've already done that to them with raw hex. I guess I would hesitate to push further in that direction without feedback or support from others. It would simplify several matters though as pointed out.

3. I don't think this works well when applied to table switching. You want the @TableIDString line and then the TableID in the switch entry in the format <XXXX>? That seems a bit uglier, and as you already pointed out probably wouldn't lend itself to formatting strings.

I might consider this if it did not apply to table switches and it is agreeable to others. I never envisioned the TableID actually being used as a text token, but rather simply an internal identifier. I don't think we will need to output the table switch at all.

Note that along these lines I am considering pushing new lines and end tokens (non hex variant go out entirely, hex variant becomes a normal entry identified as an end token at the utility level) entirely out to the utility realm. I have been having some PM discussions with Klarth about this. If we did that, that would leave us with only linked entries and table switches as non normal entries.

Quote
As multi-table insertion scenarios go, 2.6 is fairly simple, since you can move from any table to any other table (either explicitly for tables 1 and 2 or through fallback for table 3) whenever you need to and stay in the new table for as long as you need to. The only penalty for making a "wrong" table switch is inserting more hex than required. Here's a (very) slightly more convoluted scenario:

In this case, you're faced with a choice right away, and this time, guessing wrong means you can't insert the string :(.

I see some of the pitfalls with the available choices. What really is the problem here? It seems to be the variable number of matches that throws in the monkey wrench. To that I say, do we need it? Hear me out. Personally, for every table switching operation I have ever seen, it is either a single match, indefinite match until end of string/not found value, or match until another switch is encountered. So, I ask do we need the ability to define a number of matches other than one or infinite? It was originally born into the standards due to the Kanji Array feature from ROMjuice's (http://www.romhacking.net/utils/234/) readme that is in a few games.  A second review of this makes me want to view this more as data compression rather than token conversion. That's the only thing I know of that needs this ability.

If we can omit that, doesn't our job become much more manageable? Number of matches becomes 0 (infinite) or 1. That makes the path choosing much easier and manageable. My simple logic with longest hex should then work without much issue. One could even add rule to say infinite matches takes precedent over single match in the event two switches exist in the same table to the same destination table. Then your example is much easier to handle appropriately.

So, let's assume now your table switches are changed to reflect this:

!C1=tableB,1
!C3=tableB,0

Assume tableA start.
We see 'ABACD'
Longest match we can find out of that is 'AB'. It's in tableB. Can we get to TableB?
Yes. Infinite is preferred over match of 1, so output 0xC3 0xC0  (If no, you would look for next longest and try again until valid match/path is found or exhausted).
Longest match we can find is 'AC'. We're in tableB so just output 0xC1.
We see 'D'. Not in TableB, fall back to Table A. Found in Table A, output 0x01.

Thoughts?

Can you think of any real world scenarios where you would need to define a number of matches other than 1 or infinite? If not, we might have a winning change here.

Quote
We can define whatever behaviour we want, but the true test is how the game handles situations like this. Aside from the possibility of being interpreted as (part of) a valid token when a modified table switch path does not follow the original path, there's still some danger in cases where changes have been made to the text engine. My feeling is that if somebody's making changes to the text engine, they should know enough to be careful with raw hex, but if the utility is going to be responsible for choosing the table switch path (which I think it has to be), it should also be responsible for checking for raw hex suddenly becoming (part of) a valid token, which in turn would affect the game's match count.

Not following. How would the raw hex suddenly become part of a valid token?
Title: Re: Table File Standard Discussion
Post by: henke37 on June 03, 2011, 11:43:23 am
What if I simply want to have a literal less than sign in my text?
Title: Re: Table File Standard Discussion
Post by: abw on June 03, 2011, 09:59:07 pm
One other logical line of thinking might be that it should instead be a linked entry with zero parameters and not a normal entry to begin with. This essentially forces (if you want to use < or > as it typical for your controls) you to make all in-game controls be linked entries regardless of parameters.
Basically this, but I was thinking of it in reverse: a linked entry is just an in-game control code with > 0 parameters.



People are still free to define in-game control codes as normal entries if they want (FD=kp), it's just a bad idea since they could be parsed in unexpected ways during insertion.


2. I'm not sure if users might object to being locked into having to '<' and '>' for all their controls, although I understand we've already done that to them with raw hex. I guess I would hesitate to push further in that direction without feedback or support from others. It would simplify several matters though as pointed out.
What if I simply want to have a literal less than sign in my text?
Yeah, this is definitely a compromise. Given that we're already doing it with raw hex, this is better than reserving more characters. We could go for more flexibility and allow e.g. $FD=[kp] on the condition that [ and ] do not appear in any text sequence as anything other than start and end tokens of a non-normal entry, but that sounds like more cost than benefit. Since table files are UTF-8 documents, there are lots of other characters to use as substitutes. I might go with « and » as visually similar characters, { and } if my game lacked significant set theoretic discussion (for shame! :P), or maybe &lt; and &gt; if I were using an XML-based utility. All of these options assume that my substitute characters (or at least enough of them to force unambiguous parsing) weren't already in use. So, not great, but not insurmountable either.


3. I don't think this works well when applied to table switching. You want the @TableIDString line and then the TableID in the switch entry in the format <XXXX>? That seems a bit uglier, and as you already pointed out probably wouldn't lend itself to formatting strings.
@table2 would still be @table2, but !FD=table2,0 would become !FD=<table2> and !FD=table2,2 would become some variant on !FD=<table2,%2X%2X>. This does seem unnecessary for table switching, especially if table switch tokens are only used for dumping, but it does make ! entries consistent with $ entries (my main goal in suggesting it), and potentially there are situations where having table switch tokens in the script might be beneficial, maybe. Contrary to what I said earlier, I think we'd still want to keep "," as an internal delimiter between the tableID and the format string. All in all, this part is probably not worth the cost.


Note that along these lines I am considering pushing new lines and end tokens (non hex variant go out entirely, hex variant becomes a normal entry identified as an end token at the utility level) entirely out to the utility realm. I have been having some PM discussions with Klarth about this. If we did that, that would leave us with only linked entries and table switches as non normal entries.
Interesting. Artificial end tokens are already necessarily handled at the utility level during dumping, so no loss there. Insertion would lose some intercompatibility and gain some complexity. As far as utilities are concerned, end tokens are the single most important type of token. We could combine ideas and make end tokens (/) in-game control codes ($), which would ensure their unambiguous parsability. The utility would have to have some way to find out which tokens were end tokens, either by hex or by text. Hex would have to be table-specific and couldn't handle artificial end tokens, so text is preferable. I guess it all depends on how smart the insertion utility is.

Would it no longer be possible to dump a newline based on table entry alone? Or is this a weakening of the "ignore newlines" insertion rule?


Can you think of any real world scenarios where you would need to define a number of matches other than 1 or infinite? If not, we might have a winning change here.
This change would significantly reduce the complexity in finding a valid table switch path. One of the ideas I've considered is recombining table entries (e.g. taking "!00=table2,1" in table1 and "80=foo" in table2 and replacing them with "0080=foo" in table1), and these two ideas would work well together. However, I'm not sure we actually can eliminate other match count values: I have read about cases where more than one match is required (http://www.romhacking.net/forum/index.php/topic,10086.msg152625.html#msg152625). Eliminating other match counts won't solve all our problems (see next point), but it is definitely worth pursuing if we can get away with it.

How would the raw hex suddenly become part of a valid token?
Consider the following example:
00=ABC
0000=DEF
80=A
...
9A=Z
Insufficiently careful insertion of the text "ABCABC" will result in the hex "00 00", which the game will read as the text "DEF". The same thing can happen with raw hex literals: "<$00>ABC" and "ABC<$00> both end up being read as "DEF". It's worth noting that none of the algorithms discussed here (including longest prefix) address this possiblity. In a single-table scenario, I would consider this example as unlikely and indicative of poor text engine design besides (which is not the same as saying it won't work, given the right input), but the same idea applies in a multi-table scenario and seems more likely to occur in practice. Here's another made-up example:
@table1
00=~1~
80=A
...
9A=Z
!C0=table2,0
!C1=table2,1

@table2
00=~2~
0080=~3~
8200=~4~
8280=~5~
80=a
...
9A=z
Starting from table1, when we see the text "abc<$00>ABC", what do we insert? What does the game read? Even ignoring that troublesome hex literal, what happens with "abcABC"?
Title: Re: Table File Standard Discussion
Post by: Nightcrawler on June 07, 2011, 03:18:44 pm
All of these options assume that my substitute characters (or at least enough of them to force unambiguous parsing) weren't already in use. So, not great, but not insurmountable either.

The way we have it currently disallows only a raw hex pattern in normal entries, but otherwise allows everything. If we were to pursue this direction of heavier restriction, I'd need support from others.

Quote
@table2 would still be @table2, but !FD=table2,0 would become !FD=<table2> and !FD=table2,2 would become some variant on !FD=<table2,%2X%2X>.
I don't think it makes sense for table switch entries to be part of it like that. Remember TableID directs to an identifier that correspond to a TableIDString line in one of your tables. Why are you wrapping it '<>'? Why are then trying to wrap it in a token format? If you want to do that, I think it makes more sense to do it like this:

!hex=TableIDString,NumberOfTableMatches,FormatString
!FD=table2,2,<table2 %2X%2X>

That way you're now defining a specific formatted token there as an extra parameter. You're no longer trying to mesh the text token with the TableIDString identifier.

Quote
Interesting. Artificial end tokens are already necessarily handled at the utility level during dumping, so no loss there. Insertion would lose some intercompatibility and gain some complexity. As far as utilities are concerned, end tokens are the single most important type of token. We could combine ideas and make end tokens (/) in-game control codes ($), which would ensure their unambiguous parsability. The utility would have to have some way to find out which tokens were end tokens, either by hex or by text. Hex would have to be table-specific and couldn't handle artificial end tokens, so text is preferable. I guess it all depends on how smart the insertion utility is.
Yes. And if the in-game control code idea didn't fly, I would still eliminate '/' and just leave them as normal entries. As far as the table goes, "FF=<END>" is a standard hex to text conversion. The significance of this specific token is in the utility realm. That would be the logic there from the perspective of a more strict interpretation of what belongs in the table file. This would, as you pointed out require the utility to provide a way for the user to identify what tokens are end tokens. End tokens are tricky as we currently have them. They have one foot in the utility realm and the other in the table the way it is. This is one possible idea to make things simpler and logical.

Quote
Would it no longer be possible to dump a newline based on table entry alone? Or is this a weakening of the "ignore newlines" insertion rule?
Correct. There would be no newlines in the table file. "\n" in your table file would just be normal text. Again, the concept of a newlines and when to place them is pushed to the utility level. Nobody wants newlines to be part of the token in the insertion direction (except maybe you), and people have presented problems with having a different behavior between dumping and insertion for them and/or having them at all. So, consideration of removing them entirely is being made.

The drawback to pushing these items out is, while we have more strict and standardized definition of the table file, we actually standardize less in the overall process. Having no standardization in the process is probably what led to the incompatible utilities and table files we had in the past. So, does it become counterproductive to the cause?

Quote
However, I'm not sure we actually can eliminate other match count values: I have read about cases where more than one match is required (http://www.romhacking.net/forum/index.php/topic,10086.msg152625.html#msg152625). Eliminating other match counts won't solve all our problems (see next point), but it is definitely worth pursuing if we can get away with it.
That topic references the same 'Kanji Array' item from the Romjuice readme file I previously mentioned. Look at it again. This time ask yourself 'Isn't this really a form of simple data compression rather than a hex or text encoding item?' I think an argument can be made that is is data compression (even if very simple) and thus would not belong in our table file standard.

Quote
Starting from table1, when we see the text "abc<$00>ABC", what do we insert? What does the game read? Even ignoring that troublesome hex literal, what happens with "abcABC"?
I thought we covered 'abcABC'... Longest match is one letter for all of them, infinite table switch is preferred over single. So, we get 0xC0 0x80 0x81 0x82 (fallback) 0x80 0x81 0x82.

The hex literal is also simplified if NumberOfTableMatches can only be 0 or 1. This scenario can only occur if it's 0 now. So, all you need to decide is if hex literal causes a fallback or not. I don't think there is a right answer there as it will be game specific based on the interpretation of the hex value. One behavior will have to be chosen and move on. This will only ever occur when the user specifically inserts hex during a table switch.
Title: Re: Table File Standard Discussion
Post by: abw on June 08, 2011, 08:43:06 pm
The way we have it currently disallows only a raw hex pattern in normal entries, but otherwise allows everything.
Which includes tokens that can combine to produce text representing a raw hex literal.

If we were to pursue this direction of heavier restriction, I'd need support from others.
Definitely. I think this restriction is a net benefit, but it does come at a price, albeit a small one.

That way you're now defining a specific formatted token there as an extra parameter. You're no longer trying to mesh the text token with the TableIDString identifier.
The current way of doing things is fine. I wanted the syntax for ! entries to be consistent with the syntax for $ entries, but if we aren't dealing with ! entries as text, it really doesn't matter much.

The drawback to pushing these items out is, while we have more strict and standardized definition of the table file, we actually standardize less in the overall process. Having no standardization in the process is probably what led to the incompatible utilities and table files we had in the past. So, does it become counterproductive to the cause?
Hmm. That is a bit of a pickle. I agree pushing both items out to the utility level seems logical, but I wonder how many people will complain? Maybe take it one step at a time and only worry about table files for now? What's your vision of the ideal process?

That topic references the same 'Kanji Array' item from the Romjuice readme file I previously mentioned. Look at it again. This time ask yourself 'Isn't this really a form of simple data compression rather than a hex or text encoding item?' I think an argument can be made that is is data compression (even if very simple) and thus would not belong in our table file standard.
Yup, I know. The thread adds a little bit of info not in the readme, and is old enough that it might have been forgotten. Kanji Arrays are definitely data compression, but so is DTE. Where do we draw the line? Kanji Arrays and DTE are both representable by static bidirectional mappings, unlike, say, any of the Lempel-Ziv compression variants, which use dynamic mappings. Dumping Kanji Arrays isn't a problem, and if a bunch of programmers from the 90s came up with a way of inserting them, we should be able to as well.

I thought we covered 'abcABC'... Longest match is one letter for all of them, infinite table switch is preferred over single. So, we get 0xC0 0x80 0x81 0x82 (fallback) 0x80 0x81 0x82.
I think you missed my point, given that you just inserted "ab~5~bc" instead of "abcABC". How is the game supposed to know that it should fall back to table1 in the middle of a string of perfectly valid table2 tokens? Inserting without considering consequences for dumping (a.k.a. in-game display) can cause trouble, in many of the same ways as can dumping without considering consequences for inserting. If a utility author chooses not to care, so be it, but that doesn't make the problems go away.

On the topic of falling back... how exactly does the last Dakuten example in 3.2 work?
   Example 3:
   
      Let's say hex sequence "7F" indicates Dakuten mark that applies for all
      characters until "7F" occurs again to turn it off. This can be done with
      two tables containing table switch bytes in the following manner:
      
      Table 1:
         
         @NORMAL
         60=か
         !7F=Dakuten,0
      
      Table 2:
         
         @Dakuten
         60=が
            
      This will instruct any dumper to output 'か' normally until a "7F" byte
      is encountered. It will then switch to Table 2 and output 'が'. Because
      we specified 0 for the number of table matches, matching in the new
      table will continue until a value is not found. In this case "7F" is not
      in the Table 2, so fallback to Table 1 will occur.
After falling back to @NORMAL from @Dakuten on that "7F" byte, won't the very next thing the dumper does (as corroborated by the example in 2.6) be to read "7F" in @NORMAL, causing it to switch right back to @Dakuten and continue dumping with Dakuten marks? That sounds like the exact opposite of the intended result :P. This kind of fallback isn't really covered under any of the other cases, so maybe there should also be an explicit fallback entry. Perhaps something in @Dakuten like "!7F=,-1", or something in @NORMAL like "!7F=Dakuten,$7F"?
Title: Re: Table File Standard Discussion
Post by: Klarth on June 08, 2011, 09:44:13 pm
I can make arguments for and against end tokens in the standard, along with formatting.  Personally, I'd rather keep end tokens and formatting in the standard since it's less code to write and less configuration for me to keep track of (on both the dev and usage sides).  Can remove out null end tokens like /<END>.
Title: Re: Table File Standard Discussion
Post by: Nightcrawler on June 09, 2011, 04:23:10 pm
Which includes tokens that can combine to produce text representing a raw hex literal.

I spoke about some of these things with Klarth before. These are items that are possible to occur, but too rare for it to be detrimental... unless it's a "gotcha table". You either do that or you go in the other direction previously spoken about with the game control code unification and disallowing the literals used. At this point, nobody else seems in favor of such changes.  :-\

Quote
Hmm. That is a bit of a pickle. I agree pushing both items out to the utility level seems logical, but I wonder how many people will complain? Maybe take it one step at a time and only worry about table files for now? What's your vision of the ideal process?

Well as you can see, already Klarth does not like the idea of eliminating end tokens altogether. I imagine it would hit additional resistance. However, he is in favor of eliminating artificial end tokens. He also wants to keep formatting. I'm on the fence with it.

a.) Allowing arbitrary line breaks in any token is flexible formatting and hard to duplicate (from a user interface perspective) on the utility level.
b.) Many games have specific line break control codes. The end user needs a method to have a real line break after it in the text dump. The alternative would be having to define a line break control to the utility. That can seem disjointed. Some utilities had in-game line breaks specified in the table file similar to end tokens. This was deemed unnecessary with the /n formatting we currently have.
c.) Leaving them requires different behavior for interpreting tokens for dumping and inserting (ignore line break on insertion rule). Some, such as yourself want to use that as part of the token while others are strongly against.
d.) It can be difficult or problematic to have the script dump come out commented without quirks. See the example in the standard. You will end up having issues with the starting and ending comments by doing it that way. Alternatively, pushing commenting off to the utility pretty much requires the line breaks to be defined in some capacity to handle things like item b above. How else would the utility be able to comment out each line within a string and then say leave un-commented lines between strings. This is how the whole /r and /n was born for commented and un-commented line breaks in some other utilities.

I'm not even going to begin to talk about the ideal process. That's a huge can of worms as the ideal process depends on what platform and games you're working with. The table file escapes much of that and look at all the disagreement on that. :P

Quote
Yup, I know. The thread adds a little bit of info not in the readme, and is old enough that it might have been forgotten. Kanji Arrays are definitely data compression, but so is DTE. Where do we draw the line? Kanji Arrays and DTE are both representable by static bidirectional mappings, unlike, say, any of the Lempel-Ziv compression variants, which use dynamic mappings. Dumping Kanji Arrays isn't a problem, and if a bunch of programmers from the 90s came up with a way of inserting them, we should be able to as well.

The difference is, with DTE, there is direct token conversion in both directions. With the Kanji Array and most other data compression, you do not have that. An algorithm (even if simple) is required. As soon as you leave the realm of direct token conversion, I think you depart from the scope of the table file. That's a clear line to me. DTE is a static bidirectional map with no manipulation needed, the Kanji Array is not. Transformation of the hex must occur to get the map, right?

A way that makes sense to interpret this as non data compression is to set up all possible kanji array values as table switches with x number of matches, which was exactly why Tau suggested it. No optimal plan for insertion was ever come up in this topic if we left that in though. It seems to severely complicate matters.

Quote
I think you missed my point, given that you just inserted "ab~5~bc" instead of "abcABC". How is the game supposed to know that it should fall back to table1 in the middle of a string of perfectly valid table2 tokens? Inserting without considering consequences for dumping (a.k.a. in-game display) can cause trouble, in many of the same ways as can dumping without considering consequences for inserting. If a utility author chooses not to care, so be it, but that doesn't make the problems go away.

Look at what you did. You defined table2 with 82=c and 8280=~5~. I can't say I've ever seen a text engine that could operate in that way that I recall. 0x82 would be reserved to indicate the next character is a two byte character.  I believe S-JIS and UTF-8 work in a similar manner as well off hand. If your game is processing characters that are one OR two bytes and the next character is valid as one byte and two byte character, that's an ambiguous situation for the text engine unless it were to arbitrarily declare one or the other the default to choose from.

On top of all that, again, the user specifically plopped that raw value in there in the midst of a table switch to cause it. This is another case of the 'gotcha-table' theoretical condition that is so rare, I can't say it's worth any time to handle. What would you propose to do about it anyway? I don't think you'll find much support on the issue from the others.

You can try and iron out theoretical edge cases all day and spend a wealth of resource on them in your program, and continue to put it out of reach of more and more people, or you can start adding some practical limits, keep it simple, and get the product out the door. I think that's kind of where several of us are coming from now.

Quote
On the topic of falling back... how exactly does the last Dakuten example in 3.2 work?

Good catch. I remember writing those examples. I think I originally wrote it with an additional switch entry in Table2 of '!7F=NORMAL,0'. That should take care of it (and revising the logic paragraph). I remember having a (in hindsight) brain logic malfunction when I read it over and thought I didn't need that and it would just fallback. I rewrote it after that.
Title: Re: Table File Standard Discussion
Post by: abw on June 09, 2011, 11:56:47 pm
Well as you can see, already Klarth does not like the idea of eliminating end tokens altogether. I imagine it would hit additional resistance. However, he is in favor of eliminating artificial end tokens. He also wants to keep formatting. I'm on the fence with it.
The case for eliminating artificial end tokens certainly seems stronger than that for eliminating end tokens or newlines. As an end user, I think I prefer keeping them in the table file; even though that may not be the most logical place for them, it does seem to be the most convenient place.
a) I agree with this. As a real world example, it might be interesting to consider a game like The Legend of Zelda, which uses different tokens for regular characters, characters ending the first line, characters ending the second line, and characters ending the string.
b) The utility interface for defining a line break control can be pretty much the same as for defining an end token. I definitely prefer the formatting flexibility of \n over the old * entries.
c) As far as implementation goes, stripping newlines can be done in approximately 2 lines of code. My objections are primarily philosophical and are entangled with other issues.
d) I've never used \r myself. Is that a popular option? If you have some interface for defining line break controls and end tokens, the same thing should work for comment controls. Maybe something along the lines of this (clearly user interface is not my strong suit :P)?
0A=A
...
23=Z

4A=A<line1>\n
...
63=Z<line1>\n

8A=A<line2>\r
...
A3=Z<line2>\r

/CA=A<end>\n\n
/E3=Z<end>\n\n
becoming
#LINE-BREAK-AFTER: 4A-63,8A-A3,CA-E3
#LINE-BREAK-AFTER: CA-E3
#COMMENT-AFTER: 4A-63,CA-E3
#END-AFTER: CA-E3
Or maybe you could use your GUI to layer extra effects on a per-token basis? That sounds very much like a utility-specific table file extension, though.

I'm not even going to begin to talk about the ideal process. That's a huge can of worms as the ideal process depends on what platform and games you're working with. The table file escapes much of that and look at all the disagreement on that. :P
All too true. Consider the question retracted :P.


The difference is, with DTE, there is direct token conversion in both directions. With the Kanji Array and most other data compression, you do not have that. An algorithm (even if simple) is required. As soon as you leave the realm of direct token conversion, I think you depart from the scope of the table file. That's a clear line to me. DTE is a static bidirectional map with no manipulation needed, the Kanji Array is not. Transformation of the hex must occur to get the map, right?
DTE is only direct for dumping, and only when we assume all tokens have the same hex length. If the hex lengths vary, you need an algorithm to decide when to stop reading bytes (technically you still need an algorithm even if all it does is say "read one byte"). When you're inserting, you need an algorithm to tokenize the input... oh, I guess you're talking about what happens after tokenization. In that case, yes, DTE is direct both ways. But if you've already tokenized, Kanji Arrays are also direct conversion. The only problem we're having is that some of the tokenization information (the location and type of table switch tokens) is missing, and deducing the exact nature of the missing bits is tricky. If we could rely on the input script having the correct table switch tokens in the correct places, we wouldn't have most of these problems. It's the uncertainty of tokenization at issue, not the mapping of tokens.


Look at what you did. You defined table2 with 82=c and 8280=~5~. I can't say I've ever seen a text engine that could operate in that way that I recall. 0x82 would be reserved to indicate the next character is a two byte character.
Yes, that was a little sneaky of me. It is fully supported by the longest hex sequence rule for dumping, though. A quick peek at the wiki (http://datacrystal.romhacking.net/wiki/Fire_Emblem:_Mystery_of_the_Emblem:TBL) shows at least one real world example where this does in fact happen (on that note, people should submit more table files!).

I believe S-JIS and UTF-8 work in a similar manner as well off hand.
Not sure about S-JIS, but UTF-8 was actually designed to avoid this very weakness (and a few others). The same cannot be said of character encodings used in video games.

If your game is processing characters that are one OR two bytes and the next character is valid as one byte and two byte character, that's an ambiguous situation for the text engine unless it were to arbitrarily declare one or the other the default to choose from.
Which is exactly what the table file standard does :P.

On top of all that, again, the user specifically plopped that raw value in there in the midst of a table switch to cause it. This is another case of the 'gotcha-table' theoretical condition that is so rare, I can't say it's worth any time to handle.
Raw hex literals are (potentially) problematic anywhere in a game using multibyte tokens, not just at table switch boundaries.

What would you propose to do about it anyway? I don't think you'll find much support on the issue from the others.
That's what I was asking! Do they trigger fallback? Do they count as a match towards NumberOfTableMatches? What happens if they combine with the preceding or following hex to mutate into a completely different token? How can you write code to handle multi-table insertion and not consider these cases?

You can try and iron out theoretical edge cases all day and spend a wealth of resource on them in your program, and continue to put it out of reach of more and more people, or you can start adding some practical limits, keep it simple, and get the product out the door. I think that's kind of where several of us are coming from now.
Being 100% standard compliant means handling the theoretical edge cases. All of them. I haven't introduced anything in any of my examples that wasn't already allowed by the standard, and I haven't even raised all of the issues I'm currently aware of. If you want practical limits, I think either you have to adjust the standard or you can't claim compliance. And since you can't advocate a standard you don't want to comply with, I guess that means adjusting the standard.

One more try: in addition to the raw hex literal, let's also ignore all the multibyte entries and all the multicharacter entries. That leaves us with
@table1
80=A
...
9A=Z
!C0=table2,0
!C1=table2,1

@table2
80=a
...
9A=z
Now instead of "ab~5~bc", your hex represents "abcabc", which is still not the same as "abcABC". In this case, the only hex that works is "C1 80 C1 81 C1 82 80 81 82". If we are going to drop support for counts other than 0 and 1, what would you think about also dropping support for one table linking to another table using tokens with both counts? In this case, that would mean the combination in @table1 of "!C0=table2,0" and "!C1=table2,1" would be an error.

On the topic of falling back... how exactly does the last Dakuten example in 3.2 work?

Good catch. I remember writing those examples. I think I originally wrote it with an additional switch entry in Table2 of '!7F=NORMAL,0'. That should take care of it (and revising the logic paragraph). I remember having a (in hindsight) brain logic malfunction when I read it over and thought I didn't need that and it would just fallback. I rewrote it after that.
I thought about that too, but it does cause the dumper's table stack to become desynchronized from the game's stack. The next time the dumper hits an unrecognized entry in @NORMAL, it would fall back to @Dakuten (which might contain a hit) instead of dumping a hex literal. Worse yet would be if @NORMAL weren't the bottom of the stack, and the game actually fell back to some other table entirely while the dumper stayed in @Dakuten. But I guess that's just another theoretical edge case waiting to be ignored.
Title: Re: Table File Standard Discussion
Post by: Nightcrawler on June 10, 2011, 03:27:52 pm
First, let me apologize for some of the tone used in my previous message. I am just getting a little frustrated over the whole thing. After a year already, and now ripping it apart it again, the enthusiasm I started out with is fading quickly. It is becoming much more of a burden than the nice little side project I started out with while developing TextAngel. I am anxious to finish up and move on in life after this much time. I'm sure it shows in the tone of my responses. ;) While I am angry I did not think of some of these pitfalls myself in the beginning, I am grateful you have brought them to my attention nonetheless. It's been a spirited and lengthy discussion. Certainly some good has come from it as I have incorporated many items discussed. It was a daunting task to go through all this discussion again... :)

New draft, summary of changes, and summary of outstanding business. (http://transcorp.parodius.com/forum/YaBB.pl?num=1273691610/57#57)


The case for eliminating artificial end tokens certainly seems stronger than that for eliminating end tokens or newlines. As an end user, I think I prefer keeping them in the table file; even though that may not be the most logical place for them, it does seem to be the most convenient place.

Ok, so final ruling was to get rid of artificial end tokens, but keep standard end tokens and the \n formatting. To clear up the situation where wo end tokens are the same differing only by '\n', I have added rule that newlines are ignored when checking for duplication. So /FF=<end> and /FE=<end>\n\n  are considered duplicate. Along those lines, the standard duplicate text-sequence rules apply. Lastly, duplicate text sequences are checked across the logical table (as opposed to only token type or all tables). How's that?

Quote
d) I've never used \r myself. Is that a popular option? If you have some interface for defining line break controls and end tokens, the same thing should work for comment controls. Maybe something along the lines of this (clearly user interface is not my strong suit :P)?

I think it's a historically popular layout (and even recently, Cartographers supports it.). I know a number of people (including myself being 'old school') have a text dump layout where the original Japanese is all commented out and the English would go below. I happen to have an old sample already up online: http://transcorp.parodius.com/scratchpad/outputsample.txt

There are certainly other potentially better ways to do it. Tau likes to do it like this and use no line breaks or comments.
http://imageshack.us/photo/my-images/218/translationworkbench.jpg/

I'm sure others do it other ways and/or even port to XML or spreadsheet. I just want to make sure I at least make sure the few common cases I know of are possible to achieve.

I agree that we probably want to stay away from trying to extend the table format via utility. This has made me re-think one of Tau's suggested solution of this issue using regex in table entries. See this post: http://transcorp.parodius.com/forum/YaBB.pl?num=1273691610/45

Originally I did not like the added complexity. However, since then we have strayed so far. Table switching and possible printf like formatting strings etc.. If we're that far down the road, another step or two might not be as bad anymore. I feel as though it's already far out of reach to most first time utility amateurs now.  I admit it would allow for some interesting possibilities. I also have to admit to some personal bias against it. It's a fine technology of course, it's just always been so difficult for me to grasp the syntax (which seems to vary somewhat with programming language). I struggle doing anything more than the simplest task in regex.

Quote
It's the uncertainty of tokenization at issue, not the mapping of tokens.

That's the same reason why we can't do any other form of compressed data. :) So, that leads me back to thinking we can do away with x number of matches on table switching to reduce complexity. Even if we do that, we still haven't come up with an agreeable method for insertion with table switching. So, it seems leaving it in and adding to the complexity further makes it near unfeasible.

Quote
Yes, that was a little sneaky of me. It is fully supported by the longest hex sequence rule for dumping, though. A quick peek at the wiki (http://datacrystal.romhacking.net/wiki/Fire_Emblem:_Mystery_of_the_Emblem:TBL) shows at least one real world example where this does in fact happen (on that note, people should submit more table files!).
Raw hex literals are (potentially) problematic anywhere in a game using multibyte tokens, not just at table switch boundaries.

Ok. Point taken on the previous example of how the abcABC table knows when to fall back, even without the hex literal thrown. However, I'm starting to loose sight of the point here. How do you propose these items be taken into consideration during insertion?


Quote
That's what I was asking! Do they trigger fallback? Do they count as a match towards NumberOfTableMatches? What happens if they combine with the preceding or following hex to mutate into a completely different token? How can you write code to handle multi-table insertion and not consider these cases?

I don't have a good answer for that. That's why I don't make public utilities. I can just ignore all cases that don't apply to what I'm doing and go on my merry way. :) Speaking of which, some of the things we're talking about like this I don't think any program has ever taken into consideration. The possibility of mutation due to inserting raw hex is not new and yet everybody has gotten along just fine without doing anything special about it. How have utilities up until now handled it? I don't even know as I don't recall ever reading a shred of documentation on it, nor did I even take much notice of it when I went through available source code before. How much do these things matter? Why can't we just define some behavior and move on? What's the result? Insertion may do something undesirable for the user. The user will have to deal with it. Well, OK, nobody has complained for the last 15 years about it.  Maybe it's a poor attitude to have on the situation, but I have ask if it's worth the time to do anything else with it. The fact of the matter is most people using this will be inserting English scripts of a single table with English letters, DTE, and/or dictionary. How many will care about any of this?

So, I'm going to pass it to you to come up with something suitable for. :P

Quote
Now instead of "ab~5~bc", your hex represents "abcabc", which is still not the same as "abcABC". In this case, the only hex that works is "C1 80 C1 81 C1 82 80 81 82". If we are going to drop support for counts other than 0 and 1, what would you think about also dropping support for one table linking to another table using tokens with both counts? In this case, that would mean the combination in @table1 of "!C0=table2,0" and "!C1=table2,1" would be an error.

OK. Understood now. Yes, I think it would be OK to disallow multiple switches to the same table from within a single table. I have a hard time thinking of a scenario where a game text engine would have two switches from a table to the same destination table with different rules. I think they'd hit the same type of wall we are hitting. Good line of thinking here. I think rather than think of some elaborate way to facilitate what we have defined, we need to reduce what we have defined to something more reasonable to facilitate. :)

Quote
I thought about that too, but it does cause the dumper's table stack to become desynchronized from the game's stack. The next time the dumper hits an unrecognized entry in @NORMAL, it would fall back to @Dakuten (which might contain a hit) instead of dumping a hex literal. Worse yet would be if @NORMAL weren't the bottom of the stack, and the game actually fell back to some other table entirely while the dumper stayed in @Dakuten. But I guess that's just another theoretical edge case waiting to be ignored.

Right. It seems like this is a new scenario not covered by anything we have. It needs to be as I believe that is a common scenario. I've also seen hiragana/katakana switches like that. I would agree with possible addition of another fallback option. However, what effects will that have on are already difficult situation of table switch insertion process? I suppose it wouldn't be too terrible if we still limited things to not allow more than one switch to the same table in a logical table. There will only ever be on possible switch path to Table X from Table Y that way.
Title: Re: Table File Standard Discussion
Post by: Klarth on June 11, 2011, 02:15:20 am
Yes, that was a little sneaky of me. It is fully supported by the longest hex sequence rule for dumping, though. A quick peek at the wiki (http://datacrystal.romhacking.net/wiki/Fire_Emblem:_Mystery_of_the_Emblem:TBL) shows at least one real world example where this does in fact happen (on that note, people should submit more table files!).
I have strong suspicion that table is incorrect at best and impossible at worst.  Look at the hex values 00-FF and the range is entirely saturated except for 00.  Look at the kanji ranges and **00 - **FF are saturated except for the **00 entry.  There is no way to differentiate between the two because of that saturation (unless the single byte entries were meant to start with an 00 byte).  Also most of 01-07 have duplicate entries.

The scenario you were trying to describe can conceivably come true.  It'd be a mixed single and double-byte token text system (read two bytes at a time, put byte back into stream if it's a single byte token).  Past that it depends on machine endian (somewhat) and the programmer's logic (order he packs the data).  And that's why it practically doesn't happen...reading one byte at a time (and getting the rest later) is more intuitive for most.
Title: Re: Table File Standard Discussion
Post by: Nightcrawler on July 29, 2011, 03:41:07 pm
Since it appears abw flew the coop long ago, I will try to carry on and resolve the remaining outstanding business.

OUTSTANDING BUSINESS:

Comments:
As pointed out by abw's first post here, we have an issue with leaving in formatting '\n' and trying to add comments. We're also left like Cartographer with odd glitches like an extra '//' at the end of every dump, and needing the utility to add the first '//'. It seems hackish, but nobody's complained so far on Cartographer. If we move it to the utility, it is very difficult/cumbersome to be able have a user interface to give the same same flexibility to do something like this example (http://transcorp.parodius.com/scratchpad/outputsample.txt). The line breaks are arbitrarily defined in the table file... and then we're trying to comment only some of them, depending on token, after the fact. Either way is not pretty. The only other suggestion I've seen is allowing Regex in the table file. (http://transcorp.parodius.com/forum/YaBB.pl?num=1273691610/55#55)

Raw Hex
Raw hex causes several issues. First, a combination of normal entries may inadvertently output a raw hex sequence during dumping and thus not insert. Secondly, inserting raw hex can cause an issues for table switching behavior. Lastly, it can be cause a subsequent token to be interpreted as a different token upon insertion.  One solution to several issues was a general game control code and disallowing <> type characters in normal entries by abw. (http://www.romhacking.net/forum/index.php/topic,12644.msg185079.html#msg185079)

Insertion Issue with combined tokens
There are some insertion issues that can arise where less in intelligent insertion (such as longest prefix) could result in  being interpreted as different tokens upon insertion. See the example at the bottom of this post (http://www.romhacking.net/forum/index.php/topic,12644.msg185079.html#msg185079).

Final Ruling Needed
1. Allow multiple tables per file? It may be useful to have all kanji/hira/kana in single file, even if different logical files. -Current ruling is to leave single table per file.
2. Linked Entry formatting strings. "$XX=<token %2X %2X>" Cleaner, easier to validate, but increases complexity. At this point, I think I am favor of this.

Insertion for Table Switching Details:
Outside of the table file itself, but still needed. Thus far, no suitable solution has been determined by anyone after much discussion in this topic. Working on reducing supported features of tables switching to something more manageable seemed like the necessary course of action. I think if we eliminate the support for counts other than 0 and 1, and also drop support for one table linking to another table using tokens with both possible counts should resolve most of our prior issues. That seemed to be our consensus right between discussion died off.
Title: Re: Table File Standard Discussion
Post by: Klarth on August 01, 2011, 06:46:41 pm
Opposed to multiple tables per file, unless you implement it as a requirement for table switching.
In favor of linked entry format strings.  I don't have any specifics on how the format string should look or what should be supported in it.
Title: Re: Table File Standard Discussion
Post by: abw on August 05, 2011, 10:26:58 pm
Yeah, sorry about that. Real Life has been keeping me pretty busy lately, and I haven't had the time/energy to be very productive on this. I think we all needed a bit of a break anyway :P.

Comments:
If you want low-level control over which newlines trigger comments and which newlines don't, I don't think you can beat \n and \r for ease of use. If all you want is an all-or-nothing comment of the dumped text for Atlas-style script files (this might not be a relevant concept for other styles), I think the utility should be able to handle that: any informational lines the utility generates (e.g. "String: $00 - Length: $01 - Range: $000000 to $000001") it can comment itself; any time it outputs a newline due to table file translation (or its own artificial end token), it can check whether the string containing the newline has more (non newline) text to output and add comments only if the check returns true. Alternately, it can add comments for every newline that isn't part of an end token. I think that pretty much does what we want... am I missing anything? If not, thoughts on simply sacrificing low-level control?

Final Ruling Needed:
If we're taking a vote, I am weakly in favour of multiple tables per file, and definitely in favour of linked entry format strings. Do we bother hiding the sprintf specifics? I agree we'd want to support hex, decimal, and binary output, and I'd like the ability to specify an arbitrary number of format strings embedded in text (e.g. $hex=<window x=%2X, y=%2X>). Thinking of binary output, how much do we care about restricting the amount of data formatted to byte boundaries? If somebody wanted to say e.g. $hex=<window x=%X, y=%X, bg=%4b, text=%4b>, how much of a problem is that?

Insertion for Table Switching Details:
I've been thinking about this on-and-off, and have a couple of ideas, but none of them are exactly what you'd call fully-baked at the moment.
We also need to do something about the Dakuten example. Any objections to using something in @Dakuten like !7F=,-1 to represent a forced fallback / localized end token? It should be easy to handle when dumping. Insertion could be trickier, though. In cases where you've seen this behaviour in action, is it the fallback itself that triggers the end of Dakuten mark adding, or is the 7F actually mandatory? If it's mandatory, we'll need some way to let the insertion utility know that it needs to insert another token on fallback. Or maybe we can get away with assuming there will only ever be one such token per table and that it should be inserted whenever fallback from that table occurs, regardless of the token used to switch to it. Or only when the switch hex matches the fallback hex?
Title: Re: Table File Standard Discussion
Post by: Nightcrawler on August 08, 2011, 11:03:56 am
I hope you can muster up a little bit more for this last round of discussion because we're looking to close up shop on this. It's really starting to hinder development for TextAngel, which would be ready for testing if the table standard loose ends were closed up. I imagine it is also being a hindrance for Atlas as well. This was started in May 2010! :o

Comments:
I have thought about much of the same. I think it still makes the most sense to have the utility do this and get comments out of the table altogether. All-or-nothing from the utility is simplest and most straight forward, but I found that without those un-commented line breaks after strings, the script in general becomes less readable, but more importantly relies on the translator to start insertion of text at the proper place which is a bad idea. If I had to close the book on this today, I'd opt to add comments for every newline that wasn't part of an end token. That will do what we want. I just don't like how we go from simply commenting each string in the string list upon output (all-or-nothing) to now having to go down to a token level and distinguish line breaks that should be exempt. For that tiny addition to the feature, we have to do much more over all-or-nothing. I think it's still necessary though. In any event, I think we can close this out if we agree it belongs at the utility level. How and what the utility does is outside of that.

Final Ruling Needed:
From our past discussion on formatting, we wanted hex, decimal, and binary. Binary was a maybe. Possibility of support for 1,2,4 and 8 byte parameters with potential for endian. I think that's getting excessive. My thoughts are to restrict it to single byte only in hex, decimal, or binary form. Personally, I'd like to keep it to as few features as practical because I either have to roll my own printf implementation or try and convert to .NET String.Format(). Both could get tricky for me if it's as full featured as printf really is. Also, this was born as logical extension to grouping simple straight hex output on 'linked entries'. No need to build a new Cadillac. Just put the doors the old car needed on.  ;)

One question on this, especially if were to go with your general game control code scheme. How is one to know how many parameters are defined if the syntax were "$hex=<window x=%2X, y=%2X>" or "$hex=<window x=%X, y=%X, bg=%4b, text=%4b>"? How about if you wanted to print a literal '%'. I imagine would should restrict usage of that character within the <> as it's not worth the extra to support it.

Also, how would you do the opposite for insertion by picking out the parameters if the output is something like "<window x=$45, y=10, bg=01100110, text=0110000>"? With mixed types, there needs to be something in the output to distinguish the parameters. Should they be in quotes?


Insertion for Table Switching Details:
I agree with the whole !7F=,-1 forced fallback. It's highly needed. This can apply to hiragana/katakana switching applications as well. In nearly all instances I've ever seen this, fallback occurs both when 0x7F is hit or when the string ends. I've never seen any myself where it would be absolutely necessary for the 0x7F before the string end, or that persists between strings. String ends usually nullify everything regardless. I would proceed with that assumption unless someone could provide evidence to the contrary. Otherwise, I think I am confident that I have seen enough to say most cases should work like that.

I was also thinking do we need a separate option for this? Perhaps we should just extend the "0" numberofmatches table switching option to also fallback upon encountering the table switch byte that switched? I suppose it is conceivable that you might need to explicitly fall back with some other value than the initial switch. Perhaps a start dakuten, end dakuten setup where the start and end values are different. If there is not a need for that, we can simply combine it into our existing functionality for "0" numberofmatches. Just a consideration.

As for the switching, I think we've almost got this. When we agreed to eliminate numberofmatches other than 0 and 1, and disallow using both at the same time with the same target, that makes things pretty straight forward, at least for the examples we had previously spoken about in this topic. The only thing remaining was the issues with raw hex causes, I think.
Title: Re: Table File Standard Discussion
Post by: abw on August 11, 2011, 07:11:38 pm
Comments:
Yeah, this is definitely a utility-level concern. Checking the type of the last token in the string doesn't feel any worse than checking the type of each newline character in the string, but I may be biased since I'm storing everything as tokens anyway.

Final Ruling Needed:
New doors on an old car works for me :P.

The game-specific control code wouldn't change anything here, except for disallowing any other < or > characters inside the control code's own < and >.

For including literal %, why not go with the standard %%? That way you could just interpret the entire text sequence as your sprintf and pass it the next few bytes as arguments. You would need to know how many bytes the string consumes, of course; if we make every non-literal % consume 1 byte, that makes counting easy but splitting up bytes (e.g. "bg=%4b, text=%4b") impossible. Byte-splitting also gets tricky if you want the output formatted in decimal (e.g. how many bits does %4u need?). Does anybody actually care about sub-byte output? I only brought it up as something to consider. It's not quite as pretty, but you could achieve the same general effect with e.g. "(bg 4, text 4) = (%8b)".

The insertion utility should be able to figure out how to convert ambiguous parameters based on the table entry. So if we had <window x=45, y=10> in the script and $FE=<window x=%2X, y=%u> in the table, we would know that we needed to write FF 45 0A. If the table had $FF=<window x=%2X, y=%2X> instead, we would know that we needed to write FF 45 10. That's why it's important to ignore printf parameters when checking for table entry uniqueness, since if we had both $FE and $FF entries in the same table, we wouldn't be able to tell how to reverse engineer the correct hex.

Insertion for Table Switching Details:
Reading it over again, I see that my question was poorly worded. What I meant to as was whether fallback on any byte would trigger the end of dakuten marking, or whether only the closing 7F had that effect. So if @Dakuten did not contain an entry for 6F, and a 6F was encountered while outputting dakuten marks, would the game try to continue adding marks?

I think we should go with a separate option. In addition to the "same start/stop token" issue you raised, extending the "0" functionality like this would also mean that tables that don't have this on/off behaviour would be unable to contain an entry for any hex sequences that switch into them (and one table can still be switched into from multiple other tables). Pretend the @HIRA/@KATA/@KANJI example in 2.6 had an entry in @KANJI for F8 or F9 to see what I mean.

On a theoretical level, raw hex is an absolute mess. With all the additions made to the table file standard (and ignoring legacy scripts), is there any good reason for having raw hex in an insert script? In a perfect world, I would also like to deal with token mutation, but that's really more of a text engine design problem than a table file standard problem. If you've got something that works for the simplified table switch structure (i.e. only 1 or infinite matches, fewer choices of how to move from table A to table B), don't let me hold you back. I'll try to come up with something that makes me happy too, but my time for the next couple of weeks is mostly already spoken for.

As a related but possibly inconsequential aside, it seems a shame to remove support for dumping the more complicated table switch structures just because we haven't come up with a way to insert them. Given that insertion algorithms are outside the strict scope of the standard, maybe we can leave the more complicated version in the standard?
Title: Re: Table File Standard Discussion
Post by: Nightcrawler on August 15, 2011, 02:23:10 pm
Comments:
I am not storing strings as token lists at present, so a global newline operation is trivial, where moving to the token level requires more from me. Do you do anything else of interest that requires the string in list of token form? So far, with all the features (http://transcorp.parodius.com/forum/YaBB.pl?num=1273690996/14#14) I support, I have not needed it besides this type of token level formatting. I may move in that direction if I find it to be useful.

Quote
For including literal %, why not go with the standard %%? That way you could just interpret the entire text sequence as your sprintf and pass it the next few bytes as arguments. You would need to know how many bytes the string consumes, of course; if we make every non-literal % consume 1 byte, that makes counting easy

sprintf does not exist in .NET languages. So as I mentioned, I either need to make my own or try and convert to something something similar (but different), which would be String.Format(). So, I will need to process the text sequence. For dumping, I was hoping to get away with simply searching for the allowed values (%d,%x,%b) and substitute with the appropriate converted parameters. Simple. :) I'd rather not get into having search for patterns, convert some parameters to multi-bytes, endian, padding etc. Complexity goes way up quickly due to my lack of native sprintf for little gain.

Quote
The insertion utility should be able to figure out how to convert ambiguous parameters based on the table entry. So if we had <window x=45, y=10> in the script and $FE=<window x=%2X, y=%u> in the table, we would know that we needed to write FF 45 0A. If the table had $FF=<window x=%2X, y=%2X> instead, we would know that we needed to write FF 45 10. That's why it's important to ignore printf parameters when checking for table entry uniqueness, since if we had both $FE and $FF entries in the same table, we wouldn't be able to tell how to reverse engineer the correct hex.

Concept is clear, but I'm a little unclear on the mechanics of this.  You read in the text sequence "<window x=45, y=10>". How do you go about figuring out that it corresponds to your entry of "$FF=<window x=%2X, y=%2X>"? Would you want to expunge the parameter info from both and see if they match? I can do that with the table entry from the '%2X' removals, but how from the text sequence do I detect and differentiate the parameters there from the rest of the string? How do I know that it's the '45' and '10' that needs to be removed?

Insertion for Table Switching Details:
I think that treads in undefined behavior of the text engine territory. Behind the scenes upon a dakuten switch, games will typically either actually draw the marks on the last drawn tile, or simply switch to another area to load already marked characters. In the ones that I have seen, if you were to feed the game an undefined table value after it switched, it usually will print garbage and draw the marks, or just simply print a garbage tile depending on it's implementation scheme. I don't think I have seen any personally that would end the dakuten under any condition other than the closing control or end of string.

Quote
On a theoretical level, raw hex is an absolute mess. With all the additions made to the table file standard (and ignoring legacy scripts), is there any good reason for having raw hex in an insert script? In a perfect world, I would also like to deal with token mutation, but that's really more of a text engine design problem than a table file standard problem. If you've got something that works for the simplified table switch structure (i.e. only 1 or infinite matches, fewer choices of how to move from table A to table B), don't let me hold you back. I'll try to come up with something that makes me happy too, but my time for the next couple of weeks is mostly already spoken for.

Raw hex is there for undefined data. Imagine dumping a menu block or a heavy text block intermixed with code or commands. Theoretically, perhaps you could defined all of the controls, commands, and rogue data in your table, but it's highly impractical and I doubt anybody does that. Perhaps you could take the time to identify all the text and make a pointer list for every last one. However in practice,  when people dump menus areas, they don't care what all that undefined data is, only that it gets put back in. It's much quicker to just dump a block of junk, edit the text you see and stick the whole thing back in. This is not a problem with the free raw hex out current utilities handle. I think people would say it's probably absolutely critical to have raw hex ability for a variety of applications. Anyone else want to chime in?

Quote
As a related but possibly inconsequential aside, it seems a shame to remove support for dumping the more complicated table switch structures just because we haven't come up with a way to insert them. Given that insertion algorithms are outside the strict scope of the standard, maybe we can leave the more complicated version in the standard?

First, I don't believe we lose anything of value and instead simplify considerably. Don't you think it's a bad idea (and irresponsible) to release a standard without having come up with any known implementation of it? While the algorithm itself may be outside of the standard, the problem you're asking to be solved is in the standard. Let's take the extreme, perhaps it's nearly impossible to do what was asked. That's probably not true, but we know it's definitely not trivial. The solution, if even derived, would probably be the most complicated thing in the standard and probably beyond my ability. I do have to limit it to my own ability. As I've said before, I can't back something I cannot implement. What good is that? :P It makes me sad to know how out of reach it would be for anybody worse off than myself.  :(
Title: Re: Table File Standard Discussion
Post by: Klarth on August 15, 2011, 05:01:27 pm
I'd rather not get into having search for patterns, convert some parameters to multi-bytes, endian, padding etc.
Agree with no padding.  Multibyte could be useful with a lot of games.  Endian doesn't matter unless we do multibyte.

Quote
Concept is clear, but I'm a little unclear on the mechanics of this.
I'd be in favor of these strings having a standard-defined start and end token, such as [] or <>.  Possibly also removing non-identifier text.  So instead of $FF=<window x=%2X, y=%2X> you'd have $FF=window,%2X,%2X and get [window $DEAD $F00D]...you could even remove the % signs in that simplified format.  We should probably get some user input on this issue.

When the utility hits a start token, you'd search for a linked entry first (either a list or a "special" table entry with "[window " which works with longest entry insertion), and fallback to inserting "[" later.  I think I mentioned before, that all games that use variable length parameters (or variable number of parameters) would be unsupported due to limitations in linked values.  That bit of knowledge should be put into the standard if it isn't.

Quote
I think people would say it's probably absolutely critical to have raw hex ability for a variety of applications.
Raw hex should never exist in an ideal world as it is "undefined content".  Since this world is not ideal, we regularly make deals with the binary devil so we can remove hours of text engine analysis through a small assumption.  Indeed, ignorance is bliss and the devil very rarely collects.  It also saves a lot of time.  Time that could be better spent.
Title: Re: Table File Standard Discussion
Post by: Nightcrawler on August 24, 2011, 11:35:56 am
Agree with no padding.  Multibyte could be useful with a lot of games.  Endian doesn't matter unless we do multibyte.
How would you suggest the user define endian in the event we did allow multi-byte?

Quote
I'd be in favor of these strings having a standard-defined start and end token, such as [] or <>.

Agreed. I think we are going to go the way of abw's suggested game control code (http://www.romhacking.net/forum/index.php/topic,12644.msg185079.html#msg185079) idea. 'Linked Enteries' would go away and be replaced by 'Game Control Codes' which would cover all non-normal entries (except maybe table switching) both with and without parameters. They would be required to start and end with [] or <> and in turn those characters would not be allowed in normal entries. I'd probably choose []. This also solves many issues with raw hex, which would also be output with the chosen characters. This prevents combinations of normal entries turning into controls or hex accidentally, makes it clear to keep/check uniqueness on all non normal entries for unambiguous handling.

Quote
Possibly also removing non-identifier text.  So instead of $FF=<window x=%2X, y=%2X> you'd have $FF=window,%2X,%2X and get [window $DEAD $F00D]...you could even remove the % signs in that simplified format.

I like this approach. This simplifies it. I would be more open to supporting multi-byte parameters if we do it this way. I would also like to keep it as a single digit and cap multi-byte at a maximum of 9 so the parameter can be parsed one character per parameter property. In this way, perhaps endian could be added like  $FF=[window],2XE,2Xe where uppercase 'E' is big endian and lowercase 'e' is little endian for multi-byte parameters. Perhaps that gets a bit convoluted for the user?

The only thing we loose is embedding text around the parameters such as the previous x,y window example. Parsing out parameters on insertion appears more difficult to me in that way though.

Quote
We should probably get some user input on this issue.
It would be nice to get some user input on several items, but I don't think we will. I've seen it before. The comments don't start coming until it's out there and they start using it. Unfortunately, that's often times too late. It's been around for a year now and only a small handful had anything to say.

Quote
I think I mentioned before, that all games that use variable length parameters (or variable number of parameters) would be unsupported due to limitations in linked values.  That bit of knowledge should be put into the standard if it isn't.

How about something crazy like this? "$FE=[window][FD],0X[FF]" For variable number of parameters the hex indication [FD] after window is the indicator that it's variable and what the hex is that ends it.  For variable length parameters, the '0' represents variable number of bytes in the parameter and the enclosed [FF] is the hex that ends it. It seems a bit convoluted, but probably only if you have both variable length parameters and a variable number of parameters. Otherwise, it's a small extension for each. Just a thought. I'm not sure how else you could try and shoehorn it into things.
Title: Re: Table File Standard Discussion
Post by: henke37 on August 26, 2011, 02:26:19 pm
Talking about control codes, what about control codes that have arguments that one would like to replace with a meaningful value instead of the raw value? I am mainly thinking of things where a value maps to a string. Such as say, the character name in a what I call speaker badge. Also, there are codes that involve pointers.

I hope that I don't look like an idiot by bringing up something already resolved, I have been trying to pay attention here, but it is difficult.
Title: Re: Table File Standard Discussion
Post by: Klarth on August 27, 2011, 12:25:06 am
I hope that I don't look like an idiot by bringing up something already resolved, I have been trying to pay attention here, but it is difficult.
I don't think we discussed this specifically, but it does merit some explanation of the proposed approach.  Say if you have a portrait control code of $F0 and the portrait id afterwards...you'd have to do something like this:
$F000=<Carl>
$F001=<Ricky>
$F002=<Forrest>

ie. You'd have to explicitly define them.  The shorthand approach (what we're covering with "game control codes") is meant more for control codes that have little to no meaning to the story and thusly can be treated as hex data.
Title: Re: Table File Standard Discussion
Post by: henke37 on August 27, 2011, 04:32:00 pm
I wonder if that would work well with more complex control codes.

I guess I just don't like the redundancy that is introduced when mapping the ids in arguments to lengthy commands.

Another thing I just realized, what about common text formating controls? While no two game engines are alike, surely most of them have the usual formating features, such as changing the text color and what not. Should these commonly occurring features be standardized?
Title: Re: Table File Standard Discussion
Post by: Tauwasser on August 27, 2011, 04:51:35 pm
Hi,

Sorry for my long absence, but real life grabbed me by the neck... So anyway, I re-read the thread from start to finish again and I'll share my thoughts with you:


Table matches with hex literals

I think we should count hex literals as matches for the current table in the event the table produced (or would have produced) the literal: That is to say,

This behavior mirrors dumping. It can produce cases that are not correct for the game's display mechanism, but since this case is ambiguous, either choice can. This would extend to table matches other than infinite and one match(es).

Restricting table matches to infinite or one match(es)

I think this is a good restriction. Even all computer text encodings ever produced were always flat within their code space. However, the case, we should think about the following:


The problems we are facing are inherent to implementations that knew the restrictions on their input. We don't necessarily have that luxury. Notice that the last point also indicates that I think the kanji array problem is solvable, however, possibly not in linear or quadratic time. I'm not so much concerned about memory, as we happen to have a lot in machines built after 2005.

I'm in favor or restricting the current release of the standard and extend the syntax upon finding a suitable solution to a suitable problem that uses multiple matches within one table other than unlimited matches or different match ranges in the same table B reachable from some table A. So I would be definitely in favor of keeping the current tablename,№Matches syntax.

Direct table fallback

I'm in favor of direct table fallback using the !7F=,-1 syntax proposed. Not having it might cause a theoretically infinite stack of tables for an infinite given string. It's a pretty big oversight and I'm happy that somebody caught it.

Algorithm selection

The current draft does not contain anything pertaining to that. So I have to ask again: We are in favor of multiple allowed insertion algorithms along with compliance levels, right? With compliance levels I mean that any implementation of the standard must indicate for instance "TFS Level 1 compliance" for implementing "longest prefix insertion" and "TFS Level 2 compliance" for implementing "longest prefix insertion and A* optimal insertion" as well as not making up new compliance levels which might be added in a revision of the TFS.

Control Codes

First of all, yes, linked entries should be renamed to make their purpose clearer.
Also, getting a "insertion test string" for grouping input arguments is simple using regular expressions (or some form of linear substitution method of your choice). For that matter, it seems we should only allow one-byte arguments to avoid Endianess issues and we can simplify the arguments to %D, %X, %B for decimal, hexadecimal and binary respectively.

<window x=%X y=%D> can be easily turned into a match string using the following substitutions (as well as substitutions to mark user-supplied {}[] etc. as literals):


Notice that we do need to keep identifiers in the output, contrary to what was suggested. If we do not do that, we are open to the following exploits:

Code: [Select]
!7E=<window x=%X y=%X>
!7F=<window x=%D y=%D>

How do you parse <window x=10 y=10> without additional context information? You can't. Also, we might want to include signed decimal and signed hex.

Variable parameter list

First of all, I must admit that I have never ever seen a game use these. Then again, it's not impossible. The old syntax would have lent itself rather well to defining these:

Code: [Select]
!7F=<window>, 256, 00
The argument length would construe a sensible maximum argument list limit and the optional parameter after the second comma would indicate the list end identifier. Having said this, I'm not strongly in favor nor strongly opposed to include variable argument lists in any form or fashion.

Regular expression table entries

I still think they are useful, however a burden to implement and make safe with the current control code set. Their main purpose were commenting and that has been pushed to top-level utilities. For normal entries nothing is gained by this anymore. For control codes and table options, it might become a safety issue with blindly using user-controlled regex content upon output...

Multiple tables per file

Oppose because of the added complexity with almost not gain.

I hope I did not miss anything that was supposed to be addressed. Also, please mark this thread as new so other regulars get an email notification about changes.

cYa,

Tauwasser
Title: Re: Table File Standard Discussion
Post by: abw on August 29, 2011, 12:54:13 am
Do you do anything else of interest that requires the string in list of token form?
My primary motivation for maintaining token lists rather than the translated text/hex strings was a couple of edge cases in post-processing dump output. Even given their start and end addresses, detecting exactly where two strings overlap can only really be done at the hex level; in general, it cannot be determined from the translated text (at least not without running through the table translation/tokenization process again). Here's an extreme example: suppose pointer A points to the hex sequence "00 01 02 03 04", and we parse that according to table X as "0001 0203 04". Now suppose pointer B overlaps with pointer A's hex range, but starts one byte later, in the middle of what was a multibyte token for pointer A, and we parse it (also according to table X) as "0102 03 04". Suppose further that pointer C also overlaps with pointer A, starting at the same byte but according to table Y this time, and we parse it as "00 01 02 03" (no "04", since I've decided that "03" is an end token in my table Y).

Now, you may call that crazy, and I would agree with you, but it's still valid input to the utility, so I prefer to handle it rather than ignore it. I found that token lists allowed me to more elegantly handle cases like these, and since the translated strings are easily recoverable from a token list, using token lists involved no loss in functionality. There are a couple of other places where I take advantage of having access to the intermediate tokenization whenever I want it, but that's mostly about program design rather than functionality that couldn't be easily achieved otherwise.


Concept is clear, but I'm a little unclear on the mechanics of this.
Tauwasser pretty much covered my response to this :P. I'll add that the idea of replacing parameters by their corresponding regular expressions also takes care of cases where the insertion script might appear to be ambiguous (e.g. <window x=45, y=10> in the script and $FE=<window x=%X, y=10> in the table).


I think that treads in undefined behavior of the text engine territory.
Fair enough. What I'm trying to get at is where the extra power for these kinds of tokens needs to live, i.e. whether the fallback condition is determined by the opening or the closing byte. It sounds like it belongs on the closing byte, which probably makes our lives a little easier.


Don't you think it's a bad idea (and irresponsible) to release a standard without having come up with any known implementation of it?
Depends how I'm feeling :P. You raise many valid points here. On the one hand, this entire project is basically your one-man mission to bring order to chaos and make the world a better place (:thumbsup:), so I appreciate the personal aspect of it. On the other hand, it can also be viewed as a community project, and as such, ideally it shouldn't rely too heavily on any one person. I dunno. There are a lot of people on the internet with a lot of free time on their hands, and some of them are pretty smart. Does anyone have any connections with Japanese romhacking groups? They must have figured out a reasonable way to handle all this, right? If so, why re-invent the wheel?


I think I mentioned before, that all games that use variable length parameters (or variable number of parameters) would be unsupported due to limitations in linked values.  That bit of knowledge should be put into the standard if it isn't.
Agreed. So far what I'm hearing is that nobody knows of any game that actually uses this. My suspicion is that any such game would use different hex representations for the different functions, just like C++ uses different code blocks for similar methods with different full signatures, or assembly languages use different opcodes for similar operations with different number or lengths of operands.


They would be required to start and end with [] or <> and in turn those characters would not be allowed in normal entries. I'd probably choose [].
Going with [] instead of <> would also make things cleaner for XML-based scripts/utilities.


I wonder if that would work well with more complex control codes.

I guess I just don't like the redundancy that is introduced when mapping the ids in arguments to lengthy commands.
It gets kind of annoying, yeah. There are a bunch of ways the standard could be extended to make this work, though... One way would be if we allowed control codes to include matches from another table. Then you could have something like this:
@table1
$F0=<portrait: %name %expression>

@name
00=Carl
01=Ricky
02=Forrest

@expression
00=Happy
01=Grumpy
02=Sleepy
and the hex "F00201" would output "<portrait: Forrest Grumpy>". If we were to do such a thing, I think we'd want to restrict these kinds of lookup tables to contain only normal entries. Thoughts?


Another thing I just realized, what about common text formating controls? While no two game engines are alike, surely most of them have the usual formating features, such as changing the text color and what not. Should these commonly occurring features be standardized?
Possibly, but I think that's a separate issue. From the table file standard's perspective, these would probably show up as control codes, so you could call them whatever you wanted, e.g. "$CD=<red>".


Sorry for my long absence, but real life grabbed me by the neck...
Happens to us all from time to time, though hopefully not too literally!


Table matches with hex literals

I think we should count hex literals as matches for the current table in the event the table produced (or would have produced) the literal: That is to say,
  • an unlimited table would have counted an unknown byte as a fallback case and fallen back to the table below it,
  • If the stack is empty, or the table is not unlimited, the hex literal counts towards it in insertion direction.

This behavior mirrors dumping. It can produce cases that are not correct for the game's display mechanism, but since this case is ambiguous, either choice can. This would extend to table matches other than infinite and one match(es).
Assuming that table entries cannot produce output that looks like a hex literal, the only way for an unassociated literal to appear in dumper output is when the stack is empty. By that logic, mirroring dumping behaviour would mean that when a hex literal is encountered during insertion, we should immediately clear the stack, insert the hex literal, and resume tokenization with the starting table.


I'm in favor or restricting the current release of the standard and extend the syntax upon finding a suitable solution
Since it's been on my mind recently, there's also the approach the W3C takes (http://www.w3.org/2005/10/Process-20051014/tr#rec-advance), where a standard is released as a "Working Draft" and only makes its way up to a full "Recommendation" later, generally after many years and multiple successful implementations. I'm not saying I'm encouraging this approach, but a toned-down version might merit some consideration.


Control Codes
...
For that matter, it seems we should only allow one-byte arguments to avoid Endianess issues and we can simplify the arguments to %D, %X, %B for decimal, hexadecimal and binary respectively.
I agree with this, but think we should also include %% for literal %.


Notice that we do need to keep identifiers in the output, contrary to what was suggested. If we do not do that, we are open to the following exploits:

Code: [Select]
!7E=<window x=%X y=%X>
!7F=<window x=%D y=%D>

How do you parse <window x=10 y=10> without additional context information? You can't. Also, we might want to include signed decimal and signed hex.
Assuming that control codes are checked for uniqueness, and that control code parameters are handled correctly during that check, including !7E and !7F in the same table would make that table invalid. Contrary to what I said earlier, however, correctly handling control code parameters during uniqueness checks does not mean simply erasing them. If that were all we did, this would be an exploit:

Code: [Select]
!7E=<window x=%X y=%X>
!7F=<window x=%D y=10>

So in addition to erasing the parameters, we should also check for values that they can take on. Fortunately, these two checks can be combined in one expression (add/remove backslashes as appropriate for your expression engine):

As long as we have uniqueness modulo parameter values, we should always be able to determine the correct encoding. That said, including identifiers in the output is certainly another valid way of addressing the problem. My preference for not including them is based purely on aesthetics.
Title: Re: Table File Standard Discussion
Post by: Klarth on August 29, 2011, 01:07:03 pm
Am I the only person who doesn't post walls of text here?  :P

Code: [Select]
!7E=<window x=%X y=%X>
!7F=<window x=%D y=%D>

How do you parse <window x=10 y=10> without additional context information? You can't. Also, we might want to include signed decimal and signed hex.
I was under the impression that each control code identifier (ie, window in this case) would be a unique identifier.  This means when your parser hits a "[", you then search for a matching ID.  If matched, parse it.  If failure, then fallback to longest match (or A*).  But yeah, if we add polymorphic properties to control codes, then the parameters must have prefixes.

Quote from: Tauwasser
Variable parameter list
First of all, I must admit that I have never ever seen a game use these. Then again, it's not impossible.
I haven't either, but it's quite plausible.  I think I would implement this on the utility end.  For instance, users of TableLib would create a new function in C (or possibly Python), then TableLib creates an internal tag to become aware to call the callback function.  ie. TableLib.AddCallback("F0", ParseVariableListF0);.  This approach indicates a custom dumper or a scriptable general utility.
Title: Re: Table File Standard Discussion
Post by: henke37 on August 31, 2011, 04:35:49 pm
Yeah, I agree with the idea of using a reference to a different table for the arguments in a control code. I am not sure about limiting it to "simple" matches, but I do see the reasons why and no reasons not to do it.
Title: Re: Table File Standard Discussion
Post by: Nightcrawler on September 09, 2011, 01:38:51 pm
Thanks to all for the input on the outstanding items. I'll work on making up a final draft subject to editing only. New or extended features not yet fleshed out, such as variable parameter lists and controls accessing external tables etc,. won't make it in. As with any project, at some point you have to feature freeze, polish, and release.
Title: Re: Table File Standard Discussion
Post by: abw on February 03, 2012, 11:44:30 pm
Good news - I now have a working multi-table insertion algorithm. It supports all the goodies from the current draft of the standard, including all non-negative switch counts *and* multiple switch tokens into the same table (i.e. none of the table switching functionality we were contemplating abandoning needed to be sacrificed), and it also prevents token mutation (in particular, it correctly handles all the "gotcha" examples I posted earlier). But wait, there's more! If you order now, you'll also receive output that is optimal with respect to hex length! All this can be yours for the low low price of 20 seconds / 200 kb script!

The only concession I had to make (in order to guarantee termination and to keep runtime within acceptable limits) was to restrict the number of consecutive switch tokens that are allowed. I've capped it at just under the length of the longest non-looping switch path, which means that my algorithm will fail for any string that actually requires 20,000 switch tokens between some two input characters. Anybody see a problem with that?

I've been working on implementing forced fallback support, but I don't exactly have a lot of experience with games that use that feature, so I'll need some more info before I can go much further with this. Nightcrawler's idea of having a switch token act as the "off" toggle makes me wonder: are these really fallbacks, or are they just switches? If they're actually switches, then we can treat them as such and we're fine; if they're still fallbacks, then I guess the rule would be to insert a forced fallback token whenever a no-match fallback would occur? That rule creates a couple of problems (e.g. what if we switch into a table containing a forced fallback token via some other switch that doesn't impose toggle semantics?), but I don't think it introduces anything that can't be solved by using additional tables. Along the same lines, I'm also not sure how forced fallbacks should interact with expired count fallbacks. I realize the remaining open sections of the standard are getting pretty far away from common usage, but hey, gotta try to fill in those blanks, right?

Anybody else have updates to share? :D
Title: Re: Table File Standard Discussion
Post by: Normmatt on February 04, 2012, 12:02:39 am
Does the latest table standard include support for variable length end tokens like the current "!83=<Color>,3" but as an end token?
Title: Re: Table File Standard Discussion
Post by: henke37 on February 04, 2012, 11:06:30 am
I am just curious, did this standard ever get finished?
Title: Re: Table File Standard Discussion
Post by: Nightcrawler on February 08, 2012, 10:32:45 am
I wrote about this in the last update at TransCorp (http://transcorp.parodius.com). And as always, the most current draft can be found here (http://transcorp.parodius.com/forum/YaBB.pl?num=1273691610). With that said, per the events mentioned in the update, I did not get to write that final draft with the resolutions of the previously outstanding issues. Then help was offered on some long standing ROMhacking.net changes which I have been working on since (I have to take help when I get it.). Since some interest has resumed here, I will see if I can get some time to finish that draft over the next few weeks.

I started refreshing myself on some of this already. I was pretty happy with the conclusion of the discussion. I don't think I am going to open it up again to any new feature discussion.

abw:
That's good to hear. Sounds like it's quite complex taking 20 seconds. Do you have any documentation on the algorithms used?

I thought we already covered why we can't treat the fallback as a switch. You either end up with the infinite switching scenario, or de-syncronized table stack. In sheer game mechanics, this might not be any type of switch or fallback whatsoever. It could be just literally a flag that tells it to add a dot to the last tile written. It could be completely independent of the table stack. Or, it could actually be a complete fallback. Or, it could be an actual switch on some games, but if that's the case, you should use switches instead as the table stacks would be in sync then. All cases are covered by adding the fallback mechanism previously discussed and retaining table switching.

Normmatt:

No, I don't believe so. This is first this has come up that I can think of. What game do you have that does this? Can I see sample hex of a few strings? That might be an easy extension to consider, but I have not seen this situation before and am unfamiliar with it. I don't see how an end token can have associated parameters. What do the parameters mean? It sounds like a case where you have several multi-byte end tokens that should be defined.
Title: Re: Table File Standard Discussion
Post by: henke37 on February 08, 2012, 07:25:17 pm
I finally read the standard. It is so simple that it is nearly impossible to screw up. I see no possible issues with the text as is.

But it lacks features that I would like to see. There is the previously mentioned control codes with table feature, but also something I wonder if it should even have. It does not provide any smarts for pointers. Those are largely game specific, yet I can't help but feel that automatic label resolving is closely related to this. Where should one draw the line? What I would like to see is a specification that deals with more things for the task of writing a dumper/inserter. But is this standard the right one?
Title: Re: Table File Standard Discussion
Post by: Nightcrawler on February 09, 2012, 09:39:16 am
But is this standard the right one?

Absolutely not in my opinion. The table file standard only encompasses features directly contributing to mapping hex-to-text and text-to-hex. It's not a dumping or inserting standard beyond the requirements necessary to achieve uniform hex-to-text and text-to-hex mapping results. Pointers and/or data interpretation are entirely outside the scope. I believe abw and Klarth addressed your previous items (common codes spanning multiple games, meaningful values), how to handle and/or why they are out of scope to the hex/text mapping process defined here.

I believe what you are looking for is not within the scope of the table file. Perhaps a dumping/inserting standard is a great new endeavor for yourself to head up? I might suggest one approach be a secondary file containing a data interpretation dictionary for interpreting groups or patterns of hex data, and associated transformations (arithmetic, logic, labeling etc.) when dumping or inserting. A dumper/inserter could then use that dictionary in addition to the table for text encoding and that might do it. As soon as you touch pointers though, you open up a whole world of hurt with hierarchies, trees, formats, and crazy transformation. I've got an entire program (TextAngel) that is basically designed just to help define this stuff in order to dump/insert. In a way you could say it just creates that dictionary and dumps/inserts according to that dictionary and table file. Point being, it's a vast and complicated thing to begin to standardize. We haven't even been able to standardize one simple specific item so far. I don't like our chances on something larger. :P
Title: Re: Table File Standard Discussion
Post by: abw on February 10, 2012, 10:40:31 pm
I finally read the standard. It is so simple that it is nearly impossible to screw up. I see no possible issues with the text as is.

Yeah, my reaction after reading the standard for the first time was along those lines too, except for a few minor typos that have already been cleaned up. It wasn't until I began using it as a spec for development that I really started noticing things. It covers the common cases pretty well, but there are some situations it allows that don't have completely obvious solutions, and some situations that seemed like the obvious solution could be improved upon.


All cases are covered by adding the fallback mechanism previously discussed and retaining table switching.

I don't think we actually stated the rules for when to insert fallback tokens... since the current approach doesn't tie switch tokens to fallback tokens, I think the only thing that makes sense is to have any fallback from a table containing a forced fallback token insert that fallback token, regardless of whether the table was switched into via a corresponding "on" token or not. In order for the fallback token to be recognized as a fallback token, it will have to count as a match towards NumberOfTableMatches in the fallback token's table; consequently any other table using e.g. !00=TableA,1 will insert 00 7F, i.e. a two-token no-op.


That's good to hear. Sounds like it's quite complex taking 20 seconds. Do you have any documentation on the algorithms used?

Well, the timing is both better and worse than that. Better in that I haven't spent any time optimizing it after finally getting it to run in sub-exponential time, which means there's still room for improvement. Worse in that the time I gave reflects only the text -> hex translation and doesn't include reading the text or writing the hex. Also, it's in perl, so a C implementation would be substantially faster :P.

As for the algorithm, I think I'm getting slow in my old age. I ended up using Tauwasser's A* idea as a base and then adding some optimizations. A* is just a glorified breadth-first search, so it's basically guaranteed to look at a whole lot of nodes before finding a solution, which makes it more expensive on average than the single-table insertion algorithm I was using. The problem I had with A* before came from considering the nodes as (token, pos)-tuples. If we expand them to (token, pos, stack)-tuples, then we have enough information to decide what new nodes can be added at each step without invalidating switch count conditions. It's also a nice algorithm to use since you can grow the parse tree as you work; having to list all the possible nodes can take infinite time, or maybe just exponential if you limit switch loops in some way. The primary optimization I made was to start pruning equivalent nodes from the tree (I decided two nodes were equivalent if they were at the same string position, had the same stack, and tokenized the most recent mutatable hex in the same way); each pruned node meant I didn't have to examine its exponentially many children. For the particular string I was using for debug, that optimization brought the maximum tree width from ~25000 nodes to ~10 nodes. Once the search finds a candidate tokenization for the entire string, I pass the candidate's hex off to the dumping algorithm to check for mutation. If the dumping algorithm returns text different from the candidate tokenization's, then the A* search continues, otherwise we're all done and can move on to the next string.
Title: Re: Table File Standard Discussion
Post by: rveach on March 11, 2012, 10:47:46 am
Has anyone released the source to a library/program that implements this new standard in C++ or Java? (atleast past '2.4 End Tokens')

Edit:
Some things that I would like to add, after reading through some of the discussion and specs.

Character Map Support

Almost all games use some same repetition of a character map to store their characters. Some are based on standards, others not. I have seen a NES game use ASCII. I have seen plenty of PSX games use SJIS and maybe ASCII. To cut down on manually typing out every character and hex code and user errors, I think we should allow an implementation of a standard/custom character map. Here is the basic idea.

format: XX+=character map

The thing that makes this unique from a common hex/text line, is the '+' before the equals.
'character map' is either a standard character map name (ASCII,UTF8,Shift-JIS,etc (not case sensitive)) or a user defined map that looks somewhat like a regular expression (Ex: [0-9A-Z a-z]). I'm not sure how to implement japanese or other language version of a custom map, so I will leave that open to discussion. Maybe it would go based on UTF8 ordering, since that is what the table file will be in anyways. The character maps will only support printable characters, so \n may be left out.
'XX' is the starting hex value that the character map will use.

Examples:

So if we had a game with a standard ASCII character map, we could write:
00+=ASCII
This is like writting: "21=!", "22=\"", "23=#", ......, "30=0", "31=1", ...., "41=A", "42=B", .... etc

Now what if the game used ASCII, but it didn't start at 00, but instead 02?
02+=ASCII
This would be like writting: "23=!", "24=\"", "25=#", ......, "32=0", "33=1", ...., "43=A", "44=B", .... etc

Now this would save us from having to write 62+ lines.
As for overflow, I suggest we just cutoff the high bytes created (or wrap around), so FF+1 = 00, not 0100.

Now what if we wanted to override one character that ASCII makes with another value? "Imaginary" lines generated by character maps are ignored if we specify, manually, a hex value that would overwrite it.
So:
00+=ASCII
30=Love
0D=[NL]\n
00=[ED]

Would have the normal ASCII characters, but there would be no hex value for '0' (so only numbers 1-9 would be allowed) and 'Love' would be mapped to 30h. 0D and 00 wouldn't be mapped anyways by ASCII since they aren't printable characters, so 0D and 00 in our table file would work like normal.

So the hex bytes: 30 31 32 33 00
Would print out the text: "Love123[ED]" instead of "0123[ED]"

Now for user maps, lets say our game uses some letters and numbers in a weird order that follows no standard and not all characters are used. The first character starts at 30h.
034265789wxyzabcj

We could write a custom map like so:
30+=[0342657-9w-za-cj]

Starting Link Table

As of right now, you can give your table a name with the '@' character or leave it empty. If the file has multiple switch tables, it may be unclear as to which one we should start extracting from (for my implementation, I am going to always assume its the first table for now).

I think we should add something that signifies to an extractor/inserter which table is the default start, which could be overridden by the tool implementation.

normal format: @TableIDName
proposed format: @TableIDName,start

So if a table name has ",start" at its end, then it is the table that we should start with when doing its thing.
Title: Re: Table File Standard Discussion
Post by: henke37 on March 13, 2012, 01:51:56 pm
That character map idea is great. But I have a small addition to it. Allow for using previously defined tables (no loops here!) in addition to the standard mappings.
Title: Re: Table File Standard Discussion
Post by: Nightcrawler on March 13, 2012, 03:25:23 pm
It sounds like recreating the table generator. How is this different from what some of these guys do (http://www.romhacking.net/?page=utilities&category=8&platform=&game=&author=&os=&level=&perpage=20&title=&desc=&utilsearch=Go)? I think you can already do all of that with ASCII and most with hiragana/katakana. We also have EUC, JIS, and S-JIS tables in the database. Such a thing is great for generating a table, maybe not so great for inclusion of the generator within the table file standard.

How do you propose something like this be implemented code wise?

Is defining creation of the map part of defining the map? I lean toward no, but it is an interesting thing to consider. Anybody else have an opinion on this?

Title: Re: Table File Standard Discussion
Post by: rveach on March 13, 2012, 04:15:49 pm
How is this different from what some of these guys do?

The same reason, you made (2.5) Linked Entries and (2.6) Table Switching.
You could do those things with a (2.2) normal entry (excluding infinite loops), but it requires ALOT of writing depending on how complicated things get. So it is to save time for writing out each entry by hand, user errors that could possible result from manual typing, and size of the table file.

For a game that uses SJIS, to have a complete table of that SJIS, it requires ~67kbs and ~6800 lines (I will show my file if you require to see it), which won't include the game's custom codes they use. I only used ASCII in my examples above, for simplicity.

So allowing Character Maps would reduce the file size and increase a faster understanding of the table file (if your reading someone's elses), while memory usage would probably only increase slightly. The only extra memory would be from making sure you don't overwrite anything you defined in the table specifically. The character map entries will be in memory regardless if they are manually entered or generated, so they don't add any extra memory.

Is defining creation of the map part of defining the map?

I'm not sure what you are saying here.
If you mean the 'user custom map', then it is defining the ordering and letters in the map on that one line.
If you mean the normal maps, then maybe look at my code below to see when the map is defined in memory.

How do you propose something like this be implemented code wise?

Well, there must be some way to generate a list of printable characters in a specific character map, programmatically, but I haven't found a way with google yet. If its not possible, then its up to the implementations to build into their programs the most popular character maps, or support multiple DLLs that will contain them, thus allowing users to add their own maps and add unpopular ones.

This is my first crack at how to implement the character maps (which may look far from perfect lol), but there may also be a better way:
Code: [Select]
init list storage
init removal list

main loop:
    read line

    if (line is a character map support)
        parse starting byte
        parse length of character map //if it is custom, otherwise we will have the length builtin
        save rest of line and parses into list storage

        continue
    end if

    **process line like normal**

    if (line identifies hex code)
        if (hex code is in the range of one of list storages) // requires looking through all of list storage
            add hex code to removal list
        end if
    end if
end main loop

foreach list storage
    add new printable characters from maps // like it was in the format: hexadecimal sequence=text sequence
        in list storage minus what is in removal list
end foreach
Title: Re: Table File Standard Discussion
Post by: henke37 on March 14, 2012, 06:27:59 pm
When in doubt, just include the standard mapping as a second table file that has to be bundled with the table reader.
Title: Re: Table File Standard Discussion
Post by: Nightcrawler on March 19, 2012, 11:19:17 am
I'm afraid this standard may meet a tragic and embarrassing end.  :-[

Apparently, Feb 6th, the draft was somehow overwritten with an incorrect version on transcorp (probably from an FTP program hasty click fest on my part). It does not include the majority of my changes from the 10 June draft despite being marked as such. My current local copies were later subsequently refreshed to match. So, now I don't actually have any copies with all of (for some reason some are there) the changes mentioned in this post. (http://transcorp.parodius.com/forum/YaBB.pl?num=1273691610/57#57). I have absolutely no desire to re-do all that work.

Does anyone else happen to have a copy of this from between June 2011 and Feb 6th, 2012?


EDIT: I found a copy from August 2011 from an old site backup and a copy on a USB Stick from June 23rd 2011. They still doesn't include the changes listed from June. It seems it may have been mis-uploaded right from the start in June. :'(
Title: Re: Table File Standard Discussion
Post by: henke37 on March 21, 2012, 01:19:56 pm
There is always the option of trying to write it again. You could even include new features.
Title: Re: Table File Standard Discussion
Post by: Nightcrawler on March 28, 2012, 02:11:00 pm
Having nothing was unacceptable. Nine days of hell gave birth to a draft that is ready for final review and edit.

The new draft contains all features previously discussed and rules accounting for nearly all edge cases presented in this topic!

List of Changes (http://transcorp.parodius.com/forum/YaBB.pl?num=1273691610/45)

The only things that did NOT make it in:

These items either came too late, details were not fleshed out, or added complication was undesirable. In light of taking 2 years to flesh out what we have, the standard remains frozen and no new features will be considered at this time. Sorry.

Notes:

1. [] for Non-Normal Entries
[] were chosen for this as they were the popular choice in this topic. However, consideration should be made for <>,{},«», or other pairing of characters (remember they are disallowed in normal entries). This is easily changeable if anybody has an opinion one way or the other. For some reason, I am drawn to {} more than [].

2. Token Insertion Mutation (from hex literal or in careful insertion)
The only edge case/s not fully addressed is token mutation. This is an issue specific to what happens to a stream of binary output after attempted insertion. I don't believe this can be fully addressed in the table standard. The standard can only reach as far as to set some rules or guidelines to eliminate these situations from occurring. It has done so for all possible cases it could that we properly addressed (raw hex and others). Because token mutation is possible even within a single valid table (example at bottom of this post (http://www.romhacking.net/forum/index.php/topic,12644.15.html)) it doesn't seem possible to address all possible instances. If you have suggested passages to add to the standard that might better address this, I'd happy to consider inclusion, otherwise, I believe it to be addressed as best as it can be within the reach of the standard.
Title: Re: Table File Standard Discussion
Post by: FAST6191 on May 02, 2012, 10:02:48 pm
I did not miss it but forgot to post something

Still I read it several times so thanks for that as it refreshed me on a few concepts I had more or less dismissed in recent years (Kuten/Handakuten and Yoon/Youyon or diacritics in general are usually kicked to extra font entries and although I am no stranger to multiple encodings per game/file/script switching out for something as simple as the kana is not something I tend to see any more and I actually can not recall the last time I saw 3.3 Example 3 although I guess I also can not remember the last time I saw true word level dictionary compression either save for some curious situations in LZ) and pondered how I might apply it to some certainly not common situations but situations I have encountered none the less. They will mostly be from the DS, wii and maybe parts of the 360 which I guess is my main source of hacking work and probably why I had not seen the things above in a while (memory management is still a thing but not half as aggressive as it had to be for the 8 and 16 bit era). For the most part consider this a +1 with some pointless waffle from me, I have no real problems or need for solutions to the things below and would be quite happy to see this implemented beyond wondering if it is worth having a right to left reading flag (we can probably ignore tategaki though and kick it to a control code and fairly safely assume boustrophedon (alternating directions) is a thing of the very distant past) as Arabic, Hebrew and similar languages are getting a few translations nowadays.

I have reservations about some of the insertion side of things but that is always going to be the case and I agree this is probably a good way to serve the most people at once- some games support several encoding types and are quite fussy about replacing like with like even in the same line (Zombie Daisuki for the DS used parsed plain text files with some line numberings but switched between ASCII and shiftJIS (often without ASCII or U16 fallback options- shiftJIS implementations on the DS more often than not miss out that part of the spec and although it is better nowadays earlier on it was a cause for a minor celebration to have the game include the Roman characters as part of the actual shiftJIS section) meaning the collisions workarounds might not be ideal. Also there might be an issue with some things having multiple end tokens (usually for line end (line breaks were different), section end, a part end and maybe a file end but that can probably be worked around as well.

Starting with the crazy silly thing.
Riz Zoward/Wizard of Oz beyond the yellow brick road. Helping out with an English to Spanish translation (US version of the rom as the base).
Some text in it was fixed length, others was standard pointer text and the third most interesting part was damn near scripting engine grade.
In short each entry was a type code a few bytes long, a length value for the type, length and if any the resulting payload and a payload itself if there was any (think cheat encoding methods). The payload consisted of text and control characters of various forms as well as calls to various 2d and 3d image files and presumably a bit more than that (I did not take reverse engineering the format that far after I figured out roughly what went). The type codes and lengths probably could have amounted to fixed length entries which might help things but I think I mainly mentioned this to be annoying.

Also possibly related http://blog.delroth.net/2011/06/reverse-engineering-a-wii-game-script-interpreter-part-1/

Related to the above back in the standard prior to the characters being named there were other names being used with the option of it being replaced from a name entry screen (not sure how the game eventually did it) where traditionally we might have seen placeholders. I might be able to abuse control codes but a "read only" flag of some form might be an idea although thinking further that might just my being lazy and half of this project seems to be about avoiding the extra cruft so ignore that.

More recently I was pulling apart Atelier Lina ~The Alchemist of Strahl although I have seen it before in other files/formats.
Text was shiftJIS but each section was bounded by a hex number starting from 0000 and going up from there. Short sample
Code: [Select]
?????

錬金術の極意

踏ん張り

フラムっぽい物

みんな頑張れっ!

テラフラムっぽい物

どんぐりメテオ

It does not take too long to hit actual characters and perhaps more troubling control codes which in this case were not 8 bit but I have seen several with hard 16 bit characters and 8 bit control codes (or similarly non multiple of 16 bit placeholders) that troubled more basic parsers but that is probably not where this is heading.

Games using XML esque structures. I certainly see XML in games far more but that is usually something that escaped the cleanup process before building or for unrelated files. Probably can be ignored as they tend to be using known encodings and are not really the domain of table files.

Games using square brackets as escape control code/placeholder value indicators. A simple workaround I guess but one I might have to think about (for no other reason I would sub in the corner* or curly brackets).

Use of the yen symbol*, I guess I could always do alt 0165 (must remember the 0) or otherwise define/bind something or if I am truly lazy copy and paste but it does not tend to appear on a European keyboard and I am lazy.

*it is not lost on me.

There was probably more I meant to so say but is appears to be nearing 3am (again).

Title: Re: Table File Standard Discussion
Post by: abw on December 13, 2015, 03:21:00 pm
Well, it looks like I'm super late to the party, but in the fine tradition of this particular thread, I do come bearing a wall of text :D (actually more than one wall, since it seems there's a 15000 character limit on post length). Recently I've been working on updating my extraction/insertion utility based on updates to the Table File Standard (TFS) between June 2011 and March 2012 (that's still the current version, right?). After going through the March 2012 document in detail, I've compiled a list of things in the TFS that I believe are incorrect or unclear. I'll mostly skip over spelling, grammatical, and formatting issues except where they affect understandability of the TFS. Much of this is going to be nit-picking details, so I apologize in advance if the "constructive" gets buried under the "criticism"; I'm just trying to help! As always, thanks for all your hard work :).


1.2.paragraph 1: Maybe say ".tbl file extension" instead of "TBL file extension"? Are we actually requiring table files to follow any particular naming conventions?

2.1: I'm assuming "text-to-text" is a typo and should actually say "text-to-hex".

2.2.1.2/2.2.2: The TFS is pretty clear about only supporting "Longest Prefix" for hex-to-text translation, but there are many different algorithms available for text-to-hex translation. My understanding is that what we're concerned about here isn't really TFS compliance (since that's covered by 2.2.1.4), but letting the end user know what behaviour they should expect from text-to-hex translation, and I think these aren't the right words to use for doing that. As the end user of an insertion utility that implements the TFS, the things I would be most concerned about in terms of text-to-hex translation are knowing whether the inserted bytes are going to be correctly interpreted by the game's text engine as the text I want to see displayed and knowing whether as few bytes will be inserted as possible to achieve that correctness.

With that in mind, how about replacing this section of the TFS with requiring some sort of documentation on the text-to-hex translation algorithm(s) that the utility implements, preferably including correctness and optimality conditions? It doesn't have to be a novel, but seeing something in the utility's documentation like this:
Quote
This utility implements a Longest Prefix insertion algorithm, which guarantees correct text-to-hex translation based on the provided table files as long as the following conditions are satisfied:
   - all table entries are contained in a single table; and
   - no table entry's hex sequence is a prefix of any other table entry's hex sequence; and
   - for each character used in normal table entries, a table entry exists which maps some hex sequence to that single character;
and at least one of the following conditions is satisfied:
   - the text to be translated does not contain raw hex bytes; or
   - the hex sequence of every table entry represents a single byte.
It also guarantees the smallest possible hex length of any correct text-to-hex translation as long as the following additional conditions are satisfied:
   - the hex sequence of every table entry is the same length; and
   - the text sequence of every normal table entry is no more than 2 characters long.
or this:
Quote
This utility implements an A* path-finding insertion algorithm, which guarantees correct text-to-hex translation based on the provided table files and guarantees the smallest possible hex length of any correct text-to-hex translation.
would be pretty great, right?

I think it would also be useful for utility authors to note how their utility handles situations not fully defined by the standard (we'll see a few examples of those below; maybe it would be worth adding a section about theoretically possible scenarios which have never been observed "in the wild"?).

2.2.1.4: Another approach would be to let utilities list which parts are implemented and which parts are not, which would, for example, let people claim partial compliance for otherwise useful utilities that handle everything except multi-table insertion or that need to break a bit of the defined behaviour in order to deal with some weird game's crazy coding (e.g. game X doesn't actually use a Longest Prefix text engine, game Y doesn't exhibit stack behaviour for mid-string encoding changes, etc.). As long as the utility still respects the syntax of valid table files, there's probably not too much harm in an approach like that. And as long as the game an end user is working on doesn't require the missing features, the end user will probably be just as happy either way.

2.5.Label.1/2.5.Label.3: Since there's a new Star Wars movie coming out next week, I'll misquote Yoda: "must" or "must not"; there is no "should" :p.

2.5.Label.3: There appears to be some confusion here about how a Label is defined. If Labels can only contain the characters [0-9A-Za-z], then the Label itself is only the text between '[' and ']' and basically every other statement the TFS makes about Labels contradicts the examples which use Labels.

2.5.1: There's nothing here that actually specifies how to represent "$hexadecimal sequence=[label],parameter1,parameter2" in text. From the examples, it seems that the rule is something like converting commas to spaces in the text sequence, replacing placeholders with their values (while being careful with parameter text like "%%D"), moving the ] from the end of [label] to the end of the text sequence, and then moving any \n to the end of the text sequence and replacing each \n with a newline, but that rule is never stated and the examples aren't quite consistent with it:
2.5.1.Example 1: Where does the \n belong in the text sequence? 2.5.Label.5 and part of 2.6 suggest that the newlines should go between the Label and its following ']', resulting in a text sequence of "[keypress\n]" instead of "[keypress]\n".
2.5.1.Example 2: Where did that '$' come from? There's nothing in the TFS which indicates %X placeholder values can or must be prefixed with '$'.

Also, when replacing placeholders with their numeric values, the TFS should address the issue of leading zeroes. We're explicit about the hex sequence part of a table entry being 2 characters per byte, which implies displaying leading zeroes there, but I don't see anything that enforces that for %B, %D, or %X placeholder values. We should also be explicit about the expected behaviour here. How about making leading zeroes mandatory for %B and %X and optional for %D?

2.5.2.Example 1: The explanation here is not strictly correct: if 0xFF is encountered as part of another token (e.g. 0xFF13 or 0xDEFF), we don't output "[END]\n\n", since that would violate the Longest Prefix extraction specified by 2.10. This wording issue also occurs in other examples.

2.5.3.Rules.2: Have we really decided to include this restriction? I'm not sure what value it adds, especially since it can be trivially circumvented by e.g. making N copies of the same table with different Table IDs and then setting up your main table like:
Quote
C1=[Kanji1],1
C2=[Kanji2],2
C3=[Kanji3],3
If we get rid of this rule (which I think we should), we'll also need to update 2.5.3.Table Switch Format.Notes.2, which also effectively prevents a source table from switching to a destination table via multiple entries.

2.5.3.Table ID: As far as I can tell, the TFS still allows multiple @TableIDString lines in a file. Since we've killed off support for multiple logical tables in a single table file, we want there to be at most one @TableIDString line per file, right? Or are we supporting multiple IDs for the same table file?

2.5.3.Table ID: This section states that "The TableIDString can contain any characters except for ','.", but 2.5.3.Table Switch Format.Notes.2 implies that only tables whose ID is composed entirely of the characters [0-9A-Za-z] are able to be switched into, which means that tables whose ID contains any other characters can only be used as starting tables. Is that intentional?

2.5.3.Table ID: We should specify that the TableIDString only needs to be unique across all tables provided to the utility, not e.g. across all tables on Data Crystal (http://datacrystal.romhacking.net/wiki/Category:TBL_Files) or something like that.

2.5.3.TableID: I don't see anything that prevents a table from "switching" to itself; should e.g.
Quote
@HIRA
!F7=[HIRA],2
be considered an error?

2.5.3.NumberOfMatches.-1: Does this type of entry in one table imply a corresponding entry in another table? e.g. is
Quote
@NORMAL
!7F=[Dakuten],-1
@Dakuten
7F=foo
an error? Would it be possible to also have e.g. !7E=[Dakuten],5 (maybe in some table other than NORMAL) and match 7F=foo in Dakuten?

Closely related to the above point, it's not clear whether the closing token counts as a match in the pre-fallback table or in the post-fallback table. E.g., for
Quote
@NORMAL
01=foo
!23=[Kanji],2
@Kanji
01=bar
!7F=[Dakuten],-1
@Dakuten
02=baz
with input 0x23 0x7F 0x02 0x7F 0x01, which of these scenarios (if any) is correct?
Quote
0x23 in NORMAL: switch to Kanji for 2 matches
0x7F in Kanji, match #1: switch to Dakuten until 0x7F
0x02 in Dakuten: "baz"
0x7F in Dakuten: fall back to Kanji
0x01 in Kanji, match #2: "bar"
result: "bazbar"
Quote
0x23 in NORMAL: switch to Kanji for 2 matches
0x7F in Kanji, match #1: switch to Dakuten until 0x7F
0x02 in Dakuten: "baz"
no match in Dakuten: fall back to Kanji
0x7F in Kanji, match #2: returning from Dakuten
made 2 matches in Kanji: fall back to NORMAL
0x01 in NORMAL: "foo"
result: "bazfoo"

2.5.3.NumberOfMatches.-1: Further exploring the behaviour of these forced fallback entries, what happens for
Quote
@table1
!00=[table2],-1
02=a
@table2
!01=[table3],-1
02=b
@table3
00=c
02=d
with input 0x00 0x01 0x00 0x02? Under the current TFS wording, that second 0x00 triggers fallback all the way to table1 and the output is "a", but I feel like we should be expecting 0x00 to match in table3 and have "cd" as our output.

2.5.3.NumberOfMatches.X: The TFS never really defines the term "match", but in this case the precise meaning becomes more important: do bytes which are dumped as control code parameters count as separate matches towards X? I think we decided earlier in this thread that they did not (i.e. that the control code together with all of its parameter bytes count as a single match), but that decision doesn't seem to have made its way into the TFS.

2.9: It might be worth noting that this behaviour is algorithm dependent; e.g. it's possible for Longest Prefix insert to back itself into a corner and fail on input that other algorithms would succeed on.

2.11: Going back to my points about 2.2.1.2/2.2.1.4/2.2.2, it would be nice to see some kind of correctness condition here, i.e. that the hex produced by the utility when inserting text A must be extracted as text A by the Longest Prefix extraction algorithm from 2.10.

2.12.Duplicate Hex Sequences: Maybe add a note here to confirm that having the same hex sequence occur in different tables is okay.

2.12.Unrecognized Line or Invalid Syntax: I'd like to propose a slight extension to the TFS: any line which begins with the character '#' must be ignored during parsing. This would allow for comments inside table files, which would be very useful for end users, and comes at a negligible cost to utility authors.

2.13.Duplicate Text Sequences: It looks like this rule is left over from when Longest Prefix was the only insertion algorithm that had been considered. Now that the floor is open to other algorithms, this rule should be removed (or at least reduced to a suggestion for anyone wanting to implement Longest Prefix), since it can make legitimate text sequences impossible to insert correctly. As an example,
Quote
01=test
02=foo
0001=test
0102=bar
the hex sequence 0x00 0x01 0x02 is dumped as "testfoo", but enforcing this rule would result in "testfoo" being inserted as 0x01 0x02, which the Longest Prefix hex-to-text algorithm would as translate to "bar". Smarter algorithms not bound by this rule would be able to use the different options for encoding "test" to find a tokenization that would not be misinterpreted by the dumping algorithm.

2.13.Blank Text Sequences: Does anyone have a use case for this? I'm having trouble coming up with a good reason why anyone would ever want this.

I'll see if I can include my comments on sections 3+ in a separate post. Edit: nope, auto-merge killed that idea :p.
Title: Re: Table File Standard Discussion
Post by: abw on December 21, 2015, 09:12:40 pm
(TFS Version 1.0 Draft feedback part 2, including some new things from earlier sections)

2.5.General Rules: 2.4.Rules.5 specifies that, for normal entries, whitespace is allowed in the text sequence only, but there is no such rule for non-normal entries.

2.5.Label.1: Assuming we do revert the change to restrict the number of switches from one table to another in 2.5.3.Rules.2, this uniqueness condition should only apply to end and control tokens.

2.5.Label.5/2.6/2.13.Duplicate Text Sequences: It looks like using \n to control newlines has now been restricted to non-normal entries (should this actually be just end tokens and control codes? Having newlines in switch tokens doesn't make much sense), but the example in 2.13.Duplicate Text Sequences still shows \n in a normal entry.

3.0: It might be nice to provide some additional guidance on best practices for table file setup for new users. Even simple things like making sure your game's control codes are represented as control code (so you don't pull a Lufia 2 and encode "Maximum" as "[HeroName]um") and being careful about using the same Label for control codes in different tables (e.g. maybe you want "[END]" to be an end token no matter what table the inserter happens to be in at the time, but the game treats "[HeroName]" in one table a little differently from "[HeroName]" in another table) could save inexperienced users some headaches.

As a general note, showing some hex input would greatly improve the quality of the examples in this section.

3.1: This example's already pretty clear, but using "3001=DictEntry1" etc. instead of "3001=Entry1" etc. for the normal entries would make the correlation between entries expressed in a single table vs. split across two tables with switching even more clear.

3.2.Example 3: This example contains multiple errors. Table 1 should say "!7F=[Dakuten],-1" (misplaced comma), Table 2 won't actually output 'が' unless the next token is 0x60, and it's not quite clear that the 7F triggering fallback gets consumed and doesn't end up triggering a new switch from NORMAL to Dakuten.

3.3.Example 3: This should say "!C0=[KanjiTableID],3" (missing [ and ])

3.4.Example 2: There are a few more errors here: assuming the top part of the explanation is right and this is really a forced fallback example, Table 1 should say "!E0=[KATAKANA],-1" (missing [ and ], -1 NumberOfMatches instead of 0) and the bottom explanation should be corrected ('ア' won't actually be output until a 0x30 token is matched, falling back from KATAKANA to HIRAGANA on E0 due to no-match instead of forced fallback would definitely result in E0 being matched in HIRAGANA, switching us right back to KATAKANA again).

4.2.paragraph 4/4.3.3/4.3.4: These need to be updated based on the new 2.5.1.Parameters/2.5.1.Placeholders content. "Linked Entries" don't exist anymore, they're now called e.g. "Control Codes with Parameters".

5.0: I like that there are definitions in the TFS, but the terms used here don't match up with the terms used throughout the rest of the document; e.g. after reading this section, people will know what "Hex Value" means, but not what "hexadecimal sequence" or "hex sequence" mean.

5.0.Duplicate: Does this definition add anything to the generally accepted definition of "duplicate"? If so, "value" should be defined lest it be misinterpreted as e.g. "character".
Title: Re: Table File Standard Discussion
Post by: abw on March 21, 2018, 10:55:36 pm
For anyone who's still watching this thread, you might be interested in checking out abcde (http://www.romhacking.net/utilities/1392/), which implements something that definitely isn't the Table File Standard but does cover many of the issues raised during these discussions; if you do, I'd love to get some feedback, via the Personal Projects page (http://www.romhacking.net/forum/index.php?topic=25968.0) or otherwise.