logo
 drop

Main

Community

Submissions

Help

78001479 visitors

Author Topic: Table File Standard Discussion  (Read 19450 times)

Klarth

  • Sr. Member
  • ****
  • Posts: 423
  • Location: Pittsburgh
    • View Profile
Re: Table File Standard Discussion
« Reply #20 on: May 28, 2011, 09:22:36 pm »
Normally I wouldn't check out ScriptCrunch because you aren't supposed to use the dictionary approach for DTE, but with that strange savings boost, it makes it worthwhile.  I'll have to try to figure out that size bug too.  Thanks.  I've been working on and mostly off to make a frontend for ScriptCrunch so it's not as ridiculously cumbersome to configure.

Function calls / parameter bytes isn't much better terminology but at least has some meaning.  On the string format, I'd consider formatting support for only 1, 2, 3, and 4 (...maybe 8 too) byte argument printing as hex, decimal, and maybe binary.  Perhaps endian swap support.  No strings/characters because the purpose of linked values is to output hex so that non-text data doesn't get put through the table.

abw

  • Newbie
  • *
  • Posts: 42
    • View Profile
Re: Table File Standard Discussion
« Reply #21 on: May 30, 2011, 11:14:27 pm »
First off, @Klarth:
Interestingly, the resulting Dict table produced an encoding 4% smaller than that of the DTE table, according to Atlas.
This part turns out to be directly attributable to user error (I forgot to strip out the existing DTE from the table I gave to ScriptCrunch, which resulted in the table it gave back to me having two different DTE sections for the same byte range. Atlas was then quite happy to make use of both DTE sections when inserting, which is where the compression gain came from). Apologies for bringing up a red herring :-[.

While on the topic of insertion algorithms... how are you handling table switching? I've been thinking more about this, and in the general case it presents an even larger can of worms than I was anticipating.

Since I'm the guy that introduced this, I can only say that ― yes ― it does pose a problem. However, my naïve solution was to use ordered data sets and do a longest-prefix over all tables. You can still do this with A*, where path cost is adapted to table switching as well, i.e. the path to an entry with switched table from another table is the cost in bytes in the new table plus the cost in bytes from the switching code in the old table. This of course, needs just some bookkeeping to know for each explored node in which table it belongs and how many matches the table can have before switching back.
Tokenizing over all tables (accessible, directly or indirectly, from the starting table) and making tokens remember which table they came from are both pretty easy. I think there are more subtle problems with the cost function for A*, however, and with keeping track of table switch conditions in general.

NB: I'm talking here about operations permitted by the standard (and which thus must be handled by any insertion utility claiming to implement the standard), regardless of the existence of any real-world examples.

The text hacking process (as we all know) is essentially hex_1 -> text_1 -> text_2 -> hex_2 -> text_3, where in-game hex_1 is dumped to text_1 in one (or more) external file(s), text_1 becomes text_2 after some (possibly empty) set of modifications, text_2 is inserted as hex_2, and the new hex_2 is displayed in-game as text_3. We've mentioned a couple of ways that the hex_1 => hex_2 conversion process can result in a different hex sequence being inserted than was originally dumped, and how the display of in-game text is the guiding principle for determining whether such variation is acceptable, i.e. text_2 and text_3 must be the same text. In order to achieve this goal, our text -> hex insertion process must mirror the game's hex -> text display process, which by assumption is identical to our hex -> text dumping process.

This problem would be much easier if every group of tables had the ability to switch amongst themselves with a 0 count. Instead, the only path from table A to table B may necessitate switching through table C, and each switch brings with it certain conditions that must be fulfilled by subsequent tokens. These conditions come in three types:
- to match a specified number of tokens in the new table and then return to matching tokens in the current table;
- to match an unspecified number of tokens in the new table and only return to matching tokens in the current table when it is no longer possible to match another token in the new table; or
- the switch conditions for all tables may be prematurely satisfied by reading an end token.

When dumping hex -> text, an unassociated hexadecimal literal can only be output when it is no longer possible to match the next token in any of the tables contained in the table stack at that point in the dumping process. For insertion, then, an unassociated hexadecimal literal should be taken as a signal to empty the table stack and continue matching tokens with the starting table. Unfortunately, hexadecimal literals are essentially wildcards. As is the case with linked tokens lacking their literals, unassociated literals can combine with subsequently inserted hex in unintended ways, some of which may produce valid tokens in some table on the stack. Even without that danger, they cannot reliably be assigned to any specific table, which means they cannot reliably count as a match in any specific table, which in turn makes it much more difficult to determine whether all switch conditions have been satisfied.

Maybe there's a way I'm just not seeing, but put together, I think these factors break any A* heuristic. My single-table insertion algorithm doesn't appear to fare any better. My backup plan, alas, is much more complicated (build a CFG out of all the table switch entries, build a DPDA from the CFG to parse the text, and remember where in that parsing the switch tokens occurred so we can output them in hex) and I'm not sure it will even work (the language of table switch tokens appears at first glance to be non-deterministic). Even if it does, I don't see optimality in less than O(2^n).

Somebody mentioned making the end user do all the hard work?

Nightcrawler

  • Hero Member
  • *****
  • Posts: 5788
    • View Profile
    • Nightcrawler's Translation Corporation
Re: Table File Standard Discussion
« Reply #22 on: May 31, 2011, 02:45:51 pm »
1) This would have to be sacrificed, and the resulting complications for testing are lamentable. All other things being equal, I agree that simplicity is to be preferred, but in this case all other things are not equal.
2) The longest text sequence rule for insertion nicely mirrors the longest hex sequence rule for dumping, so you also lose some parallelism :P. In the interests of accessibility, I'd definitely keep longest prefix as a suggested algorithm, since it is easy to explain and understand, but make a note that other algorithms are possible, and that the same text should be output by the video game no matter which algorithm is used.
... except that that doesn't cover cases where longest prefix fails to insert otherwise insertable text. On that note, the "ignore and continue" rule of 2.2.2 further complicates the description of 2.2.4 when different algorithms are allowed. What's the reasoning behind "ignore and continue"? I think I prefer Atlas' "give up and die" approach :P.

Ok, I think I can agree to resolve this by by doing the following:

1. Amend 2.2.4 to state text collisions should be resolved by a suitable algorithm. The example and suggested algorithm will show longest text sequence as the simplest and suggested algorithm. Note will be given to mention other more intelligent algorithms (such as A*) may be used for optimal collision resolution.

2. Amend 2.2.2 to state an error should be generated if no match is found for a text sequence. I don't recall anymore why ignoring was chosen. I don't have an argument at this time for not throwing an error.

Quote
Putting all of that together would make $ the general in-game control code identifier I was shooting for earlier with #. Salient changes would be:
  • adding a note in 2.2 forbidding < and > in normal entry text sequences;
  • changing "endtoken" (2.4), "label" (2.5), and "TableID" (2.6) to "<label>" where label can contain any character not in [,<>];
  • adding a note about uniqueness of all non-normal entries;
  • explaining how format strings work in 2.5 (each gets replaced with value of next byte) and 2.6 (translate next characters in new table).
There's probably a cleaner way to handle 2.6, since "table2,6" looks nicer than "<table2 %2X %2X %2X %2X %2X %2X>", "table2,0" is perhaps clearer than "<table2>", and it doesn't matter much anyway if table switch tokens aren't output.

I see the possible benefits of parameter inclusion for Linked Entries. A formatting string probably be some nice icing on that cake. I do not understand what exactly you are proposing for the other items, nor what the advantages would be. Your previous post on in-game control identifiers with # made a little more sense, but now I am further confused.

Table Switching Insertion

First, I believe this is in the utility realm and need not be mentioned in the standard other than passing application/implementation suggestion or note.

I see two simple solutions here for myself.

1. The dumper outputs a table switch marker. This is simplest, but probably undesirable. It will really hurt readability of the dump output if you use it for Kanji and Kana.

2. We pretty much do the same process as we do for dumping with inferred table jump by token lookup, instead of explicit from hex.  We then determine a suitable switch path when a switch is detected. Let's look at the example from the standard (2.6) for insertion.

We're in table 1, our starting table.
We see ア. Not in current table.
We find that in table 2, can we get to table 2 from table 1?
Yes, we find a switch token (infinite matches) in table 1 that goes to table 2. Insert 0xF8 followed by 0x00.
Now we're in table 2 (table 1 is on the stack).
We see イ. Found in current table, output 0x01.
We see ウ. Found in current table, output 0x02.
We see 意. Not in current table.
We find that in table 3. Can we get from table 2 to table 3?
Yes, we found a switch token (infinite matches) to table 3 in the current table. Output 0xF9 and 0x01.
We're currently in table 3 (table 2 and 1 are on the stack).
We see <PlayerName>.
Not in current table. We find that in table 1. Can we switch table 3 to table 1?
No, there is no switch available. No further match is possible, this table is expired. Fall back to table 2.
Yes, there is a switch available in table 2 to table 1. Output 0xF8 0x03. Finished.

Alternatively, logic could be added to check and see that table 1 was already on the stack and we can fall back there for optimal result without the extra 0xF8 in there. Both would give equivalent game text output.

As far as possible issues with linked entries and raw hex, if we move to include the parameter bytes with linked entries, that leaves us only with the raw hex problem. For raw hex to be dumped, it is because it's not found in any table on the current stack at the time of dump (no entry found causes the table to fallback under all conditions). If so, you shouldn't have a case where raw hex is involved in any table switching operations because it will only occur in the starting table. The only way raw hex can get involved in a table switching operation is when the user explicitly stuck it in the script post-dump. If that occurs, it's probably best to just insert it and carry on with normal insertion. It should not count toward matches. I'm unsure if it should or should not cause the table to fallback. It'll probably screw things up if it does fallback to undesired output. However, it seems more logical from an operational point of view to have it fallback.
TransCorp - Over 15 years of community dedication.
Dual Orb 2, Wozz, Emerald Dragon, Tenshi No Uta, Herakles IV SFC/SNES Translations

abw

  • Newbie
  • *
  • Posts: 42
    • View Profile
Re: Table File Standard Discussion
« Reply #23 on: June 01, 2011, 11:44:08 pm »
Ok, I think I can agree to resolve this by by doing the following:
Works for me :).

Your previous post on in-game control identifiers with # made a little more sense, but now I am further confused.
Basically, I'm proposing reserving "<" and ">" for exclusive and mandatory use as the opening and closing delimiters of every entry that isn't a normal entry (i.e. any entry starting with "/", "$", or "!"). I am further proposing that it should be an error for a table to contain multiple entries (regardless of type) containing "<foo>" for any value of "foo", and that "foo" may not have the form "$XX" for any two-character hexadecimal sequence "XX". While I was at it, I figured I might as well include format strings, since that idea seems good and dovetails nicely with having reserved delimiters. The rest was just bookkeeping highlights.

We already reserve "<" and ">" as delimiters for hexadecimal literals, so really, this is just carrying the idea through to the next step. It's like how you're not allowed to have a literal "<" as data inside an xml tag.

Benefits:
- integrity of the standard;
- raw hex insertion priority is no longer an issue, conceptually or programmatically, since no sequence of table entries will be able to generate "<$XX>" when dumping;
- similarly, parsing of all in-game control codes becomes absolutely unambiguous during insertion (at least in single-table insertion scenarios);
- anyone working with dumped text will immediately be able to identify in-game control codes (i.e. we're enforcing a best practice);
- format strings further disambiguate parsing while increasing output flexibility and restoring "," as a usable character in linked and table switch entries.

Costs:
- "<" and ">" are no longer able to be used in normal entries, or as anything other than the first and last character, respectively, of non-normal entries;
- we lose the ability to have entries terminated with newlines. So maybe we make an exception and allow newlines after ">";
- checking for uniqueness marginally increases complexity;
- in order to be most effective, uniqueness should be applied across all tables used during dumping/inserting. This would also marginally increase complexity, and may be undesirable if people really want to use the exact same text sequence in multiple tables (e.g. they want "<end>" to be output regardless of the table in which the end token appeared; this may perhaps have some benefit in multi-table insertion scenarios);
- format strings also increase complexity;
- somebody would have to write all these changes into the standard (I listed some of the most important changes).

I think the benefits outweigh the costs, but that's just my opinion. At the very least, it's worth presenting for consideration.

Table Switching Insertion
First, I believe this is in the utility realm and need not be mentioned in the standard other than passing application/implementation suggestion or note.
Ah, the easy way out :P.

1. The dumper outputs a table switch marker. This is simplest, but probably undesirable. It will really hurt readability of the dump output if you use it for Kanji and Kana.
It also doesn't work in general. If I dump a sequence of Kanji and then stick some Kana in the middle, the old table switch markers won't be appropriate for insertion anymore.

2. We pretty much do the same process as we do for dumping with inferred table jump by token lookup, instead of explicit from hex.  We then determine a suitable switch path when a switch is detected. Let's look at the example from the standard (2.6) for insertion.
As multi-table insertion scenarios go, 2.6 is fairly simple, since you can move from any table to any other table (either explicitly for tables 1 and 2 or through fallback for table 3) whenever you need to and stay in the new table for as long as you need to. The only penalty for making a "wrong" table switch is inserting more hex than required. Here's a (very) slightly more convoluted scenario:
@tableA
00=A
01=D
!C1=tableB,1
!C3=tableB,3

@tableB
C0=AB
C1=AC
Insertion string: ABACD

In this case, you're faced with a choice right away, and this time, guessing wrong means you can't insert the string :(.

If [manual raw hex] occurs, it's probably best to just insert it and carry on with normal insertion. It should not count toward matches.
We can define whatever behaviour we want, but the true test is how the game handles situations like this. Aside from the possibility of being interpreted as (part of) a valid token when a modified table switch path does not follow the original path, there's still some danger in cases where changes have been made to the text engine. My feeling is that if somebody's making changes to the text engine, they should know enough to be careful with raw hex, but if the utility is going to be responsible for choosing the table switch path (which I think it has to be), it should also be responsible for checking for raw hex suddenly becoming (part of) a valid token, which in turn would affect the game's match count.

Nightcrawler

  • Hero Member
  • *****
  • Posts: 5788
    • View Profile
    • Nightcrawler's Translation Corporation
Re: Table File Standard Discussion
« Reply #24 on: June 02, 2011, 01:59:00 pm »
Basically, I'm proposing reserving "<" and ">" for exclusive and mandatory use as the opening and closing delimiters of every entry that isn't a normal entry (i.e. any entry starting with "/", "$", or "!"). I am further proposing that it should be an error for a table to contain multiple entries (regardless of type) containing "<foo>" for any value of "foo", and that "foo" may not have the form "$XX" for any two-character hexadecimal sequence "XX". While I was at it, I figured I might as well include format strings, since that idea seems good and dovetails nicely with having reserved delimiters. The rest was just bookkeeping highlights.

Couple of issues with this:

1. There are many in-game control codes that currently I use just a normal entry for. Let's take keypress as an example. $FD=<kp>. You want to disallow that? Or is that allowed as it's the first and last character in a normal entry? What if I then define a linked entry or end token as <kp> for some reason? Are you saying it should be unique across normal entries, linked entries, and end tokens? Certainly different hex sequences should be able to yield the same text sequence though.

One other logical line of thinking might be that it should instead be a linked entry with zero parameters and not a normal entry to begin with. This essentially forces (if you want to use < or > as it typical for your controls) you to make all in-game controls be linked entries regardless of parameters.

2. I'm not sure if users might object to being locked into having to '<' and '>' for all their controls, although I understand we've already done that to them with raw hex. I guess I would hesitate to push further in that direction without feedback or support from others. It would simplify several matters though as pointed out.

3. I don't think this works well when applied to table switching. You want the @TableIDString line and then the TableID in the switch entry in the format <XXXX>? That seems a bit uglier, and as you already pointed out probably wouldn't lend itself to formatting strings.

I might consider this if it did not apply to table switches and it is agreeable to others. I never envisioned the TableID actually being used as a text token, but rather simply an internal identifier. I don't think we will need to output the table switch at all.

Note that along these lines I am considering pushing new lines and end tokens (non hex variant go out entirely, hex variant becomes a normal entry identified as an end token at the utility level) entirely out to the utility realm. I have been having some PM discussions with Klarth about this. If we did that, that would leave us with only linked entries and table switches as non normal entries.

Quote
As multi-table insertion scenarios go, 2.6 is fairly simple, since you can move from any table to any other table (either explicitly for tables 1 and 2 or through fallback for table 3) whenever you need to and stay in the new table for as long as you need to. The only penalty for making a "wrong" table switch is inserting more hex than required. Here's a (very) slightly more convoluted scenario:

In this case, you're faced with a choice right away, and this time, guessing wrong means you can't insert the string :(.

I see some of the pitfalls with the available choices. What really is the problem here? It seems to be the variable number of matches that throws in the monkey wrench. To that I say, do we need it? Hear me out. Personally, for every table switching operation I have ever seen, it is either a single match, indefinite match until end of string/not found value, or match until another switch is encountered. So, I ask do we need the ability to define a number of matches other than one or infinite? It was originally born into the standards due to the Kanji Array feature from ROMjuice's readme that is in a few games.  A second review of this makes me want to view this more as data compression rather than token conversion. That's the only thing I know of that needs this ability.

If we can omit that, doesn't our job become much more manageable? Number of matches becomes 0 (infinite) or 1. That makes the path choosing much easier and manageable. My simple logic with longest hex should then work without much issue. One could even add rule to say infinite matches takes precedent over single match in the event two switches exist in the same table to the same destination table. Then your example is much easier to handle appropriately.

So, let's assume now your table switches are changed to reflect this:

!C1=tableB,1
!C3=tableB,0

Assume tableA start.
We see 'ABACD'
Longest match we can find out of that is 'AB'. It's in tableB. Can we get to TableB?
Yes. Infinite is preferred over match of 1, so output 0xC3 0xC0  (If no, you would look for next longest and try again until valid match/path is found or exhausted).
Longest match we can find is 'AC'. We're in tableB so just output 0xC1.
We see 'D'. Not in TableB, fall back to Table A. Found in Table A, output 0x01.

Thoughts?

Can you think of any real world scenarios where you would need to define a number of matches other than 1 or infinite? If not, we might have a winning change here.

Quote
We can define whatever behaviour we want, but the true test is how the game handles situations like this. Aside from the possibility of being interpreted as (part of) a valid token when a modified table switch path does not follow the original path, there's still some danger in cases where changes have been made to the text engine. My feeling is that if somebody's making changes to the text engine, they should know enough to be careful with raw hex, but if the utility is going to be responsible for choosing the table switch path (which I think it has to be), it should also be responsible for checking for raw hex suddenly becoming (part of) a valid token, which in turn would affect the game's match count.

Not following. How would the raw hex suddenly become part of a valid token?
TransCorp - Over 15 years of community dedication.
Dual Orb 2, Wozz, Emerald Dragon, Tenshi No Uta, Herakles IV SFC/SNES Translations

henke37

  • Sr. Member
  • ****
  • Posts: 370
  • Location: Sweden
    • View Profile
Re: Table File Standard Discussion
« Reply #25 on: June 03, 2011, 11:43:23 am »
What if I simply want to have a literal less than sign in my text?

abw

  • Newbie
  • *
  • Posts: 42
    • View Profile
Re: Table File Standard Discussion
« Reply #26 on: June 03, 2011, 09:59:07 pm »
One other logical line of thinking might be that it should instead be a linked entry with zero parameters and not a normal entry to begin with. This essentially forces (if you want to use < or > as it typical for your controls) you to make all in-game controls be linked entries regardless of parameters.
Basically this, but I was thinking of it in reverse: a linked entry is just an in-game control code with > 0 parameters.

  • $FD=<kp> would be fine
  • $FD=kp would be an error (non-normal entry not enclosed in <>)
  • FD=<kp> would be an error (not allowed to have < or > in a normal entry)
  • $FD=<k>p> would be an error (not allowed to have < or > on the inside of a non-normal entry)
  • $FD=<> and $FD=<%2X> would both be errors (must have some non-format text between < and >)

  • The combination of $FD=<kp> and $FE=<kp> would be an error, as would $FD=<kp> and /FE=<kp> (duplicate control codes)
  • The combination of $FD=<kp> and $FE=<kp %2X> would not be an error, nor would FD=kp and FE=kp

People are still free to define in-game control codes as normal entries if they want (FD=kp), it's just a bad idea since they could be parsed in unexpected ways during insertion.


2. I'm not sure if users might object to being locked into having to '<' and '>' for all their controls, although I understand we've already done that to them with raw hex. I guess I would hesitate to push further in that direction without feedback or support from others. It would simplify several matters though as pointed out.
What if I simply want to have a literal less than sign in my text?
Yeah, this is definitely a compromise. Given that we're already doing it with raw hex, this is better than reserving more characters. We could go for more flexibility and allow e.g. $FD=[kp] on the condition that [ and ] do not appear in any text sequence as anything other than start and end tokens of a non-normal entry, but that sounds like more cost than benefit. Since table files are UTF-8 documents, there are lots of other characters to use as substitutes. I might go with « and » as visually similar characters, { and } if my game lacked significant set theoretic discussion (for shame! :P), or maybe &lt; and &gt; if I were using an XML-based utility. All of these options assume that my substitute characters (or at least enough of them to force unambiguous parsing) weren't already in use. So, not great, but not insurmountable either.


3. I don't think this works well when applied to table switching. You want the @TableIDString line and then the TableID in the switch entry in the format <XXXX>? That seems a bit uglier, and as you already pointed out probably wouldn't lend itself to formatting strings.
@table2 would still be @table2, but !FD=table2,0 would become !FD=<table2> and !FD=table2,2 would become some variant on !FD=<table2,%2X%2X>. This does seem unnecessary for table switching, especially if table switch tokens are only used for dumping, but it does make ! entries consistent with $ entries (my main goal in suggesting it), and potentially there are situations where having table switch tokens in the script might be beneficial, maybe. Contrary to what I said earlier, I think we'd still want to keep "," as an internal delimiter between the tableID and the format string. All in all, this part is probably not worth the cost.


Note that along these lines I am considering pushing new lines and end tokens (non hex variant go out entirely, hex variant becomes a normal entry identified as an end token at the utility level) entirely out to the utility realm. I have been having some PM discussions with Klarth about this. If we did that, that would leave us with only linked entries and table switches as non normal entries.
Interesting. Artificial end tokens are already necessarily handled at the utility level during dumping, so no loss there. Insertion would lose some intercompatibility and gain some complexity. As far as utilities are concerned, end tokens are the single most important type of token. We could combine ideas and make end tokens (/) in-game control codes ($), which would ensure their unambiguous parsability. The utility would have to have some way to find out which tokens were end tokens, either by hex or by text. Hex would have to be table-specific and couldn't handle artificial end tokens, so text is preferable. I guess it all depends on how smart the insertion utility is.

Would it no longer be possible to dump a newline based on table entry alone? Or is this a weakening of the "ignore newlines" insertion rule?


Can you think of any real world scenarios where you would need to define a number of matches other than 1 or infinite? If not, we might have a winning change here.
This change would significantly reduce the complexity in finding a valid table switch path. One of the ideas I've considered is recombining table entries (e.g. taking "!00=table2,1" in table1 and "80=foo" in table2 and replacing them with "0080=foo" in table1), and these two ideas would work well together. However, I'm not sure we actually can eliminate other match count values: I have read about cases where more than one match is required. Eliminating other match counts won't solve all our problems (see next point), but it is definitely worth pursuing if we can get away with it.

How would the raw hex suddenly become part of a valid token?
Consider the following example:
00=ABC
0000=DEF
80=A
...
9A=Z
Insufficiently careful insertion of the text "ABCABC" will result in the hex "00 00", which the game will read as the text "DEF". The same thing can happen with raw hex literals: "<$00>ABC" and "ABC<$00> both end up being read as "DEF". It's worth noting that none of the algorithms discussed here (including longest prefix) address this possiblity. In a single-table scenario, I would consider this example as unlikely and indicative of poor text engine design besides (which is not the same as saying it won't work, given the right input), but the same idea applies in a multi-table scenario and seems more likely to occur in practice. Here's another made-up example:
@table1
00=~1~
80=A
...
9A=Z
!C0=table2,0
!C1=table2,1

@table2
00=~2~
0080=~3~
8200=~4~
8280=~5~
80=a
...
9A=z
Starting from table1, when we see the text "abc<$00>ABC", what do we insert? What does the game read? Even ignoring that troublesome hex literal, what happens with "abcABC"?

Nightcrawler

  • Hero Member
  • *****
  • Posts: 5788
    • View Profile
    • Nightcrawler's Translation Corporation
Re: Table File Standard Discussion
« Reply #27 on: June 07, 2011, 03:18:44 pm »
All of these options assume that my substitute characters (or at least enough of them to force unambiguous parsing) weren't already in use. So, not great, but not insurmountable either.

The way we have it currently disallows only a raw hex pattern in normal entries, but otherwise allows everything. If we were to pursue this direction of heavier restriction, I'd need support from others.

Quote
@table2 would still be @table2, but !FD=table2,0 would become !FD=<table2> and !FD=table2,2 would become some variant on !FD=<table2,%2X%2X>.
I don't think it makes sense for table switch entries to be part of it like that. Remember TableID directs to an identifier that correspond to a TableIDString line in one of your tables. Why are you wrapping it '<>'? Why are then trying to wrap it in a token format? If you want to do that, I think it makes more sense to do it like this:

!hex=TableIDString,NumberOfTableMatches,FormatString
!FD=table2,2,<table2 %2X%2X>

That way you're now defining a specific formatted token there as an extra parameter. You're no longer trying to mesh the text token with the TableIDString identifier.

Quote
Interesting. Artificial end tokens are already necessarily handled at the utility level during dumping, so no loss there. Insertion would lose some intercompatibility and gain some complexity. As far as utilities are concerned, end tokens are the single most important type of token. We could combine ideas and make end tokens (/) in-game control codes ($), which would ensure their unambiguous parsability. The utility would have to have some way to find out which tokens were end tokens, either by hex or by text. Hex would have to be table-specific and couldn't handle artificial end tokens, so text is preferable. I guess it all depends on how smart the insertion utility is.
Yes. And if the in-game control code idea didn't fly, I would still eliminate '/' and just leave them as normal entries. As far as the table goes, "FF=<END>" is a standard hex to text conversion. The significance of this specific token is in the utility realm. That would be the logic there from the perspective of a more strict interpretation of what belongs in the table file. This would, as you pointed out require the utility to provide a way for the user to identify what tokens are end tokens. End tokens are tricky as we currently have them. They have one foot in the utility realm and the other in the table the way it is. This is one possible idea to make things simpler and logical.

Quote
Would it no longer be possible to dump a newline based on table entry alone? Or is this a weakening of the "ignore newlines" insertion rule?
Correct. There would be no newlines in the table file. "\n" in your table file would just be normal text. Again, the concept of a newlines and when to place them is pushed to the utility level. Nobody wants newlines to be part of the token in the insertion direction (except maybe you), and people have presented problems with having a different behavior between dumping and insertion for them and/or having them at all. So, consideration of removing them entirely is being made.

The drawback to pushing these items out is, while we have more strict and standardized definition of the table file, we actually standardize less in the overall process. Having no standardization in the process is probably what led to the incompatible utilities and table files we had in the past. So, does it become counterproductive to the cause?

Quote
However, I'm not sure we actually can eliminate other match count values: I have read about cases where more than one match is required. Eliminating other match counts won't solve all our problems (see next point), but it is definitely worth pursuing if we can get away with it.
That topic references the same 'Kanji Array' item from the Romjuice readme file I previously mentioned. Look at it again. This time ask yourself 'Isn't this really a form of simple data compression rather than a hex or text encoding item?' I think an argument can be made that is is data compression (even if very simple) and thus would not belong in our table file standard.

Quote
Starting from table1, when we see the text "abc<$00>ABC", what do we insert? What does the game read? Even ignoring that troublesome hex literal, what happens with "abcABC"?
I thought we covered 'abcABC'... Longest match is one letter for all of them, infinite table switch is preferred over single. So, we get 0xC0 0x80 0x81 0x82 (fallback) 0x80 0x81 0x82.

The hex literal is also simplified if NumberOfTableMatches can only be 0 or 1. This scenario can only occur if it's 0 now. So, all you need to decide is if hex literal causes a fallback or not. I don't think there is a right answer there as it will be game specific based on the interpretation of the hex value. One behavior will have to be chosen and move on. This will only ever occur when the user specifically inserts hex during a table switch.
TransCorp - Over 15 years of community dedication.
Dual Orb 2, Wozz, Emerald Dragon, Tenshi No Uta, Herakles IV SFC/SNES Translations

abw

  • Newbie
  • *
  • Posts: 42
    • View Profile
Re: Table File Standard Discussion
« Reply #28 on: June 08, 2011, 08:43:06 pm »
The way we have it currently disallows only a raw hex pattern in normal entries, but otherwise allows everything.
Which includes tokens that can combine to produce text representing a raw hex literal.

If we were to pursue this direction of heavier restriction, I'd need support from others.
Definitely. I think this restriction is a net benefit, but it does come at a price, albeit a small one.

That way you're now defining a specific formatted token there as an extra parameter. You're no longer trying to mesh the text token with the TableIDString identifier.
The current way of doing things is fine. I wanted the syntax for ! entries to be consistent with the syntax for $ entries, but if we aren't dealing with ! entries as text, it really doesn't matter much.

The drawback to pushing these items out is, while we have more strict and standardized definition of the table file, we actually standardize less in the overall process. Having no standardization in the process is probably what led to the incompatible utilities and table files we had in the past. So, does it become counterproductive to the cause?
Hmm. That is a bit of a pickle. I agree pushing both items out to the utility level seems logical, but I wonder how many people will complain? Maybe take it one step at a time and only worry about table files for now? What's your vision of the ideal process?

That topic references the same 'Kanji Array' item from the Romjuice readme file I previously mentioned. Look at it again. This time ask yourself 'Isn't this really a form of simple data compression rather than a hex or text encoding item?' I think an argument can be made that is is data compression (even if very simple) and thus would not belong in our table file standard.
Yup, I know. The thread adds a little bit of info not in the readme, and is old enough that it might have been forgotten. Kanji Arrays are definitely data compression, but so is DTE. Where do we draw the line? Kanji Arrays and DTE are both representable by static bidirectional mappings, unlike, say, any of the Lempel-Ziv compression variants, which use dynamic mappings. Dumping Kanji Arrays isn't a problem, and if a bunch of programmers from the 90s came up with a way of inserting them, we should be able to as well.

I thought we covered 'abcABC'... Longest match is one letter for all of them, infinite table switch is preferred over single. So, we get 0xC0 0x80 0x81 0x82 (fallback) 0x80 0x81 0x82.
I think you missed my point, given that you just inserted "ab~5~bc" instead of "abcABC". How is the game supposed to know that it should fall back to table1 in the middle of a string of perfectly valid table2 tokens? Inserting without considering consequences for dumping (a.k.a. in-game display) can cause trouble, in many of the same ways as can dumping without considering consequences for inserting. If a utility author chooses not to care, so be it, but that doesn't make the problems go away.

On the topic of falling back... how exactly does the last Dakuten example in 3.2 work?
   Example 3:
   
      Let's say hex sequence "7F" indicates Dakuten mark that applies for all
      characters until "7F" occurs again to turn it off. This can be done with
      two tables containing table switch bytes in the following manner:
      
      Table 1:
         
         @NORMAL
         60=か
         !7F=Dakuten,0
      
      Table 2:
         
         @Dakuten
         60=が
            
      This will instruct any dumper to output 'か' normally until a "7F" byte
      is encountered. It will then switch to Table 2 and output 'が'. Because
      we specified 0 for the number of table matches, matching in the new
      table will continue until a value is not found. In this case "7F" is not
      in the Table 2, so fallback to Table 1 will occur.
After falling back to @NORMAL from @Dakuten on that "7F" byte, won't the very next thing the dumper does (as corroborated by the example in 2.6) be to read "7F" in @NORMAL, causing it to switch right back to @Dakuten and continue dumping with Dakuten marks? That sounds like the exact opposite of the intended result :P. This kind of fallback isn't really covered under any of the other cases, so maybe there should also be an explicit fallback entry. Perhaps something in @Dakuten like "!7F=,-1", or something in @NORMAL like "!7F=Dakuten,$7F"?

Klarth

  • Sr. Member
  • ****
  • Posts: 423
  • Location: Pittsburgh
    • View Profile
Re: Table File Standard Discussion
« Reply #29 on: June 08, 2011, 09:44:13 pm »
I can make arguments for and against end tokens in the standard, along with formatting.  Personally, I'd rather keep end tokens and formatting in the standard since it's less code to write and less configuration for me to keep track of (on both the dev and usage sides).  Can remove out null end tokens like /<END>.

Nightcrawler

  • Hero Member
  • *****
  • Posts: 5788
    • View Profile
    • Nightcrawler's Translation Corporation
Re: Table File Standard Discussion
« Reply #30 on: June 09, 2011, 04:23:10 pm »
Which includes tokens that can combine to produce text representing a raw hex literal.

I spoke about some of these things with Klarth before. These are items that are possible to occur, but too rare for it to be detrimental... unless it's a "gotcha table". You either do that or you go in the other direction previously spoken about with the game control code unification and disallowing the literals used. At this point, nobody else seems in favor of such changes.  :-\

Quote
Hmm. That is a bit of a pickle. I agree pushing both items out to the utility level seems logical, but I wonder how many people will complain? Maybe take it one step at a time and only worry about table files for now? What's your vision of the ideal process?

Well as you can see, already Klarth does not like the idea of eliminating end tokens altogether. I imagine it would hit additional resistance. However, he is in favor of eliminating artificial end tokens. He also wants to keep formatting. I'm on the fence with it.

a.) Allowing arbitrary line breaks in any token is flexible formatting and hard to duplicate (from a user interface perspective) on the utility level.
b.) Many games have specific line break control codes. The end user needs a method to have a real line break after it in the text dump. The alternative would be having to define a line break control to the utility. That can seem disjointed. Some utilities had in-game line breaks specified in the table file similar to end tokens. This was deemed unnecessary with the /n formatting we currently have.
c.) Leaving them requires different behavior for interpreting tokens for dumping and inserting (ignore line break on insertion rule). Some, such as yourself want to use that as part of the token while others are strongly against.
d.) It can be difficult or problematic to have the script dump come out commented without quirks. See the example in the standard. You will end up having issues with the starting and ending comments by doing it that way. Alternatively, pushing commenting off to the utility pretty much requires the line breaks to be defined in some capacity to handle things like item b above. How else would the utility be able to comment out each line within a string and then say leave un-commented lines between strings. This is how the whole /r and /n was born for commented and un-commented line breaks in some other utilities.

I'm not even going to begin to talk about the ideal process. That's a huge can of worms as the ideal process depends on what platform and games you're working with. The table file escapes much of that and look at all the disagreement on that. :P

Quote
Yup, I know. The thread adds a little bit of info not in the readme, and is old enough that it might have been forgotten. Kanji Arrays are definitely data compression, but so is DTE. Where do we draw the line? Kanji Arrays and DTE are both representable by static bidirectional mappings, unlike, say, any of the Lempel-Ziv compression variants, which use dynamic mappings. Dumping Kanji Arrays isn't a problem, and if a bunch of programmers from the 90s came up with a way of inserting them, we should be able to as well.

The difference is, with DTE, there is direct token conversion in both directions. With the Kanji Array and most other data compression, you do not have that. An algorithm (even if simple) is required. As soon as you leave the realm of direct token conversion, I think you depart from the scope of the table file. That's a clear line to me. DTE is a static bidirectional map with no manipulation needed, the Kanji Array is not. Transformation of the hex must occur to get the map, right?

A way that makes sense to interpret this as non data compression is to set up all possible kanji array values as table switches with x number of matches, which was exactly why Tau suggested it. No optimal plan for insertion was ever come up in this topic if we left that in though. It seems to severely complicate matters.

Quote
I think you missed my point, given that you just inserted "ab~5~bc" instead of "abcABC". How is the game supposed to know that it should fall back to table1 in the middle of a string of perfectly valid table2 tokens? Inserting without considering consequences for dumping (a.k.a. in-game display) can cause trouble, in many of the same ways as can dumping without considering consequences for inserting. If a utility author chooses not to care, so be it, but that doesn't make the problems go away.

Look at what you did. You defined table2 with 82=c and 8280=~5~. I can't say I've ever seen a text engine that could operate in that way that I recall. 0x82 would be reserved to indicate the next character is a two byte character.  I believe S-JIS and UTF-8 work in a similar manner as well off hand. If your game is processing characters that are one OR two bytes and the next character is valid as one byte and two byte character, that's an ambiguous situation for the text engine unless it were to arbitrarily declare one or the other the default to choose from.

On top of all that, again, the user specifically plopped that raw value in there in the midst of a table switch to cause it. This is another case of the 'gotcha-table' theoretical condition that is so rare, I can't say it's worth any time to handle. What would you propose to do about it anyway? I don't think you'll find much support on the issue from the others.

You can try and iron out theoretical edge cases all day and spend a wealth of resource on them in your program, and continue to put it out of reach of more and more people, or you can start adding some practical limits, keep it simple, and get the product out the door. I think that's kind of where several of us are coming from now.

Quote
On the topic of falling back... how exactly does the last Dakuten example in 3.2 work?

Good catch. I remember writing those examples. I think I originally wrote it with an additional switch entry in Table2 of '!7F=NORMAL,0'. That should take care of it (and revising the logic paragraph). I remember having a (in hindsight) brain logic malfunction when I read it over and thought I didn't need that and it would just fallback. I rewrote it after that.
TransCorp - Over 15 years of community dedication.
Dual Orb 2, Wozz, Emerald Dragon, Tenshi No Uta, Herakles IV SFC/SNES Translations

abw

  • Newbie
  • *
  • Posts: 42
    • View Profile
Re: Table File Standard Discussion
« Reply #31 on: June 09, 2011, 11:56:47 pm »
Well as you can see, already Klarth does not like the idea of eliminating end tokens altogether. I imagine it would hit additional resistance. However, he is in favor of eliminating artificial end tokens. He also wants to keep formatting. I'm on the fence with it.
The case for eliminating artificial end tokens certainly seems stronger than that for eliminating end tokens or newlines. As an end user, I think I prefer keeping them in the table file; even though that may not be the most logical place for them, it does seem to be the most convenient place.
a) I agree with this. As a real world example, it might be interesting to consider a game like The Legend of Zelda, which uses different tokens for regular characters, characters ending the first line, characters ending the second line, and characters ending the string.
b) The utility interface for defining a line break control can be pretty much the same as for defining an end token. I definitely prefer the formatting flexibility of \n over the old * entries.
c) As far as implementation goes, stripping newlines can be done in approximately 2 lines of code. My objections are primarily philosophical and are entangled with other issues.
d) I've never used \r myself. Is that a popular option? If you have some interface for defining line break controls and end tokens, the same thing should work for comment controls. Maybe something along the lines of this (clearly user interface is not my strong suit :P)?
0A=A
...
23=Z

4A=A<line1>\n
...
63=Z<line1>\n

8A=A<line2>\r
...
A3=Z<line2>\r

/CA=A<end>\n\n
/E3=Z<end>\n\n
becoming
#LINE-BREAK-AFTER: 4A-63,8A-A3,CA-E3
#LINE-BREAK-AFTER: CA-E3
#COMMENT-AFTER: 4A-63,CA-E3
#END-AFTER: CA-E3
Or maybe you could use your GUI to layer extra effects on a per-token basis? That sounds very much like a utility-specific table file extension, though.

I'm not even going to begin to talk about the ideal process. That's a huge can of worms as the ideal process depends on what platform and games you're working with. The table file escapes much of that and look at all the disagreement on that. :P
All too true. Consider the question retracted :P.


The difference is, with DTE, there is direct token conversion in both directions. With the Kanji Array and most other data compression, you do not have that. An algorithm (even if simple) is required. As soon as you leave the realm of direct token conversion, I think you depart from the scope of the table file. That's a clear line to me. DTE is a static bidirectional map with no manipulation needed, the Kanji Array is not. Transformation of the hex must occur to get the map, right?
DTE is only direct for dumping, and only when we assume all tokens have the same hex length. If the hex lengths vary, you need an algorithm to decide when to stop reading bytes (technically you still need an algorithm even if all it does is say "read one byte"). When you're inserting, you need an algorithm to tokenize the input... oh, I guess you're talking about what happens after tokenization. In that case, yes, DTE is direct both ways. But if you've already tokenized, Kanji Arrays are also direct conversion. The only problem we're having is that some of the tokenization information (the location and type of table switch tokens) is missing, and deducing the exact nature of the missing bits is tricky. If we could rely on the input script having the correct table switch tokens in the correct places, we wouldn't have most of these problems. It's the uncertainty of tokenization at issue, not the mapping of tokens.


Look at what you did. You defined table2 with 82=c and 8280=~5~. I can't say I've ever seen a text engine that could operate in that way that I recall. 0x82 would be reserved to indicate the next character is a two byte character.
Yes, that was a little sneaky of me. It is fully supported by the longest hex sequence rule for dumping, though. A quick peek at the wiki shows at least one real world example where this does in fact happen (on that note, people should submit more table files!).

I believe S-JIS and UTF-8 work in a similar manner as well off hand.
Not sure about S-JIS, but UTF-8 was actually designed to avoid this very weakness (and a few others). The same cannot be said of character encodings used in video games.

If your game is processing characters that are one OR two bytes and the next character is valid as one byte and two byte character, that's an ambiguous situation for the text engine unless it were to arbitrarily declare one or the other the default to choose from.
Which is exactly what the table file standard does :P.

On top of all that, again, the user specifically plopped that raw value in there in the midst of a table switch to cause it. This is another case of the 'gotcha-table' theoretical condition that is so rare, I can't say it's worth any time to handle.
Raw hex literals are (potentially) problematic anywhere in a game using multibyte tokens, not just at table switch boundaries.

What would you propose to do about it anyway? I don't think you'll find much support on the issue from the others.
That's what I was asking! Do they trigger fallback? Do they count as a match towards NumberOfTableMatches? What happens if they combine with the preceding or following hex to mutate into a completely different token? How can you write code to handle multi-table insertion and not consider these cases?

You can try and iron out theoretical edge cases all day and spend a wealth of resource on them in your program, and continue to put it out of reach of more and more people, or you can start adding some practical limits, keep it simple, and get the product out the door. I think that's kind of where several of us are coming from now.
Being 100% standard compliant means handling the theoretical edge cases. All of them. I haven't introduced anything in any of my examples that wasn't already allowed by the standard, and I haven't even raised all of the issues I'm currently aware of. If you want practical limits, I think either you have to adjust the standard or you can't claim compliance. And since you can't advocate a standard you don't want to comply with, I guess that means adjusting the standard.

One more try: in addition to the raw hex literal, let's also ignore all the multibyte entries and all the multicharacter entries. That leaves us with
@table1
80=A
...
9A=Z
!C0=table2,0
!C1=table2,1

@table2
80=a
...
9A=z
Now instead of "ab~5~bc", your hex represents "abcabc", which is still not the same as "abcABC". In this case, the only hex that works is "C1 80 C1 81 C1 82 80 81 82". If we are going to drop support for counts other than 0 and 1, what would you think about also dropping support for one table linking to another table using tokens with both counts? In this case, that would mean the combination in @table1 of "!C0=table2,0" and "!C1=table2,1" would be an error.

On the topic of falling back... how exactly does the last Dakuten example in 3.2 work?

Good catch. I remember writing those examples. I think I originally wrote it with an additional switch entry in Table2 of '!7F=NORMAL,0'. That should take care of it (and revising the logic paragraph). I remember having a (in hindsight) brain logic malfunction when I read it over and thought I didn't need that and it would just fallback. I rewrote it after that.
I thought about that too, but it does cause the dumper's table stack to become desynchronized from the game's stack. The next time the dumper hits an unrecognized entry in @NORMAL, it would fall back to @Dakuten (which might contain a hit) instead of dumping a hex literal. Worse yet would be if @NORMAL weren't the bottom of the stack, and the game actually fell back to some other table entirely while the dumper stayed in @Dakuten. But I guess that's just another theoretical edge case waiting to be ignored.

Nightcrawler

  • Hero Member
  • *****
  • Posts: 5788
    • View Profile
    • Nightcrawler's Translation Corporation
Re: Table File Standard Discussion
« Reply #32 on: June 10, 2011, 03:27:52 pm »
First, let me apologize for some of the tone used in my previous message. I am just getting a little frustrated over the whole thing. After a year already, and now ripping it apart it again, the enthusiasm I started out with is fading quickly. It is becoming much more of a burden than the nice little side project I started out with while developing TextAngel. I am anxious to finish up and move on in life after this much time. I'm sure it shows in the tone of my responses. ;) While I am angry I did not think of some of these pitfalls myself in the beginning, I am grateful you have brought them to my attention nonetheless. It's been a spirited and lengthy discussion. Certainly some good has come from it as I have incorporated many items discussed. It was a daunting task to go through all this discussion again... :)

New draft, summary of changes, and summary of outstanding business.


The case for eliminating artificial end tokens certainly seems stronger than that for eliminating end tokens or newlines. As an end user, I think I prefer keeping them in the table file; even though that may not be the most logical place for them, it does seem to be the most convenient place.

Ok, so final ruling was to get rid of artificial end tokens, but keep standard end tokens and the \n formatting. To clear up the situation where wo end tokens are the same differing only by '\n', I have added rule that newlines are ignored when checking for duplication. So /FF=<end> and /FE=<end>\n\n  are considered duplicate. Along those lines, the standard duplicate text-sequence rules apply. Lastly, duplicate text sequences are checked across the logical table (as opposed to only token type or all tables). How's that?

Quote
d) I've never used \r myself. Is that a popular option? If you have some interface for defining line break controls and end tokens, the same thing should work for comment controls. Maybe something along the lines of this (clearly user interface is not my strong suit :P)?

I think it's a historically popular layout (and even recently, Cartographers supports it.). I know a number of people (including myself being 'old school') have a text dump layout where the original Japanese is all commented out and the English would go below. I happen to have an old sample already up online: http://transcorp.parodius.com/scratchpad/outputsample.txt

There are certainly other potentially better ways to do it. Tau likes to do it like this and use no line breaks or comments.
http://imageshack.us/photo/my-images/218/translationworkbench.jpg/

I'm sure others do it other ways and/or even port to XML or spreadsheet. I just want to make sure I at least make sure the few common cases I know of are possible to achieve.

I agree that we probably want to stay away from trying to extend the table format via utility. This has made me re-think one of Tau's suggested solution of this issue using regex in table entries. See this post: http://transcorp.parodius.com/forum/YaBB.pl?num=1273691610/45

Originally I did not like the added complexity. However, since then we have strayed so far. Table switching and possible printf like formatting strings etc.. If we're that far down the road, another step or two might not be as bad anymore. I feel as though it's already far out of reach to most first time utility amateurs now.  I admit it would allow for some interesting possibilities. I also have to admit to some personal bias against it. It's a fine technology of course, it's just always been so difficult for me to grasp the syntax (which seems to vary somewhat with programming language). I struggle doing anything more than the simplest task in regex.

Quote
It's the uncertainty of tokenization at issue, not the mapping of tokens.

That's the same reason why we can't do any other form of compressed data. :) So, that leads me back to thinking we can do away with x number of matches on table switching to reduce complexity. Even if we do that, we still haven't come up with an agreeable method for insertion with table switching. So, it seems leaving it in and adding to the complexity further makes it near unfeasible.

Quote
Yes, that was a little sneaky of me. It is fully supported by the longest hex sequence rule for dumping, though. A quick peek at the wiki shows at least one real world example where this does in fact happen (on that note, people should submit more table files!).
Raw hex literals are (potentially) problematic anywhere in a game using multibyte tokens, not just at table switch boundaries.

Ok. Point taken on the previous example of how the abcABC table knows when to fall back, even without the hex literal thrown. However, I'm starting to loose sight of the point here. How do you propose these items be taken into consideration during insertion?


Quote
That's what I was asking! Do they trigger fallback? Do they count as a match towards NumberOfTableMatches? What happens if they combine with the preceding or following hex to mutate into a completely different token? How can you write code to handle multi-table insertion and not consider these cases?

I don't have a good answer for that. That's why I don't make public utilities. I can just ignore all cases that don't apply to what I'm doing and go on my merry way. :) Speaking of which, some of the things we're talking about like this I don't think any program has ever taken into consideration. The possibility of mutation due to inserting raw hex is not new and yet everybody has gotten along just fine without doing anything special about it. How have utilities up until now handled it? I don't even know as I don't recall ever reading a shred of documentation on it, nor did I even take much notice of it when I went through available source code before. How much do these things matter? Why can't we just define some behavior and move on? What's the result? Insertion may do something undesirable for the user. The user will have to deal with it. Well, OK, nobody has complained for the last 15 years about it.  Maybe it's a poor attitude to have on the situation, but I have ask if it's worth the time to do anything else with it. The fact of the matter is most people using this will be inserting English scripts of a single table with English letters, DTE, and/or dictionary. How many will care about any of this?

So, I'm going to pass it to you to come up with something suitable for. :P

Quote
Now instead of "ab~5~bc", your hex represents "abcabc", which is still not the same as "abcABC". In this case, the only hex that works is "C1 80 C1 81 C1 82 80 81 82". If we are going to drop support for counts other than 0 and 1, what would you think about also dropping support for one table linking to another table using tokens with both counts? In this case, that would mean the combination in @table1 of "!C0=table2,0" and "!C1=table2,1" would be an error.

OK. Understood now. Yes, I think it would be OK to disallow multiple switches to the same table from within a single table. I have a hard time thinking of a scenario where a game text engine would have two switches from a table to the same destination table with different rules. I think they'd hit the same type of wall we are hitting. Good line of thinking here. I think rather than think of some elaborate way to facilitate what we have defined, we need to reduce what we have defined to something more reasonable to facilitate. :)

Quote
I thought about that too, but it does cause the dumper's table stack to become desynchronized from the game's stack. The next time the dumper hits an unrecognized entry in @NORMAL, it would fall back to @Dakuten (which might contain a hit) instead of dumping a hex literal. Worse yet would be if @NORMAL weren't the bottom of the stack, and the game actually fell back to some other table entirely while the dumper stayed in @Dakuten. But I guess that's just another theoretical edge case waiting to be ignored.

Right. It seems like this is a new scenario not covered by anything we have. It needs to be as I believe that is a common scenario. I've also seen hiragana/katakana switches like that. I would agree with possible addition of another fallback option. However, what effects will that have on are already difficult situation of table switch insertion process? I suppose it wouldn't be too terrible if we still limited things to not allow more than one switch to the same table in a logical table. There will only ever be on possible switch path to Table X from Table Y that way.
« Last Edit: June 14, 2011, 09:39:09 am by Nightcrawler »
TransCorp - Over 15 years of community dedication.
Dual Orb 2, Wozz, Emerald Dragon, Tenshi No Uta, Herakles IV SFC/SNES Translations

Klarth

  • Sr. Member
  • ****
  • Posts: 423
  • Location: Pittsburgh
    • View Profile
Re: Table File Standard Discussion
« Reply #33 on: June 11, 2011, 02:15:20 am »
Yes, that was a little sneaky of me. It is fully supported by the longest hex sequence rule for dumping, though. A quick peek at the wiki shows at least one real world example where this does in fact happen (on that note, people should submit more table files!).
I have strong suspicion that table is incorrect at best and impossible at worst.  Look at the hex values 00-FF and the range is entirely saturated except for 00.  Look at the kanji ranges and **00 - **FF are saturated except for the **00 entry.  There is no way to differentiate between the two because of that saturation (unless the single byte entries were meant to start with an 00 byte).  Also most of 01-07 have duplicate entries.

The scenario you were trying to describe can conceivably come true.  It'd be a mixed single and double-byte token text system (read two bytes at a time, put byte back into stream if it's a single byte token).  Past that it depends on machine endian (somewhat) and the programmer's logic (order he packs the data).  And that's why it practically doesn't happen...reading one byte at a time (and getting the rest later) is more intuitive for most.

Nightcrawler

  • Hero Member
  • *****
  • Posts: 5788
    • View Profile
    • Nightcrawler's Translation Corporation
Re: Table File Standard Discussion
« Reply #34 on: July 29, 2011, 03:41:07 pm »
Since it appears abw flew the coop long ago, I will try to carry on and resolve the remaining outstanding business.

OUTSTANDING BUSINESS:

Comments:
As pointed out by abw's first post here, we have an issue with leaving in formatting '\n' and trying to add comments. We're also left like Cartographer with odd glitches like an extra '//' at the end of every dump, and needing the utility to add the first '//'. It seems hackish, but nobody's complained so far on Cartographer. If we move it to the utility, it is very difficult/cumbersome to be able have a user interface to give the same same flexibility to do something like this example. The line breaks are arbitrarily defined in the table file... and then we're trying to comment only some of them, depending on token, after the fact. Either way is not pretty. The only other suggestion I've seen is allowing Regex in the table file.

Raw Hex
Raw hex causes several issues. First, a combination of normal entries may inadvertently output a raw hex sequence during dumping and thus not insert. Secondly, inserting raw hex can cause an issues for table switching behavior. Lastly, it can be cause a subsequent token to be interpreted as a different token upon insertion. One solution to several issues was a general game control code and disallowing <> type characters in normal entries by abw.

Insertion Issue with combined tokens
There are some insertion issues that can arise where less in intelligent insertion (such as longest prefix) could result in  being interpreted as different tokens upon insertion. See the example at the bottom of this post.

Final Ruling Needed
1. Allow multiple tables per file? It may be useful to have all kanji/hira/kana in single file, even if different logical files. -Current ruling is to leave single table per file.
2. Linked Entry formatting strings. "$XX=<token %2X %2X>" Cleaner, easier to validate, but increases complexity. At this point, I think I am favor of this.

Insertion for Table Switching Details:
Outside of the table file itself, but still needed. Thus far, no suitable solution has been determined by anyone after much discussion in this topic. Working on reducing supported features of tables switching to something more manageable seemed like the necessary course of action. I think if we eliminate the support for counts other than 0 and 1, and also drop support for one table linking to another table using tokens with both possible counts should resolve most of our prior issues. That seemed to be our consensus right between discussion died off.
TransCorp - Over 15 years of community dedication.
Dual Orb 2, Wozz, Emerald Dragon, Tenshi No Uta, Herakles IV SFC/SNES Translations

Klarth

  • Sr. Member
  • ****
  • Posts: 423
  • Location: Pittsburgh
    • View Profile
Re: Table File Standard Discussion
« Reply #35 on: August 01, 2011, 06:46:41 pm »
Opposed to multiple tables per file, unless you implement it as a requirement for table switching.
In favor of linked entry format strings.  I don't have any specifics on how the format string should look or what should be supported in it.

abw

  • Newbie
  • *
  • Posts: 42
    • View Profile
Re: Table File Standard Discussion
« Reply #36 on: August 05, 2011, 10:26:58 pm »
Yeah, sorry about that. Real Life has been keeping me pretty busy lately, and I haven't had the time/energy to be very productive on this. I think we all needed a bit of a break anyway :P.

Comments:
If you want low-level control over which newlines trigger comments and which newlines don't, I don't think you can beat \n and \r for ease of use. If all you want is an all-or-nothing comment of the dumped text for Atlas-style script files (this might not be a relevant concept for other styles), I think the utility should be able to handle that: any informational lines the utility generates (e.g. "String: $00 - Length: $01 - Range: $000000 to $000001") it can comment itself; any time it outputs a newline due to table file translation (or its own artificial end token), it can check whether the string containing the newline has more (non newline) text to output and add comments only if the check returns true. Alternately, it can add comments for every newline that isn't part of an end token. I think that pretty much does what we want... am I missing anything? If not, thoughts on simply sacrificing low-level control?

Final Ruling Needed:
If we're taking a vote, I am weakly in favour of multiple tables per file, and definitely in favour of linked entry format strings. Do we bother hiding the sprintf specifics? I agree we'd want to support hex, decimal, and binary output, and I'd like the ability to specify an arbitrary number of format strings embedded in text (e.g. $hex=<window x=%2X, y=%2X>). Thinking of binary output, how much do we care about restricting the amount of data formatted to byte boundaries? If somebody wanted to say e.g. $hex=<window x=%X, y=%X, bg=%4b, text=%4b>, how much of a problem is that?

Insertion for Table Switching Details:
I've been thinking about this on-and-off, and have a couple of ideas, but none of them are exactly what you'd call fully-baked at the moment.
We also need to do something about the Dakuten example. Any objections to using something in @Dakuten like !7F=,-1 to represent a forced fallback / localized end token? It should be easy to handle when dumping. Insertion could be trickier, though. In cases where you've seen this behaviour in action, is it the fallback itself that triggers the end of Dakuten mark adding, or is the 7F actually mandatory? If it's mandatory, we'll need some way to let the insertion utility know that it needs to insert another token on fallback. Or maybe we can get away with assuming there will only ever be one such token per table and that it should be inserted whenever fallback from that table occurs, regardless of the token used to switch to it. Or only when the switch hex matches the fallback hex?

Nightcrawler

  • Hero Member
  • *****
  • Posts: 5788
    • View Profile
    • Nightcrawler's Translation Corporation
Re: Table File Standard Discussion
« Reply #37 on: August 08, 2011, 11:03:56 am »
I hope you can muster up a little bit more for this last round of discussion because we're looking to close up shop on this. It's really starting to hinder development for TextAngel, which would be ready for testing if the table standard loose ends were closed up. I imagine it is also being a hindrance for Atlas as well. This was started in May 2010! :o

Comments:
I have thought about much of the same. I think it still makes the most sense to have the utility do this and get comments out of the table altogether. All-or-nothing from the utility is simplest and most straight forward, but I found that without those un-commented line breaks after strings, the script in general becomes less readable, but more importantly relies on the translator to start insertion of text at the proper place which is a bad idea. If I had to close the book on this today, I'd opt to add comments for every newline that wasn't part of an end token. That will do what we want. I just don't like how we go from simply commenting each string in the string list upon output (all-or-nothing) to now having to go down to a token level and distinguish line breaks that should be exempt. For that tiny addition to the feature, we have to do much more over all-or-nothing. I think it's still necessary though. In any event, I think we can close this out if we agree it belongs at the utility level. How and what the utility does is outside of that.

Final Ruling Needed:
From our past discussion on formatting, we wanted hex, decimal, and binary. Binary was a maybe. Possibility of support for 1,2,4 and 8 byte parameters with potential for endian. I think that's getting excessive. My thoughts are to restrict it to single byte only in hex, decimal, or binary form. Personally, I'd like to keep it to as few features as practical because I either have to roll my own printf implementation or try and convert to .NET String.Format(). Both could get tricky for me if it's as full featured as printf really is. Also, this was born as logical extension to grouping simple straight hex output on 'linked entries'. No need to build a new Cadillac. Just put the doors the old car needed on.  ;)

One question on this, especially if were to go with your general game control code scheme. How is one to know how many parameters are defined if the syntax were "$hex=<window x=%2X, y=%2X>" or "$hex=<window x=%X, y=%X, bg=%4b, text=%4b>"? How about if you wanted to print a literal '%'. I imagine would should restrict usage of that character within the <> as it's not worth the extra to support it.

Also, how would you do the opposite for insertion by picking out the parameters if the output is something like "<window x=$45, y=10, bg=01100110, text=0110000>"? With mixed types, there needs to be something in the output to distinguish the parameters. Should they be in quotes?


Insertion for Table Switching Details:
I agree with the whole !7F=,-1 forced fallback. It's highly needed. This can apply to hiragana/katakana switching applications as well. In nearly all instances I've ever seen this, fallback occurs both when 0x7F is hit or when the string ends. I've never seen any myself where it would be absolutely necessary for the 0x7F before the string end, or that persists between strings. String ends usually nullify everything regardless. I would proceed with that assumption unless someone could provide evidence to the contrary. Otherwise, I think I am confident that I have seen enough to say most cases should work like that.

I was also thinking do we need a separate option for this? Perhaps we should just extend the "0" numberofmatches table switching option to also fallback upon encountering the table switch byte that switched? I suppose it is conceivable that you might need to explicitly fall back with some other value than the initial switch. Perhaps a start dakuten, end dakuten setup where the start and end values are different. If there is not a need for that, we can simply combine it into our existing functionality for "0" numberofmatches. Just a consideration.

As for the switching, I think we've almost got this. When we agreed to eliminate numberofmatches other than 0 and 1, and disallow using both at the same time with the same target, that makes things pretty straight forward, at least for the examples we had previously spoken about in this topic. The only thing remaining was the issues with raw hex causes, I think.
TransCorp - Over 15 years of community dedication.
Dual Orb 2, Wozz, Emerald Dragon, Tenshi No Uta, Herakles IV SFC/SNES Translations

abw

  • Newbie
  • *
  • Posts: 42
    • View Profile
Re: Table File Standard Discussion
« Reply #38 on: August 11, 2011, 07:11:38 pm »
Comments:
Yeah, this is definitely a utility-level concern. Checking the type of the last token in the string doesn't feel any worse than checking the type of each newline character in the string, but I may be biased since I'm storing everything as tokens anyway.

Final Ruling Needed:
New doors on an old car works for me :P.

The game-specific control code wouldn't change anything here, except for disallowing any other < or > characters inside the control code's own < and >.

For including literal %, why not go with the standard %%? That way you could just interpret the entire text sequence as your sprintf and pass it the next few bytes as arguments. You would need to know how many bytes the string consumes, of course; if we make every non-literal % consume 1 byte, that makes counting easy but splitting up bytes (e.g. "bg=%4b, text=%4b") impossible. Byte-splitting also gets tricky if you want the output formatted in decimal (e.g. how many bits does %4u need?). Does anybody actually care about sub-byte output? I only brought it up as something to consider. It's not quite as pretty, but you could achieve the same general effect with e.g. "(bg 4, text 4) = (%8b)".

The insertion utility should be able to figure out how to convert ambiguous parameters based on the table entry. So if we had <window x=45, y=10> in the script and $FE=<window x=%2X, y=%u> in the table, we would know that we needed to write FF 45 0A. If the table had $FF=<window x=%2X, y=%2X> instead, we would know that we needed to write FF 45 10. That's why it's important to ignore printf parameters when checking for table entry uniqueness, since if we had both $FE and $FF entries in the same table, we wouldn't be able to tell how to reverse engineer the correct hex.

Insertion for Table Switching Details:
Reading it over again, I see that my question was poorly worded. What I meant to as was whether fallback on any byte would trigger the end of dakuten marking, or whether only the closing 7F had that effect. So if @Dakuten did not contain an entry for 6F, and a 6F was encountered while outputting dakuten marks, would the game try to continue adding marks?

I think we should go with a separate option. In addition to the "same start/stop token" issue you raised, extending the "0" functionality like this would also mean that tables that don't have this on/off behaviour would be unable to contain an entry for any hex sequences that switch into them (and one table can still be switched into from multiple other tables). Pretend the @HIRA/@KATA/@KANJI example in 2.6 had an entry in @KANJI for F8 or F9 to see what I mean.

On a theoretical level, raw hex is an absolute mess. With all the additions made to the table file standard (and ignoring legacy scripts), is there any good reason for having raw hex in an insert script? In a perfect world, I would also like to deal with token mutation, but that's really more of a text engine design problem than a table file standard problem. If you've got something that works for the simplified table switch structure (i.e. only 1 or infinite matches, fewer choices of how to move from table A to table B), don't let me hold you back. I'll try to come up with something that makes me happy too, but my time for the next couple of weeks is mostly already spoken for.

As a related but possibly inconsequential aside, it seems a shame to remove support for dumping the more complicated table switch structures just because we haven't come up with a way to insert them. Given that insertion algorithms are outside the strict scope of the standard, maybe we can leave the more complicated version in the standard?

Nightcrawler

  • Hero Member
  • *****
  • Posts: 5788
    • View Profile
    • Nightcrawler's Translation Corporation
Re: Table File Standard Discussion
« Reply #39 on: August 15, 2011, 02:23:10 pm »
Comments:
I am not storing strings as token lists at present, so a global newline operation is trivial, where moving to the token level requires more from me. Do you do anything else of interest that requires the string in list of token form? So far, with all the features I support, I have not needed it besides this type of token level formatting. I may move in that direction if I find it to be useful.

Quote
For including literal %, why not go with the standard %%? That way you could just interpret the entire text sequence as your sprintf and pass it the next few bytes as arguments. You would need to know how many bytes the string consumes, of course; if we make every non-literal % consume 1 byte, that makes counting easy

sprintf does not exist in .NET languages. So as I mentioned, I either need to make my own or try and convert to something something similar (but different), which would be String.Format(). So, I will need to process the text sequence. For dumping, I was hoping to get away with simply searching for the allowed values (%d,%x,%b) and substitute with the appropriate converted parameters. Simple. :) I'd rather not get into having search for patterns, convert some parameters to multi-bytes, endian, padding etc. Complexity goes way up quickly due to my lack of native sprintf for little gain.

Quote
The insertion utility should be able to figure out how to convert ambiguous parameters based on the table entry. So if we had <window x=45, y=10> in the script and $FE=<window x=%2X, y=%u> in the table, we would know that we needed to write FF 45 0A. If the table had $FF=<window x=%2X, y=%2X> instead, we would know that we needed to write FF 45 10. That's why it's important to ignore printf parameters when checking for table entry uniqueness, since if we had both $FE and $FF entries in the same table, we wouldn't be able to tell how to reverse engineer the correct hex.

Concept is clear, but I'm a little unclear on the mechanics of this.  You read in the text sequence "<window x=45, y=10>". How do you go about figuring out that it corresponds to your entry of "$FF=<window x=%2X, y=%2X>"? Would you want to expunge the parameter info from both and see if they match? I can do that with the table entry from the '%2X' removals, but how from the text sequence do I detect and differentiate the parameters there from the rest of the string? How do I know that it's the '45' and '10' that needs to be removed?

Insertion for Table Switching Details:
I think that treads in undefined behavior of the text engine territory. Behind the scenes upon a dakuten switch, games will typically either actually draw the marks on the last drawn tile, or simply switch to another area to load already marked characters. In the ones that I have seen, if you were to feed the game an undefined table value after it switched, it usually will print garbage and draw the marks, or just simply print a garbage tile depending on it's implementation scheme. I don't think I have seen any personally that would end the dakuten under any condition other than the closing control or end of string.

Quote
On a theoretical level, raw hex is an absolute mess. With all the additions made to the table file standard (and ignoring legacy scripts), is there any good reason for having raw hex in an insert script? In a perfect world, I would also like to deal with token mutation, but that's really more of a text engine design problem than a table file standard problem. If you've got something that works for the simplified table switch structure (i.e. only 1 or infinite matches, fewer choices of how to move from table A to table B), don't let me hold you back. I'll try to come up with something that makes me happy too, but my time for the next couple of weeks is mostly already spoken for.

Raw hex is there for undefined data. Imagine dumping a menu block or a heavy text block intermixed with code or commands. Theoretically, perhaps you could defined all of the controls, commands, and rogue data in your table, but it's highly impractical and I doubt anybody does that. Perhaps you could take the time to identify all the text and make a pointer list for every last one. However in practice,  when people dump menus areas, they don't care what all that undefined data is, only that it gets put back in. It's much quicker to just dump a block of junk, edit the text you see and stick the whole thing back in. This is not a problem with the free raw hex out current utilities handle. I think people would say it's probably absolutely critical to have raw hex ability for a variety of applications. Anyone else want to chime in?

Quote
As a related but possibly inconsequential aside, it seems a shame to remove support for dumping the more complicated table switch structures just because we haven't come up with a way to insert them. Given that insertion algorithms are outside the strict scope of the standard, maybe we can leave the more complicated version in the standard?

First, I don't believe we lose anything of value and instead simplify considerably. Don't you think it's a bad idea (and irresponsible) to release a standard without having come up with any known implementation of it? While the algorithm itself may be outside of the standard, the problem you're asking to be solved is in the standard. Let's take the extreme, perhaps it's nearly impossible to do what was asked. That's probably not true, but we know it's definitely not trivial. The solution, if even derived, would probably be the most complicated thing in the standard and probably beyond my ability. I do have to limit it to my own ability. As I've said before, I can't back something I cannot implement. What good is that? :P It makes me sad to know how out of reach it would be for anybody worse off than myself.  :(
TransCorp - Over 15 years of community dedication.
Dual Orb 2, Wozz, Emerald Dragon, Tenshi No Uta, Herakles IV SFC/SNES Translations