logo
 drop

Main

Community

Submissions

Help

Author Topic: Table File Standard Discussion  (Read 26105 times)

rveach

  • Newbie
  • *
  • Posts: 22
    • View Profile
Re: Table File Standard Discussion
« Reply #60 on: March 13, 2012, 04:15:49 pm »
How is this different from what some of these guys do?

The same reason, you made (2.5) Linked Entries and (2.6) Table Switching.
You could do those things with a (2.2) normal entry (excluding infinite loops), but it requires ALOT of writing depending on how complicated things get. So it is to save time for writing out each entry by hand, user errors that could possible result from manual typing, and size of the table file.

For a game that uses SJIS, to have a complete table of that SJIS, it requires ~67kbs and ~6800 lines (I will show my file if you require to see it), which won't include the game's custom codes they use. I only used ASCII in my examples above, for simplicity.

So allowing Character Maps would reduce the file size and increase a faster understanding of the table file (if your reading someone's elses), while memory usage would probably only increase slightly. The only extra memory would be from making sure you don't overwrite anything you defined in the table specifically. The character map entries will be in memory regardless if they are manually entered or generated, so they don't add any extra memory.

Is defining creation of the map part of defining the map?

I'm not sure what you are saying here.
If you mean the 'user custom map', then it is defining the ordering and letters in the map on that one line.
If you mean the normal maps, then maybe look at my code below to see when the map is defined in memory.

How do you propose something like this be implemented code wise?

Well, there must be some way to generate a list of printable characters in a specific character map, programmatically, but I haven't found a way with google yet. If its not possible, then its up to the implementations to build into their programs the most popular character maps, or support multiple DLLs that will contain them, thus allowing users to add their own maps and add unpopular ones.

This is my first crack at how to implement the character maps (which may look far from perfect lol), but there may also be a better way:
Code: [Select]
init list storage
init removal list

main loop:
    read line

    if (line is a character map support)
        parse starting byte
        parse length of character map //if it is custom, otherwise we will have the length builtin
        save rest of line and parses into list storage

        continue
    end if

    **process line like normal**

    if (line identifies hex code)
        if (hex code is in the range of one of list storages) // requires looking through all of list storage
            add hex code to removal list
        end if
    end if
end main loop

foreach list storage
    add new printable characters from maps // like it was in the format: hexadecimal sequence=text sequence
        in list storage minus what is in removal list
end foreach
« Last Edit: March 13, 2012, 04:37:04 pm by rveach »

henke37

  • Hero Member
  • *****
  • Posts: 627
  • Location: Sweden
    • View Profile
Re: Table File Standard Discussion
« Reply #61 on: March 14, 2012, 06:27:59 pm »
When in doubt, just include the standard mapping as a second table file that has to be bundled with the table reader.

Nightcrawler

  • Hero Member
  • *****
  • Posts: 5989
    • View Profile
    • Nightcrawler's Translation Corporation
Re: Table File Standard Discussion
« Reply #62 on: March 19, 2012, 11:19:17 am »
I'm afraid this standard may meet a tragic and embarrassing end.  :-[

Apparently, Feb 6th, the draft was somehow overwritten with an incorrect version on transcorp (probably from an FTP program hasty click fest on my part). It does not include the majority of my changes from the 10 June draft despite being marked as such. My current local copies were later subsequently refreshed to match. So, now I don't actually have any copies with all of (for some reason some are there) the changes mentioned in this post.. I have absolutely no desire to re-do all that work.

Does anyone else happen to have a copy of this from between June 2011 and Feb 6th, 2012?


EDIT: I found a copy from August 2011 from an old site backup and a copy on a USB Stick from June 23rd 2011. They still doesn't include the changes listed from June. It seems it may have been mis-uploaded right from the start in June. :'(
« Last Edit: March 19, 2012, 05:30:00 pm by Nightcrawler »
TransCorp - Over 15 years of community dedication.
Dual Orb 2, Wozz, Emerald Dragon, Tenshi No Uta, Herakles IV SFC/SNES Translations

henke37

  • Hero Member
  • *****
  • Posts: 627
  • Location: Sweden
    • View Profile
Re: Table File Standard Discussion
« Reply #63 on: March 21, 2012, 01:19:56 pm »
There is always the option of trying to write it again. You could even include new features.

Nightcrawler

  • Hero Member
  • *****
  • Posts: 5989
    • View Profile
    • Nightcrawler's Translation Corporation
Re: Table File Standard Discussion
« Reply #64 on: March 28, 2012, 02:11:00 pm »
Having nothing was unacceptable. Nine days of hell gave birth to a draft that is ready for final review and edit.

The new draft contains all features previously discussed and rules accounting for nearly all edge cases presented in this topic!

List of Changes

The only things that did NOT make it in:
  • Variable Parameter Lists
  • Control parameters accessing external tables
  • Character Map Support

These items either came too late, details were not fleshed out, or added complication was undesirable. In light of taking 2 years to flesh out what we have, the standard remains frozen and no new features will be considered at this time. Sorry.

Notes:

1. [] for Non-Normal Entries
[] were chosen for this as they were the popular choice in this topic. However, consideration should be made for <>,{},«», or other pairing of characters (remember they are disallowed in normal entries). This is easily changeable if anybody has an opinion one way or the other. For some reason, I am drawn to {} more than [].

2. Token Insertion Mutation (from hex literal or in careful insertion)
The only edge case/s not fully addressed is token mutation. This is an issue specific to what happens to a stream of binary output after attempted insertion. I don't believe this can be fully addressed in the table standard. The standard can only reach as far as to set some rules or guidelines to eliminate these situations from occurring. It has done so for all possible cases it could that we properly addressed (raw hex and others). Because token mutation is possible even within a single valid table (example at bottom of this post) it doesn't seem possible to address all possible instances. If you have suggested passages to add to the standard that might better address this, I'd happy to consider inclusion, otherwise, I believe it to be addressed as best as it can be within the reach of the standard.
TransCorp - Over 15 years of community dedication.
Dual Orb 2, Wozz, Emerald Dragon, Tenshi No Uta, Herakles IV SFC/SNES Translations

FAST6191

  • Hero Member
  • *****
  • Posts: 1863
    • View Profile
Re: Table File Standard Discussion
« Reply #65 on: May 02, 2012, 10:02:48 pm »
I did not miss it but forgot to post something

Still I read it several times so thanks for that as it refreshed me on a few concepts I had more or less dismissed in recent years (Kuten/Handakuten and Yoon/Youyon or diacritics in general are usually kicked to extra font entries and although I am no stranger to multiple encodings per game/file/script switching out for something as simple as the kana is not something I tend to see any more and I actually can not recall the last time I saw 3.3 Example 3 although I guess I also can not remember the last time I saw true word level dictionary compression either save for some curious situations in LZ) and pondered how I might apply it to some certainly not common situations but situations I have encountered none the less. They will mostly be from the DS, wii and maybe parts of the 360 which I guess is my main source of hacking work and probably why I had not seen the things above in a while (memory management is still a thing but not half as aggressive as it had to be for the 8 and 16 bit era). For the most part consider this a +1 with some pointless waffle from me, I have no real problems or need for solutions to the things below and would be quite happy to see this implemented beyond wondering if it is worth having a right to left reading flag (we can probably ignore tategaki though and kick it to a control code and fairly safely assume boustrophedon (alternating directions) is a thing of the very distant past) as Arabic, Hebrew and similar languages are getting a few translations nowadays.

I have reservations about some of the insertion side of things but that is always going to be the case and I agree this is probably a good way to serve the most people at once- some games support several encoding types and are quite fussy about replacing like with like even in the same line (Zombie Daisuki for the DS used parsed plain text files with some line numberings but switched between ASCII and shiftJIS (often without ASCII or U16 fallback options- shiftJIS implementations on the DS more often than not miss out that part of the spec and although it is better nowadays earlier on it was a cause for a minor celebration to have the game include the Roman characters as part of the actual shiftJIS section) meaning the collisions workarounds might not be ideal. Also there might be an issue with some things having multiple end tokens (usually for line end (line breaks were different), section end, a part end and maybe a file end but that can probably be worked around as well.

Starting with the crazy silly thing.
Riz Zoward/Wizard of Oz beyond the yellow brick road. Helping out with an English to Spanish translation (US version of the rom as the base).
Some text in it was fixed length, others was standard pointer text and the third most interesting part was damn near scripting engine grade.
In short each entry was a type code a few bytes long, a length value for the type, length and if any the resulting payload and a payload itself if there was any (think cheat encoding methods). The payload consisted of text and control characters of various forms as well as calls to various 2d and 3d image files and presumably a bit more than that (I did not take reverse engineering the format that far after I figured out roughly what went). The type codes and lengths probably could have amounted to fixed length entries which might help things but I think I mainly mentioned this to be annoying.

Also possibly related http://blog.delroth.net/2011/06/reverse-engineering-a-wii-game-script-interpreter-part-1/

Related to the above back in the standard prior to the characters being named there were other names being used with the option of it being replaced from a name entry screen (not sure how the game eventually did it) where traditionally we might have seen placeholders. I might be able to abuse control codes but a "read only" flag of some form might be an idea although thinking further that might just my being lazy and half of this project seems to be about avoiding the extra cruft so ignore that.

More recently I was pulling apart Atelier Lina ~The Alchemist of Strahl although I have seen it before in other files/formats.
Text was shiftJIS but each section was bounded by a hex number starting from 0000 and going up from there. Short sample
Code: [Select]
?????

錬金術の極意

踏ん張り

フラムっぽい物

みんな頑張れっ!

テラフラムっぽい物

どんぐりメテオ

It does not take too long to hit actual characters and perhaps more troubling control codes which in this case were not 8 bit but I have seen several with hard 16 bit characters and 8 bit control codes (or similarly non multiple of 16 bit placeholders) that troubled more basic parsers but that is probably not where this is heading.

Games using XML esque structures. I certainly see XML in games far more but that is usually something that escaped the cleanup process before building or for unrelated files. Probably can be ignored as they tend to be using known encodings and are not really the domain of table files.

Games using square brackets as escape control code/placeholder value indicators. A simple workaround I guess but one I might have to think about (for no other reason I would sub in the corner* or curly brackets).

Use of the yen symbol*, I guess I could always do alt 0165 (must remember the 0) or otherwise define/bind something or if I am truly lazy copy and paste but it does not tend to appear on a European keyboard and I am lazy.

*it is not lost on me.

There was probably more I meant to so say but is appears to be nearing 3am (again).


abw

  • Jr. Member
  • **
  • Posts: 61
    • View Profile
Re: Table File Standard Discussion
« Reply #66 on: December 13, 2015, 03:21:00 pm »
Well, it looks like I'm super late to the party, but in the fine tradition of this particular thread, I do come bearing a wall of text :D (actually more than one wall, since it seems there's a 15000 character limit on post length). Recently I've been working on updating my extraction/insertion utility based on updates to the Table File Standard (TFS) between June 2011 and March 2012 (that's still the current version, right?). After going through the March 2012 document in detail, I've compiled a list of things in the TFS that I believe are incorrect or unclear. I'll mostly skip over spelling, grammatical, and formatting issues except where they affect understandability of the TFS. Much of this is going to be nit-picking details, so I apologize in advance if the "constructive" gets buried under the "criticism"; I'm just trying to help! As always, thanks for all your hard work :).


1.2.paragraph 1: Maybe say ".tbl file extension" instead of "TBL file extension"? Are we actually requiring table files to follow any particular naming conventions?

2.1: I'm assuming "text-to-text" is a typo and should actually say "text-to-hex".

2.2.1.2/2.2.2: The TFS is pretty clear about only supporting "Longest Prefix" for hex-to-text translation, but there are many different algorithms available for text-to-hex translation. My understanding is that what we're concerned about here isn't really TFS compliance (since that's covered by 2.2.1.4), but letting the end user know what behaviour they should expect from text-to-hex translation, and I think these aren't the right words to use for doing that. As the end user of an insertion utility that implements the TFS, the things I would be most concerned about in terms of text-to-hex translation are knowing whether the inserted bytes are going to be correctly interpreted by the game's text engine as the text I want to see displayed and knowing whether as few bytes will be inserted as possible to achieve that correctness.

With that in mind, how about replacing this section of the TFS with requiring some sort of documentation on the text-to-hex translation algorithm(s) that the utility implements, preferably including correctness and optimality conditions? It doesn't have to be a novel, but seeing something in the utility's documentation like this:
Quote
This utility implements a Longest Prefix insertion algorithm, which guarantees correct text-to-hex translation based on the provided table files as long as the following conditions are satisfied:
   - all table entries are contained in a single table; and
   - no table entry's hex sequence is a prefix of any other table entry's hex sequence; and
   - for each character used in normal table entries, a table entry exists which maps some hex sequence to that single character;
and at least one of the following conditions is satisfied:
   - the text to be translated does not contain raw hex bytes; or
   - the hex sequence of every table entry represents a single byte.
It also guarantees the smallest possible hex length of any correct text-to-hex translation as long as the following additional conditions are satisfied:
   - the hex sequence of every table entry is the same length; and
   - the text sequence of every normal table entry is no more than 2 characters long.
or this:
Quote
This utility implements an A* path-finding insertion algorithm, which guarantees correct text-to-hex translation based on the provided table files and guarantees the smallest possible hex length of any correct text-to-hex translation.
would be pretty great, right?

I think it would also be useful for utility authors to note how their utility handles situations not fully defined by the standard (we'll see a few examples of those below; maybe it would be worth adding a section about theoretically possible scenarios which have never been observed "in the wild"?).

2.2.1.4: Another approach would be to let utilities list which parts are implemented and which parts are not, which would, for example, let people claim partial compliance for otherwise useful utilities that handle everything except multi-table insertion or that need to break a bit of the defined behaviour in order to deal with some weird game's crazy coding (e.g. game X doesn't actually use a Longest Prefix text engine, game Y doesn't exhibit stack behaviour for mid-string encoding changes, etc.). As long as the utility still respects the syntax of valid table files, there's probably not too much harm in an approach like that. And as long as the game an end user is working on doesn't require the missing features, the end user will probably be just as happy either way.

2.5.Label.1/2.5.Label.3: Since there's a new Star Wars movie coming out next week, I'll misquote Yoda: "must" or "must not"; there is no "should" :p.

2.5.Label.3: There appears to be some confusion here about how a Label is defined. If Labels can only contain the characters [0-9A-Za-z], then the Label itself is only the text between '[' and ']' and basically every other statement the TFS makes about Labels contradicts the examples which use Labels.

2.5.1: There's nothing here that actually specifies how to represent "$hexadecimal sequence=[label],parameter1,parameter2" in text. From the examples, it seems that the rule is something like converting commas to spaces in the text sequence, replacing placeholders with their values (while being careful with parameter text like "%%D"), moving the ] from the end of [label] to the end of the text sequence, and then moving any \n to the end of the text sequence and replacing each \n with a newline, but that rule is never stated and the examples aren't quite consistent with it:
2.5.1.Example 1: Where does the \n belong in the text sequence? 2.5.Label.5 and part of 2.6 suggest that the newlines should go between the Label and its following ']', resulting in a text sequence of "[keypress\n]" instead of "[keypress]\n".
2.5.1.Example 2: Where did that '$' come from? There's nothing in the TFS which indicates %X placeholder values can or must be prefixed with '$'.

Also, when replacing placeholders with their numeric values, the TFS should address the issue of leading zeroes. We're explicit about the hex sequence part of a table entry being 2 characters per byte, which implies displaying leading zeroes there, but I don't see anything that enforces that for %B, %D, or %X placeholder values. We should also be explicit about the expected behaviour here. How about making leading zeroes mandatory for %B and %X and optional for %D?

2.5.2.Example 1: The explanation here is not strictly correct: if 0xFF is encountered as part of another token (e.g. 0xFF13 or 0xDEFF), we don't output "[END]\n\n", since that would violate the Longest Prefix extraction specified by 2.10. This wording issue also occurs in other examples.

2.5.3.Rules.2: Have we really decided to include this restriction? I'm not sure what value it adds, especially since it can be trivially circumvented by e.g. making N copies of the same table with different Table IDs and then setting up your main table like:
Quote
C1=[Kanji1],1
C2=[Kanji2],2
C3=[Kanji3],3
If we get rid of this rule (which I think we should), we'll also need to update 2.5.3.Table Switch Format.Notes.2, which also effectively prevents a source table from switching to a destination table via multiple entries.

2.5.3.Table ID: As far as I can tell, the TFS still allows multiple @TableIDString lines in a file. Since we've killed off support for multiple logical tables in a single table file, we want there to be at most one @TableIDString line per file, right? Or are we supporting multiple IDs for the same table file?

2.5.3.Table ID: This section states that "The TableIDString can contain any characters except for ','.", but 2.5.3.Table Switch Format.Notes.2 implies that only tables whose ID is composed entirely of the characters [0-9A-Za-z] are able to be switched into, which means that tables whose ID contains any other characters can only be used as starting tables. Is that intentional?

2.5.3.Table ID: We should specify that the TableIDString only needs to be unique across all tables provided to the utility, not e.g. across all tables on Data Crystal or something like that.

2.5.3.TableID: I don't see anything that prevents a table from "switching" to itself; should e.g.
Quote
@HIRA
!F7=[HIRA],2
be considered an error?

2.5.3.NumberOfMatches.-1: Does this type of entry in one table imply a corresponding entry in another table? e.g. is
Quote
@NORMAL
!7F=[Dakuten],-1
@Dakuten
7F=foo
an error? Would it be possible to also have e.g. !7E=[Dakuten],5 (maybe in some table other than NORMAL) and match 7F=foo in Dakuten?

Closely related to the above point, it's not clear whether the closing token counts as a match in the pre-fallback table or in the post-fallback table. E.g., for
Quote
@NORMAL
01=foo
!23=[Kanji],2
@Kanji
01=bar
!7F=[Dakuten],-1
@Dakuten
02=baz
with input 0x23 0x7F 0x02 0x7F 0x01, which of these scenarios (if any) is correct?
Quote
0x23 in NORMAL: switch to Kanji for 2 matches
0x7F in Kanji, match #1: switch to Dakuten until 0x7F
0x02 in Dakuten: "baz"
0x7F in Dakuten: fall back to Kanji
0x01 in Kanji, match #2: "bar"
result: "bazbar"
Quote
0x23 in NORMAL: switch to Kanji for 2 matches
0x7F in Kanji, match #1: switch to Dakuten until 0x7F
0x02 in Dakuten: "baz"
no match in Dakuten: fall back to Kanji
0x7F in Kanji, match #2: returning from Dakuten
made 2 matches in Kanji: fall back to NORMAL
0x01 in NORMAL: "foo"
result: "bazfoo"

2.5.3.NumberOfMatches.-1: Further exploring the behaviour of these forced fallback entries, what happens for
Quote
@table1
!00=[table2],-1
02=a
@table2
!01=[table3],-1
02=b
@table3
00=c
02=d
with input 0x00 0x01 0x00 0x02? Under the current TFS wording, that second 0x00 triggers fallback all the way to table1 and the output is "a", but I feel like we should be expecting 0x00 to match in table3 and have "cd" as our output.

2.5.3.NumberOfMatches.X: The TFS never really defines the term "match", but in this case the precise meaning becomes more important: do bytes which are dumped as control code parameters count as separate matches towards X? I think we decided earlier in this thread that they did not (i.e. that the control code together with all of its parameter bytes count as a single match), but that decision doesn't seem to have made its way into the TFS.

2.9: It might be worth noting that this behaviour is algorithm dependent; e.g. it's possible for Longest Prefix insert to back itself into a corner and fail on input that other algorithms would succeed on.

2.11: Going back to my points about 2.2.1.2/2.2.1.4/2.2.2, it would be nice to see some kind of correctness condition here, i.e. that the hex produced by the utility when inserting text A must be extracted as text A by the Longest Prefix extraction algorithm from 2.10.

2.12.Duplicate Hex Sequences: Maybe add a note here to confirm that having the same hex sequence occur in different tables is okay.

2.12.Unrecognized Line or Invalid Syntax: I'd like to propose a slight extension to the TFS: any line which begins with the character '#' must be ignored during parsing. This would allow for comments inside table files, which would be very useful for end users, and comes at a negligible cost to utility authors.

2.13.Duplicate Text Sequences: It looks like this rule is left over from when Longest Prefix was the only insertion algorithm that had been considered. Now that the floor is open to other algorithms, this rule should be removed (or at least reduced to a suggestion for anyone wanting to implement Longest Prefix), since it can make legitimate text sequences impossible to insert correctly. As an example,
Quote
01=test
02=foo
0001=test
0102=bar
the hex sequence 0x00 0x01 0x02 is dumped as "testfoo", but enforcing this rule would result in "testfoo" being inserted as 0x01 0x02, which the Longest Prefix hex-to-text algorithm would as translate to "bar". Smarter algorithms not bound by this rule would be able to use the different options for encoding "test" to find a tokenization that would not be misinterpreted by the dumping algorithm.

2.13.Blank Text Sequences: Does anyone have a use case for this? I'm having trouble coming up with a good reason why anyone would ever want this.

I'll see if I can include my comments on sections 3+ in a separate post. Edit: nope, auto-merge killed that idea :p.
« Last Edit: December 21, 2015, 09:15:43 pm by abw »

abw

  • Jr. Member
  • **
  • Posts: 61
    • View Profile
Re: Table File Standard Discussion
« Reply #67 on: December 21, 2015, 09:12:40 pm »
(TFS Version 1.0 Draft feedback part 2, including some new things from earlier sections)

2.5.General Rules: 2.4.Rules.5 specifies that, for normal entries, whitespace is allowed in the text sequence only, but there is no such rule for non-normal entries.

2.5.Label.1: Assuming we do revert the change to restrict the number of switches from one table to another in 2.5.3.Rules.2, this uniqueness condition should only apply to end and control tokens.

2.5.Label.5/2.6/2.13.Duplicate Text Sequences: It looks like using \n to control newlines has now been restricted to non-normal entries (should this actually be just end tokens and control codes? Having newlines in switch tokens doesn't make much sense), but the example in 2.13.Duplicate Text Sequences still shows \n in a normal entry.

3.0: It might be nice to provide some additional guidance on best practices for table file setup for new users. Even simple things like making sure your game's control codes are represented as control code (so you don't pull a Lufia 2 and encode "Maximum" as "[HeroName]um") and being careful about using the same Label for control codes in different tables (e.g. maybe you want "[END]" to be an end token no matter what table the inserter happens to be in at the time, but the game treats "[HeroName]" in one table a little differently from "[HeroName]" in another table) could save inexperienced users some headaches.

As a general note, showing some hex input would greatly improve the quality of the examples in this section.

3.1: This example's already pretty clear, but using "3001=DictEntry1" etc. instead of "3001=Entry1" etc. for the normal entries would make the correlation between entries expressed in a single table vs. split across two tables with switching even more clear.

3.2.Example 3: This example contains multiple errors. Table 1 should say "!7F=[Dakuten],-1" (misplaced comma), Table 2 won't actually output 'が' unless the next token is 0x60, and it's not quite clear that the 7F triggering fallback gets consumed and doesn't end up triggering a new switch from NORMAL to Dakuten.

3.3.Example 3: This should say "!C0=[KanjiTableID],3" (missing [ and ])

3.4.Example 2: There are a few more errors here: assuming the top part of the explanation is right and this is really a forced fallback example, Table 1 should say "!E0=[KATAKANA],-1" (missing [ and ], -1 NumberOfMatches instead of 0) and the bottom explanation should be corrected ('ア' won't actually be output until a 0x30 token is matched, falling back from KATAKANA to HIRAGANA on E0 due to no-match instead of forced fallback would definitely result in E0 being matched in HIRAGANA, switching us right back to KATAKANA again).

4.2.paragraph 4/4.3.3/4.3.4: These need to be updated based on the new 2.5.1.Parameters/2.5.1.Placeholders content. "Linked Entries" don't exist anymore, they're now called e.g. "Control Codes with Parameters".

5.0: I like that there are definitions in the TFS, but the terms used here don't match up with the terms used throughout the rest of the document; e.g. after reading this section, people will know what "Hex Value" means, but not what "hexadecimal sequence" or "hex sequence" mean.

5.0.Duplicate: Does this definition add anything to the generally accepted definition of "duplicate"? If so, "value" should be defined lest it be misinterpreted as e.g. "character".