logo
 drop

Main

Community

Submissions

Help

80589707 visitors

Author Topic: Table File Standard Discussion  (Read 19824 times)

Klarth

  • Sr. Member
  • ****
  • Posts: 423
  • Location: Pittsburgh
    • View Profile
Re: Table File Standard Discussion
« Reply #40 on: August 15, 2011, 05:01:27 pm »
I'd rather not get into having search for patterns, convert some parameters to multi-bytes, endian, padding etc.
Agree with no padding.  Multibyte could be useful with a lot of games.  Endian doesn't matter unless we do multibyte.

Quote
Concept is clear, but I'm a little unclear on the mechanics of this.
I'd be in favor of these strings having a standard-defined start and end token, such as [] or <>.  Possibly also removing non-identifier text.  So instead of $FF=<window x=%2X, y=%2X> you'd have $FF=window,%2X,%2X and get [window $DEAD $F00D]...you could even remove the % signs in that simplified format.  We should probably get some user input on this issue.

When the utility hits a start token, you'd search for a linked entry first (either a list or a "special" table entry with "[window " which works with longest entry insertion), and fallback to inserting "[" later.  I think I mentioned before, that all games that use variable length parameters (or variable number of parameters) would be unsupported due to limitations in linked values.  That bit of knowledge should be put into the standard if it isn't.

Quote
I think people would say it's probably absolutely critical to have raw hex ability for a variety of applications.
Raw hex should never exist in an ideal world as it is "undefined content".  Since this world is not ideal, we regularly make deals with the binary devil so we can remove hours of text engine analysis through a small assumption.  Indeed, ignorance is bliss and the devil very rarely collects.  It also saves a lot of time.  Time that could be better spent.

Nightcrawler

  • Hero Member
  • *****
  • Posts: 5808
    • View Profile
    • Nightcrawler's Translation Corporation
Re: Table File Standard Discussion
« Reply #41 on: August 24, 2011, 11:35:56 am »
Agree with no padding.  Multibyte could be useful with a lot of games.  Endian doesn't matter unless we do multibyte.
How would you suggest the user define endian in the event we did allow multi-byte?

Quote
I'd be in favor of these strings having a standard-defined start and end token, such as [] or <>.

Agreed. I think we are going to go the way of abw's suggested game control code idea. 'Linked Enteries' would go away and be replaced by 'Game Control Codes' which would cover all non-normal entries (except maybe table switching) both with and without parameters. They would be required to start and end with [] or <> and in turn those characters would not be allowed in normal entries. I'd probably choose []. This also solves many issues with raw hex, which would also be output with the chosen characters. This prevents combinations of normal entries turning into controls or hex accidentally, makes it clear to keep/check uniqueness on all non normal entries for unambiguous handling.

Quote
Possibly also removing non-identifier text.  So instead of $FF=<window x=%2X, y=%2X> you'd have $FF=window,%2X,%2X and get [window $DEAD $F00D]...you could even remove the % signs in that simplified format.

I like this approach. This simplifies it. I would be more open to supporting multi-byte parameters if we do it this way. I would also like to keep it as a single digit and cap multi-byte at a maximum of 9 so the parameter can be parsed one character per parameter property. In this way, perhaps endian could be added like  $FF=[window],2XE,2Xe where uppercase 'E' is big endian and lowercase 'e' is little endian for multi-byte parameters. Perhaps that gets a bit convoluted for the user?

The only thing we loose is embedding text around the parameters such as the previous x,y window example. Parsing out parameters on insertion appears more difficult to me in that way though.

Quote
We should probably get some user input on this issue.
It would be nice to get some user input on several items, but I don't think we will. I've seen it before. The comments don't start coming until it's out there and they start using it. Unfortunately, that's often times too late. It's been around for a year now and only a small handful had anything to say.

Quote
I think I mentioned before, that all games that use variable length parameters (or variable number of parameters) would be unsupported due to limitations in linked values.  That bit of knowledge should be put into the standard if it isn't.

How about something crazy like this? "$FE=[window][FD],0X[FF]" For variable number of parameters the hex indication [FD] after window is the indicator that it's variable and what the hex is that ends it.  For variable length parameters, the '0' represents variable number of bytes in the parameter and the enclosed [FF] is the hex that ends it. It seems a bit convoluted, but probably only if you have both variable length parameters and a variable number of parameters. Otherwise, it's a small extension for each. Just a thought. I'm not sure how else you could try and shoehorn it into things.
TransCorp - Over 15 years of community dedication.
Dual Orb 2, Wozz, Emerald Dragon, Tenshi No Uta, Herakles IV SFC/SNES Translations

henke37

  • Sr. Member
  • ****
  • Posts: 384
  • Location: Sweden
    • View Profile
Re: Table File Standard Discussion
« Reply #42 on: August 26, 2011, 02:26:19 pm »
Talking about control codes, what about control codes that have arguments that one would like to replace with a meaningful value instead of the raw value? I am mainly thinking of things where a value maps to a string. Such as say, the character name in a what I call speaker badge. Also, there are codes that involve pointers.

I hope that I don't look like an idiot by bringing up something already resolved, I have been trying to pay attention here, but it is difficult.

Klarth

  • Sr. Member
  • ****
  • Posts: 423
  • Location: Pittsburgh
    • View Profile
Re: Table File Standard Discussion
« Reply #43 on: August 27, 2011, 12:25:06 am »
I hope that I don't look like an idiot by bringing up something already resolved, I have been trying to pay attention here, but it is difficult.
I don't think we discussed this specifically, but it does merit some explanation of the proposed approach.  Say if you have a portrait control code of $F0 and the portrait id afterwards...you'd have to do something like this:
$F000=<Carl>
$F001=<Ricky>
$F002=<Forrest>

ie. You'd have to explicitly define them.  The shorthand approach (what we're covering with "game control codes") is meant more for control codes that have little to no meaning to the story and thusly can be treated as hex data.

henke37

  • Sr. Member
  • ****
  • Posts: 384
  • Location: Sweden
    • View Profile
Re: Table File Standard Discussion
« Reply #44 on: August 27, 2011, 04:32:00 pm »
I wonder if that would work well with more complex control codes.

I guess I just don't like the redundancy that is introduced when mapping the ids in arguments to lengthy commands.

Another thing I just realized, what about common text formating controls? While no two game engines are alike, surely most of them have the usual formating features, such as changing the text color and what not. Should these commonly occurring features be standardized?

Tauwasser

  • Hero Member
  • *****
  • Posts: 1397
  • Fantabulous!!
    • View Profile
    • My blog
Re: Table File Standard Discussion
« Reply #45 on: August 27, 2011, 04:51:35 pm »
Hi,

Sorry for my long absence, but real life grabbed me by the neck... So anyway, I re-read the thread from start to finish again and I'll share my thoughts with you:

  • I'm okay with comments being gone.
  • I would like to see newlines go, too.
  • End tokens can move to the utility realm.

Table matches with hex literals

I think we should count hex literals as matches for the current table in the event the table produced (or would have produced) the literal: That is to say,
  • an unlimited table would have counted an unknown byte as a fallback case and fallen back to the table below it,
  • If the stack is empty, or the table is not unlimited, the hex literal counts towards it in insertion direction.

This behavior mirrors dumping. It can produce cases that are not correct for the game's display mechanism, but since this case is ambiguous, either choice can. This would extend to table matches other than infinite and one match(es).

Restricting table matches to infinite or one match(es)

I think this is a good restriction. Even all computer text encodings ever produced were always flat within their code space. However, the case, we should think about the following:

  • Many game text encodings are stateful. This is not the case for Unicode or any other traditional encodings including multi-byte encodings that I'm aware of. Even legacy multibyte encodings can be managed inside one flat table, i.e. Klarth's last example.
  • Multi-table options, including multiple match ranges activated from one table A to another table B, are solvable and usable. It is just that we decided them to be in a polynomial complexity class.

The problems we are facing are inherent to implementations that knew the restrictions on their input. We don't necessarily have that luxury. Notice that the last point also indicates that I think the kanji array problem is solvable, however, possibly not in linear or quadratic time. I'm not so much concerned about memory, as we happen to have a lot in machines built after 2005.

I'm in favor or restricting the current release of the standard and extend the syntax upon finding a suitable solution to a suitable problem that uses multiple matches within one table other than unlimited matches or different match ranges in the same table B reachable from some table A. So I would be definitely in favor of keeping the current tablename,№Matches syntax.

Direct table fallback

I'm in favor of direct table fallback using the !7F=,-1 syntax proposed. Not having it might cause a theoretically infinite stack of tables for an infinite given string. It's a pretty big oversight and I'm happy that somebody caught it.

Algorithm selection

The current draft does not contain anything pertaining to that. So I have to ask again: We are in favor of multiple allowed insertion algorithms along with compliance levels, right? With compliance levels I mean that any implementation of the standard must indicate for instance "TFS Level 1 compliance" for implementing "longest prefix insertion" and "TFS Level 2 compliance" for implementing "longest prefix insertion and A* optimal insertion" as well as not making up new compliance levels which might be added in a revision of the TFS.

Control Codes

First of all, yes, linked entries should be renamed to make their purpose clearer.
Also, getting a "insertion test string" for grouping input arguments is simple using regular expressions (or some form of linear substitution method of your choice). For that matter, it seems we should only allow one-byte arguments to avoid Endianess issues and we can simplify the arguments to %D, %X, %B for decimal, hexadecimal and binary respectively.

<window x=%X y=%D> can be easily turned into a match string using the following substitutions (as well as substitutions to mark user-supplied {}[] etc. as literals):

  • %D → [0-9]\{1,3\}
  • %X → \$[0-9A-Fa-f]\{2\}
  • %B → %[01]\{8\}

Notice that we do need to keep identifiers in the output, contrary to what was suggested. If we do not do that, we are open to the following exploits:

Code: [Select]
!7E=<window x=%X y=%X>
!7F=<window x=%D y=%D>

How do you parse <window x=10 y=10> without additional context information? You can't. Also, we might want to include signed decimal and signed hex.

Variable parameter list

First of all, I must admit that I have never ever seen a game use these. Then again, it's not impossible. The old syntax would have lent itself rather well to defining these:

Code: [Select]
!7F=<window>, 256, 00
The argument length would construe a sensible maximum argument list limit and the optional parameter after the second comma would indicate the list end identifier. Having said this, I'm not strongly in favor nor strongly opposed to include variable argument lists in any form or fashion.

Regular expression table entries

I still think they are useful, however a burden to implement and make safe with the current control code set. Their main purpose were commenting and that has been pushed to top-level utilities. For normal entries nothing is gained by this anymore. For control codes and table options, it might become a safety issue with blindly using user-controlled regex content upon output...

Multiple tables per file

Oppose because of the added complexity with almost not gain.

I hope I did not miss anything that was supposed to be addressed. Also, please mark this thread as new so other regulars get an email notification about changes.

cYa,

Tauwasser

abw

  • Newbie
  • *
  • Posts: 42
    • View Profile
Re: Table File Standard Discussion
« Reply #46 on: August 29, 2011, 12:54:13 am »
Do you do anything else of interest that requires the string in list of token form?
My primary motivation for maintaining token lists rather than the translated text/hex strings was a couple of edge cases in post-processing dump output. Even given their start and end addresses, detecting exactly where two strings overlap can only really be done at the hex level; in general, it cannot be determined from the translated text (at least not without running through the table translation/tokenization process again). Here's an extreme example: suppose pointer A points to the hex sequence "00 01 02 03 04", and we parse that according to table X as "0001 0203 04". Now suppose pointer B overlaps with pointer A's hex range, but starts one byte later, in the middle of what was a multibyte token for pointer A, and we parse it (also according to table X) as "0102 03 04". Suppose further that pointer C also overlaps with pointer A, starting at the same byte but according to table Y this time, and we parse it as "00 01 02 03" (no "04", since I've decided that "03" is an end token in my table Y).

Now, you may call that crazy, and I would agree with you, but it's still valid input to the utility, so I prefer to handle it rather than ignore it. I found that token lists allowed me to more elegantly handle cases like these, and since the translated strings are easily recoverable from a token list, using token lists involved no loss in functionality. There are a couple of other places where I take advantage of having access to the intermediate tokenization whenever I want it, but that's mostly about program design rather than functionality that couldn't be easily achieved otherwise.


Concept is clear, but I'm a little unclear on the mechanics of this.
Tauwasser pretty much covered my response to this :P. I'll add that the idea of replacing parameters by their corresponding regular expressions also takes care of cases where the insertion script might appear to be ambiguous (e.g. <window x=45, y=10> in the script and $FE=<window x=%X, y=10> in the table).


I think that treads in undefined behavior of the text engine territory.
Fair enough. What I'm trying to get at is where the extra power for these kinds of tokens needs to live, i.e. whether the fallback condition is determined by the opening or the closing byte. It sounds like it belongs on the closing byte, which probably makes our lives a little easier.


Don't you think it's a bad idea (and irresponsible) to release a standard without having come up with any known implementation of it?
Depends how I'm feeling :P. You raise many valid points here. On the one hand, this entire project is basically your one-man mission to bring order to chaos and make the world a better place (:thumbsup:), so I appreciate the personal aspect of it. On the other hand, it can also be viewed as a community project, and as such, ideally it shouldn't rely too heavily on any one person. I dunno. There are a lot of people on the internet with a lot of free time on their hands, and some of them are pretty smart. Does anyone have any connections with Japanese romhacking groups? They must have figured out a reasonable way to handle all this, right? If so, why re-invent the wheel?


I think I mentioned before, that all games that use variable length parameters (or variable number of parameters) would be unsupported due to limitations in linked values.  That bit of knowledge should be put into the standard if it isn't.
Agreed. So far what I'm hearing is that nobody knows of any game that actually uses this. My suspicion is that any such game would use different hex representations for the different functions, just like C++ uses different code blocks for similar methods with different full signatures, or assembly languages use different opcodes for similar operations with different number or lengths of operands.


They would be required to start and end with [] or <> and in turn those characters would not be allowed in normal entries. I'd probably choose [].
Going with [] instead of <> would also make things cleaner for XML-based scripts/utilities.


I wonder if that would work well with more complex control codes.

I guess I just don't like the redundancy that is introduced when mapping the ids in arguments to lengthy commands.
It gets kind of annoying, yeah. There are a bunch of ways the standard could be extended to make this work, though... One way would be if we allowed control codes to include matches from another table. Then you could have something like this:
@table1
$F0=<portrait: %name %expression>

@name
00=Carl
01=Ricky
02=Forrest

@expression
00=Happy
01=Grumpy
02=Sleepy
and the hex "F00201" would output "<portrait: Forrest Grumpy>". If we were to do such a thing, I think we'd want to restrict these kinds of lookup tables to contain only normal entries. Thoughts?


Another thing I just realized, what about common text formating controls? While no two game engines are alike, surely most of them have the usual formating features, such as changing the text color and what not. Should these commonly occurring features be standardized?
Possibly, but I think that's a separate issue. From the table file standard's perspective, these would probably show up as control codes, so you could call them whatever you wanted, e.g. "$CD=<red>".


Sorry for my long absence, but real life grabbed me by the neck...
Happens to us all from time to time, though hopefully not too literally!


Table matches with hex literals

I think we should count hex literals as matches for the current table in the event the table produced (or would have produced) the literal: That is to say,
  • an unlimited table would have counted an unknown byte as a fallback case and fallen back to the table below it,
  • If the stack is empty, or the table is not unlimited, the hex literal counts towards it in insertion direction.

This behavior mirrors dumping. It can produce cases that are not correct for the game's display mechanism, but since this case is ambiguous, either choice can. This would extend to table matches other than infinite and one match(es).
Assuming that table entries cannot produce output that looks like a hex literal, the only way for an unassociated literal to appear in dumper output is when the stack is empty. By that logic, mirroring dumping behaviour would mean that when a hex literal is encountered during insertion, we should immediately clear the stack, insert the hex literal, and resume tokenization with the starting table.


I'm in favor or restricting the current release of the standard and extend the syntax upon finding a suitable solution
Since it's been on my mind recently, there's also the approach the W3C takes, where a standard is released as a "Working Draft" and only makes its way up to a full "Recommendation" later, generally after many years and multiple successful implementations. I'm not saying I'm encouraging this approach, but a toned-down version might merit some consideration.


Control Codes
...
For that matter, it seems we should only allow one-byte arguments to avoid Endianess issues and we can simplify the arguments to %D, %X, %B for decimal, hexadecimal and binary respectively.
I agree with this, but think we should also include %% for literal %.


Notice that we do need to keep identifiers in the output, contrary to what was suggested. If we do not do that, we are open to the following exploits:

Code: [Select]
!7E=<window x=%X y=%X>
!7F=<window x=%D y=%D>

How do you parse <window x=10 y=10> without additional context information? You can't. Also, we might want to include signed decimal and signed hex.
Assuming that control codes are checked for uniqueness, and that control code parameters are handled correctly during that check, including !7E and !7F in the same table would make that table invalid. Contrary to what I said earlier, however, correctly handling control code parameters during uniqueness checks does not mean simply erasing them. If that were all we did, this would be an exploit:

Code: [Select]
!7E=<window x=%X y=%X>
!7F=<window x=%D y=10>

So in addition to erasing the parameters, we should also check for values that they can take on. Fortunately, these two checks can be combined in one expression (add/remove backslashes as appropriate for your expression engine):
  • %D → \|[0-9]\{1,3\}
  • %X → \|\$[0-9A-Fa-f]\{2\}
  • %B → \|%[01]\{8\}

As long as we have uniqueness modulo parameter values, we should always be able to determine the correct encoding. That said, including identifiers in the output is certainly another valid way of addressing the problem. My preference for not including them is based purely on aesthetics.

Klarth

  • Sr. Member
  • ****
  • Posts: 423
  • Location: Pittsburgh
    • View Profile
Re: Table File Standard Discussion
« Reply #47 on: August 29, 2011, 01:07:03 pm »
Am I the only person who doesn't post walls of text here?  :P

Code: [Select]
!7E=<window x=%X y=%X>
!7F=<window x=%D y=%D>

How do you parse <window x=10 y=10> without additional context information? You can't. Also, we might want to include signed decimal and signed hex.
I was under the impression that each control code identifier (ie, window in this case) would be a unique identifier.  This means when your parser hits a "[", you then search for a matching ID.  If matched, parse it.  If failure, then fallback to longest match (or A*).  But yeah, if we add polymorphic properties to control codes, then the parameters must have prefixes.

Quote from: Tauwasser
Variable parameter list
First of all, I must admit that I have never ever seen a game use these. Then again, it's not impossible.
I haven't either, but it's quite plausible.  I think I would implement this on the utility end.  For instance, users of TableLib would create a new function in C (or possibly Python), then TableLib creates an internal tag to become aware to call the callback function.  ie. TableLib.AddCallback("F0", ParseVariableListF0);.  This approach indicates a custom dumper or a scriptable general utility.

henke37

  • Sr. Member
  • ****
  • Posts: 384
  • Location: Sweden
    • View Profile
Re: Table File Standard Discussion
« Reply #48 on: August 31, 2011, 04:35:49 pm »
Yeah, I agree with the idea of using a reference to a different table for the arguments in a control code. I am not sure about limiting it to "simple" matches, but I do see the reasons why and no reasons not to do it.

Nightcrawler

  • Hero Member
  • *****
  • Posts: 5808
    • View Profile
    • Nightcrawler's Translation Corporation
Re: Table File Standard Discussion
« Reply #49 on: September 09, 2011, 01:38:51 pm »
Thanks to all for the input on the outstanding items. I'll work on making up a final draft subject to editing only. New or extended features not yet fleshed out, such as variable parameter lists and controls accessing external tables etc,. won't make it in. As with any project, at some point you have to feature freeze, polish, and release.
TransCorp - Over 15 years of community dedication.
Dual Orb 2, Wozz, Emerald Dragon, Tenshi No Uta, Herakles IV SFC/SNES Translations

abw

  • Newbie
  • *
  • Posts: 42
    • View Profile
Re: Table File Standard Discussion
« Reply #50 on: February 03, 2012, 11:44:30 pm »
Good news - I now have a working multi-table insertion algorithm. It supports all the goodies from the current draft of the standard, including all non-negative switch counts *and* multiple switch tokens into the same table (i.e. none of the table switching functionality we were contemplating abandoning needed to be sacrificed), and it also prevents token mutation (in particular, it correctly handles all the "gotcha" examples I posted earlier). But wait, there's more! If you order now, you'll also receive output that is optimal with respect to hex length! All this can be yours for the low low price of 20 seconds / 200 kb script!

The only concession I had to make (in order to guarantee termination and to keep runtime within acceptable limits) was to restrict the number of consecutive switch tokens that are allowed. I've capped it at just under the length of the longest non-looping switch path, which means that my algorithm will fail for any string that actually requires 20,000 switch tokens between some two input characters. Anybody see a problem with that?

I've been working on implementing forced fallback support, but I don't exactly have a lot of experience with games that use that feature, so I'll need some more info before I can go much further with this. Nightcrawler's idea of having a switch token act as the "off" toggle makes me wonder: are these really fallbacks, or are they just switches? If they're actually switches, then we can treat them as such and we're fine; if they're still fallbacks, then I guess the rule would be to insert a forced fallback token whenever a no-match fallback would occur? That rule creates a couple of problems (e.g. what if we switch into a table containing a forced fallback token via some other switch that doesn't impose toggle semantics?), but I don't think it introduces anything that can't be solved by using additional tables. Along the same lines, I'm also not sure how forced fallbacks should interact with expired count fallbacks. I realize the remaining open sections of the standard are getting pretty far away from common usage, but hey, gotta try to fill in those blanks, right?

Anybody else have updates to share? :D

Normmatt

  • Full Member
  • ***
  • Posts: 127
    • View Profile
Re: Table File Standard Discussion
« Reply #51 on: February 04, 2012, 12:02:39 am »
Does the latest table standard include support for variable length end tokens like the current "!83=<Color>,3" but as an end token?

henke37

  • Sr. Member
  • ****
  • Posts: 384
  • Location: Sweden
    • View Profile
Re: Table File Standard Discussion
« Reply #52 on: February 04, 2012, 11:06:30 am »
I am just curious, did this standard ever get finished?

Nightcrawler

  • Hero Member
  • *****
  • Posts: 5808
    • View Profile
    • Nightcrawler's Translation Corporation
Re: Table File Standard Discussion
« Reply #53 on: February 08, 2012, 10:32:45 am »
I wrote about this in the last update at TransCorp. And as always, the most current draft can be found here. With that said, per the events mentioned in the update, I did not get to write that final draft with the resolutions of the previously outstanding issues. Then help was offered on some long standing ROMhacking.net changes which I have been working on since (I have to take help when I get it.). Since some interest has resumed here, I will see if I can get some time to finish that draft over the next few weeks.

I started refreshing myself on some of this already. I was pretty happy with the conclusion of the discussion. I don't think I am going to open it up again to any new feature discussion.

abw:
That's good to hear. Sounds like it's quite complex taking 20 seconds. Do you have any documentation on the algorithms used?

I thought we already covered why we can't treat the fallback as a switch. You either end up with the infinite switching scenario, or de-syncronized table stack. In sheer game mechanics, this might not be any type of switch or fallback whatsoever. It could be just literally a flag that tells it to add a dot to the last tile written. It could be completely independent of the table stack. Or, it could actually be a complete fallback. Or, it could be an actual switch on some games, but if that's the case, you should use switches instead as the table stacks would be in sync then. All cases are covered by adding the fallback mechanism previously discussed and retaining table switching.

Normmatt:

No, I don't believe so. This is first this has come up that I can think of. What game do you have that does this? Can I see sample hex of a few strings? That might be an easy extension to consider, but I have not seen this situation before and am unfamiliar with it. I don't see how an end token can have associated parameters. What do the parameters mean? It sounds like a case where you have several multi-byte end tokens that should be defined.
TransCorp - Over 15 years of community dedication.
Dual Orb 2, Wozz, Emerald Dragon, Tenshi No Uta, Herakles IV SFC/SNES Translations

henke37

  • Sr. Member
  • ****
  • Posts: 384
  • Location: Sweden
    • View Profile
Re: Table File Standard Discussion
« Reply #54 on: February 08, 2012, 07:25:17 pm »
I finally read the standard. It is so simple that it is nearly impossible to screw up. I see no possible issues with the text as is.

But it lacks features that I would like to see. There is the previously mentioned control codes with table feature, but also something I wonder if it should even have. It does not provide any smarts for pointers. Those are largely game specific, yet I can't help but feel that automatic label resolving is closely related to this. Where should one draw the line? What I would like to see is a specification that deals with more things for the task of writing a dumper/inserter. But is this standard the right one?

Nightcrawler

  • Hero Member
  • *****
  • Posts: 5808
    • View Profile
    • Nightcrawler's Translation Corporation
Re: Table File Standard Discussion
« Reply #55 on: February 09, 2012, 09:39:16 am »
But is this standard the right one?

Absolutely not in my opinion. The table file standard only encompasses features directly contributing to mapping hex-to-text and text-to-hex. It's not a dumping or inserting standard beyond the requirements necessary to achieve uniform hex-to-text and text-to-hex mapping results. Pointers and/or data interpretation are entirely outside the scope. I believe abw and Klarth addressed your previous items (common codes spanning multiple games, meaningful values), how to handle and/or why they are out of scope to the hex/text mapping process defined here.

I believe what you are looking for is not within the scope of the table file. Perhaps a dumping/inserting standard is a great new endeavor for yourself to head up? I might suggest one approach be a secondary file containing a data interpretation dictionary for interpreting groups or patterns of hex data, and associated transformations (arithmetic, logic, labeling etc.) when dumping or inserting. A dumper/inserter could then use that dictionary in addition to the table for text encoding and that might do it. As soon as you touch pointers though, you open up a whole world of hurt with hierarchies, trees, formats, and crazy transformation. I've got an entire program (TextAngel) that is basically designed just to help define this stuff in order to dump/insert. In a way you could say it just creates that dictionary and dumps/inserts according to that dictionary and table file. Point being, it's a vast and complicated thing to begin to standardize. We haven't even been able to standardize one simple specific item so far. I don't like our chances on something larger. :P
TransCorp - Over 15 years of community dedication.
Dual Orb 2, Wozz, Emerald Dragon, Tenshi No Uta, Herakles IV SFC/SNES Translations

abw

  • Newbie
  • *
  • Posts: 42
    • View Profile
Re: Table File Standard Discussion
« Reply #56 on: February 10, 2012, 10:40:31 pm »
I finally read the standard. It is so simple that it is nearly impossible to screw up. I see no possible issues with the text as is.

Yeah, my reaction after reading the standard for the first time was along those lines too, except for a few minor typos that have already been cleaned up. It wasn't until I began using it as a spec for development that I really started noticing things. It covers the common cases pretty well, but there are some situations it allows that don't have completely obvious solutions, and some situations that seemed like the obvious solution could be improved upon.


All cases are covered by adding the fallback mechanism previously discussed and retaining table switching.

I don't think we actually stated the rules for when to insert fallback tokens... since the current approach doesn't tie switch tokens to fallback tokens, I think the only thing that makes sense is to have any fallback from a table containing a forced fallback token insert that fallback token, regardless of whether the table was switched into via a corresponding "on" token or not. In order for the fallback token to be recognized as a fallback token, it will have to count as a match towards NumberOfTableMatches in the fallback token's table; consequently any other table using e.g. !00=TableA,1 will insert 00 7F, i.e. a two-token no-op.


That's good to hear. Sounds like it's quite complex taking 20 seconds. Do you have any documentation on the algorithms used?

Well, the timing is both better and worse than that. Better in that I haven't spent any time optimizing it after finally getting it to run in sub-exponential time, which means there's still room for improvement. Worse in that the time I gave reflects only the text -> hex translation and doesn't include reading the text or writing the hex. Also, it's in perl, so a C implementation would be substantially faster :P.

As for the algorithm, I think I'm getting slow in my old age. I ended up using Tauwasser's A* idea as a base and then adding some optimizations. A* is just a glorified breadth-first search, so it's basically guaranteed to look at a whole lot of nodes before finding a solution, which makes it more expensive on average than the single-table insertion algorithm I was using. The problem I had with A* before came from considering the nodes as (token, pos)-tuples. If we expand them to (token, pos, stack)-tuples, then we have enough information to decide what new nodes can be added at each step without invalidating switch count conditions. It's also a nice algorithm to use since you can grow the parse tree as you work; having to list all the possible nodes can take infinite time, or maybe just exponential if you limit switch loops in some way. The primary optimization I made was to start pruning equivalent nodes from the tree (I decided two nodes were equivalent if they were at the same string position, had the same stack, and tokenized the most recent mutatable hex in the same way); each pruned node meant I didn't have to examine its exponentially many children. For the particular string I was using for debug, that optimization brought the maximum tree width from ~25000 nodes to ~10 nodes. Once the search finds a candidate tokenization for the entire string, I pass the candidate's hex off to the dumping algorithm to check for mutation. If the dumping algorithm returns text different from the candidate tokenization's, then the A* search continues, otherwise we're all done and can move on to the next string.

rveach

  • Newbie
  • *
  • Posts: 22
    • View Profile
Re: Table File Standard Discussion
« Reply #57 on: March 11, 2012, 10:47:46 am »
Has anyone released the source to a library/program that implements this new standard in C++ or Java? (atleast past '2.4 End Tokens')

Edit:
Some things that I would like to add, after reading through some of the discussion and specs.

Character Map Support

Almost all games use some same repetition of a character map to store their characters. Some are based on standards, others not. I have seen a NES game use ASCII. I have seen plenty of PSX games use SJIS and maybe ASCII. To cut down on manually typing out every character and hex code and user errors, I think we should allow an implementation of a standard/custom character map. Here is the basic idea.

format: XX+=character map

The thing that makes this unique from a common hex/text line, is the '+' before the equals.
'character map' is either a standard character map name (ASCII,UTF8,Shift-JIS,etc (not case sensitive)) or a user defined map that looks somewhat like a regular expression (Ex: [0-9A-Z a-z]). I'm not sure how to implement japanese or other language version of a custom map, so I will leave that open to discussion. Maybe it would go based on UTF8 ordering, since that is what the table file will be in anyways. The character maps will only support printable characters, so \n may be left out.
'XX' is the starting hex value that the character map will use.

Examples:

So if we had a game with a standard ASCII character map, we could write:
00+=ASCII
This is like writting: "21=!", "22=\"", "23=#", ......, "30=0", "31=1", ...., "41=A", "42=B", .... etc

Now what if the game used ASCII, but it didn't start at 00, but instead 02?
02+=ASCII
This would be like writting: "23=!", "24=\"", "25=#", ......, "32=0", "33=1", ...., "43=A", "44=B", .... etc

Now this would save us from having to write 62+ lines.
As for overflow, I suggest we just cutoff the high bytes created (or wrap around), so FF+1 = 00, not 0100.

Now what if we wanted to override one character that ASCII makes with another value? "Imaginary" lines generated by character maps are ignored if we specify, manually, a hex value that would overwrite it.
So:
00+=ASCII
30=Love
0D=[NL]\n
00=[ED]

Would have the normal ASCII characters, but there would be no hex value for '0' (so only numbers 1-9 would be allowed) and 'Love' would be mapped to 30h. 0D and 00 wouldn't be mapped anyways by ASCII since they aren't printable characters, so 0D and 00 in our table file would work like normal.

So the hex bytes: 30 31 32 33 00
Would print out the text: "Love123[ED]" instead of "0123[ED]"

Now for user maps, lets say our game uses some letters and numbers in a weird order that follows no standard and not all characters are used. The first character starts at 30h.
034265789wxyzabcj

We could write a custom map like so:
30+=[0342657-9w-za-cj]

Starting Link Table

As of right now, you can give your table a name with the '@' character or leave it empty. If the file has multiple switch tables, it may be unclear as to which one we should start extracting from (for my implementation, I am going to always assume its the first table for now).

I think we should add something that signifies to an extractor/inserter which table is the default start, which could be overridden by the tool implementation.

normal format: @TableIDName
proposed format: @TableIDName,start

So if a table name has ",start" at its end, then it is the table that we should start with when doing its thing.
« Last Edit: March 11, 2012, 08:35:39 pm by rveach »

henke37

  • Sr. Member
  • ****
  • Posts: 384
  • Location: Sweden
    • View Profile
Re: Table File Standard Discussion
« Reply #58 on: March 13, 2012, 01:51:56 pm »
That character map idea is great. But I have a small addition to it. Allow for using previously defined tables (no loops here!) in addition to the standard mappings.

Nightcrawler

  • Hero Member
  • *****
  • Posts: 5808
    • View Profile
    • Nightcrawler's Translation Corporation
Re: Table File Standard Discussion
« Reply #59 on: March 13, 2012, 03:25:23 pm »
It sounds like recreating the table generator. How is this different from what some of these guys do? I think you can already do all of that with ASCII and most with hiragana/katakana. We also have EUC, JIS, and S-JIS tables in the database. Such a thing is great for generating a table, maybe not so great for inclusion of the generator within the table file standard.

How do you propose something like this be implemented code wise?

Is defining creation of the map part of defining the map? I lean toward no, but it is an interesting thing to consider. Anybody else have an opinion on this?

TransCorp - Over 15 years of community dedication.
Dual Orb 2, Wozz, Emerald Dragon, Tenshi No Uta, Herakles IV SFC/SNES Translations