11 March 2016 - Forum Rules
Started by abw, May 15, 2011, 05:56:11 PM
Quote from: abw on May 28, 2011, 07:41:07 PMInterestingly, the resulting Dict table produced an encoding 4% smaller than that of the DTE table, according to Atlas.
Quote from: Tauwasser on May 25, 2011, 06:40:33 PMQuote from: abw on May 25, 2011, 12:16:36 AMWhile on the topic of insertion algorithms... how are you handling table switching? I've been thinking more about this, and in the general case it presents an even larger can of worms than I was anticipating.Since I'm the guy that introduced this, I can only say that ― yes ― it does pose a problem. However, my naïve solution was to use ordered data sets and do a longest-prefix over all tables. You can still do this with A*, where path cost is adapted to table switching as well, i.e. the path to an entry with switched table from another table is the cost in bytes in the new table plus the cost in bytes from the switching code in the old table. This of course, needs just some bookkeeping to know for each explored node in which table it belongs and how many matches the table can have before switching back.
Quote from: abw on May 25, 2011, 12:16:36 AMWhile on the topic of insertion algorithms... how are you handling table switching? I've been thinking more about this, and in the general case it presents an even larger can of worms than I was anticipating.
Quote from: abw on May 28, 2011, 07:41:07 PM1) This would have to be sacrificed, and the resulting complications for testing are lamentable. All other things being equal, I agree that simplicity is to be preferred, but in this case all other things are not equal.2) The longest text sequence rule for insertion nicely mirrors the longest hex sequence rule for dumping, so you also lose some parallelism . In the interests of accessibility, I'd definitely keep longest prefix as a suggested algorithm, since it is easy to explain and understand, but make a note that other algorithms are possible, and that the same text should be output by the video game no matter which algorithm is used.... except that that doesn't cover cases where longest prefix fails to insert otherwise insertable text. On that note, the "ignore and continue" rule of 2.2.2 further complicates the description of 2.2.4 when different algorithms are allowed. What's the reasoning behind "ignore and continue"? I think I prefer Atlas' "give up and die" approach .
QuotePutting all of that together would make $ the general in-game control code identifier I was shooting for earlier with #. Salient changes would be:adding a note in 2.2 forbidding < and > in normal entry text sequences;changing "endtoken" (2.4), "label" (2.5), and "TableID" (2.6) to "<label>" where label can contain any character not in [,<>];adding a note about uniqueness of all non-normal entries;explaining how format strings work in 2.5 (each gets replaced with value of next byte) and 2.6 (translate next characters in new table).There's probably a cleaner way to handle 2.6, since "table2,6" looks nicer than "<table2 %2X %2X %2X %2X %2X %2X>", "table2,0" is perhaps clearer than "<table2>", and it doesn't matter much anyway if table switch tokens aren't output.
Quote from: Nightcrawler on May 31, 2011, 02:45:51 PMOk, I think I can agree to resolve this by by doing the following:
Quote from: Nightcrawler on May 31, 2011, 02:45:51 PMYour previous post on in-game control identifiers with # made a little more sense, but now I am further confused.
Quote from: Nightcrawler on May 31, 2011, 02:45:51 PMFirst, I believe this is in the utility realm and need not be mentioned in the standard other than passing application/implementation suggestion or note.
Quote from: Nightcrawler on May 31, 2011, 02:45:51 PM1. The dumper outputs a table switch marker. This is simplest, but probably undesirable. It will really hurt readability of the dump output if you use it for Kanji and Kana.
Quote from: Nightcrawler on May 31, 2011, 02:45:51 PM2. We pretty much do the same process as we do for dumping with inferred table jump by token lookup, instead of explicit from hex. We then determine a suitable switch path when a switch is detected. Let's look at the example from the standard (2.6) for insertion.
Quote from: Nightcrawler on May 31, 2011, 02:45:51 PMIf [manual raw hex] occurs, it's probably best to just insert it and carry on with normal insertion. It should not count toward matches.
Quote from: abw on June 01, 2011, 11:44:08 PMBasically, I'm proposing reserving "<" and ">" for exclusive and mandatory use as the opening and closing delimiters of every entry that isn't a normal entry (i.e. any entry starting with "/", "$", or "!"). I am further proposing that it should be an error for a table to contain multiple entries (regardless of type) containing "<foo>" for any value of "foo", and that "foo" may not have the form "$XX" for any two-character hexadecimal sequence "XX". While I was at it, I figured I might as well include format strings, since that idea seems good and dovetails nicely with having reserved delimiters. The rest was just bookkeeping highlights.
QuoteAs multi-table insertion scenarios go, 2.6 is fairly simple, since you can move from any table to any other table (either explicitly for tables 1 and 2 or through fallback for table 3) whenever you need to and stay in the new table for as long as you need to. The only penalty for making a "wrong" table switch is inserting more hex than required. Here's a (very) slightly more convoluted scenario: In this case, you're faced with a choice right away, and this time, guessing wrong means you can't insert the string .
QuoteWe can define whatever behaviour we want, but the true test is how the game handles situations like this. Aside from the possibility of being interpreted as (part of) a valid token when a modified table switch path does not follow the original path, there's still some danger in cases where changes have been made to the text engine. My feeling is that if somebody's making changes to the text engine, they should know enough to be careful with raw hex, but if the utility is going to be responsible for choosing the table switch path (which I think it has to be), it should also be responsible for checking for raw hex suddenly becoming (part of) a valid token, which in turn would affect the game's match count.
Quote from: Nightcrawler on June 02, 2011, 01:59:00 PMOne other logical line of thinking might be that it should instead be a linked entry with zero parameters and not a normal entry to begin with. This essentially forces (if you want to use < or > as it typical for your controls) you to make all in-game controls be linked entries regardless of parameters.
Quote from: Nightcrawler on June 02, 2011, 01:59:00 PM2. I'm not sure if users might object to being locked into having to '<' and '>' for all their controls, although I understand we've already done that to them with raw hex. I guess I would hesitate to push further in that direction without feedback or support from others. It would simplify several matters though as pointed out.
Quote from: henke37 on June 03, 2011, 11:43:23 AMWhat if I simply want to have a literal less than sign in my text?
Quote from: Nightcrawler on June 02, 2011, 01:59:00 PM3. I don't think this works well when applied to table switching. You want the @TableIDString line and then the TableID in the switch entry in the format <XXXX>? That seems a bit uglier, and as you already pointed out probably wouldn't lend itself to formatting strings.
Quote from: Nightcrawler on June 02, 2011, 01:59:00 PMNote that along these lines I am considering pushing new lines and end tokens (non hex variant go out entirely, hex variant becomes a normal entry identified as an end token at the utility level) entirely out to the utility realm. I have been having some PM discussions with Klarth about this. If we did that, that would leave us with only linked entries and table switches as non normal entries.
Quote from: Nightcrawler on June 02, 2011, 01:59:00 PMCan you think of any real world scenarios where you would need to define a number of matches other than 1 or infinite? If not, we might have a winning change here.
Quote from: Nightcrawler on June 02, 2011, 01:59:00 PMHow would the raw hex suddenly become part of a valid token?
Quote from: abw on June 03, 2011, 09:59:07 PMAll of these options assume that my substitute characters (or at least enough of them to force unambiguous parsing) weren't already in use. So, not great, but not insurmountable either.
Quote@table2 would still be @table2, but !FD=table2,0 would become !FD=<table2> and !FD=table2,2 would become some variant on !FD=<table2,%2X%2X>.
QuoteInteresting. Artificial end tokens are already necessarily handled at the utility level during dumping, so no loss there. Insertion would lose some intercompatibility and gain some complexity. As far as utilities are concerned, end tokens are the single most important type of token. We could combine ideas and make end tokens (/) in-game control codes ($), which would ensure their unambiguous parsability. The utility would have to have some way to find out which tokens were end tokens, either by hex or by text. Hex would have to be table-specific and couldn't handle artificial end tokens, so text is preferable. I guess it all depends on how smart the insertion utility is.
QuoteWould it no longer be possible to dump a newline based on table entry alone? Or is this a weakening of the "ignore newlines" insertion rule?
QuoteHowever, I'm not sure we actually can eliminate other match count values: I have read about cases where more than one match is required. Eliminating other match counts won't solve all our problems (see next point), but it is definitely worth pursuing if we can get away with it.
QuoteStarting from table1, when we see the text "abc<$00>ABC", what do we insert? What does the game read? Even ignoring that troublesome hex literal, what happens with "abcABC"?
Quote from: Nightcrawler on June 07, 2011, 03:18:44 PMThe way we have it currently disallows only a raw hex pattern in normal entries, but otherwise allows everything.
Quote from: Nightcrawler on June 07, 2011, 03:18:44 PMIf we were to pursue this direction of heavier restriction, I'd need support from others.
Quote from: Nightcrawler on June 07, 2011, 03:18:44 PMThat way you're now defining a specific formatted token there as an extra parameter. You're no longer trying to mesh the text token with the TableIDString identifier.
Quote from: Nightcrawler on June 07, 2011, 03:18:44 PMThe drawback to pushing these items out is, while we have more strict and standardized definition of the table file, we actually standardize less in the overall process. Having no standardization in the process is probably what led to the incompatible utilities and table files we had in the past. So, does it become counterproductive to the cause?
Quote from: Nightcrawler on June 07, 2011, 03:18:44 PMThat topic references the same 'Kanji Array' item from the Romjuice readme file I previously mentioned. Look at it again. This time ask yourself 'Isn't this really a form of simple data compression rather than a hex or text encoding item?' I think an argument can be made that is is data compression (even if very simple) and thus would not belong in our table file standard.
Quote from: Nightcrawler on June 07, 2011, 03:18:44 PMI thought we covered 'abcABC'... Longest match is one letter for all of them, infinite table switch is preferred over single. So, we get 0xC0 0x80 0x81 0x82 (fallback) 0x80 0x81 0x82.
Example 3: Let's say hex sequence "7F" indicates Dakuten mark that applies for all characters until "7F" occurs again to turn it off. This can be done with two tables containing table switch bytes in the following manner: Table 1: @NORMAL 60=か !7F=Dakuten,0 Table 2: @Dakuten 60=が This will instruct any dumper to output 'か' normally until a "7F" byte is encountered. It will then switch to Table 2 and output 'が'. Because we specified 0 for the number of table matches, matching in the new table will continue until a value is not found. In this case "7F" is not in the Table 2, so fallback to Table 1 will occur.
Quote from: abw on June 08, 2011, 08:43:06 PMWhich includes tokens that can combine to produce text representing a raw hex literal.
QuoteHmm. That is a bit of a pickle. I agree pushing both items out to the utility level seems logical, but I wonder how many people will complain? Maybe take it one step at a time and only worry about table files for now? What's your vision of the ideal process?
QuoteYup, I know. The thread adds a little bit of info not in the readme, and is old enough that it might have been forgotten. Kanji Arrays are definitely data compression, but so is DTE. Where do we draw the line? Kanji Arrays and DTE are both representable by static bidirectional mappings, unlike, say, any of the Lempel-Ziv compression variants, which use dynamic mappings. Dumping Kanji Arrays isn't a problem, and if a bunch of programmers from the 90s came up with a way of inserting them, we should be able to as well.
QuoteI think you missed my point, given that you just inserted "ab~5~bc" instead of "abcABC". How is the game supposed to know that it should fall back to table1 in the middle of a string of perfectly valid table2 tokens? Inserting without considering consequences for dumping (a.k.a. in-game display) can cause trouble, in many of the same ways as can dumping without considering consequences for inserting. If a utility author chooses not to care, so be it, but that doesn't make the problems go away.
QuoteOn the topic of falling back... how exactly does the last Dakuten example in 3.2 work?
Quote from: Nightcrawler on June 09, 2011, 04:23:10 PMWell as you can see, already Klarth does not like the idea of eliminating end tokens altogether. I imagine it would hit additional resistance. However, he is in favor of eliminating artificial end tokens. He also wants to keep formatting. I'm on the fence with it.
#LINE-BREAK-AFTER: 4A-63,8A-A3,CA-E3#LINE-BREAK-AFTER: CA-E3#COMMENT-AFTER: 4A-63,CA-E3#END-AFTER: CA-E3
Quote from: Nightcrawler on June 09, 2011, 04:23:10 PMI'm not even going to begin to talk about the ideal process. That's a huge can of worms as the ideal process depends on what platform and games you're working with. The table file escapes much of that and look at all the disagreement on that.
Quote from: Nightcrawler on June 09, 2011, 04:23:10 PMThe difference is, with DTE, there is direct token conversion in both directions. With the Kanji Array and most other data compression, you do not have that. An algorithm (even if simple) is required. As soon as you leave the realm of direct token conversion, I think you depart from the scope of the table file. That's a clear line to me. DTE is a static bidirectional map with no manipulation needed, the Kanji Array is not. Transformation of the hex must occur to get the map, right?
Quote from: Nightcrawler on June 09, 2011, 04:23:10 PMLook at what you did. You defined table2 with 82=c and 8280=~5~. I can't say I've ever seen a text engine that could operate in that way that I recall. 0x82 would be reserved to indicate the next character is a two byte character.
Quote from: Nightcrawler on June 09, 2011, 04:23:10 PMI believe S-JIS and UTF-8 work in a similar manner as well off hand.
Quote from: Nightcrawler on June 09, 2011, 04:23:10 PMIf your game is processing characters that are one OR two bytes and the next character is valid as one byte and two byte character, that's an ambiguous situation for the text engine unless it were to arbitrarily declare one or the other the default to choose from.
Quote from: Nightcrawler on June 09, 2011, 04:23:10 PMOn top of all that, again, the user specifically plopped that raw value in there in the midst of a table switch to cause it. This is another case of the 'gotcha-table' theoretical condition that is so rare, I can't say it's worth any time to handle.
Quote from: Nightcrawler on June 09, 2011, 04:23:10 PMWhat would you propose to do about it anyway? I don't think you'll find much support on the issue from the others.
Quote from: Nightcrawler on June 09, 2011, 04:23:10 PMYou can try and iron out theoretical edge cases all day and spend a wealth of resource on them in your program, and continue to put it out of reach of more and more people, or you can start adding some practical limits, keep it simple, and get the product out the door. I think that's kind of where several of us are coming from now.
Quote from: Nightcrawler on June 09, 2011, 04:23:10 PMQuote from: abw on June 08, 2011, 08:43:06 PMOn the topic of falling back... how exactly does the last Dakuten example in 3.2 work?Good catch. I remember writing those examples. I think I originally wrote it with an additional switch entry in Table2 of '!7F=NORMAL,0'. That should take care of it (and revising the logic paragraph). I remember having a (in hindsight) brain logic malfunction when I read it over and thought I didn't need that and it would just fallback. I rewrote it after that.
Quote from: abw on June 08, 2011, 08:43:06 PMOn the topic of falling back... how exactly does the last Dakuten example in 3.2 work?
Quote from: abw on June 09, 2011, 11:56:47 PMThe case for eliminating artificial end tokens certainly seems stronger than that for eliminating end tokens or newlines. As an end user, I think I prefer keeping them in the table file; even though that may not be the most logical place for them, it does seem to be the most convenient place.
Quoted) I've never used \r myself. Is that a popular option? If you have some interface for defining line break controls and end tokens, the same thing should work for comment controls. Maybe something along the lines of this (clearly user interface is not my strong suit )?
QuoteIt's the uncertainty of tokenization at issue, not the mapping of tokens.
QuoteYes, that was a little sneaky of me. It is fully supported by the longest hex sequence rule for dumping, though. A quick peek at the wiki shows at least one real world example where this does in fact happen (on that note, people should submit more table files!).Raw hex literals are (potentially) problematic anywhere in a game using multibyte tokens, not just at table switch boundaries.
QuoteThat's what I was asking! Do they trigger fallback? Do they count as a match towards NumberOfTableMatches? What happens if they combine with the preceding or following hex to mutate into a completely different token? How can you write code to handle multi-table insertion and not consider these cases?
QuoteNow instead of "ab~5~bc", your hex represents "abcabc", which is still not the same as "abcABC". In this case, the only hex that works is "C1 80 C1 81 C1 82 80 81 82". If we are going to drop support for counts other than 0 and 1, what would you think about also dropping support for one table linking to another table using tokens with both counts? In this case, that would mean the combination in @table1 of "!C0=table2,0" and "!C1=table2,1" would be an error.
QuoteI thought about that too, but it does cause the dumper's table stack to become desynchronized from the game's stack. The next time the dumper hits an unrecognized entry in @NORMAL, it would fall back to @Dakuten (which might contain a hit) instead of dumping a hex literal. Worse yet would be if @NORMAL weren't the bottom of the stack, and the game actually fell back to some other table entirely while the dumper stayed in @Dakuten. But I guess that's just another theoretical edge case waiting to be ignored.
Quote from: abw on June 09, 2011, 11:56:47 PMYes, that was a little sneaky of me. It is fully supported by the longest hex sequence rule for dumping, though. A quick peek at the wiki shows at least one real world example where this does in fact happen (on that note, people should submit more table files!).
QuoteFor including literal %, why not go with the standard %%? That way you could just interpret the entire text sequence as your sprintf and pass it the next few bytes as arguments. You would need to know how many bytes the string consumes, of course; if we make every non-literal % consume 1 byte, that makes counting easy
QuoteThe insertion utility should be able to figure out how to convert ambiguous parameters based on the table entry. So if we had <window x=45, y=10> in the script and $FE=<window x=%2X, y=%u> in the table, we would know that we needed to write FF 45 0A. If the table had $FF=<window x=%2X, y=%2X> instead, we would know that we needed to write FF 45 10. That's why it's important to ignore printf parameters when checking for table entry uniqueness, since if we had both $FE and $FF entries in the same table, we wouldn't be able to tell how to reverse engineer the correct hex.
QuoteOn a theoretical level, raw hex is an absolute mess. With all the additions made to the table file standard (and ignoring legacy scripts), is there any good reason for having raw hex in an insert script? In a perfect world, I would also like to deal with token mutation, but that's really more of a text engine design problem than a table file standard problem. If you've got something that works for the simplified table switch structure (i.e. only 1 or infinite matches, fewer choices of how to move from table A to table B), don't let me hold you back. I'll try to come up with something that makes me happy too, but my time for the next couple of weeks is mostly already spoken for.
QuoteAs a related but possibly inconsequential aside, it seems a shame to remove support for dumping the more complicated table switch structures just because we haven't come up with a way to insert them. Given that insertion algorithms are outside the strict scope of the standard, maybe we can leave the more complicated version in the standard?
Page created in 0.110 seconds with 19 queries.