I did not miss it but forgot to post something
Still I read it several times so thanks for that as it refreshed me on a few concepts I had more or less dismissed in recent years (Kuten/Handakuten and Yoon/Youyon or diacritics in general are usually kicked to extra font entries and although I am no stranger to multiple encodings per game/file/script switching out for something as simple as the kana is not something I tend to see any more and I actually can not recall the last time I saw 3.3 Example 3 although I guess I also can not remember the last time I saw true word level dictionary compression either save for some curious situations in LZ) and pondered how I might apply it to some certainly not common situations but situations I have encountered none the less. They will mostly be from the DS, wii and maybe parts of the 360 which I guess is my main source of hacking work and probably why I had not seen the things above in a while (memory management is still a thing but not half as aggressive as it had to be for the 8 and 16 bit era). For the most part consider this a +1 with some pointless waffle from me, I have no real problems or need for solutions to the things below and would be quite happy to see this implemented beyond wondering if it is worth having a right to left reading flag (we can probably ignore tategaki though and kick it to a control code and fairly safely assume boustrophedon (alternating directions) is a thing of the very distant past) as Arabic, Hebrew and similar languages are getting a few translations nowadays.
I have reservations about some of the insertion side of things but that is always going to be the case and I agree this is probably a good way to serve the most people at once- some games support several encoding types and are quite fussy about replacing like with like even in the same line (Zombie Daisuki for the DS used parsed plain text files with some line numberings but switched between ASCII and shiftJIS (often without ASCII or U16 fallback options- shiftJIS implementations on the DS more often than not miss out that part of the spec and although it is better nowadays earlier on it was a cause for a minor celebration to have the game include the Roman characters as part of the actual shiftJIS section) meaning the collisions workarounds might not be ideal. Also there might be an issue with some things having multiple end tokens (usually for line end (line breaks were different), section end, a part end and maybe a file end but that can probably be worked around as well.
Starting with the crazy silly thing.
Riz Zoward/Wizard of Oz beyond the yellow brick road. Helping out with an English to Spanish translation (US version of the rom as the base).
Some text in it was fixed length, others was standard pointer text and the third most interesting part was damn near scripting engine grade.
In short each entry was a type code a few bytes long, a length value for the type, length and if any the resulting payload and a payload itself if there was any (think cheat encoding methods). The payload consisted of text and control characters of various forms as well as calls to various 2d and 3d image files and presumably a bit more than that (I did not take reverse engineering the format that far after I figured out roughly what went). The type codes and lengths probably could have amounted to fixed length entries which might help things but I think I mainly mentioned this to be annoying.
Also possibly related
http://blog.delroth.net/2011/06/reverse-engineering-a-wii-game-script-interpreter-part-1/Related to the above back in the standard prior to the characters being named there were other names being used with the option of it being replaced from a name entry screen (not sure how the game eventually did it) where traditionally we might have seen placeholders. I might be able to abuse control codes but a "read only" flag of some form might be an idea although thinking further that might just my being lazy and half of this project seems to be about avoiding the extra cruft so ignore that.
More recently I was pulling apart Atelier Lina ~The Alchemist of Strahl although I have seen it before in other files/formats.
Text was shiftJIS but each section was bounded by a hex number starting from 0000 and going up from there. Short sample
?????
錬金術の極意
踏ん張り
フラムっぽい物
みんな頑張れっ!
テラフラムっぽい物
どんぐりメテオ
It does not take too long to hit actual characters and perhaps more troubling control codes which in this case were not 8 bit but I have seen several with hard 16 bit characters and 8 bit control codes (or similarly non multiple of 16 bit placeholders) that troubled more basic parsers but that is probably not where this is heading.
Games using XML esque structures. I certainly see XML in games far more but that is usually something that escaped the cleanup process before building or for unrelated files. Probably can be ignored as they tend to be using known encodings and are not really the domain of table files.
Games using square brackets as escape control code/placeholder value indicators. A simple workaround I guess but one I might have to think about (for no other reason I would sub in the corner* or curly brackets).
Use of the yen symbol*, I guess I could always do alt 0165 (must remember the 0) or otherwise define/bind something or if I am truly lazy copy and paste but it does not tend to appear on a European keyboard and I am lazy.
*it is not lost on me.
There was probably more I meant to so say but is appears to be nearing 3am (again).