So in http://www.romhacking.net/forum/index.php?topic=27787.0
I played down relative search tools for Japanese games, especially first pass things (in English games I have quite often searched for " the " sans quotes despite not knowing it would even be in the script, or indeed having ever seen the game so much as run, and got essentially the whole table), but others called me out (quite rightly). This would be a furthering of a discussion there as it could yield something interesting here.
Background. Relative search is a powerful tool for the ROM hacker looking to edit text in languages with a well defined alphabet order (or insert list of related terms).
I assume we all know what tables are, if not then this thread is not really for you but for the sake of a line a table is a list of characters in a known encoding and the hex values they encode as in a given game.
Relative search works by noting that with a defined order then words will typically appear with values -- CAB if a game's encoding uses the typical English alphabet order is 0,-2,+1 whether C encodes as standard ASCII, shiftJIS's Roman characters or the classic A=1 B=2 thing kids use as a "my first secret code". You reduce a file to the diffs between values and it will render everything the same, relate it back to the source file and bam.
Japanese however has no defined/universal order. In the thread above I was said to play this up too much and that there are orders one can make some use of with relative search. This is both true because I know it from experience and because it is plausible -- Japanese combines some simpler characters (the hiragana and katakana, collectively the kana) with some complex symbols derived from old Chinese ones (called Kanji), for a total of... nobody probably knows but even a basic set of kanji will see you up in the thousands. Nobody wants to hand encode a thousand symbols for their game so they will often borrow things. Sometimes this is another game, first in script, popularity and not really a thing relative search can handle but other times it will be a known encoding ( http://www.rikai.com/library/kanjitables/kanji_codes.sjis.shtml
has a handful of common Japanese encodings you see on the PC for normal programs), an order borrowed from things they might have learned in school, an obvious derivation (in language we tend to note accents and such as punctuation but in computing we tend to call it another character, any new characters in most encodings then go with their respective characters or at the end but more on that later.
With existing stuff you could probably do this as a list of actions (I am sure we all have things we do to fiddle with data that we should really automate but don't) but that is not very nice for newcomers.
As that was dry and boring have a fun but somewhat related videohttps://www.youtube.com/watch?v=4VNVCxi9TL8
Spin it another way -- when it became apparent that CPU resources were not increasing with the speed of imagination then devs noted that we have hard drives in the terabytes (or an easy way to get that much storage) and lent into that. Complex tasks were precalculated and and shoved into tables (everything from rainbow tables for passwords to chess). In more classical engineering then fudge factors are common when computing gets hard -- see no small chunk of fluid dynamics, or indeed the thing where people will attempt to reduce a problem to something more relative (specific gravity and density, specific heat, Young's modulus...).
I reckon something like this could then be done for relative search to make it more useful for Japanese, or beginners that don't want to do something fun with a number entry in the relative search box*.
*the encoding in a game tends to follow the order in the font. If you find the font and note its order then you can possibly do a number search to determine the location of the text (and with it the encoding).
If we wanted to get really fancy then we could include some kind of probability of gibberish rating and bias/rank/order things according. Base version of that would be a dictionary or language rules (don't know if any Japanese has a q is followed by u thing but that) but a more advanced one would be akin to grammar checkers in your chosen word processor or even some kind of almost machine translator that tries to filter gibberish, but I will skip that one for now.
I dare say many people's favourite relative searcher called Monkey Moore does something like this -- I quite often get capitals filled in when I made no effort to put such a thing in my search query, and space is also handled (whether as some kind of "it is the most common character, and seldom more than 12 characters between" affair or just as a blank for the search query might go by program). Those which might have learned ascii back in the day might also have noted "tricks" like numbers start with 3 (33 = 3, 34 =4) and capitals are a fixed distance back from the lower case letters.
My knowledge of Japanese is mostly so I can speak more easily to translators for the purposes of ROM hacking. This however means I am less familiar with Japanese on the ground and where were are with this thread. I want a discussion similar to what we have for tables http://www.romhacking.net/forum/index.php?topic=12644.0
but for formats to feed our would be database. Computationally speaking it would not be so bad to chuck half a hundred known tables of games on any number of systems in the database as well, one of the comments being that despite the theoretical randomness a lot of old school (in this case game gear games) used previously seen orderings.
So orderings would be the usual shiftJIS, eucJP, unicode (see again http://www.rikai.com/library/kanjitables/kanji_codes.sjis.shtml
and links at the top of the page), possibly some more old school things we saw with half byte stuff, JLPT levels and their lists, other exams and their lists, Japanese schoolboy lists, chuck whatever popular versions of dai kan wa jiten (old school but very respected kanji dictionary for those unaware) in there if you like, and if you want to broach the moji topic then it is your mental stability at stake.
Presumably each of those will have the obvious iterations for the Dakuten being after each respective character, or those after the hiragana which are followed by the "normal" katakana and then their respective variations.
I don't know the names of all the popular for the Japanese native lists, much less those which would have been popular when the games we like to look at would have seen the people making them in school (arguably the 1940s and onwards).
If you can suggest similar things to the space is the most common character, or things to put in a relative expression style search (like in English if many a 16 bit game was all upper case do any Japanese games omit some technically necessarily but practically not so necessary punctuation) then feel free.
Going into the weeds then there are also things like trying to handle things with 8 bit control codes in a 16 bit encoding (observed in several GBA and DS games for me) but that I will suggest saving for another day, though if I must then like I note shiftJIS as mostly being 16 bit values starting with 8 or 9 and adjust accordingly when defocusing on my hex editor then some kind of probability and fun with a shift could be done (you would only really be running the scan twice).
If I have overlooked something on the computational front that will turn this into a system grinder then please note that as well. I don't think I have suggested anything that will do that, or indeed be all that much more on storage resources than if I decided to include a picture of my dog in the about box. My history however with estimating complexity is a bit... shall we say spotty.This guy writes too much. What is the short version?
Japanese has no defined order but as it is a large list of characters there are orders which crop up again and again. I want to expand relative search tools to encompass those but as I don't know much Japanese then first a discussion on the list of the orderings that would need to be added to such a project. Also any related tricks that might be added to such a project, or pitfalls that might want to be avoided beforehand. For obvious reasons then Japanese would be the focus, however I do have to note game development has gone worldwide these days and most of the world is getting to a point where they can support game devs so maybe build such a project such that a language switch could be added down the line if Korea really takes off, or indeed somewhere in the middle east really takes off.