News: 11 March 2016 - Forum Rules
Current Moderators - DarkSol, KingMike, MathOnNapkins, Azkadellia, Danke

Author Topic: Improving relative search, Japanese language in mind.  (Read 1272 times)

FAST6191

  • Hero Member
  • *****
  • Posts: 2406
    • View Profile
Improving relative search, Japanese language in mind.
« on: January 14, 2019, 08:25:23 am »
So in http://www.romhacking.net/forum/index.php?topic=27787.0 I played down relative search tools for Japanese games, especially first pass things (in English games I have quite often searched for " the " sans quotes despite not knowing it would even be in the script, or indeed having ever seen the game so much as run, and got essentially the whole table), but others called me out (quite rightly). This would be a furthering of a discussion there as it could yield something interesting here.

Background. Relative search is a powerful tool for the ROM hacker looking to edit text in languages with a well defined alphabet order (or insert list of related terms).
I assume we all know what tables are, if not then this thread is not really for you but for the sake of a line a table is a list of characters in a known encoding and the hex values they encode as in a given game.
Relative search works by noting that with a defined order then words will typically appear with values -- CAB if a game's encoding uses the typical English alphabet order is 0,-2,+1 whether C encodes as standard ASCII, shiftJIS's Roman characters or the classic A=1 B=2 thing kids use as a "my first secret code". You reduce a file to the diffs between values and it will render everything the same, relate it back to the source file and bam.

Japanese however has no defined/universal order. In the thread above I was said to play this up too much and that there are orders one can make some use of with relative search. This is both true because I know it from experience and because it is plausible -- Japanese combines some simpler characters (the hiragana and katakana, collectively the kana) with some complex symbols derived from old Chinese ones (called Kanji), for a total of... nobody probably knows but even a basic set of kanji will see you up in the thousands. Nobody wants to hand encode a thousand symbols for their game so they will often borrow things. Sometimes this is another game, first in script, popularity and not really a thing relative search can handle but other times it will be a known encoding ( http://www.rikai.com/library/kanjitables/kanji_codes.sjis.shtml has a handful of common Japanese encodings you see on the PC for normal programs), an order borrowed from things they might have learned in school, an obvious derivation (in language we tend to note accents and such as punctuation but in computing we tend to call it another character, any new characters in most encodings then go with their respective characters or at the end but more on that later.
With existing stuff you could probably do this as a list of actions (I am sure we all have things we do to fiddle with data that we should really automate but don't) but that is not very nice for newcomers.

As that was dry and boring have a fun but somewhat related video
https://www.youtube.com/watch?v=4VNVCxi9TL8

Spin it another way -- when it became apparent that CPU resources were not increasing with the speed of imagination then devs noted that we have hard drives in the terabytes (or an easy way to get that much storage) and lent into that. Complex tasks were precalculated and and shoved into tables (everything from rainbow tables for passwords to chess). In more classical engineering then fudge factors are common when computing gets hard -- see no small chunk of fluid dynamics, or indeed the thing where people will attempt to reduce a problem to something more relative (specific gravity and density, specific heat, Young's modulus...).
I reckon something like this could then be done for relative search to make it more useful for Japanese, or beginners that don't want to do something fun with a number entry in the relative search box*.
*the encoding in a game tends to follow the order in the font. If you find the font and note its order then you can possibly do a number search to determine the location of the text (and with it the encoding).

If we wanted to get really fancy then we could include some kind of probability of gibberish rating and bias/rank/order things according. Base version of that would be a dictionary or language rules (don't know if any Japanese has a q is followed by u thing but that) but a more advanced one would be akin to grammar checkers in your chosen word processor or even some kind of almost machine translator that tries to filter gibberish, but I will skip that one for now.
I dare say many people's favourite relative searcher called Monkey Moore does something like this -- I quite often get capitals filled in when I made no effort to put such a thing in my search query, and space is also handled (whether as some kind of "it is the most common character, and seldom more than 12 characters between" affair or just as a blank for the search query might go by program). Those which might have learned ascii back in the day might also have noted "tricks" like numbers start with 3 (33 = 3, 34 =4) and capitals are a fixed distance back from the lower case letters.

My knowledge of Japanese is mostly so I can speak more easily to translators for the purposes of ROM hacking. This however means I am less familiar with Japanese on the ground and where were are with this thread. I want a discussion similar to what we have for tables http://www.romhacking.net/forum/index.php?topic=12644.0 but for formats to feed our would be database. Computationally speaking it would not be so bad to chuck half a hundred known tables of games on any number of systems in the database as well, one of the comments being that despite the theoretical randomness a lot of old school (in this case game gear games) used previously seen orderings.

So orderings would be the usual shiftJIS, eucJP, unicode (see again http://www.rikai.com/library/kanjitables/kanji_codes.sjis.shtml and links at the top of the page), possibly some more old school things we saw with half byte stuff, JLPT levels and their lists, other exams and their lists, Japanese schoolboy lists, chuck whatever popular versions of dai kan wa jiten (old school but very respected kanji dictionary for those unaware) in there if you like, and if you want to broach the moji topic then it is your mental stability at stake.
Presumably each of those will have the obvious iterations for the Dakuten being after each respective character, or those after the hiragana which are followed by the "normal" katakana and then their respective variations.
I don't know the names of all the popular for the Japanese native lists, much less those which would have been popular when the games we like to look at would have seen the people making them in school (arguably the 1940s and onwards).


If you can suggest similar things to the space is the most common character, or things to put in a relative expression style search (like in English if many a 16 bit game was all upper case do any Japanese games omit some technically necessarily but practically not so necessary punctuation) then feel free.
Going into the weeds then there are also things like trying to handle things with 8 bit control codes in a 16 bit encoding (observed in several GBA and DS games for me) but that I will suggest saving for another day, though if I must then like I note shiftJIS as mostly being 16 bit values starting with 8 or 9 and adjust accordingly when defocusing on my hex editor then some kind of probability and fun with a shift could be done (you would only really be running the scan twice).

If I have overlooked something on the computational front that will turn this into a system grinder then please note that as well. I don't think I have suggested anything that will do that, or indeed be all that much more on storage resources than if I decided to include a picture of my dog in the about box. My history however with estimating complexity is a bit... shall we say spotty.

This guy writes too much. What is the short version?
Japanese has no defined order but as it is a large list of characters there are orders which crop up again and again. I want to expand relative search tools to encompass those but as I don't know much Japanese then first a discussion on the list of the orderings that would need to be added to such a project. Also any related tricks that might be added to such a project, or pitfalls that might want to be avoided beforehand. For obvious reasons then Japanese would be the focus, however I do have to note game development has gone worldwide these days and most of the world is getting to a point where they can support game devs so maybe build such a project such that a language switch could be added down the line if Korea really takes off, or indeed somewhere in the middle east really takes off.

abw

  • Full Member
  • ***
  • Posts: 186
    • View Profile
Re: Improving relative search, Japanese language in mind.
« Reply #1 on: January 14, 2019, 12:05:14 pm »
I can't help much with the Japanese-specific aspects of this, but on the topic of pitfalls, in a perfect world, something like this would also be able to handle various types of table-based compression (e.g. DTE/dictionary) and non-octet encodings (e.g. Huffman codes, 5-bit characters, etc.). You would also want to watch out for multiple encodings per game (e.g. menu text vs. main script, but also including encoding changes within a single string like kanji arrays/table switching) and games with multiple languages. That said, even if it doesn't immediately reach the end goal, any step forward is still progress!

It would be interesting to see how machine learning/artificial intelligence would perform on a problem like this and how much trouble control codes would cause.

filler

  • RHDN Patreon Supporter!
  • Hero Member
  • *****
  • Posts: 833
  • "WINNERS DON'T SELL REPROS"
    • View Profile
    • Filler's Translation Projects
Re: Improving relative search, Japanese language in mind.
« Reply #2 on: January 14, 2019, 12:50:31 pm »
I recently wrote a PHP script that I call "table helper" that I talk about a little here: https://youtu.be/ThpHBoxK3Ss It's really just a time saver, but it relies on, and exposes this issue of Japanese kana and kanji order that you're talking about.

I made this because I'm working on a bunch of 8-bit games right now, mostly Game Gear, but also Famicom that I just want to dump scripts for so I can translate them. Nothing fancy. I use the following character sets:

english_upper
hiragana
hiragana_diacritic
hiragana_small
katakana
katakana_diacritic
katakana_small
numerals

You can check the video to see how I use these to save some time typing out hex values, but suffice to say these character sets cover many instances of the order of these characters, but not all. In your first Japanese class you'll learn the kana alphabet, it has an order. You'll also learn a way to help remember the order a, ka, sa, ta, na, ha, ma, ya, ra, wa.

Here is hiragana in order:
あいうえお
かきくけこ
さしすせそ
たちつてと
なにぬねの
はひふへほ
まみむめも
やゆよ
らりるれろ
わをん

No text encoding in a video game has to follow this order, and many don't. The main areas that the kana syllabary tends to differ from this order is as follows.

Sometimes there is significant difference in the last 3 characters. Most common I see is をん transposed as んを. Sometimes を is somewhere else, like at the beginning, or mixed in with other characters, so it's just わん. This order can be pretty variable.

Also the voiced characters, i.e. the ones that can have diacritics, are sometimes on their own in the encoding, and sometimes they are mixed with their unvoiced counterparts. KingMike mentioned this in the other thread. it will look as follows.

がぎぐげご
ざじずぜぞ
だぢづでど
ばびぶべぼ
ぱぴぷぺぽ

or sometimes:

かがきぎくぐけげこご
さざしじすずせぜそぞ
ただちぢつづてでとど
はばぱひびぴふぶぷへべぺほぼぽ

Also there are small kana:

ぁぃぅぇぉっゃゅょ

They are also sometimes inline with their large counterparts:

あぁいぃうぅえぇおぉ
つっ
やゃゆゅよょ

But normally they are on their own. The order can vary. I've seen:
ぁぃぅぇぉっゃゅょ
ぁぃぅぇぉゃゅょっ
ゃゅょっぁぃぅぇぉ

and other variations.

What this all means is that the "safest" kana to search for when relative searching on Japanese characters are the ones with no extra voicing, and no large/small versions.

なにぬねの
まみむめも
らりるれろ

if you can find a sequence of 3-4 of these characters in a row, your chances of a relative search hit are high. A variation in the "や" characters like やゃゆゅよょ could alter the relationship of the な and ま characters to the ら characters, so the absolute safest would be a series of 3-4 なにぬねのまみむめも characters together.

The best way to search for these that I've found is to tag their order:

00=な
01=に
02=ぬ
03=ね
04=の
05=ま
06=み
07=む
08=め
09=も

And then use TranslHextion "value scan relative" function. So a search for なにも would be "010209".

All of this is relative. I've encountered encodings where the whole alphabet is backward. I've also found ones where kana characters are simply missing. It just depends on the game and how much space they had, etc... Frequently just seeing the font stored in the ROM can help you sort this out. If there is a non-standard order, you can adjust your relative search to compensate.

With kanji, it gets more complicated, and I myself have said that there isn't necessarily an order, but I've had the chance to make a few kanji table files recently, and there is an order. A lot of characters are missing, at least in 8-bit games to save space, but when IDing the kanji, just following the JIS order can be a huge help since the games I've worked with follow this order for the kanji, but only include the kanji that they use in the game script. Therefore when IDing, you can scan the JIS kanji set and pick out the characters that you see in the font tiles. I don't know if Unicode follows the same kanji order that JIS does, but I know it has more kanji in it.

EDIT: To add a bit of a conclusion, relative searching is not going to help you find the text encoding in every instance, but it can be helpful, and there is definitely an order to Japanese kana and kanji.
« Last Edit: January 14, 2019, 03:11:14 pm by filler »

Psyklax

  • Hero Member
  • *****
  • Posts: 833
    • View Profile
    • Psyklax Translations
Re: Improving relative search, Japanese language in mind.
« Reply #3 on: January 14, 2019, 03:07:18 pm »
Filler said most of what I wanted to say, but I would add that when you have good debugging tools and an understanding of assembly and the system you're working on, you don't actually need relative searching.

On the NES, for instance, I would never bother to relative search now, instead you reverse engineer. You start with what you can see: the text on the screen. Look in the VRAM and you can see the bytes right there, so you find out how they got there. If there's no compression involved, it likely will take you only a minute or two to locate the actual text in the ROM, and then figure out the table and what have you. This applies to virtually any system, but especially the 8- and 16-bit ones. I don't have so much experience with later consoles but if you can figure out their video systems then theoretically it's possible.

I know it may seem more complicated than relative searching, but trust me: once you get the hang of it, relative searching will seem like caveman stuff. :D

Of course, you could also get lucky and the system uses Shift-JIS. I'm working on a 3DO game which uses it (and even lets me use half-width Roman characters when I use regular ASCII), and my translation for the PC-98 also uses Shift-JIS. I did some tests on a Detective Conan board game on the PlayStation, but I haven't got the hang of the VRAM setup and therefore didn't know where to look - and there was no Shift-JIS in the game.

KingMike

  • Forum Moderator
  • Hero Member
  • *****
  • Posts: 6659
  • *sigh* A changed avatar. Big deal.
    • View Profile
Re: Improving relative search, Japanese language in mind.
« Reply #4 on: January 14, 2019, 04:22:47 pm »
Most often in NES games, the diacritic characters will be a separate character range but typically ga-za-da-ba rows (pa will often be its own set).
Sometimes it will be simply + or - 0x80 from the base characters. (as a full kana set will usually fit into the first or second half of the CHR bank) But often you'd just have to find the base characters and then search for known text to fill in the diacritic characters.

Game Boy is often similar, but due to that it only has 75% the tile space that NES has, from what I've seen it is not as common to find full kana tilesets unless it is a full-screen menu (as there is not as much free space alongside actual background tiles). (on GB, the middle 1/3 of the tileset is tiles 80-FF and the first and last are both numbered 0-7F. As one is exclusive to BG tiles, and the other to sprites but the middle is shared)
"My watch says 30 chickens" Google, 2018

filler

  • RHDN Patreon Supporter!
  • Hero Member
  • *****
  • Posts: 833
  • "WINNERS DON'T SELL REPROS"
    • View Profile
    • Filler's Translation Projects
Re: Improving relative search, Japanese language in mind.
« Reply #5 on: January 14, 2019, 06:53:54 pm »
I know it may seem more complicated than relative searching, but trust me: once you get the hang of it, relative searching will seem like caveman stuff. :D

It's definitely caveman stuff. However, it is platform agnostic. I dump the script for a PC Engine game the same way as a Game Gear game, the same way as a Famicom game. The one hacking the game should definitely approach it with more sophistication. As the translator I can get away with clubbing it with a big sick. :D
« Last Edit: January 14, 2019, 07:53:14 pm by filler »