News: 11 March 2016 - Forum Rules
Current Moderators - DarkSol, KingMike, MathOnNapkins, Azkadellia, Danke

Author Topic: General Process for Extracting Encoded/Compressed/Encrypted Scripts?  (Read 1982 times)

weirdalsuperfan

  • Newbie
  • *
  • Posts: 2
    • View Profile
Just to clarify, this is about Japanese games, often psx/ps2/psp ones.

It seems like whenever I have a ROM (or similar file), even if I can make some progress in getting the basic files that are inside it (which I can't always do) by using a tool like VGMtoolbox or quickbms, 99% of the time, the files are encrypted, compressed, custom-encoded, or of some format I've never heard of that needs a special tool to extract, which may or may not be easy to find (usually not  >:() - basically, I open them in a text editor, try a half a dozen encodings, and the text still won't display, and then I think "oh great, another game I can't get the text from"). But then I look online and people have either successfully extracted the text files from these games (sometimes), or at least act like they have, or are very comfortable with the whole process in general.  :-\

How do they do it? What is the step-by-step process? What knowledge do you need? What youtube videos should you watch or what documents should you read to learn how to get around these encodings? What are the indispensible tools besides e.g. a hex editor? It seems like there are "people in the know" to whom this is second nature, and then everyone else is in the dark just flailing around googling for random tools. Do they run the game and look at pointers with CheatEngine and then somehow dump the text directly from there? That would sure be convenient. I know every file and every encoding is different, but I also have no idea what the general process is, or how to learn how to do each of those general steps  :'(

I should say that I have very limited experience with these tools since I'm still an ultra noob), but I have done the CheatEngine tutorial, I have a CS degree, I know hex, I know there are tutorials on tables and whatnot, I'm comfortable with Japanese and with its encodings, and I have some experience with NLP as well as reverse engineering. I have also had a surprising amount of success using the linux strings command to dump text.

Even given my base knowledge, which is enough that I'm comfortable learning where to go from here, I have no idea where to begin.  :banghead: How do you do it? Btw is the process different for a visual novel?

One more question - how long does it take to tackle these kinds of games where the file are of a special type or are in binary or something? Because I've spent half of this weekend alone just googling for the text to Steins;gate, for example - but I want to be able to extract the scripts of A LOT of games. If my main goal is to actually use these scripts (i.e. I don't want to spend all my time just extracting them), is it even worth my time, or should I just go for low-hanging fruit?  :(
« Last Edit: May 10, 2020, 03:53:41 pm by weirdalsuperfan »

Risae

  • Jr. Member
  • **
  • Posts: 73
    • View Profile
Hi weirdalsuperfan,

i don't have much knowledge on how other translation groups are doing it, but i can tell you what i did to start my translation of the PS2 game "Growlanser VI: Precarious World":

In the beginning i only had the ISO, so i searched for ways on how to unpack the ISOs contents.
I came across the tool called "Xpert". The tool has a plugin called "PS2 CdDvD5 |PSP UMD ISO Shrinker v1.05 *.ISO".
I used that to dump the contents of the games ISO.

Now i had the .DAT files inside the ISO, and i had to unpack those too.
With the help of the creator of quickBMS (who made a script for the .DAT files of Growlanser 5 + 6) i was able to dump all of the .DAT files and looked for japanese text inside of those.
Most of the time japanese games are encoded in SHIFT-JIS, so i looked for japanese text with that encoding.

After finding the script files i needed to figure out how the scripts work.
After a little bit of trial n error i figured out enough to understand that they use so called "pointer tables" and now i needed a proper way to dump all of the text including its text pointers.
So I found the tools called "Cartographer" (dump text) and "Atlas" (script re-insert) and after a bit more of trial and error and the help of other Romhackers
Those tools are able to dump the script with each texts pointer, and update the pointer table with the modified text position.

Now i almost had everything i need to

1. Dump the ISO
2. unpack the .DAT files
3. dump the script and its pointers
4. re-insert the script into the script files with the updated pointer table
5. re-insert the script files into the .DAT files
6. re-build the ISO

Over the last year i changed from using Atlas + Cartographer to the updated version called "abcde" and i also found a lot of tools that can read/modify different kinds of files, including:

- rainbow TIM2 viewer:
https://github.com/marco-calautti/Rainbow

- Apache2 to hotswap files inside an ISO:
https://www.psx-place.com/resources/apache.697/

- Look at other image files with tile viewers, for example "Tile Molester Mod" and TileShop:
https://www.romhacking.net/utilities/991/
https://www.romhacking.net/forum/index.php?topic=30404.0

- Use Ghidra (and IDA) to check out the games executable file:
https://github.com/NationalSecurityAgency/ghidra

And a few others i probably forgot about.
It was certainly not easy to come this far, but over the course of the last year i noticed that for almost anything that you want to do there is a tool that can do exacly that.
I hope this could help you a bit!

FAST6191

  • Hero Member
  • *****
  • Posts: 2962
    • View Profile
For the GBA and DS but applies to most things
http://www.romhacking.net/forum/index.php/topic,14708.0.html

I rarely see encrypted text outside of the PC. A ROM might be encrypted but that usually means something else, and PSP isos might have been as well but that is not a problem in the modern world (we have all the keys, including public and private thanks to Sony's major screw ups with the PS3).
Most games won't practice anything like safe handling of decrypted data either and have it all in plain text in RAM (also goes for compression). A RAM dump of a script is usually not so useful for end users (usually incomplete, hard to use as a hacker) but can be a step along the way or narrow your search down if you are facing compression (or maybe encryption).

Compression is common, especially in text.
https://ece.uwaterloo.ca/~ece611/LempelZiv.pdf while nominally about LZ family compressions is my usual choice for an overview of compression, though maybe also grab some source code handling any BIOS, SDK or otherwise known compressions seen in games to get another idea. Step through at least a few steps of it manually and it usually starts to make sense.

I assume you have the basic idea of an encoding. English/European languages might get away with 8 bit encodings as there are usually less than 100 characters but Japanese thanks to the thousands of kanji tends to go for 16, and in some odd cases even 24).
While most PC things use known encodings console game devs seem to have a proud tradition of not doing that. By the time of the PSP that had changed a bit so yeah check to see if there are a whole bunch of 16 bit values starting with 8 or 9 (most common characters in shiftJIS http://www.rikai.com/library/kanjitables/kanji_codes.sjis.shtml ) or whatever you feel necessary for http://www.rikai.com/library/kanjitables/kanji_codes.euc.shtml

English has the defined alphabet order, Japanese has no official order or even much of an unofficial one (though depending upon how much Japanese you know you might still spot common ones that are taught to tests, schools, said same but back in the day when either the game was made or devs likely came up, somewhat logical like having *kuten next to their respectives or in the same order afterwards) or copied from other orders (same as shiftJIS but starting at ?? being seen on multiple occasions).
On the flip side there can be game based orders like most/least common, first in script (though it might be first in beta script). Commonality is also a trick I like to use (RSTLNE in guess the word games, scrabble scores, space is probably the most common, the vast majority of all words in English have a vowel or a known alternative like y) but you are on your own for Japanese beyond see what the split is like between the kana (some don't do many loanwords) and/or kanji to bias your tests.

Said order can help in various ways, relative search https://www.romhacking.net/utilities/513/ being my tool of choice. For English then most encodings will be ABCDE....XYZ so if you search for a sentence (careful with capitals, variables and new lines) from the game then as the chances of a matching patten (CAB could be a pattern 0-2+1 between them) are generally slim and usually immediately noticeable by reading characters a few either side with the pattern. Some like to overwrite Japanese fonts with a known user chosen order (be it alphabet or numbers) and search for the gibberish (but still alphabet/numbers) that the game displays, though it works less well as Japanese encodings often have gaps for various reasons.

Having a little poke at the code to see what happens in the resulting game is also quite valid as a thing to do, though best if you narrow it down a bit first and maybe have an idea about what the encoding will look like (you can start randomly tweaking things, it is a process known as corruption, but it can take a while to get useful results). Indeed it can often be the best way to figure out random characters that the game does not use, or commonly use.

There are also things like DTE/MTE where a character is split across a few tiles and thus can be made up from a few values. Around now I should probably note old school half width stuff but that is mostly a PC affair.

Modern games also often resemble a markup language as well, or can have variables. Sometimes these will be binary, sometimes textual, sometimes both. Both of these things can throw some methods off. Oh and 8 bit markup in text can be found in Japanese games which can frustrate decoders as things will pop in and out of 16 bit alignment.

You can also trace back through things from being displayed on screen to where they land in the ROM/ISO if you are so inclined. Most of the above stuff is based on lived experience and works well for a lot of things.

I am basically rewriting documents written by myself and others, usually far more clearly with nice pictures and examples, so I will tie it off there for now.

Short version. Known encodings are great, certainly check and see if the game uses them, but it is far from the whole story. You might want to grab a relative search tool, and make sure your hex editor has nice options like distribution of characters. Other than that is it mostly a combo of linguistic techniques, brute force, code analysis and maybe actually just taking it from the top and working backwards to find the origin via the code itself.