Legend of Heroes III Win / 新・英雄伝説Ⅲ 白き魔女 script decompression assistance?

Started by KobaBeach, February 23, 2021, 04:34:52 PM

Previous topic - Next topic

KobaBeach

You're going to laugh at me for this but I'm a complete newcomer to translation patch making, and I was trying to see if someone could give me some pointers for reverse engineering the Falcom decompression method used on LoH3 PC's DAT files.

Pokanchan's tool works pretty well for extracting the PNG files (tiles, intro graphics etc) and a bunch of DAT files, however it doesn't decompress ED3_DT09 and DT10 which is where I'm *theorizing* contains the script, purely due to browsing through a few of the extracted DATs and not finding any text in them (it's a mix of Shift JIS for Kana and Kanji and ASCII for A-Z and 0-9, from my messing around in Cheat Engine).

Esperknight's C++ code also decodes the DATs into unc files, but 09 and 10 don't work either, they throw unhandled exceptions. I tried 00 and it decompressed as a 6 KB unc file which, I thought unc files were meant to be the raw data? Sorry I'm dumb.

I have Snowman on hand and while C++ without function and variable names is hell, it does give me x86 code which I don't mind messing about in to try and find the source of the data. If it also helps it does seem to load the relevant text blocks depending on map screen, the right side of Bolt and the top part of the path to Igunis load different text. For example, the text for the signpost saying the amount of milos until reaching Igunis and Amdera is pre-loaded as long as you're in the area.

I'm biting off more than I can chew, admittedly, but I don't mind learning as I go along and not rushing. I was thinking of trying Oni II as practice before this, too and that seems to be a case of compressed text as well. Possibly font too? I forget. I'd also have to learn Z80 and the Game Boy architecture too, which I'm also okay with.

Debugging further in Cheat Engine also isn't out of the picture. I'll also consider cracking open C++ on VS2019 and playing around with DAT related code extracted from the game, studying it and debugging it as I go along.

My Japanese is admittedly rough - I can read a bit, but I'm still learning - but someone on the Falcom Discord fan server who is fluent has volunteered to help me translate it if I can get something going.

I'm also working on and off on a SMW editor that works with disassemblies to mess around with SMW's engine further (primarily for myself but a release is planned once I get it comparable with Lunar Magic in terms of user friendliness at least), so all around I'm getting my grips with programming outside of 65c816 and it's really fun. ^^

Hope I'm not embarrassing myself doing this, translation patches are really cool and I'd like to contribute to a few. Again, sorry for sounding really stupid. ^^ Just go easy on me. m(_ _)m
replace my bones with pocky and throw me down a stairwell

FAST6191

Doubt anybody would laugh at you -- help is what we tend to do around here and compression woes are one of the main things to make life interesting when it comes to hacking.

PC compression can get fun as the range of options can be rather wider than we normally see on consoles, both because of the extra power/memory/storage and because there are 9000 vendors all doing their own thing.
My usual primer on compression is https://ece.uwaterloo.ca/~ece611/LempelZiv.pdf and it will probably be a variation on the themes noted there; doubt you will get into any of the fun probabilistic/regenerative or compression contest (see Hutter Prize) type things for a game, though I have been surprised.
Cheat engine debugging is not bad -- if you already found the text in plaintext in RAM then you can line up all or part of it with the compressed form and try to figure out what was done. Hopefully it is just a minor tweak on the existing format (add 10 extra bytes in the header or spin the compression off to the sub sections rather than whole file and it will fail for existing setups that don't expect that) but if not still likely to be some variation on the theme of dictionary or sliding window. Might even be quicker than playing with assembly, though assembly will pretty quickly give you the exact split and length formats for sliding window and clues as to the nature of any dictionaries if that is what was opted for. If you can already program well enough to attempt an editor like that then decompression (and recompression*) is nothing drastic at all -- basic loop with some basic maths really.


*assuming you don't just cheat and either change the game to not expect compression or ignore any compression and just put "this section not compressed" flags in if it if is one of those.

KobaBeach

tysm <3
I guess I just let my social anxiety get the best of me ^^;

My programming skills aren't the best, been trying to learn how to add the Lunar Compress dll (which is in C, I believe) to the C# project with help from my friend, who has more experience, but I don't mind learning the ins and outs and best practices, with the occasional break in between for sanity purposes.

I'll give a look through that pdf over the coming days and try to see if I can code some basic compression and decompression methods, asking help from my friend whenever she's available. Seems like hard stuff m(_ _)m. I might still learn Z80 and the Game Boy, and study Oni II's innards later on once I get more progress on this or the editor, it looks like a neat game. Again, thanks a bunch.
replace my bones with pocky and throw me down a stairwell

FAST6191

The programming really is nothing too drastic

For the average LZ setup

First part will usually be an indicator and length of file (might even have a length of decompressed file in some cases but this is not certain), so much so normal for many file types. If there is a header detailing location of sections if they are what is actually compressed (better to decompress small sections rather than have to unpack a whole file to get 200 bytes -- if you have ever had a 7zip file take ages to decompress a single file from a big collection of them then this is that in the real world) it will probably be here too.

LZ itself then usually amounts to
[Whatever header stuff you are doing] and get to start of compressed section

Check flag (assuming there is one, some really simplistic ones might just expect you to start and count upwards)

   if not compressed copy data to final setup, go to next section (usually a fixed number of bytes away, though can be included in the flag). This is also the cheap way of faking it -- if every flag is "not compressed, go to next" then you don't have to worry about actually recompressing and the game will not care either (Hopefully you can persuade your cartridge storage division of the company you are making this for to give you the extra space).

   if compressed then read compression value. It will have a value for number of bits to read (might be slightly more complex as you will have a minimum to make it worth it*) and number of bytes from where you are to go back to find it.

Repeat until end of file/compressed data section (which is what the header might have told you).

*there is no point in saying go here and read 0 bytes, or less bytes than it took to have the compressed flag. To that end a read value of 0 might well mean 3 bytes and start counting up from there so you also have a higher maximum just in case that is useful.

Custom LZ family compressions on consoles or indeed PC tend mostly to vary the flags and split between number of bits kicked to read length and number of bits kicked to location, which will very quickly break any compression hardcoded to expect a particular split but be similarly trivially sorted when it comes down to it. Can also start from the end of the file (see BLZ) or the header/file locations bit might be stored there, and some particularly silly types might read forwards and backwards or count from an absolute position in the file (go to byte ?? and copy $$ from it sort of thing) but absolute is rare and usually makes less sense/makes for worse compression so yeah.
RLE (run length encoding, more common on older consoles without as much computing power) is just a simplistic version of this; skip along though the file, if meeting compressed flag then repeat this next bit for however many bytes it tells you to, get to end of file)

Anyway is that not more a less a slight tweak on the classic "change new line from 0d to Windows style 0d0a but also be careful not to change 0d0a to 0d0a0a" thing you get to play with when first learning either loops or file manipulation? Also something you can do on any programming language that allows you to fiddle at binary level for maths and fiddle with files so don't even have to get your hands dirty with some low level language/assembly if you don't want, even more so if it is just for a few hundred KB of text file (might be more annoying for parsing 80 gigs of modern PC graphics textures if you do it in Javascript).


Dictionary compressions (like Huffman, which is rare as a general thing) are simple substitutions most of the time. Can be more long winded to do as elegant code but in the end you get to go through everything and where it has one value on the list of substitutions you take the (hopefully smaller) value it is stored as and replace it with whatever the dictionary/lookup table says to replace it with.
Recompression means you can either adopt the existing table or maybe define a new one -- scan through the file noting particular repeating patterns, make new table based on these and it might actually make it smaller.

If you want some example source code then https://www.romhacking.net/utilities/826/ is good stuff, or maybe https://github.com/barubary/dsdecmp . Both are for the GBA/DS but as above it is all fairly similar in the end.

Also if you have not checked extensions, header magic stamps or the like out in a search then might want to do that too.

KobaBeach

I've been trying to convert EsperKnight's code to C# but it still keeps overflowing or something? Don't really know how to debug this... I can show the code if necessary, most of it is hand converted C++ code. I've managed to write the original bytes to another file, so it doesn't seem like its the original file that's causing the issue, it's the decompression. Sorry for being a pain.

EDIT: nvm gonna try to debug the game with IDA to figure out its decompression routine. Sorry for being a bad programmer ^^;
replace my bones with pocky and throw me down a stairwell

Aeana

I looked at these games a very long time ago (3-5 on PC) and while my memory is a bit fuzzy, I seem to recall 3 and 5 used an interesting scheme where they'd use shift-JIS kanji and then control codes to switch display modes, with the kana stored as half-width katakana.  So something like 助ける would be 助<hiragana switch>ケル.

Natreg

Quote from: KobaBeach on February 26, 2021, 08:11:10 AM
I've been trying to convert EsperKnight's code to C# but it still keeps overflowing or something? Don't really know how to debug this... I can show the code if necessary, most of it is hand converted C++ code. I've managed to write the original bytes to another file, so it doesn't seem like its the original file that's causing the issue, it's the decompression. Sorry for being a pain.

EDIT: nvm gonna try to debug the game with IDA to figure out its decompression routine. Sorry for being a bad programmer ^^;


I found this post and registered in order to answer. The Legend of Heroes III for windows stores it's script on several files.

You need first to decompress the main files for the game using this tool: https://www.pokanchan.jp/dokuwiki/software/falcnvrt/start

Files ED3_DT09 and DT10 don't contain the script. Those files are not supported by the tool, but my guess is that they contain the game sprites.


A lot of files contain script, I can't tell you exactly which ones, but for instance, ED3_DT00_0001.dat contains the dialogue for the first town in the game.

The data is encoded using Shift JIS. You can open it with notepad++ and just change the encoding to Shift JIS and you will see the text.

However, there is a catch, All hiragana is stored in half width katakana. This is in order to save space since half width katakana only needs 1 byte whereas normal hiragana on Shift JIS would need 2 bytes.

The game automatically transforms the halfwidth katakana to hiragana in the game. In order to use normal katakana, the game use 2 specific flags that tell when to transform haldwidth Katakana to normal katakana.


For instance, in that same file you will find the name given to Jurio's father:

83 57 A2 AD D8 B5 A3 C9 95 83 00
ジ「ュリオ」ノ父

83 57 ジ
A2 「
AD ュ
D8 リ
B5 オ
A3 」
C9 ノ
95 83 父
00 <end_string>

This will appear in game as ジュリオの父
byte A2 tells the game that everything in halfwidth katakana after it must be transformed to normal Katakana. This won't affect any Kanjis, so this flag would work until the end of the sentence, unless it finds...
A3 This will tell the game, that any halfwidth katakana afterwards will be transformed into normal hiragana.

That's how text is stored in this game.

There is also some text stored on images, Mostly the intro text and the ending text (ED3_DT13 contains the intro images, and I think ED3_DT14 is the ending)

There is also some stored on the main executable, mostly item names.


I hope this helps.

In order to do a program that could edit the text you will need also to find where the pointers to this strings are stored inside the file. I didn't have much luck with that. However, I think the pointer for the start of the strings is always stored on 0x1A-1B.

I also found out that each text file is identified with a code which is stored as the second string. This code tells you what kind of dialogue contains the file and for which chapter/event flag it is for.

The windows version of Eiyuu Densetsu 1 is way easier in that regard.


EDIT: The dialogue in the game has the name of the person speaking attached. Instead of repeating the name of the person, it has a pointer to the string. so for instance the initial dialogue with Jurio's Father has this:


12 85 04 0F 83 57 A2 AD D8 B5 A1 01 A3 B2 D6 B2 D6 96 BE 93 FA B6 D7 8F 84 97 E7 C9 97 B7 82 BE C5 A1 02 81 45 81 45 81 45 96 BE 93 FA B6 D7 C5 C9 C6 01 C5 DD 82 C5 A4 D3 B3 8F 84 97 E7 95 9E A6 92 85 C3 D9 DD 82 BE 81 48

which represents this:

ジュリオの父
ジュリオ。
いよいよ明日から巡礼の旅だな。

ジュリオの父
・・・明日からなのに
なんで、もう巡礼服を着てるんだ?


So, a 01 byte means there is a line break, a 02 byte means there is a change in window.

But the most important ones are the first 4 bytes of the string: 12 85 04 0F
This means that the text contained between 12 and 0F is the name of the person speaking. And the 85 04 is a pointer to a string on position 04 85 which is Jurio's Father string: ジュリオの父

However, important characters like Cris and Jurio don't have their names on the file since they are on the main executable: position 0x53704 for Jurio and 0x536FC for Cris.