News: 11 March 2016 - Forum Rules
Current Moderators - DarkSol, KingMike, MathOnNapkins, Azkadellia, Danke

Author Topic: (PSX - Aconcagua) Can't figure out how compressed text works  (Read 3475 times)

730

  • Jr. Member
  • **
  • Posts: 49
    • View Profile
(PSX - Aconcagua) Can't figure out how compressed text works
« on: February 15, 2018, 09:56:34 pm »
I wanted to translate this game but to be honest I think I should come back to this in a few months or years when I've learned more about programming/ASM/computers but I figured I might as well make a post before completely giving up.

So the image of this game has a file in it called PROGRAM.BIN where most of the game besides cutscenes and maybe music/voices (some XA files I haven't looked into) seems to be in.
This PROGRAM.BIN file seems to have several headers labelled PS-X EXE (and some other stuff like Sony Computer Entertainment etc.) in it separating different chunks (should be 141 of them, I split them with QuickBMS).

I managed to find in-game lines in PROGRAM.BIN but they're all jumbled up, missing characters and being shortened. I looked up the lines in RAM for comparison. Here's two examples:


Code: [Select]
付近に客室が落ちたらしい
RAM: 95 74 8B DF 82 C9 8B 71 8E BA 82 AA 97 8E 82 BF 82 BD 82 E7 82 B5 82 A2
BIN: 95 74 8B DF 82 C9 8B A8 71 8E BA D4 01 8E 1E 05 BD 67 00 D8 B5 82 A2

<stuff>Tenemos que subir
RAM: 11 35 12 31 13 3D 34 30 39 35 3B 03 54 65 6E 65 6D 6F 73 20 71 75 65 20 73 75 62 69 72
BIN: 11 26 35 EC 7E 54 65 6E 00 65 6D 6F 73 20 71 75 65 80 20 73 75 62 69 72

I read some posts about LZ compression, but those algorithms seem to be different on every game they're used on, and I also can't find anything like an LZ header in the file, so I don't know what to make of that.

I tried figuring out how the text got decompressed by debugging with this and tracing with this but I... wasn't very successful. I realized there was an option for CD-ROM reading breakpoint, but I can't figure out how data is being read and where it's being written.

Some extra info: as you can see above, the game uses both Shift-JIS and ASCII, since it has both Japanese and Spanish/English, though it doesn't seem to support special characters like á or ñ (I edited a savestate to see if I could edit text through those and when I put characters like those in, it crashed when the dialogue box came up). I haven't played the game much to see if it had symbols like those in-game, but I doubt it does.

Discord: 730#8471

Vehek

  • Full Member
  • ***
  • Posts: 169
    • View Profile
Re: (PSX - Aconcagua) Can't figure out how compressed text works
« Reply #1 on: February 16, 2018, 02:45:38 am »
Here are some notes I made a few years ago.
Quote
Compression
[flag Bytes] are read from lowest to highest. Clear bits indicate literal bytes.
Guess: Upper 4 bits of reference is the copy-size.
References appear to be negative numbers, not absolute locations into a buffer.
---
Compressed data for the first location is at 0x527800 of PROGRAM.bin
The decompressed size is earlier, right after the start of a new PS-X EXE header.

730

  • Jr. Member
  • **
  • Posts: 49
    • View Profile
Re: (PSX - Aconcagua) Can't figure out how compressed text works
« Reply #2 on: February 16, 2018, 03:29:48 am »
Here are some notes I made a few years ago.

Wow, I didn't expect someone to reply to this thread that had actually done something with this game. Thanks, though I think at my current level of expertise it won't do much good, but I'll keep it in my own notes for later.
Discord: 730#8471

Valendian

  • Jr. Member
  • **
  • Posts: 42
    • View Profile
Re: (PSX - Aconcagua) Can't figure out how compressed text works
« Reply #3 on: February 16, 2018, 09:41:16 am »
The bottom one is not compressed, "Ten emos que subdir". pSX can break on CDROM DMA, if you do not know the DMA address then set the range to all memory 80000000-801FFFFF

weissvulf

  • Sr. Member
  • ****
  • Posts: 324
  • Good news! An anomaly solved the enigma.
    • View Profile
Re: (PSX - Aconcagua) Can't figure out how compressed text works
« Reply #4 on: February 16, 2018, 01:43:59 pm »
I almost wonder if it's encryption rather than compression. Since you know the RAM location of the 'decompressed' text, have you tried putting a 'breakpoint on memory-write' at that location? That would let you discover the tail-end of the routine generating the final text. From there you should be able to trace backwards to see how it's being produced.

730

  • Jr. Member
  • **
  • Posts: 49
    • View Profile
Re: (PSX - Aconcagua) Can't figure out how compressed text works
« Reply #5 on: February 16, 2018, 07:12:39 pm »
I almost wonder if it's encryption rather than compression. Since you know the RAM location of the 'decompressed' text, have you tried putting a 'breakpoint on memory-write' at that location? That would let you discover the tail-end of the routine generating the final text. From there you should be able to trace backwards to see how it's being produced.

I tried that (though not a breakpoint, just managed to find the moment it started to write the text at the location I knew it would), and it was a bit hard to follow, and also found that it's getting the value to write from another location in memory, already decompressed/decrypted. I guess I'll try it again, maybe setting a breakpoint on the other location to find out how it writes it there.

The bottom one is not compressed, "Ten emos que subdir". pSX can break on CDROM DMA, if you do not know the DMA address then set the range to all memory 80000000-801FFFFF

Hmm, it looks like you're right for the string itself, though the bytes that come before it are different from what is in RAM, so those must still be compressed.
Discord: 730#8471

weissvulf

  • Sr. Member
  • ****
  • Posts: 324
  • Good news! An anomaly solved the enigma.
    • View Profile
Re: (PSX - Aconcagua) Can't figure out how compressed text works
« Reply #6 on: February 17, 2018, 04:27:42 am »
Quote
it's getting the value to write from another location in memory, already decompressed/decrypted.
I've found multiple copies like that to be very common. Usually the text goes through a series of tests each time it is copied, and variables like names are filled in etc. If you keep going, you should get to the original 'compressed' text as loaded from the CD. You might get lucky and find some simple compression or encryption routine that can be bypassed all together.

730

  • Jr. Member
  • **
  • Posts: 49
    • View Profile
Re: (PSX - Aconcagua) Can't figure out how compressed text works
« Reply #7 on: February 17, 2018, 02:07:23 pm »
I've found multiple copies like that to be very common. Usually the text goes through a series of tests each time it is copied, and variables like names are filled in etc. If you keep going, you should get to the original 'compressed' text as loaded from the CD. You might get lucky and find some simple compression or encryption routine that can be bypassed all together.

I think I was wrong with what I said, but I did find the part in memory where the compressed text loaded from the CD is, and it's getting it from there. I've yet to figure out how it's decompressed, but I did see it changing the register that points to the current byte to load from the compressed section to an address that was several bytes before what the current byte should be, similar to what I read about LZ storing bytes that were previously used as to not reuse them in memory.

The concrete example was something like this:
Code: [Select]
00 01 02 03 04 05 06 07 08 09 0A

Decompressed bytes that end up being stored:
30 E6 02 80 68 DD 02 80 84 DD

Compressed bytes:
30 E6 02 80 68 DD 02 80 82 84 FC

When it's time to write to 0x09, it should load 0x0A, and somehow in the process,
the address ends up being 0x05, loading that DD then storing it in 0x09.

Don't see how this is compressing anything at all but maybe it works out in the long run?

And I don't think there's anything I can bypass, it does need to decompress/decrypt the text in the CD after all, and I need to be able to insert text compressed/encrypted like this into the image.
I'm guessing if it's encryption I could technically rewrite it to bypass the decryption and load non-encrypted text I would write into the image, though judging from what Vehek posted, it looks like it really is compression.

Edit: I've been digging some more. There's apparently something like 3 routes through which it can go. It either loads bytes directly from the "compressed" zone and stores them into the other zone, loops loading bytes from the compressed zone until it finds the appropriate byte to load and store, or it uses a loaded byte to calculate (with some ANDs and ORs and shifts and values in other registers) an address pointing to the byte that must be stored. I've seen this address end up pointing to the uncompressed zone, way before the position where it would store the new byte. It might also have ended up pointing to a posterior position in the compressed zone, but I might've confused that for the second case I described where it skips over some bytes.

I've also seen it use 2 registers to compare them, where one is a set value and the other keeps incrementing during a loop, and depending whether the incrementing one is < than the set one, it branches off to somewhere or not. I think this loop determines a certain amount of bytes to either directly load from the compressed zone, to get from a previously written part of memory, or to skip through in the compressed zone, I don't remember right now exactly.
« Last Edit: February 17, 2018, 07:40:11 pm by 730 »
Discord: 730#8471

weissvulf

  • Sr. Member
  • ****
  • Posts: 324
  • Good news! An anomaly solved the enigma.
    • View Profile
Re: (PSX - Aconcagua) Can't figure out how compressed text works
« Reply #8 on: February 18, 2018, 06:06:09 pm »
It sounds like you're certainly on the right track!
Quote
It either loads bytes directly from the "compressed" zone and stores them into the other zone, loops loading bytes from the compressed zone until it finds the appropriate byte to load and store, or it uses a loaded byte to calculate (with some ANDs and ORs and shifts and values in other registers) an address pointing to the byte that must be stored. I've seen this address end up pointing to the uncompressed zone,
I would think figuring out the trigger for these three paths would be the key. If one patch copies directly from compressed to decompressed area, perhaps you can discover how to make it the only path that is triggered.

The bitwise operations you mention may work toward the pattern Vehek mentioned earlier. You may already know, but AND operations are usually for 'bitmasks' - extracting bits out of a larger number. It's common to see an AND followed by a SRL to move the extracted bits into their proper place before testing.

If you don't already have a resource for tracking such operations, the HexDecBin converter HERE might help.

On a related note, I get the anti-mod lockup screen when I load this game in pSX 1.13, even when using a Japanese BIOS. Which is especially strange because the Redump database doesn't list is as anti-modchip. Has anyone been successful in booting this game with the pSX emulator?


730

  • Jr. Member
  • **
  • Posts: 49
    • View Profile
Re: (PSX - Aconcagua) Can't figure out how compressed text works
« Reply #9 on: February 18, 2018, 08:34:17 pm »
It sounds like you're certainly on the right track! I would think figuring out the trigger for these three paths would be the key. If one patch copies directly from compressed to decompressed area, perhaps you can discover how to make it the only path that is triggered.

So you're saying by doing that I could simply store decompressed text in the disc? That makes sense, though then I would have to figure out how it loads stuff from the disc... Well I guess I'd have to do that if I stored it compressed anyways since the size would change.

The bitwise operations you mention may work toward the pattern Vehek mentioned earlier. You may already know, but AND operations are usually for 'bitmasks' - extracting bits out of a larger number. It's common to see an AND followed by a SRL to move the extracted bits into their proper place before testing.

If you don't already have a resource for tracking such operations, the HexDecBin converter HERE might help.

On a related note, I get the anti-mod lockup screen when I load this game in pSX 1.13, even when using a Japanese BIOS. Which is especially strange because the Redump database doesn't list is as anti-modchip. Has anyone been successful in booting this game with the pSX emulator?

I've been debugging with pSX in fact, I use the scph5500 BIOS.

And today I've figured out a few more things, I think I'm getting pretty close.
In the register r19 (just naming based on what pSX says), at some point a byte from the compressed area is loaded, and it keeps getting shifted right by 1 bit every loop that a byte is stored to the decompressed area (also at some point it gets 0xFF added to it, such that it ends up being 0xFFXX, XX being the loaded byte, and I think this loaded byte is also the bytes that are "skipped" that I mentioned).
Normally bytes are taken from the compressed area and stored into the decompressed area, but when the lowest bit of r19 becomes 1, it branches off into the other path.

In this path it loads 2 bytes from the compressed area, say XY ZT for convenience, and it enters what I can't help but to call a for loop, with Z+3 being the amount of iterations. In this loop it gets Z+3 bytes from a previous location of the decompressed zone and stores them into the current one. It makes this address by adding up a base address, an offset stored in r16 (I still don't understand how it gets this offset; it appears to increment every loop, though I think depending on the path taken it doesn't increment), TXY (all those operations were for building that number), and the number of the current iteration of the "for" loop.

I've yet to figure out how r16 works, and keep track of r19. After finishing that path above it seems it can either "refresh" itself by loading a byte from the compressed area (the skipped one), or it can just keep going through the normal path until it branches off again. That behavior might have to do with a check done right before the one that checks the lowest bit. This check looks into the lowest bit of the 3rd nibble instead (ANDs with 0x0100).
Discord: 730#8471

Valendian

  • Jr. Member
  • **
  • Posts: 42
    • View Profile
Re: (PSX - Aconcagua) Can't figure out how compressed text works
« Reply #10 on: February 18, 2018, 11:43:55 pm »
it keeps getting shifted right by 1 bit every loop that a byte is stored to the decompressed area (also at some point it gets 0xFF added to it, such that it ends up being 0xFFXX, XX being the loaded byte, and I think this loaded byte is also the bytes that are "skipped" that I mentioned).

I think you found the getbits function. The typical LZ type decoder has three paths:
  1  Uncompressed raw data
  2  Compressed pointer / size pair
  3  End of file marker (-1 in the code below)
Although any attempt to compress such a small payload will result in expansion. I think this is a more of a text preprocessor than any form of depacker or decrypter.

Anyways here's what an LZ decoder looks like
Code: [Select]
// source https://oku.edu.mie-u.ac.jp/~okumura/compression/lzss.c
unsigned int   insize;
unsigned int   outsize;
unsigned char *indata;
unsigned char *outdata;
unsigned int  inptr;
unsigned int  buf;
unsigned int  mask = 0;
unsigned char buffer[N * 2];

// get n bits
int getbit(int n) {
    int i, x;
    x = 0;
    for (i = 0; i < n; i++) {
        if (mask == 0) {
            if ( inptr >= insize ) return -1;
            buf = indata[ inptr++ ];
            mask = 128;
        }
        x <<= 1;
        if (buf & mask) x++;
        mask >>= 1;
    }
    return x;
}

void decode(void) {
    int i, j, k, r, c;
    for (i = 0; i < N - F; i++) buffer[i] = B; // clear the buffer to avoid infinite loops
    r = N - F;
    while ((c = getbit(1)) != -1) {
        if (c) {
            if ((c = getbit(8)) == -1) break;
            out[outsize++] = c;
            buffer[r++] = c;
            r &= (N - 1);
        } else {
            if ((i = getbit(EI)) == -1) break;
            if ((j = getbit(EJ)) == -1) break;
            for (k = 0; k <= j + 1; k++) {
                c = buffer[(i + k) & (N - 1)];
                outdata[ outsize++ ] = c;
                buffer[r++] = c;
                r &= (N - 1);
            }
        }
    }
}
« Last Edit: February 18, 2018, 11:50:10 pm by Valendian »

weissvulf

  • Sr. Member
  • ****
  • Posts: 324
  • Good news! An anomaly solved the enigma.
    • View Profile
Re: (PSX - Aconcagua) Can't figure out how compressed text works
« Reply #11 on: February 19, 2018, 01:10:33 am »
I was searching for the reason my copy won't load (anti-mod) and ran into what looks like the same compression you've been describing. If this is compression, it's just about the crummiest scheme I've ever seen. There seems to be a pattern where 00 means the next 8 letters are to be copied as is. Have you tried adding that code to your text? When the 00 is replaced with 80 (aka  1000 0000), it causes a fluctuation 8 letters later. Just like you said, a bitwise 'countdown'.

F8 0F 20 20 20 20 20 53 00
?--?-- -- -- -- -- --S

4F 46 54 57 41 52 45 20 00
O--F--T--W--A--R--E     

54 45 52 4D 49 4E 41 54 00
T--E--R--M--I--N--A--T

45 44 0A 43 4F 4E 53 4F 00
E--D-- --C--O--N--S--O 

4C 45 20 4D 41 59 20 48 00
L--E-- --M--A--Y-- --H

41 56 45 20 42 45 45 4E 80
A--V--E-- --B--E--E--N

20 4D 4F 44 49 46 49 E1 0F 01 C8
 --M--O--D--I--F--I   

2F 43 41 4C 4C 20 31 2D 00
/--C--A--L--L-- --1- -

38 38 38 2D 37 38 30 2D 00
8--8--8- - --7--8--0- -

37 36 39 30 00 00 00 00 00 
7--6--9--0

730

  • Jr. Member
  • **
  • Posts: 49
    • View Profile
Re: (PSX - Aconcagua) Can't figure out how compressed text works
« Reply #12 on: February 19, 2018, 01:35:49 am »
There seems to be a pattern where 00 means the next 8 letters are to be copied as is. When the 00 is replaced with 80 (aka  1000 0000), it causes a fluctuation 8 letters later. Just like you said, a bitwise 'countdown'

Wow, I hadn't noticed that, I guess I'll look into it tomorrow. That countdown seems to be done with just about any value though, so it's weird.
For now, modifying the code to load text directly seems to be the easiest method to achieve this, I'm gonna have to find out where the text starts being loaded and where it ends, so I can compare the size of the compressed text with the uncompressed text, since if it really is compression, uncompressed text might not fit in the disc.

Have you tried adding that code to your text?

What code? If you mean that LZ snippet, I've yet to figure it out (going from low level to high level is difficult jesus).

On another topic, I've tried opening the game's STR files with some STR players, but none of them can open them. I don't even see a header in them, if there's supposed to be one. Part of this translation would be adding English subtitles to the few Spanish lines in the cutscenes, so I'm gonna have to figure these out too.
Discord: 730#8471

Vehek

  • Full Member
  • ***
  • Posts: 169
    • View Profile
Re: (PSX - Aconcagua) Can't figure out how compressed text works
« Reply #13 on: February 19, 2018, 02:52:49 am »
I should have just said it was an LZ variant. I posted those notes because I thought they would be recognized as the key details that need to be tweaked in an LZSS decoder.

On another topic, I've tried opening the game's STR files with some STR players, but none of them can open them. I don't even see a header in them, if there's supposed to be one. Part of this translation would be adding English subtitles to the few Spanish lines in the cutscenes, so I'm gonna have to figure these out too.

I did a lot of research into the subtitles (which are stored in the STR files), but don't know enough about CD reads to understand how it knows where to look for the parts it loads.

Gemini

  • Hero Member
  • *****
  • Posts: 2004
  • 時を越えよう、そして彼女の元に戻ろう
    • View Profile
    • Apple of Eden
Re: (PSX - Aconcagua) Can't figure out how compressed text works
« Reply #14 on: February 19, 2018, 03:32:55 am »
It's definitively LZS with 1 byte flags and 2 byte pairs, no encryption involved at all or otherwise you wouldn't be able to read a thing. What you need to know is if uses the decompression buffer as a lookup or a ring buffer of 4KB. If it writes two bytes at different locations it's definitively using a ring buffer. There are plenty de/compressors with sources out there, it shouldn't take long to recreate for this game in particular (mind you, I've been recycling the same compression code for years with minor tweaks and it ends up working for most LZS variants). Just make sure you don't make a fake compressor that actually inserts literal data with flags set to 0 or you will end up having making the game load quite slowly (i.e. compression is used to really decrease load time, not as an anti modification protection like many usually point out).
I am the lord, you all know my name, now. I got it all: cash, money, and fame.

weissvulf

  • Sr. Member
  • ****
  • Posts: 324
  • Good news! An anomaly solved the enigma.
    • View Profile
Re: (PSX - Aconcagua) Can't figure out how compressed text works
« Reply #15 on: February 19, 2018, 03:51:52 am »
Just make sure you don't make a fake compressor that actually inserts literal data with flags set to 0 or you will end up having making the game load quite slowly (i.e. compression is used to really decrease load time, not as an anti modification protection like many usually point out).
I know that's certainly true with large data chunks like images, but is it really significant for text? I mean 1x CD speed is something like 100,000 letters per second isn't it?

Gemini

  • Hero Member
  • *****
  • Posts: 2004
  • 時を越えよう、そして彼女の元に戻ろう
    • View Profile
    • Apple of Eden
Re: (PSX - Aconcagua) Can't figure out how compressed text works
« Reply #16 on: February 19, 2018, 04:14:15 am »
Considering you're adding a fake compression it would take significantly longer to load and decompress bigger data rather than load roughly half the size and perform decompression from memory. You would almost never find small data with compression on top because it would be pointless / counterproductive; given how slow seek is most games would gather as much data as possible to load in one read pass, hence why it's always better to compress as intended.
I am the lord, you all know my name, now. I got it all: cash, money, and fame.

730

  • Jr. Member
  • **
  • Posts: 49
    • View Profile
Re: (PSX - Aconcagua) Can't figure out how compressed text works
« Reply #17 on: February 19, 2018, 04:19:45 pm »
I'm trying to understand that LZS code but it's difficult. I understand that the offset in r16 that keeps getting incremented is probably the beginning of the "window", and by doing (current writing address - start address) - r16 I get what would be the size of this window/buffer (it seems it's 0x1000 or 0x0FFF, I'm not good at math, I'm going to assume it's 0x1000), but I'm having trouble understanding what all these constants and variables are for.

Code: [Select]
#define EI 12  /* typically 10..13 */
#define EJ  4  /* typically 4..5 */
#define P   1  /* If match length <= P then output one character */
#define N (1 << EI)  /* buffer size */
#define F ((1 << EJ) + 1) /* lookahead buffer size */

What are EI and EJ? Why are they used to set the values of N and F instead of setting them directly? It looks like N is the buffer size, aka the window, but then what is F used for?
Is the mask variable in getbit() the same as r19? Since it looks like it does that countdown, though the one in the game is more than just a single bit and has more stuff in it.
The EI/EJ and N/F stuff is really confusing me, I'm not entirely sure what the lookahead buffer is, but I think its size determines the amount of literal bytes to be read when it reaches the thing that tells it to just get literal bytes.

What you need to know is if uses the decompression buffer as a lookup or a ring buffer of 4KB. If it writes two bytes at different locations it's definitively using a ring buffer.

What do you mean by different locations? As in, while decompressing it can write a byte at 0x0100, then next, instead of writing to 0x0101, it writes to 0x0150 or something like that? Because I'm pretty sure it writes them all consecutively, at least it looks that way, not sure how I would check whether there happens to be a byte that is written somewhere else.

I should have just said it was an LZ variant. I posted those notes because I thought they would be recognized as the key details that need to be tweaked in an LZSS decoder.

I did a lot of research into the subtitles (which are stored in the STR files), but don't know enough about CD reads to understand how it knows where to look for the parts it loads.

Can we talk in IRC? I mentioned you in there, not sure if you saw it. I want to ask you questions but this is kinda slow.

By the way, thanks everyone for the help so far!
Discord: 730#8471

RidFin

  • Newbie
  • *
  • Posts: 4
    • View Profile
Re: (PSX - Aconcagua) Can't figure out how compressed text works
« Reply #18 on: February 19, 2018, 04:33:32 pm »
every game implements its own format based on common algorithm, once you understand the algorithm you'll see the pattern in the asm code. Look for a good book about compression, youtube video and learn the lz77/lzss/whatever algorithm, then try to write something on your own and implement some decompresser for the game. You are in the right path!

weissvulf

  • Sr. Member
  • ****
  • Posts: 324
  • Good news! An anomaly solved the enigma.
    • View Profile
Re: (PSX - Aconcagua) Can't figure out how compressed text works
« Reply #19 on: February 19, 2018, 05:29:48 pm »
It appears that this game shipped with an aggressive anti-Mod detection routine. My BIN/CUE, that was dumped from an original CD, locks up on pSX 1.13. There is, however, a pirated version floating around which loads normally. Comparing the two, the pirate copy appears to have been patched to bypass anti-mod. If you complete this translation, you'll probably want to factor in people using the original version. To that end, HERE is a set of patches to convert the pirate version into the original, in case it helps for comparisons.