Romhacking.net

Romhacking => ROM Hacking Discussion => Topic started by: 730 on February 15, 2018, 09:56:34 pm

Title: (PSX - Aconcagua) Can't figure out how compressed text works
Post by: 730 on February 15, 2018, 09:56:34 pm
Edit: link from last post is dead, reuploaded to Mediafire here (https://www.mediafire.com/file/5wgz5k9x5hfinwl/decompressor.7z/file). This program takes paths to PROGRAM.BIN chunks and decodes them. The InputFile class also includes an unfinished encode() method for encoding them back (don't think it should be that hard to reverse the algorithm in order to encode them, though I doubt it would be very useful since it would probably just make the game crash). Here (https://www.mediafire.com/file/cozm25jiz1hb9e8/split-program-bin.bms/file) is the QuickBMS script for splitting PROGRAM.BIN into said chunks.

I wanted to translate this game but to be honest I think I should come back to this in a few months or years when I've learned more about programming/ASM/computers but I figured I might as well make a post before completely giving up.
So the image of this game has a file in it called PROGRAM.BIN where most of the game besides cutscenes and maybe music/voices (some XA files I haven't looked into) seems to be in.
This PROGRAM.BIN file seems to have several headers labelled PS-X EXE (and some other stuff like Sony Computer Entertainment etc.) in it separating different chunks (should be 141 of them, I split them with QuickBMS).

I managed to find in-game lines in PROGRAM.BIN but they're all jumbled up, missing characters and being shortened. I looked up the lines in RAM for comparison. Here's two examples:

(https://my.mixtape.moe/ymutle.png)
Code: [Select]
付近に客室が落ちたらしい
RAM: 95 74 8B DF 82 C9 8B 71 8E BA 82 AA 97 8E 82 BF 82 BD 82 E7 82 B5 82 A2
BIN: 95 74 8B DF 82 C9 8B A8 71 8E BA D4 01 8E 1E 05 BD 67 00 D8 B5 82 A2

<stuff>Tenemos que subir
RAM: 11 35 12 31 13 3D 34 30 39 35 3B 03 54 65 6E 65 6D 6F 73 20 71 75 65 20 73 75 62 69 72
BIN: 11 26 35 EC 7E 54 65 6E 00 65 6D 6F 73 20 71 75 65 80 20 73 75 62 69 72

I read some posts about LZ compression, but those algorithms seem to be different on every game they're used on, and I also can't find anything like an LZ header in the file, so I don't know what to make of that.

I tried figuring out how the text got decompressed by debugging with this (http://www.romhacking.net/utilities/267/) and tracing with this (http://www.romhacking.net/utilities/236/) but I... wasn't very successful. I realized there was an option for CD-ROM reading breakpoint, but I can't figure out how data is being read and where it's being written.

Some extra info: as you can see above, the game uses both Shift-JIS and ASCII, since it has both Japanese and Spanish/English, though it doesn't seem to support special characters like á or ñ (I edited a savestate to see if I could edit text through those and when I put characters like those in, it crashed when the dialogue box came up). I haven't played the game much to see if it had symbols like those in-game, but I doubt it does.
Title: Re: (PSX - Aconcagua) Can't figure out how compressed text works
Post by: Vehek on February 16, 2018, 02:45:38 am
Here are some notes I made a few years ago.
Quote
Compression
[flag Bytes] are read from lowest to highest. Clear bits indicate literal bytes.
Guess: Upper 4 bits of reference is the copy-size.
References appear to be negative numbers, not absolute locations into a buffer.
---
Compressed data for the first location is at 0x527800 of PROGRAM.bin
The decompressed size is earlier, right after the start of a new PS-X EXE header.
Title: Re: (PSX - Aconcagua) Can't figure out how compressed text works
Post by: 730 on February 16, 2018, 03:29:48 am
Here are some notes I made a few years ago.

Wow, I didn't expect someone to reply to this thread that had actually done something with this game. Thanks, though I think at my current level of expertise it won't do much good, but I'll keep it in my own notes for later.
Title: Re: (PSX - Aconcagua) Can't figure out how compressed text works
Post by: Valendian on February 16, 2018, 09:41:16 am
The bottom one is not compressed, "Ten emos que subdir". pSX can break on CDROM DMA, if you do not know the DMA address then set the range to all memory 80000000-801FFFFF
Title: Re: (PSX - Aconcagua) Can't figure out how compressed text works
Post by: weissvulf on February 16, 2018, 01:43:59 pm
I almost wonder if it's encryption rather than compression. Since you know the RAM location of the 'decompressed' text, have you tried putting a 'breakpoint on memory-write' at that location? That would let you discover the tail-end of the routine generating the final text. From there you should be able to trace backwards to see how it's being produced.
Title: Re: (PSX - Aconcagua) Can't figure out how compressed text works
Post by: 730 on February 16, 2018, 07:12:39 pm
I almost wonder if it's encryption rather than compression. Since you know the RAM location of the 'decompressed' text, have you tried putting a 'breakpoint on memory-write' at that location? That would let you discover the tail-end of the routine generating the final text. From there you should be able to trace backwards to see how it's being produced.

I tried that (though not a breakpoint, just managed to find the moment it started to write the text at the location I knew it would), and it was a bit hard to follow, and also found that it's getting the value to write from another location in memory, already decompressed/decrypted. I guess I'll try it again, maybe setting a breakpoint on the other location to find out how it writes it there.

The bottom one is not compressed, "Ten emos que subdir". pSX can break on CDROM DMA, if you do not know the DMA address then set the range to all memory 80000000-801FFFFF

Hmm, it looks like you're right for the string itself, though the bytes that come before it are different from what is in RAM, so those must still be compressed.
Title: Re: (PSX - Aconcagua) Can't figure out how compressed text works
Post by: weissvulf on February 17, 2018, 04:27:42 am
Quote
it's getting the value to write from another location in memory, already decompressed/decrypted.
I've found multiple copies like that to be very common. Usually the text goes through a series of tests each time it is copied, and variables like names are filled in etc. If you keep going, you should get to the original 'compressed' text as loaded from the CD. You might get lucky and find some simple compression or encryption routine that can be bypassed all together.
Title: Re: (PSX - Aconcagua) Can't figure out how compressed text works
Post by: 730 on February 17, 2018, 02:07:23 pm
I've found multiple copies like that to be very common. Usually the text goes through a series of tests each time it is copied, and variables like names are filled in etc. If you keep going, you should get to the original 'compressed' text as loaded from the CD. You might get lucky and find some simple compression or encryption routine that can be bypassed all together.

I think I was wrong with what I said, but I did find the part in memory where the compressed text loaded from the CD is, and it's getting it from there. I've yet to figure out how it's decompressed, but I did see it changing the register that points to the current byte to load from the compressed section to an address that was several bytes before what the current byte should be, similar to what I read about LZ storing bytes that were previously used as to not reuse them in memory.

The concrete example was something like this:
Code: [Select]
00 01 02 03 04 05 06 07 08 09 0A

Decompressed bytes that end up being stored:
30 E6 02 80 68 DD 02 80 84 DD

Compressed bytes:
30 E6 02 80 68 DD 02 80 82 84 FC

When it's time to write to 0x09, it should load 0x0A, and somehow in the process,
the address ends up being 0x05, loading that DD then storing it in 0x09.

Don't see how this is compressing anything at all but maybe it works out in the long run?

And I don't think there's anything I can bypass, it does need to decompress/decrypt the text in the CD after all, and I need to be able to insert text compressed/encrypted like this into the image.
I'm guessing if it's encryption I could technically rewrite it to bypass the decryption and load non-encrypted text I would write into the image, though judging from what Vehek posted, it looks like it really is compression.

Edit: I've been digging some more. There's apparently something like 3 routes through which it can go. It either loads bytes directly from the "compressed" zone and stores them into the other zone, loops loading bytes from the compressed zone until it finds the appropriate byte to load and store, or it uses a loaded byte to calculate (with some ANDs and ORs and shifts and values in other registers) an address pointing to the byte that must be stored. I've seen this address end up pointing to the uncompressed zone, way before the position where it would store the new byte. It might also have ended up pointing to a posterior position in the compressed zone, but I might've confused that for the second case I described where it skips over some bytes.

I've also seen it use 2 registers to compare them, where one is a set value and the other keeps incrementing during a loop, and depending whether the incrementing one is < than the set one, it branches off to somewhere or not. I think this loop determines a certain amount of bytes to either directly load from the compressed zone, to get from a previously written part of memory, or to skip through in the compressed zone, I don't remember right now exactly.
Title: Re: (PSX - Aconcagua) Can't figure out how compressed text works
Post by: weissvulf on February 18, 2018, 06:06:09 pm
It sounds like you're certainly on the right track!
Quote
It either loads bytes directly from the "compressed" zone and stores them into the other zone, loops loading bytes from the compressed zone until it finds the appropriate byte to load and store, or it uses a loaded byte to calculate (with some ANDs and ORs and shifts and values in other registers) an address pointing to the byte that must be stored. I've seen this address end up pointing to the uncompressed zone,
I would think figuring out the trigger for these three paths would be the key. If one patch copies directly from compressed to decompressed area, perhaps you can discover how to make it the only path that is triggered.

The bitwise operations you mention may work toward the pattern Vehek mentioned earlier. You may already know, but AND operations are usually for 'bitmasks' - extracting bits out of a larger number. It's common to see an AND followed by a SRL to move the extracted bits into their proper place before testing.

If you don't already have a resource for tracking such operations, the HexDecBin converter HERE (https://www.romhacking.net/forum/index.php?topic=25409.0) might help.

On a related note, I get the anti-mod lockup screen when I load this game in pSX 1.13, even when using a Japanese BIOS. Which is especially strange because the Redump database doesn't list is as anti-modchip. (http://redump.org/disc/14630/) Has anyone been successful in booting this game with the pSX emulator?

Title: Re: (PSX - Aconcagua) Can't figure out how compressed text works
Post by: 730 on February 18, 2018, 08:34:17 pm
It sounds like you're certainly on the right track! I would think figuring out the trigger for these three paths would be the key. If one patch copies directly from compressed to decompressed area, perhaps you can discover how to make it the only path that is triggered.

So you're saying by doing that I could simply store decompressed text in the disc? That makes sense, though then I would have to figure out how it loads stuff from the disc... Well I guess I'd have to do that if I stored it compressed anyways since the size would change.

The bitwise operations you mention may work toward the pattern Vehek mentioned earlier. You may already know, but AND operations are usually for 'bitmasks' - extracting bits out of a larger number. It's common to see an AND followed by a SRL to move the extracted bits into their proper place before testing.

If you don't already have a resource for tracking such operations, the HexDecBin converter HERE (https://www.romhacking.net/forum/index.php?topic=25409.0) might help.

On a related note, I get the anti-mod lockup screen when I load this game in pSX 1.13, even when using a Japanese BIOS. Which is especially strange because the Redump database doesn't list is as anti-modchip. (http://redump.org/disc/14630/) Has anyone been successful in booting this game with the pSX emulator?

I've been debugging with pSX in fact, I use the scph5500 BIOS.

And today I've figured out a few more things, I think I'm getting pretty close.
In the register r19 (just naming based on what pSX says), at some point a byte from the compressed area is loaded, and it keeps getting shifted right by 1 bit every loop that a byte is stored to the decompressed area (also at some point it gets 0xFF added to it, such that it ends up being 0xFFXX, XX being the loaded byte, and I think this loaded byte is also the bytes that are "skipped" that I mentioned).
Normally bytes are taken from the compressed area and stored into the decompressed area, but when the lowest bit of r19 becomes 1, it branches off into the other path.

In this path it loads 2 bytes from the compressed area, say XY ZT for convenience, and it enters what I can't help but to call a for loop, with Z+3 being the amount of iterations. In this loop it gets Z+3 bytes from a previous location of the decompressed zone and stores them into the current one. It makes this address by adding up a base address, an offset stored in r16 (I still don't understand how it gets this offset; it appears to increment every loop, though I think depending on the path taken it doesn't increment), TXY (all those operations were for building that number), and the number of the current iteration of the "for" loop.

I've yet to figure out how r16 works, and keep track of r19. After finishing that path above it seems it can either "refresh" itself by loading a byte from the compressed area (the skipped one), or it can just keep going through the normal path until it branches off again. That behavior might have to do with a check done right before the one that checks the lowest bit. This check looks into the lowest bit of the 3rd nibble instead (ANDs with 0x0100).
Title: Re: (PSX - Aconcagua) Can't figure out how compressed text works
Post by: Valendian on February 18, 2018, 11:43:55 pm
it keeps getting shifted right by 1 bit every loop that a byte is stored to the decompressed area (also at some point it gets 0xFF added to it, such that it ends up being 0xFFXX, XX being the loaded byte, and I think this loaded byte is also the bytes that are "skipped" that I mentioned).

I think you found the getbits function. The typical LZ type decoder has three paths:
  1  Uncompressed raw data
  2  Compressed pointer / size pair
  3  End of file marker (-1 in the code below)
Although any attempt to compress such a small payload will result in expansion. I think this is a more of a text preprocessor than any form of depacker or decrypter.

Anyways here's what an LZ decoder looks like
Code: [Select]
// source https://oku.edu.mie-u.ac.jp/~okumura/compression/lzss.c
unsigned int   insize;
unsigned int   outsize;
unsigned char *indata;
unsigned char *outdata;
unsigned int  inptr;
unsigned int  buf;
unsigned int  mask = 0;
unsigned char buffer[N * 2];

// get n bits
int getbit(int n) {
    int i, x;
    x = 0;
    for (i = 0; i < n; i++) {
        if (mask == 0) {
            if ( inptr >= insize ) return -1;
            buf = indata[ inptr++ ];
            mask = 128;
        }
        x <<= 1;
        if (buf & mask) x++;
        mask >>= 1;
    }
    return x;
}

void decode(void) {
    int i, j, k, r, c;
    for (i = 0; i < N - F; i++) buffer[i] = B; // clear the buffer to avoid infinite loops
    r = N - F;
    while ((c = getbit(1)) != -1) {
        if (c) {
            if ((c = getbit(8)) == -1) break;
            out[outsize++] = c;
            buffer[r++] = c;
            r &= (N - 1);
        } else {
            if ((i = getbit(EI)) == -1) break;
            if ((j = getbit(EJ)) == -1) break;
            for (k = 0; k <= j + 1; k++) {
                c = buffer[(i + k) & (N - 1)];
                outdata[ outsize++ ] = c;
                buffer[r++] = c;
                r &= (N - 1);
            }
        }
    }
}
Title: Re: (PSX - Aconcagua) Can't figure out how compressed text works
Post by: weissvulf on February 19, 2018, 01:10:33 am
I was searching for the reason my copy won't load (anti-mod) and ran into what looks like the same compression you've been describing. If this is compression, it's just about the crummiest scheme I've ever seen. There seems to be a pattern where 00 means the next 8 letters are to be copied as is. Have you tried adding that code to your text? When the 00 is replaced with 80 (aka  1000 0000), it causes a fluctuation 8 letters later. Just like you said, a bitwise 'countdown'.

F8 0F 20 20 20 20 20 53 00
?--?-- -- -- -- -- --S

4F 46 54 57 41 52 45 20 00
O--F--T--W--A--R--E     

54 45 52 4D 49 4E 41 54 00
T--E--R--M--I--N--A--T

45 44 0A 43 4F 4E 53 4F 00
E--D-- --C--O--N--S--O 

4C 45 20 4D 41 59 20 48 00
L--E-- --M--A--Y-- --H

41 56 45 20 42 45 45 4E 80
A--V--E-- --B--E--E--N

20 4D 4F 44 49 46 49 E1 0F 01 C8
 --M--O--D--I--F--I   

2F 43 41 4C 4C 20 31 2D 00
/--C--A--L--L-- --1- -

38 38 38 2D 37 38 30 2D 00
8--8--8- - --7--8--0- -

37 36 39 30 00 00 00 00 00 
7--6--9--0
Title: Re: (PSX - Aconcagua) Can't figure out how compressed text works
Post by: 730 on February 19, 2018, 01:35:49 am
There seems to be a pattern where 00 means the next 8 letters are to be copied as is. When the 00 is replaced with 80 (aka  1000 0000), it causes a fluctuation 8 letters later. Just like you said, a bitwise 'countdown'

Wow, I hadn't noticed that, I guess I'll look into it tomorrow. That countdown seems to be done with just about any value though, so it's weird.
For now, modifying the code to load text directly seems to be the easiest method to achieve this, I'm gonna have to find out where the text starts being loaded and where it ends, so I can compare the size of the compressed text with the uncompressed text, since if it really is compression, uncompressed text might not fit in the disc.

Have you tried adding that code to your text?

What code? If you mean that LZ snippet, I've yet to figure it out (going from low level to high level is difficult jesus).

On another topic, I've tried opening the game's STR files with some STR players, but none of them can open them. I don't even see a header in them, if there's supposed to be one. Part of this translation would be adding English subtitles to the few Spanish lines in the cutscenes, so I'm gonna have to figure these out too.
Title: Re: (PSX - Aconcagua) Can't figure out how compressed text works
Post by: Vehek on February 19, 2018, 02:52:49 am
I should have just said it was an LZ variant. I posted those notes because I thought they would be recognized as the key details that need to be tweaked in an LZSS decoder.

On another topic, I've tried opening the game's STR files with some STR players, but none of them can open them. I don't even see a header in them, if there's supposed to be one. Part of this translation would be adding English subtitles to the few Spanish lines in the cutscenes, so I'm gonna have to figure these out too.

I did a lot of research into the subtitles (which are stored in the STR files), but don't know enough about CD reads to understand how it knows where to look for the parts it loads.
Title: Re: (PSX - Aconcagua) Can't figure out how compressed text works
Post by: Gemini on February 19, 2018, 03:32:55 am
It's definitively LZS with 1 byte flags and 2 byte pairs, no encryption involved at all or otherwise you wouldn't be able to read a thing. What you need to know is if uses the decompression buffer as a lookup or a ring buffer of 4KB. If it writes two bytes at different locations it's definitively using a ring buffer. There are plenty de/compressors with sources out there, it shouldn't take long to recreate for this game in particular (mind you, I've been recycling the same compression code for years with minor tweaks and it ends up working for most LZS variants). Just make sure you don't make a fake compressor that actually inserts literal data with flags set to 0 or you will end up having making the game load quite slowly (i.e. compression is used to really decrease load time, not as an anti modification protection like many usually point out).
Title: Re: (PSX - Aconcagua) Can't figure out how compressed text works
Post by: weissvulf on February 19, 2018, 03:51:52 am
Just make sure you don't make a fake compressor that actually inserts literal data with flags set to 0 or you will end up having making the game load quite slowly (i.e. compression is used to really decrease load time, not as an anti modification protection like many usually point out).
I know that's certainly true with large data chunks like images, but is it really significant for text? I mean 1x CD speed is something like 100,000 letters per second isn't it?
Title: Re: (PSX - Aconcagua) Can't figure out how compressed text works
Post by: Gemini on February 19, 2018, 04:14:15 am
Considering you're adding a fake compression it would take significantly longer to load and decompress bigger data rather than load roughly half the size and perform decompression from memory. You would almost never find small data with compression on top because it would be pointless / counterproductive; given how slow seek is most games would gather as much data as possible to load in one read pass, hence why it's always better to compress as intended.
Title: Re: (PSX - Aconcagua) Can't figure out how compressed text works
Post by: 730 on February 19, 2018, 04:19:45 pm
I'm trying to understand that LZS code but it's difficult. I understand that the offset in r16 that keeps getting incremented is probably the beginning of the "window", and by doing (current writing address - start address) - r16 I get what would be the size of this window/buffer (it seems it's 0x1000 or 0x0FFF, I'm not good at math, I'm going to assume it's 0x1000), but I'm having trouble understanding what all these constants and variables are for.

Code: [Select]
#define EI 12  /* typically 10..13 */
#define EJ  4  /* typically 4..5 */
#define P   1  /* If match length <= P then output one character */
#define N (1 << EI)  /* buffer size */
#define F ((1 << EJ) + 1) /* lookahead buffer size */

What are EI and EJ? Why are they used to set the values of N and F instead of setting them directly? It looks like N is the buffer size, aka the window, but then what is F used for?
Is the mask variable in getbit() the same as r19? Since it looks like it does that countdown, though the one in the game is more than just a single bit and has more stuff in it.
The EI/EJ and N/F stuff is really confusing me, I'm not entirely sure what the lookahead buffer is, but I think its size determines the amount of literal bytes to be read when it reaches the thing that tells it to just get literal bytes.

What you need to know is if uses the decompression buffer as a lookup or a ring buffer of 4KB. If it writes two bytes at different locations it's definitively using a ring buffer.

What do you mean by different locations? As in, while decompressing it can write a byte at 0x0100, then next, instead of writing to 0x0101, it writes to 0x0150 or something like that? Because I'm pretty sure it writes them all consecutively, at least it looks that way, not sure how I would check whether there happens to be a byte that is written somewhere else.

I should have just said it was an LZ variant. I posted those notes because I thought they would be recognized as the key details that need to be tweaked in an LZSS decoder.

I did a lot of research into the subtitles (which are stored in the STR files), but don't know enough about CD reads to understand how it knows where to look for the parts it loads.

Can we talk in IRC? I mentioned you in there, not sure if you saw it. I want to ask you questions but this is kinda slow.

By the way, thanks everyone for the help so far!
Title: Re: (PSX - Aconcagua) Can't figure out how compressed text works
Post by: RidFin on February 19, 2018, 04:33:32 pm
every game implements its own format based on common algorithm, once you understand the algorithm you'll see the pattern in the asm code. Look for a good book about compression, youtube video and learn the lz77/lzss/whatever algorithm, then try to write something on your own and implement some decompresser for the game. You are in the right path!
Title: Re: (PSX - Aconcagua) Can't figure out how compressed text works
Post by: weissvulf on February 19, 2018, 05:29:48 pm
It appears that this game shipped with an aggressive anti-Mod detection routine. My BIN/CUE, that was dumped from an original CD, locks up on pSX 1.13. There is, however, a pirated version floating around which loads normally. Comparing the two, the pirate copy appears to have been patched to bypass anti-mod. If you complete this translation, you'll probably want to factor in people using the original version. To that end, HERE (http://www.mediafire.com/file/lq1fc3g7n3wdc1h/Acongcagua_Restore_AntiMod%282%29.rar) is a set of patches to convert the pirate version into the original, in case it helps for comparisons.
Title: Re: (PSX - Aconcagua) Can't figure out how compressed text works
Post by: Valendian on February 19, 2018, 06:19:17 pm
EI is the bitsize of the pointer and EJ is the same for copy size. N an F convert these to buffer size and lookahead size. The lookahead is ised during compression to find longer duplicates within the sliding window. P is also only used during compression. It is the minimum string length, less than this will be stored raw.

The decoder works like this.
Read a bit.
If its 1 then the next bytes are raw. so read 8 bit count and copy that many raw bytes from the packed input to the output. Every raw byte is also appended to the sliding window (buffer).

But if the bit was clear then read a pointer + size pair and copy from the window instead.

Your version is not bit packing so replace getbit with readbyte.

Note LZSS uses sliding windows. LZ just offsets from current output pointer. Watch that slight difference. As Gemeni noted all LZ decoders have a similar form. The different versions are just tweaks.
Title: Re: (PSX - Aconcagua) Can't figure out how compressed text works
Post by: Gemini on February 19, 2018, 09:07:53 pm
What do you mean by different locations? As in, while decompressing it can write a byte at 0x0100, then next, instead of writing to 0x0101, it writes to 0x0150 or something like that? Because I'm pretty sure it writes them all consecutively, at least it looks that way, not sure how I would check whether there happens to be a byte that is written somewhere else.
In terms of code it means that it writes decompressed data to two buffers: destination and a ring buffer. The latter is also used as source for compressed pairs.
Title: Re: (PSX - Aconcagua) Can't figure out how compressed text works
Post by: 730 on February 20, 2018, 03:24:28 am
I DID IT!!! mostly...
There's a couple of bugs like some bytes way at the bottom that are different from RAM
(https://my.mixtape.moe/mmtktt.png)

though maybe they were modified after loading, I'll have to see that later.
I think I should also figure out where the size is, not only because I might have "decompressed" some garbage, but also because the loop isn't stopping at EOF (that's probably just me doing something wrong, still would be nice to know the size); I couldn't find the value Vehek says is right after the PS-X EXE header, maybe I didn't try hard enough.

Anyways, here's a Pastebin with the current code (https://pastebin.com/VfdF92Bg), I'm gonna have to do some debugging tomorrow, and verify if it actually works with all the other files (I hope they don't have different parameters or anything or I'm gonna die).
When I fix that stuff and find out the info I need I suppose I can make the compressor.

The file I used to test this was the one starting at 0x57000 of PROGRAM.BIN, with the first 0x1000 bytes removed since it really starts at 0x57800 like Vehek said. I would upload it but I'm not sure if that's legal...? (let me know if I can upload it)

(also please any suggestions regarding the code, maybe there's some abstractions I'm not aware of and it can be made easier to understand)
Title: Re: (PSX - Aconcagua) Can't figure out how compressed text works
Post by: STARWIN on February 20, 2018, 08:33:18 am
(also please any suggestions regarding the code, maybe there's some abstractions I'm not aware of and it can be made easier to understand)

you can zero init the buffer with char buf[BUFFER_SIZE]={0};

if buf is declared at global level, it will be zero init even by itself and you don't have to pass the array parameter when calling push

i would always have the code block after if well defined with { } around, even single liners, and that second line with EOF may be broken btw

there is almost never a reason to use a short int instead of int

(you can declare i temporarily "for (int i=0;" in for instead of declaring i earlier, if you want)

if you want to reduce copying in push, you can have a globally declared "int start=0;" and use the following:

pull with =buf[(start+offset)%BUFFER_SIZE];

push with buf[start%BUFFER_SIZE]=c;start++;

(not tested)

(it would be good manners to set start to 0 when it grows big enough, but i doubt it matters here)
Title: Re: (PSX - Aconcagua) Can't figure out how compressed text works
Post by: Valendian on February 20, 2018, 12:09:24 pm
you can zero init the buffer with char buf[BUFFER_SIZE]={0};

Not when the array is used more than once.

@730 if those errors only show up at the end then all you are missing is the end of file logic.
Title: Re: (PSX - Aconcagua) Can't figure out how compressed text works
Post by: 730 on February 24, 2018, 06:39:09 pm
I rewrote the program in C++ (full with classes and modularity and all that stuff), and fixed the bug (besides from the EOF thing). This bug was due to how I handled the sliding window, moving it every time a byte was written instead of leaving it fixed while it was writing a sequence of bytes from an offset and sliding it after that was done, which made it impossible to retrieve bytes from the proper position when (offset + i) > 0xFFF, which overflowed so it should've gone to the beginning of the window to get these bytes (also some arithmetics on how to get the bytes in those situations).

Anyways here's the program (https://my.mixtape.moe/nbcdwu.zip). This is basically the first time I've ever done something (so) OOP outside of class so this code is probably garbage, I'd like to know what I can improve (some notes already in TODO in main.cpp).

One thing I still don't have clear is, what algorithm is this specifically? Is it a variant with a name or is it just some made up LZ for Aconcagua? If it's something with a name it's gonna be easier to find out how to write the compressor, I've no idea how I'm gonna go about that yet.

edit: I think I figured out the compression algorithm, I'm gonna try coding it tomorrow. I forgot another thing that was really wrong in the code I made, I didn't destroy objects when I was supposed to, woops.
Title: Re: (PSX - Aconcagua) Can't figure out how compressed text works
Post by: Valendian on February 25, 2018, 07:21:37 am
You're using a std::vector<char> for the sliding window, I think this is creating a new problem rather than solving an existing one. The problem is made more simple by using a fixed sized c-style char array all you need to do is use modulus wrap around to enforce boundary constraints. Also the negative start index in the Sliding Window is dubious, I'm not sure if this is because you are using a vector but I really would consider dropping the std::vector and switching to a fixed sized c-style char array.

Code: [Select]
    // using a std::vector is creating more problems than it solves
    realOffset = ((offset + i) > 0xFFF ? (offset + i) - 0x1000 : offset + i);
    buf.push_back(buf[sldWindow->getStart() + realOffset]);

Code: [Select]
    // if you use a char[] it would look more like this
    buf[end++ % sizeof(buf)] = buf[(offset + i) % sizeof(buf)];

And this appears to be LZSS (https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Storer%E2%80%93Szymanski). The distinguishing feature of this LZ variant is that it maintains a separate sliding window. It treats the sliding window as a FIFO queue. Earlier variants simply took the position in the output stream and grabbed bytes from earlier positions, (the sliding window was a subset of the output stream and would contain duplicate entries).

Also when you get around to encoding you will need to layer some other technique on top to get satisfactory results, the basic example I pointed to earlier doesn't do anything fancy and so its compression ratio is rather poor. Huffman Coding (https://en.wikipedia.org/wiki/Huffman_coding) can be used to improve this ratio.