News:

11 March 2016 - Forum Rules

Main Menu

Recompressing a file to original size

Started by theflyingzamboni, December 05, 2018, 01:56:49 PM

Previous topic - Next topic

theflyingzamboni

In The Legend of Dragoon, some of the files are compressed via an implementation of byte-pair encoding applied to blocks of <=2048 bytes in size. For each block, the byte-pair dictionary assigns value pairs to keys until there are no pairs occurring at least 5 times, or until all valid keys are full. The only ordering that is important is that the pair that occurs the most frequently is the next to be encoded.

My issue is that in some cases, the size of the compressed code has to remain the same, to avoid upsetting absolute pointers in the file the compressed code is embedded in. Currently, I get around this by randomizing the dictionary sort order on a block-by-block basis. When two byte pairs in a block occur with the same frequency, I use different sub-sorts. For example, I sort them in ascending order by the first byte in the pair, and further sub-sort that by the second byte in descending order. I have a total of 5 different sort types, and for a given block, each one may give a different compressed size. This way, when I'm compressing multiple blocks, it can repeatedly attempt to compress the file until it hits on a combination of sorts that results in the original compressed file size.

There are two questions I have:
1. Is there a way to do this that is more efficient than pure randomization? If I'm compressing a file 100 blocks long, I could attempt compression 100+ times and still not hit on a winning combo (and running through every permutation is out of the question). So like, if the file is consistently compressing to too small a size, then weight a sort order that produces a larger compressed block on average? Or loop up to five times on each block and try to get each block to equal its original size, with a variable tracking deficits/surpluses that need to be balanced out by future blocks? Or something?
2. Right now, if I manage to get lucky on a tricky file and hit on a combination that works, the next time I run the program it will randomize it all again. Is there a good way to keep track between instances of running the program of a working combination of sort orders, so that the next time I run the program, it does it in one try?

Sorry if my descriptions are confusing, I'll attempt to clarify points if someone has questions.
ROM wasn't hacked in a day.

Mugi

maybe i just misunderstood something here but it seems to me that if you can othervise produce functional files but they're just smaller than the originals, then why not just pad the end of the files with zeroes ? that would make the next block or file or whatever start where it's supposed to, and if the new file which is smaller than ther original othervise works, i would imagine stuffing zeroes to it's end does it no harm.
In PSP we trust.

theflyingzamboni

#2
Quote from: Mugi on December 05, 2018, 02:52:00 PM
maybe i just misunderstood something here but it seems to me that if you can othervise produce functional files but they're just smaller than the originals, then why not just pad the end of the files with zeroes ? that would make the next block or file or whatever start where it's supposed to, and if the new file which is smaller than ther original othervise works, i would imagine stuffing zeroes to it's end does it no harm.
That was the absolute first thing I tried, but it doesn't work. I'm not really sure why, but when I do that, the game crashes. I think something in the game expects the code following the end of the compressed portion of the file to be some sort of instruction. The game doesn't know how large the compressed file is. Each block is set up like:

[decompressed block size (4 bytes)][decompression dictionary (variable size)][compressed block (variable size)]

So when it finishes decompressing one block, the next 4 bytes tell it how many bytes are in the next decompressed block. When it reads 0x00000000 as the next block size, the decompression terminates, and if there's code in the file after the compressed portion, it starts reading and executing that normally.
ROM wasn't hacked in a day.

FAST6191

What is there in the original compressed section? It is uncommon to see an end of file/compression marker (I would usually expect that sort of thing in the header or decompression instruction/function) but it is not entirely unknown a concept.

I would not expect executable code to magically be there, however if it is then maybe rather than simple 00/FF padding go with NOPs or something. Though I just looked at http://www.mrc.uidaho.edu/mrc/people/jff/digital/MIPSir.html and NOOP is encoded as 0000 0000 0000 0000 0000 0000 0000 0000 apparently.

theflyingzamboni

Quote from: FAST6191 on December 05, 2018, 06:35:10 PM
What is there in the original compressed section? It is uncommon to see an end of file/compression marker (I would usually expect that sort of thing in the header or decompression instruction/function) but it is not entirely unknown a concept.

I would not expect executable code to magically be there, however if it is then maybe rather than simple 00/FF padding go with NOPs or something. Though I just looked at http://www.mrc.uidaho.edu/mrc/people/jff/digital/MIPSir.html and NOOP is encoded as 0000 0000 0000 0000 0000 0000 0000 0000 apparently.
Yes, 0x00000000 is NOP. I describe how the original compressed section is arranged in my previous post. I think it's not so much that the compressed code has an EOF marker, more that it ends on a header for a subsequent block that does not exist. I was just hypothesizing why it breaks when padded. I'm not really sure why it happens. The other reason I don't want to pad is that I'm not always trying to compress the entire file. To maximize mod-compatibility potential, I don't patch the entire compressed file. I decompress only the blocks that contain the code I'm changing, patch that, then compress those blocks back into place in the original. If I pad those blocks, the game will be reading garbage as the next decompressed block size. So that's why I want a non-padding solution.
ROM wasn't hacked in a day.

Mugi

i just reread this and judging by how you explained how it's built, it's actually the same sort of structure i ran into when i was haxoring magna carta portable back in the day.

the files were split into pieces of 0x6000 bytes and then each piece was compressed with zlib, and given a header describing the size of each block

now, what i did was simply wrote a recompressor for it that does the same, but that was easy because it was a single file instead of being in the middle of something else, however, i did run into the same issue you have originally when i examined this, and my original implementation simply rearranged the blocks by 0x6000 instead of their compressed size, resulting in 0x0000 padding between blocks, that obviously made it freeze :P

now, if you have a chance to pull out all the blocks (including the ones you dont modify) instead of just the ones you're touching, you could just generate new smaller files like you did so far, then join all the blocks together, and then put them back to the game, and zero pad after the last block instead of between the blocks. by all means, that should work.
In PSP we trust.

theflyingzamboni

#6
(Apologies for the tl;dr, but I'm trying to explain an entire process and chain of logic here, to answer the inevitable questions of "Why not just do this?")

So zero padding the end was the first thing I tried, but it caused the game to crash. Just tried it again and it worked though, so I guess the bug was something else that I've since fixed.

Regardless, I still have a reason to do things the way that I am, and that's mod compatibility. Obviously, the compressor I wrote can deal with entire files at once, but dealing with them in segments improves mod compatibility. If I patch the entire file, then only the one patch is viable. But if I patch individual segments only, then any future mods that use that file will be compatible so long as they don't target any overlapping blocks.

Additionally, the way I handle segments, I write the start and end offset of the segment to a metadata file when I decompress. That way, when I compress it again, I can just copy the data before and after the segment I'm actually compressing, which is a lot faster than always decompressing and compressing every block (especially in one case where I'm targeting 6 blocks out of about 450). But of course, if I am doing multiple segments in a batch, if I compress one and it alters the file size, suddenly the metadata for the other segments is incorrect. I realize this could be resolved by using a looping workflow of decompress/extract single segment/file -> insert script -> compress/insert segment/file, but that's a lot less convenient for working on mods than a workflow of decompress/extract all segments/files -> insert all scripts -> compress/insert all segments/files. Plus, these tools are intended for more general use beyond text insertion as well, where this might be even less convenient. Basically what I'm saying is that somewhere I have to sacrifice some degree of convenience, and workflow isn't where I want to do it. So I'm trying to do something that's computationally more difficult, but simpler for the person trying to mod the game.

I have at least solved the issue of the end user of the mod having to wait on a randomized process, though. When the compressor hits on a successful sequence of sort orders for the blocks, it appends that sequence to the metadata file. During patch creation, this sequence is attached to the xdelta file, then stripped off during patch application and attached to the metadata again at the user's end, guaranteeing it will succeed first time, every time.

That still leaves me looking for a less naïve implementation of the compressor to speed up the process of finding a successful sort order sequence, if anyone has any ideas.


EDIT: I just had a couple thoughts on a compromise solution. What if, when I do a batch compression of multiple segments within the same file, I just do it in reverse order from the highest-numbered set of blocks to the lowest? That way, if one results in a smaller file, it won't impact the location of the next segment to be compressed?

Attacking from the other end, would it be worthwhile, instead of using start and end offsets from a static metadata file to copy the data on either side of the segment, to run the decompressor from within the compression process to generate the correct offsets at runtime? It would cause the compressor to take a performance hit, but the compressed size of a segment would no longer matter beyond sometimes needing to be no larger than the original.

I can see advantages and drawbacks to both. Anyone have thoughts?
ROM wasn't hacked in a day.

Mugi

i think im still just misundestanding here but i understood that you have a linear sequence of compressed blocks and you simply took one bock (or few blocks) from the middle of the sequence, and made it smaller and then padded the end of that block with zeroes.

my though was to take all the blocks, and modify what you need, and then linearily arrange them back, then reinsert the entire sequence of blocks to the game and pad the end of the last block only with zeroes. (from a programmatical standpoint, doing this should not be too hard as you could propably fairly easily implement a way to fetch the compressed size of the block you compress and write it's size in it's place.)

this way the entire sequence would stay intact and unless the compressed sequence actually contains something that makes it seek anything specific at the end of of the sequence, by all means it has no reason to not work.

either way, it's good to hear you got it working (whatever it was that caused it to not work to begin with)
In PSP we trust.

theflyingzamboni

#8
Quote from: Mugi on December 19, 2018, 01:48:56 PM
i think im still just misundestanding here but i understood that you have a linear sequence of compressed blocks and you simply took one bock (or few blocks) from the middle of the sequence, and made it smaller and then padded the end of that block with zeroes.

my though was to take all the blocks, and modify what you need, and then linearily arrange them back, then reinsert the entire sequence of blocks to the game and pad the end of the last block only with zeroes. (from a programmatical standpoint, doing this should not be too hard as you could propably fairly easily implement a way to fetch the compressed size of the block you compress and write it's size in it's place.)

this way the entire sequence would stay intact and unless the compressed sequence actually contains something that makes it seek anything specific at the end of of the sequence, by all means it has no reason to not work.

either way, it's good to hear you got it working (whatever it was that caused it to not work to begin with)

Ah, a failure of explanation on my part. What I meant was what you said. That I would have a file with say, 30 blocks, decompress blocks 14-18, and when I compressed them back into place, if the full file was smaller I would pad the end of the file with zeros (in this example, after block 30), not the end of the segment of blocks 14-18. By the metric of "does this create a functioning file," it works.

However, if you read further down in my last post, I try to explain the reason that's not really the solution I'm looking for, as it creates problems down the line for mod compatibility. Hopefully, the explanation is helpful. It's hard to explain my reasoning, since it's tied up in a full modding process that I've been working on, with a lot of parts tying together, that's difficult to describe briefly. That's why my question was focused on resolving the problem presented, rather than asking for alternative solutions that would circumvent the problem (though I mentioned a couple sort of in-between solutions in the EDIT of my last post.)

(To address another part of what I think you might be suggesting, decompressing into 30 files of 1 block each would present a problem for the text inserter, since all the text and pointers would be broken up across multiple files. I have one where the pointers are scattered in locations across the entire file  ::)) Likewise, in cases with compressed textures, the texture file would get broken into pieces by this method.)
ROM wasn't hacked in a day.