[Technical] [LZ?] Please, explain me compression algoritm in attached file.

Started by Metafalica, September 21, 2019, 10:20:30 AM

Previous topic - Next topic


Hello. I doing tools to translate dragon quest 5 for ps2 on my native language.
I already made main archive repacker and extracted many smaller files. Some of them has .lz extension, but I never saw LZ algorithm like that before (I am not experienced guy).
The text compressed there in a weird way. Most of time after each 8 bytes there comes FF byte and then text continues. Sometimes instead of FF there is close to it values like FC. And sometimes weird bytes set closer to each other, instead of 8 bytes of normal text.
I think this is specific enough for some algorithm, just name me it, so I can google and read about it.

This is file example link (compressed text is in end): https://www.dropbox.com/s/oijahxhr2igq2j0/shido.chaindata.lz?dl=1


I don't think this goes here:
Quote from: Nightcrawler
Minor problems, informal help, and other regular forum inquiries should go in the ROM Hacking Discussion section or other more appropriate section.


From 0x455EA to the end (plus your description) it doesn't seem to be compressed (most of the text can be read).
Is more like text with commands, for example:
-the FF could be some sort of Pad or Pause
-the other F? could be bold font, sounds, references to names, pointers for more text or special objects.

In the end, it depends on context. I mean you should find a piece of file with some or all these commands and know what the game displays for that chunk of data while you play.

Probably the asm hacker of DQ Translations Team, is one of the few ppl that knows what you want.
Support this site on Patreon


Sorry for putting this in wrong section.

And the text is compressed for sure. Look on the image. It's just as I told sometimes FF bytes comes every 8 bytes (fake compression? make decompresser think something is compressed?)

Link on this file exactly: https://www.dropbox.com/s/bw53z2qp9qsw2n1/cdf3b.chaindata.lz?dl=1


That's a lot more info than the previous file, here a simple analysis:



' (20)','K(4b)','i(69)','n(6e)','g(67)',
= King

' (20)','P(50)','a(61)',
= Pa


'p(70)','a(61)','s(73)','... I(3fa2)',
=pas... I

' (20)','u(75)','n(6e)','d(64)',
= und



' (20)','y(79)','o(6f)',
= yo



' (20)','a(61)','n(6e)',
= an



' (20)','r(72)','eally(7d655071)',
= really

' (20)','I(49)',
= I

' (20)','d(64)','o(6f)','...(243f)'
= do...{end}


Indeed there's some sort of compression.

I don't think it is a dictionary compression. It could be a font compression (when you pack in a kanji slot many latin signs like 'ill')...

But this "sta(e2d0)" doesn't fit well in that theory...

And neither this "eally(7d655071)" 5 chars in 4 bytes...

Also there are no "NewLine" or "Page" special chars, those ff, f7 and fb must for that...

Hope these vagues ideas help you.
Support this site on Patreon


If it's .lz, then it'll be some Lempel-Ziv derivative. Looks like this is LZSS.

                        4B 69 6E 67 20 50 61 F7
                        K  i  n  g     P  a  *

70 61 73 3F A2 20 75 6E 64 FB 65 72 E2 D0 6E 64
p  a  s  ?  ?     u  n  d  *  e  r  ?  ?  n  d

20 79 6F
   y  o

Starting from the F7 in the middle of "Papas": F7 is a control byte indicating whether the next eight values are individual characters or length-distance pairs. F7 is 11110111. Read from least significant bit to most, it indicates three literal characters, a length-distance pair, then four more characters. Then comes FB, or 11111011, indicating two characters, a length-distance pair, and five more characters.

A length-value pair indicates a string repeated from earlier in the text. The E2 D0 in "understand" represents a length of 3 and an offset to an point where "sta" appeared previously. Note that the length may be greater than the offset. For instance, the string "ha ha ha" might be "ha " followed by a repetition of length 5 at an offset of -3.


Saffith, thank you. This is a good start for me. Now I understand how things work here. Which way E2 D0 represent size and offset (which bits of it, what's max size and offset) is untold, but I probably find that answer in experiments.

P.S. I found that last 4 bits in E2 D0 is size to copy - 3, but the offset... if it remaining bits... the first link in the file already points to 3.5k offset :/ Compressed stream starts at 0x18 in every such file.


Speaking from experience, offsets in LZ derivatives are usually relative and not absolute. Moreover, they're counted based on the decompressed stream, not the compressed one. I would suggest making a ram dump with the full uncompressed file and try to deduce the addressing based on that. A savestate is usually enough for this purpose, unless it's compressed. I'm not sure if that's the case with pcsx2.


I made what you told me and summarized all I know about this algorithm on image (I made first 56 compressed bytes analysis).

Compressed file: https://www.dropbox.com/s/bw53z2qp9qsw2n1/cdf3b.chaindata.lz?dl=1
Decompressed file from RAM: https://www.dropbox.com/s/hd0hvbnaevszh76/cdf3b.chaindata.lz_decompr?dl=1
I even made decompresser that create "decompressed" file, but since addressing is wrong the result is partial junk (only single bytes is correct).

Maybe someone know what kind of addressing is here...


Assuming you know what the decompressed text should be, you could try figuring out the expected distance and working backward from there.

000FA640   68 61 73 20  FF 76 61 6E  69 73 68 65  64 F7 20 77  has .vanished. w
000FA650   69 60 01 74  20 61 20 DF  74 72 61 63  65 42 16 4D  i`.t a .traceB.M
000FA660   61 7F 79 62  65 20 69 66  20 4F 15 FF  20 77 65 72  a.ybe if O.. wer
000FA670   65 20 74 6F  FF 20 61 70  70 65 61 72  20 FF 69 6E  e to. appear .in
000FA680   20 74 68 65  20 73 FF 6B  69 65 73 20  6F 6E 63 F3   the s.kies onc.
000FA690   65 20 02 E3  41 17 49 74  20 6D FF 69  67 68 74 20  e ..A.It m.ight
000FA6A0   68 65 6C F7  70 20 75 A6  10 72 20 69  6E FF 20 61  hel.p u..r in. a

That looks like "usher" at the end there, and it's probably referring back to the "she" in "vanished". In that case, A61 should indicate the distance between those positions in the decompressed text.