News: 11 March 2016 - Forum Rules
Current Moderators - DarkSol, KingMike, MathOnNapkins, Azkadellia, Danke

Author Topic: Algorithm for best optimizing string compression?  (Read 8636 times)

Kafke

  • Jr. Member
  • **
  • Posts: 44
    • View Profile
Algorithm for best optimizing string compression?
« on: December 18, 2015, 07:45:26 am »
I have what I think is a simple question, I'm just not too good on the math-y side of programming in order to figure this one out. I have a string compression function that reduces repeating strings down into a compressed form. Something like: "ABABAB" -> 3AB. The decompression is really simple and results in a perfectly expanded product. However, I'm having a bit of a difficult time getting the compression to... well... compress.

My first attempt was a naive approach of just starting with single repeating characters, then 2, then 3, then 4, etc. until the max amount of characters allowed. However, this still tends to result in unoptimized outcomes (given the capped nature of the compression). An example would be:

AB AB AB ABC ABC

My current code would see AB AB AB AB and list it as 4AB, then it'd see a C with nothing good after so it'd do 1C, and then it'd see the 1ABC. Resulting in a compressed string of 4AB1C1ABC. However, it's fairly obvious to us humans that the correct compression would be 3AB2ABC.

I was wondering if there was a better way of scanning the input to get a better compression.

Simply decompressing an existing string and recompressing it boosts me from 2KB to 6KB in one example.

And I kinda lied, it's not really strings, but bytes for a particular part of my translation hack. But the principle is the same.

Here's a quick pastebin paste of my current rough code: http://pastebin.com/JebzabSZ

Basically I check the single byte compression, since it works a bit differently than the rest. From there I check the varying amount of bytes that will be repeated, and keep counts for each, and simply grab the one that works the best at the end and use that. But as I mentioned, it's a naive approach and commonly leaves bad compression in parts.

Thoughts? Has anyone else had to deal with something like this?

It's not too important, since the game is flexible on file sizes. But I have a feeling there's a hard limit on the total amount of stuff in general, so it's best to keep everything small. There's some other optimizations I can do as well, which I haven't tried yet. But that's unrelated to the question I'm posing here.

Thanks  :)

FAST6191

  • Hero Member
  • *****
  • Posts: 3182
    • View Profile
Re: Algorithm for best optimizing string compression?
« Reply #1 on: December 18, 2015, 09:36:08 am »
"However, it's fairly obvious to us humans that the correct compression"
That way madness lies. Obviously some kind of almost AI will be the future of compression but this is not that. Take what you can get, finesse things if you want/can but don't go pitting your millions of years of pattern detection against it too hard.

Lossless compression, even one as primitive as the RLE/run length encoding you seem to be playing with here, is all about tradeoffs between memory reads (are you doing arbitrary alignments or every 8 bits...), type of data you aim to be compressing (strings* are different to standard binary data), compression level desired, CPU time available, what you are compressing on (your multi GHz PC with gigabytes of memory and even more storage is rather different to having to do something in close to real time on a little old 80's embedded processor with no storage and RAM measures in the kilobytes, normally you would not have to compress things in a game but if you have custom names to save..) and such like.

*in standard ASCII you will stop at 7F so you have an entire upper bit to play with, even more if you consider that you will probably not see many characters below 20 hex for most things).

Anyway "just starting with single repeating characters" is the wrong path. Start as big as your repeating value will allow and count backwards from there until you hit the point where it is not worth it as the compressed data is as long as the uncompressed*. You will probably not get many/any hits for text strings but graphics can see this. RLE is considered the precursor to the LZ aka sliding window family of compression algorithms so you can adopt something nicer from there as your scanning method if you like, assuming you are not going to run into an unaligned read penalty.

*this does assume you have an uncompressed flag option. If it is really primitive (or wants to avoid as many if else setups as possible) then you may have no uncompressed flag and have to compress it anyway.

Kafke

  • Jr. Member
  • **
  • Posts: 44
    • View Profile
Re: Algorithm for best optimizing string compression?
« Reply #2 on: December 18, 2015, 04:28:21 pm »
Yea, I know humans have had a lot time to evolve to get this right.

Thanks for all the info. And to expand a bit more, the images DO have a compressed flag. As well as an option to ignore a certain amount of bytes. As the code implies, one byte can be looped many times (up to 50 or so), while multiple bytes (2-9) can only be looped up to 8 times each. I don't see the point in starting with the higher amount, as the lower amounts can often times lead to further compression, even if it is possible to compress the higher ones (see: AB AB AB AB -> 2AB2AB vs 4AB).

"Start as big as your repeating value will allow and count backwards from there until you hit the point where it is not worth it as the compressed data is as long as the uncompressed*. "

Yea, I was definitely going to add a comparison to the uncompressed version. But why high-to-low instead of low-to-high? Due to the nature of the data, most of it is either going to be a single byte looped many times, or 2/4 bytes looped several times. Or do you mean just for run-time's sake? The run-time is fine and largely a non-issue at this point.

"RLE is considered the precursor to the LZ aka sliding window family of compression algorithms so you can adopt something nicer from there as your scanning method if you like, assuming you are not going to run into an unaligned read penalty."

I'm not entirely sure what you mean by 'unaligned read penalty', but I'll definitely check out how LZ does things.

"normally you would not have to compress things in a game but if you have custom names to save..) "

Yea, it's a bit weird. The image format is a proprietary container that can hold a few different types of img/rbg/color data. And there's a byte to say whether it's compressed using the manner I described. After decompressing, the image can be loaded using one of the known formats. Leaving it decompressed pushes the file size to be much larger, and thus potentially cause errors (either due to not fitting on the ISO, or perhaps some other issue I'm not aware of). If it wasn't already compressed to begin with, I wouldn't even bother compressing it.

"*in standard ASCII you will stop at 7F so you have an entire upper bit to play with, even more if you consider that you will probably not see many characters below 20 hex for most things)."

7F is the cap for ignoring bytes/leaving the marked amount uncompressed. Then 80-B0 for single byte repetition, and B0-FF for multi-byte repetition.

It's worth noting, again, this isn't really for text, but for images. But I figured the theory should be the same, since it's all just bytes.

FAST6191

  • Hero Member
  • *****
  • Posts: 3182
    • View Profile
Re: Algorithm for best optimizing string compression?
« Reply #3 on: December 18, 2015, 05:36:10 pm »
There are plenty of protocols and encodings that differentiate between text and straight binary. FTP having text and binary mode and the Yenc (commonly seen in usenet binary posts, usenet being a text based systems, rather than older methods like base64) encoding being two good examples for each of those.

Statistically a long repeat gains more compression than a shorter one. It probably won't happen and you would probably get acceptable results starting sooner but high to low even then is still the better method. Or if you prefer then run the method yourself -- if you had started looking for patterns of three rather than two your ABC would have been caught before the AB stuff.

Aligned reads. You obtain the contents of any bit in memory if you want, however in most systems this is not going to happen without fiddling (grab it, mask it, shift it...) and in some newer systems you can have tradeoffs -- dma might only read at 16 bit alignments but some instructions can read at 8 bit, if "all DMA, all the time" is a good plan then no need to skip around every 8 bits if you are only going to be reading on 16 bit boundaries to help the DMA out. By similar token you could theoretically find that if you started at 47 bits you have a massive repeating pattern but one that is not present in the usual multiple of 2 read.

"I was definitely going to add a comparison to the uncompressed version"
It should be simple maths -- what is the length of the uncompressed code, what is the length of the compression function command plus the playload, if the uncompressed is the same or shorter than leave it uncompressed unless you have an argument not to. There are various arguments not to for some implementations but it is not going to happen here. Likewise once you know that then you can do things to gain some more numbers if you have not already --

For this you have 0 to 9 as the compression length and A to Z as the data
0A means nothing
1A is longer than just putting A
2A is the same as putting AA

To that end you can start at 3 so 0A means you have three lots of A. If you look up most implementations of LZ you will see this, certainly for things like the GBA and DS and the LZ methods their bioses pack.
A bit of a poor example as it does not handle/consider no compression flags and is still dealing in single character payloads but hopefully the reasoning shines through.
 

Disch

  • Hero Member
  • *****
  • Posts: 2814
  • NES Junkie
    • View Profile
Re: Algorithm for best optimizing string compression?
« Reply #4 on: January 26, 2016, 12:57:34 pm »
FWIW, someone on the nesdev forums did a comparison of different text compression algorithms and found that a recursive DTE algorithm (where DTE codes can contain other DTE codes) actually worked best for most English text.... outperforming Huffman, and even LZ.

I forget who it was exactly, and I didn't verify his numbers, but he was someone I trusted on that site and I don't think he would have lied.

Though ultimately, if you want to know for sure, you should run your own tests.

RyanfaeScotland

  • Sr. Member
  • ****
  • Posts: 366
    • View Profile
    • My Brill Game Site
Re: Algorithm for best optimizing string compression?
« Reply #5 on: January 26, 2016, 06:36:30 pm »
I'm a little worried here that you've dived right into describing your own algorithm and haven't explained why you've written off any of the existing compression algorithms, you aren't accidentally trying to re-invent the wheel are you!?

I don't have any recommendations myself but I gave it a little Google and since it turned up a StackOverflow result I had a read through that. You've probably done the same yourself but though I best mention it just in case! Here is the link if you are interested: http://stackoverflow.com/questions/1138345/best-compression-algorithm-for-short-text-strings it has quite a few suggestions.

Of course if you are 'reinventing the wheel' and looking to give it a whirl and see what you can come up with then that's pretty cool too.

henke37

  • Hero Member
  • *****
  • Posts: 643
    • View Profile
Re: Algorithm for best optimizing string compression?
« Reply #6 on: January 27, 2016, 02:31:53 am »
"Recursive DTE" is complicated for variable length tokens. That is roughly the same as huffman, with huffman being able to have non fixed size characters.

Bregalad

  • Hero Member
  • *****
  • Posts: 2755
    • View Profile
Re: Algorithm for best optimizing string compression?
« Reply #7 on: January 27, 2016, 03:24:48 am »
"Recursive DTE" is complicated for variable length tokens. That is roughly the same as huffman, with huffman being able to have non fixed size characters.
No, you're completely 100% wrong. It is probably the 3rd simplest compression sheme after RLE and non-recusrive DTE, and it has absolutely nothing to do with huffman.

Quote
I forget who it was exactly, and I didn't verify his numbers, but he was someone I trusted on that site and I don't think he would have lied.
Wow I didn't know I was that much reliable ! I did this text compressing two english books, but that doesn't mean it will work best on all english text. Nevertheless it's a good indication. The simplicity of the algorithm makes it even more considering.

Disch

  • Hero Member
  • *****
  • Posts: 2814
  • NES Junkie
    • View Profile
Re: Algorithm for best optimizing string compression?
« Reply #8 on: January 27, 2016, 04:23:04 pm »
DTE is easy to encode/decode once you have the DTE table -- but generating that table is definitely the hardest part.  I'd even say harder than generating the Huffman table.  Huffman just has to look at character frequency -- it's a straightforward linear search and tally.  With DTE you have to look at frequency of pairs recursively.  I can't immediately think of an easy way to do that in a way which produces an optimal table.

That said... once you have the table -- actually compressing/decompressing Huffman is a nightmare with all the bitpacking and unpacking you have to do.  DTE (even recursive) is incredibly simple.


EDIT:  Hahaha, so it WAS you Bregalad?  I thought it might be but I wasn't really sure.  And yeah I trust you enough to say that you wouldn't have just lied about testing compression algorithms.  :beer:

Bregalad

  • Hero Member
  • *****
  • Posts: 2755
    • View Profile
Re: Algorithm for best optimizing string compression?
« Reply #9 on: January 27, 2016, 05:04:27 pm »
With DTE you have to look at frequency of pairs recursively.
Look at CompressTools' source code. It's a 256x256 entry array, which counts the frequencies. Whenever a byte pair is substitued with a single byte, the table is updated. When recursion is disabled, any pair containing a DTE already doesn't count and keep a frequency of 0, however when recursion is possible, the update is full. It's really that simple

Disch

  • Hero Member
  • *****
  • Posts: 2814
  • NES Junkie
    • View Profile
Re: Algorithm for best optimizing string compression?
« Reply #10 on: January 31, 2016, 01:26:14 pm »
Late reply:

I haven't looked at CompressTools' source yet -- I find that having the algorithm explained is easier than trying to decipher it from source.

The thing that gets me is that... a 256x257 entry array would work fine for DTE... but not for recursive DTE because there are far more than 256x257 possibilities for entries.  An entry in the DTE table could be 7 bytes long... so there's ~256^7 possibilities.

The only way I can see doing this effectively would be with multiple passes.  Here's the first thought that came to mind, but I'm not sure this will even produce an optimal table:

1st pass:
- Tally frequency of 1-byte entries  (= N)
- Tally frequency of 2-byte entries
- Make note of how many 1-byte entries are required in N, effectively giving you D=256-N possible DTE entries.
- Take the D most frequent 2-byte entries and use them, dropping the rest

2nd pass:
- Tally frequency of 3-byte entries with given 1, and 2-byte entries.
- Sort 3-byte entries in order of decreasing frequency
- For each 3-byte entry:
--  If it is more frequent than the least frequent 2-byte entry, and that 2-byte entry is not used in any previous 3-byte entries, replace that 2-byte with this 3-byte.


Nth pass:
- Do same as 2nd pass but increasing the number of bytes each time.
- Repeat until there are no substitutions.




As you can see this is a mess.

Bregalad

  • Hero Member
  • *****
  • Posts: 2755
    • View Profile
Re: Algorithm for best optimizing string compression?
« Reply #11 on: January 31, 2016, 03:33:48 pm »
Quote
I haven't looked at CompressTools' source yet -- I find that having the algorithm explained is easier than trying to decipher it from source.
All CompressTool's source codes are extensively and boldly commented, and at the top of each algorithm there is a long explaination on how the algorithm work, so it's not like you'd have to reverse-engineer anything here.

Wow, what you're describing is overly complicated, much more than how I compress using recursive DTE. It is a little how I had to implement dictionary compression, which is surprisingly hard to compress optimally in finite time. I guess I do something that is not optimal, but pretty close, and it takes sometimes very long, but it is a lot faster since I rewrote CompressTools in C++. I had to cheat a bit with the code to speed things up by overwriting some hash array funcitons, so it would not look up the same hash multiple times a row.

Back on topic, 256x256 table works, because the DTE are included in the table. I do not care how long the replacement sequence is because I do not care how may times the recursion works : For me, all DTEs are byte pair, no matter if they point another DTE or not. I do not know why you say 256x257 / I do not threat the end character as a symbol or anything - when the end of a block is reached, it just doesn't count.

I only search for byte pairs, and do not take in account strings longer than that.

There is multiple passes : One for each potential DTE. So for instance if you use characters 0x00-0x7f directly and reserve 0x80-0xff for byte pairs, there will be 128 passes. This still compresses fairly fast overall, because each pass is very simple. Dictionary encoding on the other hand has many passes *and* passes are complicated, which is why it is so slow, and why I put it all at the end so the user can cancel before it completes without skipping any other algorithm.

Disch

  • Hero Member
  • *****
  • Posts: 2814
  • NES Junkie
    • View Profile
Re: Algorithm for best optimizing string compression?
« Reply #12 on: January 31, 2016, 04:12:50 pm »
Okay I see what you're doing.  So you're doing the compression while you build the table, which makes perfect sense now that I see it.  For some reason I was thinking you would have to build the table first, then use it to compress the text only after it was built.  But yeah that's ridiculous.   :P

I was thinking you'd need 256x257 because you'd need a set of 256 to count for individual characters, but yeah that doesn't make much sense in hindsight.

Quote
I guess I do something that is not optimal, but pretty close, and it takes sometimes very long

I see you using vector::erase in your code, which is a brutal killer.  Whenever you erase from the middle of the vector, everything after that position has to be shifted down.  You could probably greatly speed that up by removing that.

Code: [Select]
// You're doing this:
for(size_t i = 0; i != data.size(); ++i)
{
    if( data[i] == something )
    {
        data[i] = whatever;
        data.erase( data.begin() + i + 1 );
    }
}

// This would be faster:
size_t dst = 0;  // dst always <= src
for(size_t src = 0; src < data.size(); ++src, ++dst)
{
    if( data[src] == something )
    {
        data[dst] = whatever;
        ++src;                  // skip next src byte
    }
    else
        data[dst] = data[src];
}
data.resize( data.size() + dst - src );  // trim

Bregalad

  • Hero Member
  • *****
  • Posts: 2755
    • View Profile
Re: Algorithm for best optimizing string compression?
« Reply #13 on: January 31, 2016, 04:49:30 pm »
Well - there is absolutely no need for the compression algorithm to be fast or optimal - it runs on the host PC and there is all the power and memory of a modern PC at disposeal. The decompression code is supposed to be run on a game console, and it's where the performance are limited.

If it runs awfully slow on your PC, then I'd consider optimizing the code, but otherwise I wouldn't care as long as it works and that the code is clear to the readers.

henke37

  • Hero Member
  • *****
  • Posts: 643
    • View Profile
Re: Algorithm for best optimizing string compression?
« Reply #14 on: January 31, 2016, 05:52:29 pm »
I must've been unclear in my previous post. The problem in my opinion is the ill fitting term "DTE", or "double tile encoding" as it is supposed to stand for (still ill fitting). That's just a code book. Find long strings and replace them with shorter strings, called marks for this discussion.

Now comes the "recursion". It's context sensitive marks for the code book. Instead of using one code book for everything, the code book is switched depending on the previous characters in the string.

Yes, it can be defined using recursion. But the same result can also be gotten by using longer marks and even longer strings.

Now comes an ambiguity. Does the early parts of such a recursive mark output strings? If they do then a non context sensitive code book will need to duplicate data in the code book. A little annoying, but tolerable.

Now for the link to Huffman. Huffman is a code book compression where the length of each mark is adapted to the frequency of the corresponding string. Yes, huffman is often used for code books with fixed length strings, but it doesn't have to be. Huffman has non character length marks. You can get to Huffman from a normal character based codebook just by dropping the requirement that marks are characters. Instead a mark is a prefix unique bit string.

Disch

  • Hero Member
  • *****
  • Posts: 2814
  • NES Junkie
    • View Profile
Re: Algorithm for best optimizing string compression?
« Reply #15 on: January 31, 2016, 07:47:54 pm »
henke:  Your description of Recursive DTE does not sound right -- I think you might be overthinking it.  There's only one code book, and it's no more complicated than a table with N entries, with 2 marks per entry.

Code: [Select]
; 1 char = 1 byte base-case values
40=t
41=e
42=s
43=r

; DTE without recursion (values must be 1-entry base cases)
80=[40][41]  (te)
81=[42][40]  (st)
82=[41][43]  (er)

; DTE with recursion (values can be other DTE values)
83=[80][81]  (test)
84=[83][82]  (tester)

Decompression for this is very easy and probably faster than many other text compression algorithms.  The fixed code book (or table) size and no bitpacking makes it very easy to work with.

The bitpacking and variable bit length of marks is enough to make huffman much more difficult to work with (at least when decompressing)




EDIT:

I was bored.  Here's a 6502 recursive DTE decompression routine.  Untested.  See how easy it is, though?  I was even kind and allowed for the use of the termination char to be in the DTE table.  If you remove that ability, this code could be further simplified.

Code: [Select]
; macro for 16-bit increment
.define inc16 v
    inc v
    bne :+
        inc v+1
    :
.end define


; Read compressed text from 'src' pointer
; Write decompressed text to 'dst' pointer
; Use 2 tables, 'DTE_a' and 'DTE_b', each with $80 bytes
;  representing the first (a) and second (b) byte in the DTE pair
;
; Use temp RAM 'temp' as counter
; src/dst pointers are modified
;
; assumes $00 is termination character
; assumes $01-7F are single-byte output
; assumes $80-FF are DTE codes


Decompress:

    .define out
        sta (dst),Y
        inc16 dst
    .end define
   
    ;;;;;;;;;;;
   
    ldy #0              ; zero both Y and temp
    sty temp
   
    ; get byte from source
    @loop:
        lda (src),Y
        beq @exit           ; termination char = exit
        bmi @dte            ; DTE char = jump ahead
        out                 ; otherwise, just output
    @nextloop:
        inc16 src
        jmp @loop
               
    ; handle DTE code
    @dte:
        tax
        lda DTE_b, X        ; push the 'B' code
        pha                 ; and inc temp to keep track of how many we've pushed
        inc temp
       
        lda DTE_a, X        ; get 'A' code
        beq @popandexit     ; exit if termination char
        bmi @dte            ; loop if dte code
       
      @dteout:
        out                 ; otherwise, output it
       
        lda temp
        beq @nextloop       ; output all dte codes, return to main loop
       
        dec temp            ; otherwise, pop one and process it
        pla
        beq @popandexit
        bmi @dte
        bpl @dteout         ; (always jumps)
       
    @popandexit:
        ldx temp
        beq @exit
          : pla
            dex
            bne :-
    @exit:
        rts
« Last Edit: January 31, 2016, 08:19:44 pm by Disch »

Bregalad

  • Hero Member
  • *****
  • Posts: 2755
    • View Profile
Re: Algorithm for best optimizing string compression?
« Reply #16 on: February 01, 2016, 03:53:10 am »
Yes, it can be defined using recursion. But the same result can also be gotten by using longer marks and even longer strings.
Yes, "recursive DTE" is the same as dictionary compression with the following differences:
  • Dictionary entries can reference other dictionary entries
  • The length is always 2, so the length of each word do not have to be stored

Because the dictionary can re-use itself, and because no length table has to be be used, it makes the compression typically more effective, as couter-intuitive as it may seem. (Now, there is probably cases where dictionary will be more efficient, but for the texts I tested my algorithms on, recursive DTE beat all the other algorithms).

Revenant

  • Full Member
  • ***
  • Posts: 206
    • View Profile
Re: Algorithm for best optimizing string compression?
« Reply #17 on: February 03, 2016, 07:07:51 pm »
Yes, "recursive DTE" is the same as dictionary compression with the following differences:
  • Dictionary entries can reference other dictionary entries
  • The length is always 2, so the length of each word do not have to be stored

Is this functionally different from what's usually known as byte-pair encoding?

Disch

  • Hero Member
  • *****
  • Posts: 2814
  • NES Junkie
    • View Profile
Re: Algorithm for best optimizing string compression?
« Reply #18 on: February 03, 2016, 08:17:59 pm »
Nope... it looks like exactly the same thing.

"byte pair encoding" might be the official name for it.  "DTE" comes from really, really old hacker lingo, that was probably invented by someone who was unfamiliar with existing compression schemes.  I guess that name just stick with me, since that's what I heard of first.

Revenant

  • Full Member
  • ***
  • Posts: 206
    • View Profile
Re: Algorithm for best optimizing string compression?
« Reply #19 on: February 04, 2016, 12:00:35 am »
I definitely knew it as DTE originally, when used non-recursively.

I'm surprised Wikipedia claims byte-pair/digram encoding wasn't publicly described until 1994, given how simple it is. In fact, I've personally found one obscure example that is dated 1993 (and I decompressed all of it by hand, which was rather interesting to slowly watch, given the text). I'm curious as to what the real earliest example of it in the wild is.
« Last Edit: February 04, 2016, 12:05:56 am by Revenant »