News:

11 March 2016 - Forum Rules

Main Menu

Algorithms for optimum dictionary compression

Started by DougRPG, September 04, 2013, 12:00:28 PM

Previous topic - Next topic

DougRPG

Hi, I want to compress a script with dictionary compression, and I'm trying to implement a good algorithm for the optimum solution.

First, I want that the string sizes in the dictionary can goes from 1 to n, so a string in the dictionary can have a maximum of n characters.
Second, I want that the dictionary has only 256 entries (maximum).

So I need to find the 256 strings that gives me the maximum compression.

My fist idea is the following:

1-)Find the 256 most common strings for each size (1 to n).
The worst algorithm here is to run through the script 256 times for each size (n). Probably there is a better way!

2-)With this information in hands we need to find the 256 global strings that gives the best compression for the whole script.
My first idea is to run through the strings of each size, starting with the most common for each size, and starting to sort the strings based on the compression ratio. To calculate this ratio we can run through the script adding the string size for each occurrence of this string, which gives us the total size in bytes this string has in the whole script.
We can stop to sort the strings for some string size, let's say k, if the position after the sort is higher than 256. So after the strings for all sizes starts to have a position higher than 256 the algorithm is finished.

3-)After finding the 256 best strings, we can create the dictionary.

As we can see this method needs to run through the whole script a LOT of times.
So someone knows some documents, or have some ideas for a better algorithm?



I was thinking about another method where I put one control byte each eight bytes, where each bit of this control byte tells if we have a common char or a dictionary entry. But using this method we need a different algorithm to calculate the dictionary strings, and seems much more complex to find the optimum solution.


If someone has experience with this stuff, please share here.


PS: I started thinking about this stuff today, so sorry if it seems a silly question.

Bregalad

Hi. What you describe sounds close to the StaticDictionary compression I use in my CompressTools.

The algoritms are all fully doccumented, including this one, so feel free to take a look at the sources.
What you said won't work for many reasons :
1) You should not replace the most common string, but the one that will grant a greater saving of bytes. This is completely different !
2) You forgot that your string matches will not be independant. For example if "hello" is present 15 times, then "hell" is present at the very least 15 times too.
3) I think it's silly to think you'll have 256 dictionary entires, you'll need room for literals which are not in the dictionary.

DougRPG

Thank you for the reply, I'll take a look on you CompressTools.

Quote1) You should not replace the most common string, but the one that will grant a greater saving of bytes. This is completely different !

That is what I said in my post. I want the 256 global strings that gives the best compression. So my first idea is to find the most common strings for each size, and after that sort all these strings based on the compression ratio (the one's who saves more bytes). So, a string of size 3 can save much more space than a string with size 10, if the (occurence_size3 x 3) is higher than (occurrence_size10 x 10).
Maybe my description was not very well written.

Quote2) You forgot that your string matches will not be independant. For example if "hello" is present 15 times, then "hell" is present at the very least 15 times too.

Yes, my approach is far from the best solution, and the worst case can have a very poor compression ratio. Even if I find the best strings like I said above, and start to replace the strings with the corresponding dictionary entries, there will be a lot of conflicts, and the result will be far from the best compression.
I can restrict to only get the strings that doesn't conflict with the previous strings, but I'll need to check all the possible combinations to see which one has the optimum set of strings, what is impractical.

What was your approach? I'll study your tool, but can you describe in a high level what you did?

Quote3) I think it's silly to think you'll have 256 dictionary entires, you'll need room for literals which are not in the dictionary.

I'm thinking in a generic algorithm, focused only on the dictionary, and for practical uses I can choose other smaller number.
Other solution is to add this control byte like I said in my first post, this way I can know if I have a literal or a dictionary entry, so I can have 256 literals and 256 dictionary entries. This control byte will add a lot of extra info on the compressed data, but can be better than having a smaller dictionary in some cases (I'm guessing).

Nightcrawler

I'd suggest taking a look at Script Crunch for some ideas and testing.
TransCorp - Over 20 years of community dedication.
Dual Orb 2, Wozz, Emerald Dragon, Tenshi No Uta, Glory of Heracles IV SFC/SNES Translations

Klarth

I'll save him some time and just say how ScriptCrunch works.

1. Determine a MinLength and MaxLength value for words.
2. Scan the text for every substring from MinLength to MaxLength and sum their frequencies.
3. Calculate which match would save the most room.
4. Save the optimal match
5. Delete the optimal match from the text analysis to prevent overlap (I split the text into a bunch of smaller vectors to keep proper boundaries)
6. Repeat from step 2.

This is a nonoptimal, greedy heuristic.  I'm not sure if a proper algorithm to do this exists with a reasonable time complexity.  Also, rescanning the text makes the heuristic take several minutes to run to build the entire dictionary.  You can write code to "adjust" the frequencies instead of deletion and rescanning to make it run much more quickly.  I wrote some pseudocode on paper but never implemented that method.

Note: I just noticed you said dictionary, not substring.  Same method applies, except your scan is much quicker and simpler to do.

henke37


function spaceSavedForString(str:String,freq:uint):uint {
return str.length*freq;
}

Klarth

That's a bit off.  The space saved is:
(StringLength - HexLength) * Freq, where HexLength is the number of bytes required (usually 2) to encode the dictionary/substring entry.

Bregalad

QuoteWhat was your approach? I'll study your tool, but can you describe in a high level what you did?
Of course.
What I do is that once on word is compressed, it's entry is added to dictionary. Then there is two variants depending if you want the dictionary to be recursive (a "word" can contain other dictionary references) or non-recursive (a "word" is entierely made of literals).

Let's start with the recursive variant, as it's the simplest in concept.
The string occurrence table is then cleared, and the whole process of counting string occurrences starts again. As this is a very slow process, which have to be repeated N times (where N is the total # of entries) it's of course a very slow encoding.

Now let's talk about the non-recursive variant.
There is probably other ways to do it, but I wanted to do it in an optimal way, even if it's slow.
So again the string occurrence table is cleared and the whole process of counting string occurrences restarts, but this time any string that contains any dictionary references is automatically skipped.
This is even slower than the recursive variant.