« on: September 04, 2013, 12:00:28 pm »
Hi, I want to compress a script with dictionary compression, and I'm trying to implement a good algorithm for the optimum solution.
First, I want that the string sizes in the dictionary can goes from 1 to n, so a string in the dictionary can have a maximum of n characters.
Second, I want that the dictionary has only 256 entries (maximum).
So I need to find the 256 strings that gives me the maximum compression.
My fist idea is the following:
1-)Find the 256 most common strings for each size (1 to n).
The worst algorithm here is to run through the script 256 times for each size (n). Probably there is a better way!
2-)With this information in hands we need to find the 256 global strings that gives the best compression for the whole script.
My first idea is to run through the strings of each size, starting with the most common for each size, and starting to sort the strings based on the compression ratio. To calculate this ratio we can run through the script adding the string size for each occurrence of this string, which gives us the total size in bytes this string has in the whole script.
We can stop to sort the strings for some string size, let's say k, if the position after the sort is higher than 256. So after the strings for all sizes starts to have a position higher than 256 the algorithm is finished.
3-)After finding the 256 best strings, we can create the dictionary.
As we can see this method needs to run through the whole script a LOT of times.
So someone knows some documents, or have some ideas for a better algorithm?
I was thinking about another method where I put one control byte each eight bytes, where each bit of this control byte tells if we have a common char or a dictionary entry. But using this method we need a different algorithm to calculate the dictionary strings, and seems much more complex to find the optimum solution.
If someone has experience with this stuff, please share here.
PS: I started thinking about this stuff today, so sorry if it seems a silly question.