News:

11 March 2016 - Forum Rules

Main Menu

New Compression Scheme

Started by FinS, September 07, 2011, 08:43:07 PM

Previous topic - Next topic

FinS

I have an idea for an improvement to LZSS. It would operate by pointers instead of bit flags but it could still use the bit flags if there are 2 or more compressed packets in the following 8 entries or if there are 127 uncompressed bytes followed by 1 or more compressed packets in the next 8 entries. It could save as much as 5% in wasted data with little or no loss.

Basically you start with the pointer which would be from 1 to 128 and this would tell you how many uncompressed bytes to write. If the pointer is 0x00 then the next 128 bytes would be uncompressed. The first bit of this pointer would be a flag telling whether it points to an 8 bit flag code which would be like the one used for LZSS. If the first bit is not set then it just points to a 2 byte control code which operates the same as a normal LZSS  control code. If the pointer and bit flags and control codes have completed all operations then follow with another pointer.

henke37

I it too late for me to think, do you have any figures to explain this with?

FinS


I'll try to show what I mean with some example data. So take the following data, I am using quotes because code tags destroy the color settings.
Quote00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F
20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 2E 2F
30 31 32 33 34 35 36 37 38 39 3A 3B 3C 3D 3E 3F
40 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F

Following is normal LZSS encoding. Everybody who knows LZSS already knows how this works. There are many ways to implement it but I'm just trying to show one way that is pretty clear to see. The red numbers are flagged bits which tell where the control codes are. They are mostly 00 because there are no control codes until the end where the flagged bits are 01 because there is 1 control code after it. The control codes are green and just tell it to write the last 16 bytes again.
Quote00 00 01 02 03 04 05 06 07 00 08 09 0A 0B 0C 0D 0E 0F
00 10 11 12 13 14 15 16 17 00 18 19 1A 1B 1C 1D 1E 1F
00 20 21 22 23 24 25 26 27 00 28 29 2A 2B 2C 2D 2E 2F
00 30 31 32 33 34 35 36 37 00 38 39 3A 3B 3C 3D 3E 3F
00 40 41 42 43 44 45 46 47 00 48 49 4A 4B 4C 4D 4E 4F
01 6F B0

This is my alternate way. The first byte is the pointer colored blue. It's 50 so that means write the next 50 bytes uncompressed then the control code. The total savings here is 10 bytes.
Quote50 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F
20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 2E 2F
30 31 32 33 34 35 36 37 38 39 3A 3B 3C 3D 3E 3F
40 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F
6F B0


Here is another example with greater compression prospects.
Quote00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F
20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 2E 2F
30 31 32 33 34 35 36 37 38 39 3A 3B 3C 3D 3E 3F
40 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F

Same thing here except the final flagged bits are 07 which is 0000 0111 in binary so there are 3 control codes, 2 bytes each.
Quote00 00 01 02 03 04 05 06 07 00 08 09 0A 0B 0C 0D 0E 0F
00 10 11 12 13 14 15 16 17 00 18 19 1A 1B 1C 1D 1E 1F
00 20 21 22 23 24 25 26 27 00 28 29 2A 2B 2C 2D 2E 2F
00 30 31 32 33 34 35 36 37 00 38 39 3A 3B 3C 3D 3E 3F
00 40 41 42 43 44 45 46 47 00 48 49 4A 4B 4C 4D 4E 4F
07 6F B0 FF F0 5F 80

Here is how mine handles it. Because the next control code will be followed by 2 or more control codes bit 7 will be flagged to tell it to use flagged bits. So subtract 80 to remove bit 7 and the total number of uncompressed bytes is 50. Then the red 07 are the flagged bits telling it to use the next 3 control codes of 2 bytes each. The total savings is 9 bytes here
QuoteD0 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F
20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 2E 2F
30 31 32 33 34 35 36 37 38 39 3A 3B 3C 3D 3E 3F
40 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F
07 6F B0 FF F0 5F 80

Klarth

I have a feeling that this scheme only works better for small files with very little redundancy.

Kiyoshi Aman

Write an encoder/decoder pair which implements this scheme and compare it versus the other variant you mentioned it, both with small files and large files—plaintext data would work best because then you can easily compare the unencoded input with the decoded output and check for differences that much more easily.

FinS

#5
It should be better on low redundancy files as you say, but it shouldn't matter how big they are. There probably wouldn't be savings on highly compressed files but I don't think it should take a hit.  The only way it could take a hit is if the bit flags consistently had at least 4 of 8 flags. 

I think the decoder should be pretty simple but I'm concerned about the compression. I think the only way to do it would be to build the LZSS encoding as a temporary file then make a second pass to change it to pointer based, but the 2nd pass should be very fast.

The compressor could be set up in a way to handle control codes more efficiently so even under high compression it could still save some property. For instance if there are 8 groups of consecutive control codes  but they are spread across 2 groups of flagged bits in the regularly compressed file, the compressor could take advantage of this on the second pass and the control codes would be combined into a single set of flagged bits.

FinS

I have some errors to work out yet but it looks to be getting about 11% extra efficiency on this file.

dhtext.lz original lz compressed file

dhtext.lz3 enhanced lz compression

syntax error

ideal for 32Mbit SNES games that dont have much free ROM space.

Dwedit

Aplib is the king of pure LZ77-style compression.  It often beats Zip, as long as you aren't packing pure ASCII text.  Aplib still loses to 7-zip's LZMA, but you need at least a GBA for that.
I've written a Z80 and ARM decompressor for Aplib, but not a 6502 version yet.
I don't think Aplib has a mode for long stretches of uncompressible data, but it does have ways of encoding a single byte which has appeared within 15 bytes ago, that saves one bit vs a literal.

Tried Aplib yet?
"We are merely sprites that dance at the beck and call of our button-pressing overlord."

syntax error

nice lib for writing a Master System RPG or strategy game

FinS

I've only worked with LZSS up to now. Once I get this to a point I am satisfied I could move on to other compressions. LZ77 would be a good choice for it's compatibility with older systems, LZ78 also looks interesting.

I am also thinking about how to make a compressor that will work with any data no matter how random or patterned it is. So you could just run it over and over and it will progressively compact more and more. It eventually of course would reach a point that no more compression is achievable. I don't know though maybe that is a little paradoxical.

Dwedit

Quote from: FinS on September 23, 2011, 06:00:03 PMI am also thinking about how to make a compressor that will work with any data no matter how random or patterned it is. So you could just run it over and over and it will progressively compact more and more. It eventually of course would reach a point that no more compression is achievable. I don't know though maybe that is a little paradoxical.
If you can compress something twice, and get something better, then your first compressor wasn't very good.
If you just want to be able to indicate large amounts of literals without lots of wasted bits, use gamma coding to encode bigger numbers.
Gamma coding is like "We read a bit, check the next bit (1 or 0) to see if there are any more bits left".  You can encode arbitrarily large numbers that way.
"We are merely sprites that dance at the beck and call of our button-pressing overlord."

FinS

Quote from: Dwedit on September 23, 2011, 06:46:44 PM
Gamma coding is like "We read a bit, check the next bit (1 or 0) to see if there are any more bits left".  You can encode arbitrarily large numbers that way.
That sounds like it might work well with frequency substitution.

henke37

In general you want to store as few bits for the length of the length as possible.

Kiyoshi Aman

A faster encoding would just use the 8th bit to indicate whether another byte needed to be read:


1 111 1111
0 000 0001


would be 128, for instance, while


1 111 1111
1 111 1111
0 000 1000


...would be 519.

KaioShin

Quote from: FinS on September 23, 2011, 06:00:03 PM
I am also thinking about how to make a compressor that will work with any data no matter how random or patterned it is. So you could just run it over and over and it will progressively compact more and more. It eventually of course would reach a point that no more compression is achievable. I don't know though maybe that is a little paradoxical.

Burrows-Wheeler is what you are looking for. Of course it's completely impractical for real time stuff in consoles as it needs n² the memory space of whatever you want to compress, but it can compress pretty much any data even when other algorithms just give up.
All my posts are merely personal opinions and not statements of fact, even if they are not explicitly prefixed by "In my opinion", "IMO", "I believe", or similar modifiers. By reading this disclaimer you agree to reply in spirit of these conditions.

Roger Pepitone

Quote from: FinS on September 23, 2011, 06:00:03 PM
I've only worked with LZSS up to now. Once I get this to a point I am satisfied I could move on to other compressions. LZ77 would be a good choice for it's compatibility with older systems, LZ78 also looks interesting.

I am also thinking about how to make a compressor that will work with any data no matter how random or patterned it is. So you could just run it over and over and it will progressively compact more and more. It eventually of course would reach a point that no more compression is achievable. I don't know though maybe that is a little paradoxical.

That's simply mathematically impossible.  Even if the size of the uncompressed data is fixed, 99.6% (254/255) of all possible files can't even be compressed by one byte.

If you were to try to improve LZSS, I'd suggest allowing larger values for the backcopy distance / length values; interpret a pair that would have a backcopy length of 18 to use a third byte for the backcopy / length data.   This will give a good increase for files which have long blocks of repeated data (such as bitmapped images with small palettes). 

And the length of the file does matter.  Backcopy-based compressors perform poorly at the early parts of their run, before they've processed the maximum backcopy distance.  It's simply less likely that there will be a match within the processed bytes.

henke37

Yes, back copy compressors in their purest form doesn't do well with repetition. But there is an easy trick to solve that problem. Allow having a longer copy length than the back seek distance. Then you suddenly are able to include RLE for free.