News: 11 March 2016 - Forum Rules
Current Moderators - DarkSol, KingMike, MathOnNapkins, Azkadellia, Danke

Author Topic: Utilities: CompressTools upgraded to version 2.0  (Read 5667 times)

RHDNBot

  • Guest
Utilities: CompressTools upgraded to version 2.0
« on: July 22, 2013, 05:39:30 am »
Update By: Bregalad

CompressTools has been upgraded to version 2.0

CompressTools is currently the most advanced tool for compressing data. It's key feature is being able to "compile" data to be compressed from a script (instead of using raw files), and being able to separate multiple data blocks (like multiple maps or lines of text within a game, for instance).

It will then using a dozen of different compression algorithms, allowing the user to instantly see the most efficient algorithms, but it is also possible to force using a single algorithm and insert CompressTools in a compilation chain.

Some major fixups have been made since the last version.

New features :
- Strings and chars can use all unicode chars, and they can be mapped to bytes.
- Escape sequences are now implemented within strings.
- The StaticDictionary algorithm is now much improved, it is still the slowest, especially for large data where it can take a dozen of minutes to compress, however, it is still "usuable", as opposed to it's previous version.
- A tool to convert from plaintext to input scripts has been made.
- No single byte blocks added any longer for compression parameters, now they are handled as defines

RHDN Project Page

Relevant Link: (http://www.romhacking.net/utilities/882/)

Disnesquick

  • Jr. Member
  • **
  • Posts: 38
    • View Profile
Re: Utilities: CompressTools upgraded to version 2.0
« Reply #1 on: July 22, 2013, 06:34:24 am »
Quite a small point but you seem to be using a quadratic-time algorithm for LZ-variant encoding. Rather than searching through all possible sliding windows, it's faster to create a map that binds two-byte values to their location. At each position you can then check the map on the current two-byte value and get a list of all window seeds then extend them. You can cull the list as values fall off the end of the sliding window.

Bregalad

  • Hero Member
  • *****
  • Posts: 2637
    • View Profile
Re: Utilities: CompressTools upgraded to version 2.0
« Reply #2 on: July 22, 2013, 07:32:58 am »
Yeah, but the bottleneck is still (clearly) the StaticDicrionary compression, so if effort should be made to improve something, it should be put into this encoding.

However, unlike version 1.x, StaticDictionary is usable this time for blocks of reasonable size.

Disnesquick

  • Jr. Member
  • **
  • Posts: 38
    • View Profile
Re: Utilities: CompressTools upgraded to version 2.0
« Reply #3 on: July 22, 2013, 03:19:46 pm »
As before, hashes are your friends.

Quite a rough implementation here and (because I hate Java) its in python but hopefully it will give you some pointers:

Code: [Select]
import heapq


def stripUnique(stripDict):
newDict = {}
for key, val in stripDict.items():
if val[0] > 1:
newDict[key] = val
return newDict

def countOverlap(posList1, len1, posList2, len2):
count = 0
iter1 = iter(posList1)
iter2 = iter(posList2)
len1 -= 1
len2 -= 1
try:
l1 = next(iter1)
l2 = next(iter2)
r1 = l1+len1
r2 = l2+len2
while True:
while r1 < l2:
l1 = next(iter1)
r1 = l1 + len1
while r2 < l1:
l2 = next(iter2)
r2 = l2 + len2
if l2 > l1 and l2 <= r1:
count+=r1-l2+1
l2 = next(iter2)
r2 = l2 + len2
if l1 > l2 and l1 <= r2:
count+=r2-l1+1
l1 = next(iter1)
r1 = l1 + len1
except StopIteration:
pass
return count

#text blocks within text should be null terminated!
# "block1\0block2\0endofblocks\0"
def dictionaryScan(text, dictCost=1, dictionarySize=128, recursive = True):
if text[len(text)-1] != "\0":
text = text + "\0"
buildDict = [None]*(dictCost+1)
priority = []
totalBenefit=0
textLen = len(text)
#Initialize the dictionary
nextDict = {}
for char,pos in zip(text,range(textLen)):
if not char in nextDict:
nextDict[char] = (0,[])
count,curList = nextDict[char]
curList.append(pos)
nextDict[char] = count+1,curList


curDict = stripUnique(nextDict)
curLength = 1
fullDict = {}
while curDict != {}:
print(curLength, len(curDict))
curLength += 1
nextDict = {}
for count, posList in curDict.values():
for pos in posList:
nextPos = pos + curLength
if text[nextPos-1]=="\0":
break
newStr = text[pos:nextPos]
if not newStr in nextDict:
nextDict[newStr] = count, curList = (0,[])
else:
count, curList = nextDict[newStr]
curList.append(pos)
nextDict[newStr] = count+1, curList

curDict = stripUnique(nextDict)
if curLength <= dictCost:
continue
buildDict.append(curDict)

for testStr, val in curDict.items():
benefit = (val[0]-1)*(curLength-dictCost)
heapq.heappush(priority, (-benefit, val[0], 0, testStr))
outputList = []
foundCount = 0
touched = {}

while len(outputList) < dictionarySize and priority != []:
benefit, count, overlapCount, candidate = heapq.heappop(priority)
if benefit >= 0:
continue
# Check whether an fragment overlaps with a found candidate
candLen = len(candidate)
candList = buildDict[candLen][candidate][1]
benefitChange = 0
for foundCount, found in outputList[overlapCount:]:
foundLen = len(found)
foundList = buildDict[foundLen][found][1]
flag = 0
if candidate in found:
if recursive:
benefitChange += count * (len(candidate)-dictCost)
else:
benefitChange += count * len(candidate)
elif found in candidate:
if recursive:
benefitChange += count * (len(found)-dictCost)
else:
benefitChange += count * len(found)
else:
for i in range(1,candLen):
frag = candidate[:i]
fragLen = len(frag)
if frag == found[-fragLen:]:
flag = 1
frag = candidate[-i:]
fragLen = len(frag)
if frag == found[:i]:
flag = 1
if flag == 1:
benefitChange += countOverlap(foundList, foundLen, candList, candLen)

if benefitChange > 0:
heapq.heappush(priority, (benefit+benefitChange, count, len(outputList), candidate))
else:
outputList.append((count,candidate))
print(str(len(outputList))+": '''"+candidate+"''' saves "+str(-benefit))
totalBenefit -=benefit
print("Total savings of %d"%totalBenefit)
return outputList

It builds a dictionary for "Bleak House" (2Mb of text ) in three minutes on my machine. I haven't had chance to compare your code to compare though but there are numerous ways to speed up the above algorithm (centred around how it handles pushing candidates on the "this candidate overlaps an accepted dictionary entry" heap)
« Last Edit: July 22, 2013, 03:39:17 pm by Disnesquick »

Bregalad

  • Hero Member
  • *****
  • Posts: 2637
    • View Profile
Re: Utilities: CompressTools upgraded to version 2.0
« Reply #4 on: July 22, 2013, 04:08:37 pm »
Sorry but I don't really understand what you're doing with that piece of code. It is supposed to improve the way I encode data with a static dictionary ?

The way I do it currently is that I have a hash table for all match of strings between length 2 to N (N is changable, I set it to something like 15 by default), and count the number of occurrences (in the non recursive version, all strings that contains dictionary references are discarded and not stored in the hash table).
The bytes saved is deduced from the number of occurence from various parameters. I then replace the very string which saves the most bytes by a single byte and add this (word, length) pair to dictionary. I then clear the hash table and start again until there is no room in the "non-used bytes" for dictionary word references.

The encoding will be very slow if :

- Block of data are very big, as there is an insane amount of possible "words" in the data, and treating all of them is slow
- Few bytes are used, as there is more dictionary words to search (but it also means room for better compression !)

However, despite being slow to encode, this algorithm is extremely simple in concept, fast to decode and requires no RAM, and efficient with many kinds of different data, which is why it's still included on the list. Recursive Static Dictionary is currently the most efficient of the collection with english text.

If there is any way to make it faster while being equivalent then I'm all ears. But anyways as I said, slow encoding is usually not much a problem, as long as it does not take ages (as it was the case with v1.x)

Disnesquick

  • Jr. Member
  • **
  • Posts: 38
    • View Profile
Re: Utilities: CompressTools upgraded to version 2.0
« Reply #5 on: July 22, 2013, 04:55:02 pm »
Yes, you're describing a standard dictionary compression algorithm.

My code looks for all unique strings (currently of unbounded length but in practice there are very few above 15) in a block of text, building length N+1 from the list of strings of length N. As it goes, it counts the number of occurrences and thereby calculates the byte saving per string, were it to be replaced by a dictionary reference.

Then the dictionary is built: A heap is used to order the strings by their byte saving. First, each string is checked to see if it overlaps with an existing (accepted) string. if so, then the savings must be recalculated and the string is pushed back to the heap, otherwise it is accepted. Code makes sure that a string is never  checked twice against an already-seen accepted word.

The saving of this approach is that you don't rebuild the hash every word.

The code doesn't handle encoding but once you have the dictionary, that is completely trivial using a rolling hash or some-such. Yes, once encoded it will be completely equivalent to your (and all other) dictionary implementations.

I tightened the code up a bit more:

Code: [Select]
import heapq


def stripUnique(stripDict):
newDict = {}
for key, val in stripDict.items():
if val[0] > 1:
newDict[key] = val
return newDict

def countOverlap(posList1, len1, posList2, len2):
count1 = 0
count2 = 0
iter1 = iter(posList1)
iter2 = iter(posList2)
len1 -= 1
len2 -= 1
try:
l1 = next(iter1)
l2 = next(iter2)
r1 = l1+len1
r2 = l2+len2

if l2 < l1:
while r2 < l1:
l2 = next(iter2)
r2 = l2 + len2
if l1 > l2:
count2+=1
l1 = next(iter1)
r1 = l1 + len1
while True:
while r1 < l2:
l1 = next(iter1)
r1 = l1 + len1
if l2 > l1:
count1+=1
l2 = next(iter2)
r2 = l2 + len2
while r2 < l1:
l2 = next(iter2)
r2 = l2 + len2
if l1 > l2:
count2+=1
l1 = next(iter1)
r1 = l1 + len1
except StopIteration:
pass
return count1,count2

#text blocks within text should be null terminated!
# "block1\0block2\0endofblocks\0"
def dictionaryScan(text, dictCost=1, dictionarySize=128, recursive = True):
if text[len(text)-1] != "\0":
text = text + "\0"
buildDict = [None]*(dictCost+1)
priority = []
totalBenefit=0
textLen = len(text)
#Initialize the dictionary
nextDict = {}
for char,pos in zip(text,range(textLen)):
if not char in nextDict:
nextDict[char] = (0,[])
count,curList = nextDict[char]
curList.append(pos)
nextDict[char] = count+1,curList


curDict = stripUnique(nextDict)
curLength = 1
fullDict = {}
while curDict != {}:
print(curLength, len(curDict))
curLength += 1
nextDict = {}
for count, posList in curDict.values():
for pos in posList:
nextPos = pos + curLength
if text[nextPos-1]=="\0":
break
newStr = text[pos:nextPos]
if not newStr in nextDict:
nextDict[newStr] = count, curList = (0,[])
else:
count, curList = nextDict[newStr]
curList.append(pos)
nextDict[newStr] = count+1, curList

curDict = stripUnique(nextDict)
if curLength <= dictCost:
continue
buildDict.append(curDict)

for testStr, val in curDict.items():
benefit = (val[0]-1)*(curLength-dictCost)
heapq.heappush(priority, (-benefit, val[0], 0, testStr))
outputList = []
foundCount = 0
touched = {}

while len(outputList) < dictionarySize and priority != []:
benefit, count, overlapCount, candidate = heapq.heappop(priority)

# Check whether an fragment overlaps with a found candidate
candLen = len(candidate)
candList = buildDict[candLen][candidate][1]
benefitChange = 0
for foundCount, found in outputList[overlapCount:]:
foundLen = len(found)
foundList = buildDict[foundLen][found][1]
flag = 0
if candidate in found:
if recursive:
benefitChange += (foundCount - 1) * (len(candidate)-dictCost)
else:
benefitChange += foundCount * (len(candidate)-dictCost)
elif found in candidate:
if recursive:
benefitChange += (count - 1) * (len(found)-dictCost)
else:
benefitChange += count * (len(found)-dictCost)
else:
for i in range(1,candLen):
frag = candidate[:i]
fragLen = len(frag)
if frag == found[-fragLen:]:
flag = 1
break
for i in range(1,candLen):
frag = candidate[-i:]
fragLen = len(frag)
if frag == found[:i]:
flag = 1
break
if flag == 1:
c1, c2 = countOverlap(foundList, foundLen, candList, candLen)
benefitChange += c1*(len(candidate)-dictCost)
benefitChange += c2*(len(found)-dictCost)

if benefitChange > 0 and benefit+benefitChange < 0:
heapq.heappush(priority, (benefit+benefitChange, count, len(outputList), candidate))
elif benefitChange == 0:
outputList.append((count,candidate))
print(str(len(outputList))+": '''"+candidate+"''' saves "+str(-benefit))
totalBenefit -=benefit
print("Total savings of %d"%totalBenefit)
return outputList

What my code (and yours too) can't handle is when a dictionary entry overlaps with itself: E.g. dictionary compression of the string "hello hello bello  bello mellow yellow fellow" would give false heuristics on the byte saving per entry.
« Last Edit: July 22, 2013, 05:14:45 pm by Disnesquick »

Bregalad

  • Hero Member
  • *****
  • Posts: 2637
    • View Profile
Re: Utilities: CompressTools upgraded to version 2.0
« Reply #6 on: July 23, 2013, 03:03:43 am »
So the main difference is that :

- You build table for matches of length N+1 from the table of length N, while I build them from scratch (probably does not have a huge impact on speed ?)
- You don't rebuild the hash table at every word (probably has a huge impact on speed).

I tought about doing that too, but the problem is that it's extremely hard to adjust the table while still being accurate. Re-counting everything from scratch seems to be the simplest way to handle this.

Disnesquick

  • Jr. Member
  • **
  • Posts: 38
    • View Profile
Re: Utilities: CompressTools upgraded to version 2.0
« Reply #7 on: July 23, 2013, 05:29:57 am »
1. Yeah it doesn't have much impact on speed in this case. In my C implementation from years back I used a Radix tree to handle the map so this had a much greater impact.
2. It's not particularly tricky. Any existing accepted candidate that is completely contained within the current considered candidate will remove (len(accept) - dictCost) * freq(consider) from the considered benefit. Any existing accepted cadidate that completely contains the considered candidate will remove (len(consider)-dictCost)*freq(accept) from the benefit. You then need a way to count the number of partial overlaps. this can be done in a number of ways and basically will be the single largest speed hog. There are ways to do this very fast if you're using Radix trees though. If a candidate has its benefit adjusted then it is not accept but stuck back in the heap. You can speed this bit up too by using lazy heap structures but its not a huge drain.