Dictionary Based Compression: The LZW Algorithm
The LZ family of lossless compression algorithms has many, many variants each making use of various optimizations to eek out every bit of performance for a given application. There are so many variants that they are generally broken into two categories: LZ77 based algorithms which use a an implict dictionary via a "sliding window", and those which are based on LZ78 that make use of an explicit dictionary data structure. Having previously covered the LZ77 algorithm, todays post will focus on the LZ78 algorithm. More specifically, I want to discuss one of the more... controversial members of the LZ78 family: LZW84.
I'm sure you're probably thinking "controversial? a compression algorithm?" but I assure you it's release stirred up a whole host of issues regarding patent law and algorithms. The TLDR of it is that the author published the algorithm, and then patented it. "Ok, and?". Well, this was 1984. So inbetween the algorithm being published, and the author being granted a patent, it was adopted and used in a number of commercial unix distributions under the impression that the algorithm was not encumbered by patents - or why bother publishing it? Needless to say, this caused a quite the headache for corporate lawyers for a few years.
All of that is history though and the patent has since expired, so we're free to take a look at what is a pretty interesting variant of the LZ78 lossless compression algorithm, one which was instrumental in implementing the GIF image format, as well as the Unix "compress" command. So grab some coffee, get comfortable, and lets take a peek at implementing LZW.
From LZ Tuples to LZ Pairs to Codewords
The LZ78 family of algorithms had two key advantages over the earleir LZ77 based algorithms in that where LZ77 always had to start from the beginning of the compressed stream, LZ78 algorithms could pick up from any point in the stream. Of greater signifigance, where LZ77's canonical implementation used a 3-tuple of <offset,length,lookahead>, LZ78 algorithms use Pairs of <Offset, Length>.
LZW takes this space crunching one step further by preloading the dictionary with every possible single character string. Thanks to this seemingly trivial optimization we now require only the offset so our compresed stream is reduced to a stream of integers. These integers are indexes into the decoding dictionary which are used to build up phrases of previously seen words which are dubbed "codewords".
Squeezing Text the LZW way
There are (at least) two general approaches to implementing compression with LZW, and they differ in the data structure used to implement the compression table. By using a data structure optimized for querying the longest common prefix of a search key, such as can be efficiently done with a TST or Prefix Existence Trie, can implement the algorithm in a very simple way. Lacking such a specialized data structure, we turn to the second approach, in which we can make do with a hash table or even a balanced binary search tree for a slightly lower degree of compression.
Our dictionary can have a maximum of 4096 codewords. This is so that we can represent all of the indicea in the decoding table using 12 bit integers instead of the normal 16 bits.
const int max_code_words = 4096;
void initEncodingTable(unordered_map<string,int>& dict) {
int i = 0;
while (i < 256) {
string tmp;
tmp.push_back((char)i);
dict.emplace(tmp, i++);
}
}
We begin the compression process by initializing the encoding table with each of the 256 single character ASCII values. Once the table is initialized we can begin encoding the data. This is done by building phrases up one character at a time and checking for their existance in the encoding table.
vector<int> squeeze(string input) {
unordered_map<string,int> dict;
initEncodingTable(dict);
string phrase = "";
vector<int> comrpessed;
for (int j = 0; j < input.size(); j++) {
string prev = phrase;
phrase.push_back(input[j]);
if (dict.find(phrase) == dict.end()) {
comrpessed.push_back(dict[prev]);
dict.emplace(phrase, i++);
phrase.clear();
phrase.push_back(input[j]);
}
}
comrpessed.push_back(dict[phrase]);
return comrpessed;
}
Uncompressing with LZW
Just as we did with the encoding process, the decoding process begins by initializing the decoding table with all 256 single-character ascii strings. Take note that this time we've swapped the data types of the key/value pairs in the dictionary: now we have integer keys returning string values.
void initDecodingTable(unordered_map<int,string>& dict) {
int i;
for (i = 0; i < 256; i++) {
string s;
s.push_back((char)i);
dict[i] = s;
}
dict[i++] = "";
}
Once the table has been initialized, decoding can begin. Decoding an LZW stream is very much the reverse of the encoding process. The trick lies in building up and adding the correct strings for further decoding as we go.
void unsqueeze(vector<int> compd) {
unordered_map<int, string> dict;
initDecodingTable(dict);
int codeword = compd[0];
string sofar = dict[codeword];
string output;
for (int j = 1; j < compd.size(); j++) {
output += sofar;
codeword = compd[j];
string s = dict[codeword];
if (i == codeword) { s = sofar; s.push_back(sofar[0]); }
if (i < max_code_words) {
sofar.push_back(s[0]);
dict[i++] = sofar;
}
sofar = s;
}
output += sofar;
return output;
}
And that's all there is to it.
Some Thoughts On LZW
As can be seen from the implementation, when it comes to the LZ family of compression algorithms, LZW is certainly one of the simplest. Interestingly, it also has rather favorable compression ratios. Further optimizations can be done, primarily with regards to running time, by using other more optimized search structures such as prefix tries or ternary search trees supporting the efficient finding of longest common prefixes for string keys.
Anyway, that's what I've got for today. Until next time, Happy Hacking.
-
Parsing Lisp: From Data To Code
-
Swiz Heaps: A deterministic modification of Randomized Meldable Heaps
-
Let's talk Eval/Apply
-
BST Deletion: Removal By Merge
-
Dictionary Based Compression: The LZW Algorithm
-
Taking Action: Compiling Procedures to P-Code
-
Making Decisions: Compiling If Statements to P-Code
-
Repeating yourself: Compiling While Loops to P-Code
-
Removing an entry from a B+ Tree without Rebalancing: A viable approach?
-
Implementing An Iterator for In-Memory B-Trees
Leave A Comment