Dictionary Based Compression: The LZW Algorithm

mgcadmin 12-03-2025

The LZ family of lossless compression algorithms has many, many variants each making use of various optimizations to eek out every bit of performance for a given application. There are so many variants that they are generally broken into two categories: LZ77 based algorithms which use a an implict dictionary via a "sliding window", and those which are based on LZ78 that make use of an explicit dictionary data structure. Having previously covered the LZ77 algorithm, todays post will focus on the LZ78 algorithm. More specifically, I want to discuss one of the more... controversial members of the LZ78 family: LZW84.

I'm sure you're probably thinking "controversial? a compression algorithm?" but I assure you it's release stirred up a whole host of issues regarding patent law and algorithms. The TLDR of it is that the author published the algorithm, and then patented it. "Ok, and?". Well, this was 1984. So inbetween the algorithm being published, and the author being granted a patent, it was adopted and used in a number of commercial unix distributions under the impression that the algorithm was not encumbered by patents - or why bother publishing it? Needless to say, this caused a quite the headache for corporate lawyers for a few years.

All of that is history though and the patent has since expired, so we're free to take a look at what is a pretty interesting variant of the LZ78 lossless compression algorithm, one which was instrumental in implementing the GIF image format, as well as the Unix "compress" command. So grab some coffee, get comfortable, and lets take a peek at implementing LZW.

From LZ Tuples to LZ Pairs to Codewords

The LZ78 family of algorithms had two key advantages over the earleir LZ77 based algorithms in that where LZ77 always had to start from the beginning of the compressed stream, LZ78 algorithms could pick up from any point in the stream. Of greater signifigance, where LZ77's canonical implementation used a 3-tuple of <offset,length,lookahead>, LZ78 algorithms use Pairs of <Offset, Length>.

LZW takes this space crunching one step further by preloading the dictionary with every possible single character string. Thanks to this seemingly trivial optimization we now require only the offset so our compresed stream is reduced to a stream of integers. These integers are indexes into the decoding dictionary which are used to build up phrases of previously seen words which are dubbed "codewords".

Squeezing Text the LZW way

There are (at least) two general approaches to implementing compression with LZW, and they differ in the data structure used to implement the compression table. By using a data structure optimized for querying the longest common prefix of a search key, such as can be efficiently done with a TST or Prefix Existence Trie, can implement the algorithm in a very simple way. Lacking such a specialized data structure, we turn to the second approach, in which we can make do with a hash table or even a balanced binary search tree for a slightly lower degree of compression.

Our dictionary can have a maximum of 4096 codewords. This is so that we can represent all of the indicea in the decoding table using 12 bit integers instead of the normal 16 bits.

const int max_code_words = 4096;

void initEncodingTable(unordered_map<string,int>& dict) {
    int i = 0;
    while (i < 256) { 
        string tmp;
        tmp.push_back((char)i);
        dict.emplace(tmp, i++);
    }
}

We begin the compression process by initializing the encoding table with each of the 256 single character ASCII values. Once the table is initialized we can begin encoding the data. This is done by building phrases up one character at a time and checking for their existance in the encoding table.

vector<int> squeeze(string input) {
    unordered_map<string,int> dict;
    initEncodingTable(dict);
    string phrase = "";
    vector<int> comrpessed;
    for (int j = 0; j < input.size(); j++) {
        string prev = phrase;
        phrase.push_back(input[j]);
        if (dict.find(phrase) == dict.end()) {
            comrpessed.push_back(dict[prev]);
            dict.emplace(phrase, i++);
            phrase.clear();
            phrase.push_back(input[j]);
        }
    }
    comrpessed.push_back(dict[phrase]);
    return comrpessed;
}

Uncompressing with LZW

Just as we did with the encoding process, the decoding process begins by initializing the decoding table with all 256 single-character ascii strings. Take note that this time we've swapped the data types of the key/value pairs in the dictionary: now we have integer keys returning string values.

void initDecodingTable(unordered_map<int,string>& dict) {
    int i;
    for (i = 0; i < 256; i++) {
        string s;
        s.push_back((char)i);
        dict[i] = s;
    }
    dict[i++] = "";
}

Once the table has been initialized, decoding can begin. Decoding an LZW stream is very much the reverse of the encoding process. The trick lies in building up and adding the correct strings for further decoding as we go.

void unsqueeze(vector<int> compd) {
    unordered_map<int, string> dict;
    initDecodingTable(dict);
    int codeword = compd[0];
    string sofar = dict[codeword];
    string output;
    for (int j = 1; j < compd.size(); j++) {
        output += sofar;
        codeword = compd[j];
        string s = dict[codeword];
        if (i == codeword) { s = sofar; s.push_back(sofar[0]); }
        if (i < max_code_words) {
            sofar.push_back(s[0]);
            dict[i++] = sofar;
        } 
        sofar = s;
    }
    output += sofar;
    return output;
}

And that's all there is to it.

Some Thoughts On LZW

As can be seen from the implementation, when it comes to the LZ family of compression algorithms, LZW is certainly one of the simplest. Interestingly, it also has rather favorable compression ratios. Further optimizations can be done, primarily with regards to running time, by using other more optimized search structures such as prefix tries or ternary search trees supporting the efficient finding of longest common prefixes for string keys.

Anyway, that's what I've got for today. Until next time, Happy Hacking.

MaxGCoding.com

For the love of programming

Dictionary Based Compression: The LZW Algorithm

From LZ Tuples to LZ Pairs to Codewords

Squeezing Text the LZW way

Uncompressing with LZW

Some Thoughts On LZW

Leave A Comment

Procedural Map Generation with Binary Space Partitioning

Exact Match String Searching: The Knuth-Morris-Pratt Algorithm

The visitor pattern: OOP takes on the Expression Problem

Improving mgcLisp's define syntax

Separating the Men from the Boys, Knuth Style

Reducing rotations during deletion from AVL trees

Parsing Lisp: From Data To Code

Swiz Heaps: A deterministic modification of Randomized Meldable Heaps

Let's talk Eval/Apply

BST Deletion: Removal By Merge