Data compression is the art of converting data from an
original representation to a more compact, compressed
representation. Although with pictures or audio it might be
acceptable if some information were lost during compression—keeping
only an approximation to the original data—for many applications
approximation is not acceptable. A data compression technique is
lossless if there exists a corresponding decompression
technique that always recovers the original data without error. As
it turns out, it is impossible for any lossless data compression
technique to always succeed in reducing the size of the original
data. Consider all cases of original data consisting of n
bits. There are 2n such cases, each of which must
compress to a distinct version of compressed data, or else for some
version it would be impossible to know which original case should
be recovered. But if you count up all possible versions that
contain fewer than n bits, you only get
2n-1 possibilities, which is not enough. In fact,
it is typical for a lossless data “compression” technique to
significantly increase the size of the data in many
troublesome cases.
Nonetheless, in practice there are lossless data compression
techniques that substantially reduce in size many useful kinds of
original data. For example, it takes more than five million bytes
to represent the complete works of
Shakespeare as ASCII text. The text contains about one million
words, of which about thirty thousand are distinct. As you might
expect, common English words such as the, and,
you, that, not, and with appear a lot
more frequently than most of the other words. The popular lossless
data compression program gzip exploits such repetition to
compress the five million bytes of ASCII Shakespeare down to just
over two million bytes.
Exploiting repeated words is the idea behind
dictionary-based lossless data compression. This compression
technique employs a dictionary of words likely to appear in the
original data, assigns a reference number to each word, and then
produces compressed data in which the original words are largely or
entirely replaced by reference numbers. (Of course, instead of
words, any substrings could be used.) One problem is that a
dictionary suitable for English might not work so well with, say,
Esperanto. A related problem is that some words, for example
Exeunt, appear a lot more frequently in Shakespeare than
they might in, say, a medical textbook. The solution is to create a
specialized dictionary for the original data and then include the
dictionary in the compressed data so that the decompressor can
recover the original words. Of course, this solution has its own
drawbacks, namely, the preparatory work required to create the
specialized dictionary and the space required to include it in the
compressed data.
In their groundbreaking publications in 1977 and 1978, Jacob Ziv
and Abraham Lempel developed techniques for creating the dictionary
on-the-fly as the compressor makes one pass reading through the
original data and writing out the compressed data. Things are
cleverly arranged so that the decompressor can also make one pass
reading through the compressed data and writing out the original
data, recreating at each step the identical dictionary that the
compressor had at that point. The actions of the compressor and the
decompressor are effectively locked together so that the
decompressor always knows the meaning of each reference number it
encounters, even though the dictionary never explicitly appears in
the compressed data. In 1984, Terry Welch added substantial
practical improvements to the 1978 technique of Ziv and Lempel. The
resulting dictionary-based lossless data compression technique is
known as LZW, after the initials of its inventors.
- Jacob Ziv and Abraham Lempel, “A Universal
Algorithm for Sequential Data Compression,” IEEE Transactions on
Information Theory, 23(3), pp. 337-343, 1977.
- Jacob Ziv and Abraham Lempel, “Compression of
Individual Sequences Via Variable-Rate Coding,” IEEE
Transactions on Information Theory, 24(5), pp. 530-536,
1978.
- Terry Welch, “A Technique for High-Performance
Data Compression,” IEEE Computer, 17(6), pp. 8-19,
1984.
LZW is clever, effective, and not really all that complicated.
LZW became quite popular, notably being used in the Unix compress
utility and in the definition of the hugely popular GIF image
format. (GIF is a service mark of CompuServe, Inc.) Perhaps LZW
became too popular, because people failed to notice that it
was covered by U.S.
Patent 4,558,302. Eventually the owner of the patent realized
that LZW was being used in numerous places and began to demand
license fees. One description of the resulting brouhaha can be
found at http://purl.org/mcb/id/giflzw.
The LZW patent expired in 2003 and its counterpart patents in
Canada, Europe, and Japan expired in 2004.
In the following exposition, we describe the basic form of the
LZW compression technique without going into various additional
practical improvements developed by Welch. The LZW compressor
maintains a dictionary DICT of unique strings. In contrast
to an ordinary dictionary, which lists entries in lexicographic
order, DICT lists entries in the order they were added to
the dictionary. For each string S in the dictionary, the
reference number of S, which we write as
DICT[S], is defined as the number of strings that
were already in the dictionary just before S was added.
Observe that reference numbers start at zero and count upward.
Initially, DICT is preloaded with all single-character
strings. Additional strings are added as compression proceeds.
The LZW compressor operates by determining the longest string
W that (a) starts at the current point in the original data
and (b) is currently in the dictionary. This string is then
represented by its reference number in the compressed output. Let
C be the character that follows W in the original
data. We call C the “stop” character, because it stopped the
compressor from matching a longer string. From the definition of
W, we know that the string defined by appending C
onto the end of W, which we write as W+C, is
not currently contained in the dictionary. So the compressor adds
W+C to the dictionary. The compressor then moves its
consideration of the original data to the location of the stop
character and repeats the process of determining the longest
string, now starting with C. Because the dictionary is
preloaded with all single-character strings, anything that appears
in the original data will match some string in the dictionary, so
in the worst case the compressor outputs a reference number for
each character. But as the dictionary builds up, repeated
occurrences of longer and longer strings result in just one
reference number for each repetition, thus compressing the data. A
precise description of the actions of the LZW compressor is as
follows:
Step 1. Set W to the empty string and
preload DICT.
Step 2. If there are no more original data characters, goto
Step 8.
Step 3. Read the next original data character and call it
C.
Step 4. If W+C is contained in DICT,
set W to W+C and goto Step 2.
Step 5. Output the reference number
DICT[W].
Step 6. Add W+C to DICT.
Step 7. Set W to the single-character string
C and goto Step 2.
Step 8. If W is not the empty string, output the
reference number DICT[W].
Step 9. Stop.
Let us consider an example. For simplicity, we use a
27-character alphabet consisting of a spacing character
* and the upper case letters A
through Z. It is convenient to display the
dictionary as a parenthesized list of strings, the original data as
a parenthesized list of characters, and the compressed data as a
parenthesized list of reference numbers. To reduce clutter, we
separate the items in a list with space instead of commas.
We preload our dictionary with ( * A B C D E F G H I J K
L M N O P Q R S T U V W X Y Z ). Observe that the spacing
character * is assigned reference number 0 and the
upper case letters A through Z are
assigned reference numbers 1 through 26, respectively. Let us see
how compression proceeds supposing that the original data consists
of the twelve characters ( T H E * C A C H E * I S
).
The LZW compressor reads the input characters in order, matching
the longest string already contained in the dictionary. Starting
off, the longest string the compressor can match is the
single-character string T, which has reference
number 20. So the compressor matches this string and outputs the
corresponding reference number. The “stop” character is
H, so the compressor adds the two-character string
TH to the dictionary, giving it reference number
27. At this point, the output is ( 20 ), the unmatched input is (
H E * C A C H E * I S ), and the dictionary is
(* A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
TH ). Note that one string has been matched, one reference
number has been output, and one string has been added to the
dictionary.
The next six input characters are each matched in exactly the
same way. The compressor matches a single-character string, outputs
its reference number, and adds a two-character string to the
dictionary. After these additional six matches, the output is ( 20
8 5 0 3 1 3 ), the unmatched input is ( H E * I S
), and the dictionary is ( * A B C D E F G H I J K L M N O P
Q R S T U V W X Y Z TH HE E* *C CA AC CH ). Six additional
strings have been added to the dictionary and given the reference
numbers 28 through 33.
Now something interesting happens. Observe that the next two
characters of the original data form the string HE,
which is already in the dictionary with reference number 28. The
compressor cannot match any further, because *
stops the match. So the compressor matches HE,
outputs the corresponding reference number, and adds the
three-character string HE* to the dictionary. At
this point the output is ( 20 8 5 0 3 1 3 28 ), the unmatched input
is ( * I S ), and the dictionary is ( * A B
C D E F G H I J K L M N O P Q R S T U V W X Y Z TH HE E* *C CA AC
CH HE* ).
Thereafter, nothing novel happens during the final three
characters, except that the compressor has to output the final
reference number for the string it is matching at the time the
input data runs out. The final output is ( 20 8 5 0 3 1 3 28 0 9 19
) and the final dictionary is ( * A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z TH HE E* *C CA AC CH HE* *I IS
). The final dictionary is not needed for anything, so it can
simply be discarded. All necessary information is contained in the
output reference numbers.
In this example, LZW compressed twelve characters to get eleven
reference numbers. This is not very good compression, especially
considering that each reference number requires more bits of
storage than a character. However, this is also a very short
example. Realistic examples would be thousands of characters long.
On long English ASCII text, LZW typically reduces the size of the
data by at least half.
( 20 8 5 0 3 1 3 28 0 9 19 0 1 0 6 9 6 20 25 30 1 12 9 2 5 18 38
13 13 15 2 15 24 30 15 12 15 18 5 4 0 48 11 29 1 14 0 5 24 20 18 39
18 9 16 29 2 71 84 0 15 87 0 25 5 37 19 21 3 3 91 37 16 1 18 11 38
20 86 14 29 14 9 106 0 16 15 109 104 41 22 29 27 64 29 5 9 7 8 104
13 109 21 20 91 0 14 63 27 105 29 26 51 15 111 113 14 104 19 5 117
72 6 15 21 52 119 5 29 127 14 129 131 23 91 104 8 9 69 0 27 29 23
163 130 86 1 102 76 1 9 12 166 173 29 4 154 152 77 177 37 149 152
34 12 1 19 104 20 23 140 145 147 0 138 18 140 6 154 194 34 31 33 29
36 103 136 132 109 29 112 114 40 150 52 201 203 220 0 156 158 37
133 18 135 15 110 222 141 218 122 124 104 108 233 139 225 128 130
37 160 193 73 12 146 1 20 9 232 166 8 120 40 205 )
The cache originally contained various compression-themed
materials, including:
- Starbucks gift card (compressed alertness)
- Squeeze smiley ball (compressed happiness)
- California geocoin (compressed travelbug)
- Ray Charles CDs (compressed music)
There is no requirement to maintain the theme when trading, but
I'm sure you can think of something.
2/9/2006 — Congratulations to
PurplePeople on the FTF. Thanks for providing a pen and an
interesting bit of
personal history.
2/10/2006 — No, that is not a mistake.
84 is correct. So is the other one. As I said, LZW is
clever.