- Reed Nelson
- 30 Jun, 2022
Data compression is a process of modifying the representation of some information so that it can be stored using less data. Storing and transmitting data is expensive in large quantities, making an optimal compression (one which minimizes storage size) extremely valuable. In this post we will learn how to quantify data information-theoretically1, and how to get an optimal compression with optimal speed.
The ratio of concepts and definitions to prose is pretty high in this post, but I try to ameliorate that with examples and alternate descriptions. Assuming a bit of familiarity with binary and ASCII, nothing here should be too dificult to understand. It’s okay to skim over some of the equations, there won’t be a test at the end.
Entropy: Quantifying Information
Let be some set, and be the probability distribution over , i.e. the frequency with which each element of occurs. The entropy of is a measure of how structured (non-uniform) is. So for example, the uniform distribution (where ) is the least structured, and the lowest entropy. Conversely, the distribution where , is the most structured, and has the most entropy.
More intuitively, entropy can be thought of as a measure of how uncertain you are about a random choice from , using . With a set of minimal entropy (one with a uniform distribution), you can do no better than a random guess. But on a set where one element has a probability of occurring, and the rest have probability , you can be certain that the next element to occur will be that one with probability .
Formally, the entropy of is given by .
Remarkably, a choice from contains bits of information per element.
Take . . Then naively we can represent the alphabet using 5 bits. Perhaps . Now we can imagine using this coding on a text file we wrote (of only these 26 characters). The size of this file (ignoring metadata and whatever) is 5 bits per character the number of characters.
Already you might see that this is suboptimal. With 5 bits, you can represent as many as 32 characters (), so we could add 6 more chacters to our alphabet without using any extra data per character. This is an improvement, but still we’re restricting ourselves. What if we used a variable number of bits per character?
A coding of is a unique mapping between the elements of and a set of binary strings. The mapping from before is an example of a coding.
A prefix code is a coding where no coded character is a prefix of another character. For example, a coding with is not a prefix code, because begins with . If we aren’t using fixed-length codings, it’s important to use a prefix code so there isn’t ambiguity about where one character ends and another begins.
Whether a coding will be good or bad at compression depends on its expected length. The expected length of a code is the sum of the probabilities of each character occurring, multiplied by the length of that character’s code. That is, . Practically, this means that on average we expect a message characters long (using this coding) to take up bits.
Of course, the goal of compression is to use a coding of minimal expected length. It is essential to this project that language is not uniform (recall, greater entropy means greater compressability!). In English for example, doesn’t appear with the same frequency as . makes up about of all letters we write, but makes up a mere 2. Intuitively, we want our code to reserve the shortest bit strings for the most common letters, like and , and assign the longer codings to the rare characters, like and . To reiterate, the higher the entropy in a set of characters, the more compressible it is.
There is a theorem that states that there exists an optimal prefix code , such that the expected length of is equal to or only a tiny bit larger than the entropy of the corresponding probability distribution. That is, .
In the 50’s, MIT PhD student David Huffman had to write a paper proving some coding was optimal. He couldn’t figure it out, but extraordinarily, he thought up his own algorithm3 and proved that coding was optimal. Huffman’s professor Robert Fano had previously devised a coding algorithm with4 Claude Shannon himself, so when his student showed him a better algorithm, Fano supposedly5 canceled class for the rest of the term. In essence, this is the Huffman Coding:
Start with the system , and the probability function.
Take of lowest probability in .
Remove and from , and add to a new character , where .
Repeat (1) and (2) until only 1 element remains in .
Build a binary tree for which all the leaves are the original members of , and two nodes share a parent if they were replaced by that parent in step (2).
Huffman’s simple algorithm finds an optimal symbol-by-symbol coding. There are alternate methods of coding which perform better under certain circumstances, but even where suboptimal, Huffman is quite good.
Claude Shannon is the father of Information Theory, and an absolute legend. He wrote the book A Mathematical Theory of Communication, which a professor of mine once described as “one of the most important books in science in the last century”. ↩
David Huffman, A Method for the Construction of Minimum-Redundancy Codes (1952). ↩
I’ve only heard this part from one source. ↩