Armchair Guide to Data Compression

Data compression is used to represent information in less space than its original representation. This post explains the basics of lossless compression, and hopefully helps you think about what happens when you next compress data.

Similarly to my post on hash tables, it’s aimed at software folks who’ve forgotten about compression, or folks just interested in how software works; if you’re a compression expert, you’re in the wrong place.

A Simple Compression Example

Suppose you have four colored lights: red, green, blue, and yellow. These lights flash in a repeating pattern: red, red, green, red, red, blue, red, red, yellow, yellow (rrgrrbrryy). You think this is neat, and you want to send a message to a friend and share the light flashing sequence.

You know the binary (base 2) numbering system, and you know that you could represent each of the lights in two digits: 00 (for red, 0 in decimal), 01 (for green, 1 in decimal), 10 (for blue, 2 in decimal), and 11 (for yellow, 3 in decimal). To send the message rrgrrbrryy to your friend, you could therefore send them this message: 00 00 01 00 00 10 00 00 11 11 (of course, you don’t send the spaces, I’ve just included them to make the message easy for you to read).

Data compression. It's hard to find a decent picture!

Data compression. It’s hard to find a decent picture!

You also need to send a key or dictionary, so your friend can decode the message. You only need to send this once, even if the lights flash in a new sequence and you want to share that with your friend. The dictionary is: red green blue yellow. Your friend will be smart enough to know that this implies red is 00, green is 01, and so on.

Sending the message takes 20 bits: two bits per light flash and ten flashes in total (plus the one-off cost of sending the dictionary, which I’ll ignore for now). Let’s call that the original representation.

Alright, let’s do some simple compression. Notice that red flashes six times, yellow twice, and the other colors once each. Sending those red flashes is taking twelve bits, four for sending yellow, and the others a total of four for the total of twenty bits. If we’re smart, what we should be doing is sending the code for red in one bit (instead of two) since it’s sent often, and using more than two bits for green and blue since they’re sent once each. If we could send red in one bit, it’d cost us six bits to send the six red flashes, saving a total of six bits over the two-bit version.

How about we try something simple? Let’s represent red as 0. Let’s represent yellow as 10, green as 110, and blue as 1110 (this is a simple unary counting scheme). Why’d I use that scheme? Well, it’s a simple idea: sort the color flashes by decreasing frequency (red, yellow, green, blue), and assign increasingly longer codes to the colors in a very simple way: you can count 1s until you see a 0, and then you have a key you can use to look in the dictionary. When we see just a 0, we can look in the dictionary to find that seeing zero 1s means red. When we see 1110, we can look in the dictionary to find that seeing three 1s means blue.

Here’s what our original twenty-bit sequence would now look like: 0 0 110 0 0 1110 0 0 10 10. That’s a total of 17 bits, a saving of 3 bits — we’ve compressed the data! Of course, we need to send our friend the dictionary too: red yellow green blue.

It turns out we can do better than this using Huffman coding. We could assign 0 to red, 10 to yellow, 110 to blue, and 111 to green. Our message would then be 0 0 111 0 0 110 0 0 10 10. That’s 16 bits, 1 bit better than our simple scheme (and, again, we don’t need to send the spaces). We’d also need to share the dictionary: red <blank> yellow <blank> blue green to show that 0 is red, 1 isn’t anything, yellow is 10, 11 isn’t anything, blue is 110, and green is 111. A slightly more complicated dictionary for better message compression.

Semi-Static Compression

Our examples are two semi-static compression schemes. The dictionary is static, it doesn’t change. However, it’s built from a single-pass over the data to learn about the frequencies — so the dictionary is dependent on the data. For that reason, I’ll call our two schemes semi-static schemes.

Huffman coding (or minimum-redundancy coding) is the most famous example of a semi-static scheme.

Semi-static schemes have at least three interesting properties:

  1. They require two passes over the data: one to build the dictionary, and another to emit the compressed representation
  2. They’re data-dependent, meaning that the dictionary is built based on the symbols and frequencies in the original data. A dictionary that’s derived from one data set isn’t optimal for a different data set (one with different frequencies) — for example, if you figured out the dictionary for Shakespeare’s works, it isn’t going to be optimal for compressing War and Peace
  3. You need to send the dictionary to the recipient, so the message can be decoded; lots of folks forget to include this cost when they share the compression ratios or savings they’re seeing, don’t do that

It doesn’t matter what kind of compression scheme you choose, you need to decide what the symbols are. For example, you could choose letters, words, or even phrases from English text.

Static Compression

Morse code is an example of a (fairly lame) static compression scheme. The dictionary is universally known (and doesn’t need to be communicated), but the dictionary isn’t derived from the input data. This means that the compression ratios you’ll see are at best the same as you’ll see from a semi-static compression scheme, and usually worse.

There aren’t many static compression schemes in widespread use. Two examples are Elias gamma and delta codes, which are used to compress inputs that consist of only integer values.

Adaptive Compression

The drawbacks of semi-static compression schemes are two-fold: you need to process the data twice, and they don’t adapt to local regions in the data where the frequencies of symbols might vary from the overall frequencies. Imagine, for example, that you’re compressing a very large image: you might find a region of blue sky, where there’s only a few blue colors. If you built a dictionary for only that blue section, you’d get a different (and better) dictionary than the one you’d get for the whole image.

Here’s the idea behind adaptive compression. Build the dictionary as you go: process the input, and see if you can find it in the (initially empty) dictionary. If you can’t, add the input to the dictionary. Now emit the compressed code, and keep on processing the input. In this way, your dictionary adapts as you go, and you only have to process the data once to create the dictionary and compress the data. The most famous example is LZW compression, which is used in many of the compression tools you’re probably using: gzip, pkzip, GIF images, and more.

Adaptive schemes have at least three interesting properties:

  1. One pass over the data creates the dictionary and the compressed representation, an advantage over semi-static schemes
  2. Adaptive schemes get better compression with the more input they see: since they don’t approximate the global symbol frequencies until lots of input has been processed, they’re usually much less optimal than semi-static schemes for small inputs
  3. You can’t randomly access the compressed data, since the dictionary is derived from processing the data sequentially from the start. This is one good reason why folks use static and semi-static schemes for some applications

What about lossy compression?

The taxonomy I’ve presented is for lossless compression schemes — those that are used to compress and decompress data such that you get an exact copy of the original input. Lossy compression schemes don’t guarantee that: they’re schemes where some of the input is thrown away by approximation, and the decompressed representation isn’t guaranteed to be the same as the input. A great example is JPEG image compression: it’s an effective way to store an image, by throwing away (hopefully) unimportant data and approximating it instead. The MP3 music file format and the MP4 video format (usually) do the same thing.

In general, lossy compression schemes are more compact than lossless schemes. That’s why, for example, GIF image files are often much larger than JPEG image files.

Hope you learnt something useful. Tell me if you did or didn’t. See you next time.

3 thoughts on “Armchair Guide to Data Compression

  1. Ardent Logophile

    I did learn something really useful! 🙂 Thank you so much esp. for presenting it in a way that is easy and very quick to comprehend! Though I was aware of this concept before, reading this post refreshed and solidified my understanding! I am very well setup for exploring more about compression if I want to!

  2. Ardent Logophile

    Had a chance to work on a compression problem (Index Compression using Variable Byte and Gamma encodings) over the last 2 weeks, and enjoyed it! Thank you once again for the post!

Leave a comment