# Armchair Guide to Data Compression

Data compression is used to represent information in less space than its original representation. This post explains the basics of lossless compression, and hopefully helps you think about what happens when you next compress data.

Similarly to my post on hash tables, it’s aimed at software folks who’ve forgotten about compression, or folks just interested in how software works; if you’re a compression expert, you’re in the wrong place.

# A Simple Compression Example

Suppose you have four colored lights: red, green, blue, and yellow. These lights flash in a repeating pattern: red, red, green, red, red, blue, red, red, yellow, yellow (rrgrrbrryy). You think this is neat, and you want to send a message to a friend and share the light flashing sequence.

You know the binary (base 2) numbering system, and you know that you could represent each of the lights in two digits: 00 (for red, 0 in decimal), 01 (for green, 1 in decimal), 10 (for blue, 2 in decimal), and 11 (for yellow, 3 in decimal). To send the message rrgrrbrryy to your friend, you could therefore send them this message: 00 00 01 00 00 10 00 00 11 11 (of course, you don’t send the spaces, I’ve just included them to make the message easy for you to read).

Data compression. It’s hard to find a decent picture!

You also need to send a key or dictionary, so your friend can decode the message. You only need to send this once, even if the lights flash in a new sequence and you want to share that with your friend. The dictionary is: red green blue yellow. Your friend will be smart enough to know that this implies red is 00, green is 01, and so on.

Sending the message takes 20 bits: two bits per light flash and ten flashes in total (plus the one-off cost of sending the dictionary, which I’ll ignore for now). Let’s call that the original representation.

Alright, let’s do some simple compression. Notice that red flashes six times, yellow twice, and the other colors once each. Sending those red flashes is taking twelve bits, four for sending yellow, and the others a total of four for the total of twenty bits. If we’re smart, what we should be doing is sending the code for red in one bit (instead of two) since it’s sent often, and using more than two bits for green and blue since they’re sent once each. If we could send red in one bit, it’d cost us six bits to send the six red flashes, saving a total of six bits over the two-bit version.

How about we try something simple? Let’s represent red as 0. Let’s represent yellow as 10, green as 110, and blue as 1110 (this is a simple unary counting scheme). Why’d I use that scheme? Well, it’s a simple idea: sort the color flashes by decreasing frequency (red, yellow, green, blue), and assign increasingly longer codes to the colors in a very simple way: you can count 1s until you see a 0, and then you have a key you can use to look in the dictionary. When we see just a 0, we can look in the dictionary to find that seeing zero 1s means red. When we see 1110, we can look in the dictionary to find that seeing three 1s means blue.

Here’s what our original twenty-bit sequence would now look like: 0 0 110 0 0 1110 0 0 10 10. That’s a total of 17 bits, a saving of 3 bits — we’ve compressed the data! Of course, we need to send our friend the dictionary too: red yellow green blue.

It turns out we can do better than this using Huffman coding. We could assign 0 to red, 10 to yellow, 110 to blue, and 111 to green. Our message would then be 0 0 111 0 0 110 0 0 10 10. That’s 16 bits, 1 bit better than our simple scheme (and, again, we don’t need to send the spaces). We’d also need to share the dictionary: red <blank> yellow <blank> blue green to show that 0 is red, 1 isn’t anything, yellow is 10, 11 isn’t anything, blue is 110, and green is 111. A slightly more complicated dictionary for better message compression.

# Semi-Static Compression

Our examples are two semi-static compression schemes. The dictionary is static, it doesn’t change. However, it’s built from a single-pass over the data to learn about the frequencies — so the dictionary is dependent on the data. For that reason, I’ll call our two schemes semi-static schemes.

Huffman coding (or minimum-redundancy coding) is the most famous example of a semi-static scheme.

Semi-static schemes have at least three interesting properties:

1. They require two passes over the data: one to build the dictionary, and another to emit the compressed representation
2. They’re data-dependent, meaning that the dictionary is built based on the symbols and frequencies in the original data. A dictionary that’s derived from one data set isn’t optimal for a different data set (one with different frequencies) — for example, if you figured out the dictionary for Shakespeare’s works, it isn’t going to be optimal for compressing War and Peace
3. You need to send the dictionary to the recipient, so the message can be decoded; lots of folks forget to include this cost when they share the compression ratios or savings they’re seeing, don’t do that

It doesn’t matter what kind of compression scheme you choose, you need to decide what the symbols are. For example, you could choose letters, words, or even phrases from English text.

# Static Compression

Morse code is an example of a (fairly lame) static compression scheme. The dictionary is universally known (and doesn’t need to be communicated), but the dictionary isn’t derived from the input data. This means that the compression ratios you’ll see are at best the same as you’ll see from a semi-static compression scheme, and usually worse.

There aren’t many static compression schemes in widespread use. Two examples are Elias gamma and delta codes, which are used to compress inputs that consist of only integer values.

The drawbacks of semi-static compression schemes are two-fold: you need to process the data twice, and they don’t adapt to local regions in the data where the frequencies of symbols might vary from the overall frequencies. Imagine, for example, that you’re compressing a very large image: you might find a region of blue sky, where there’s only a few blue colors. If you built a dictionary for only that blue section, you’d get a different (and better) dictionary than the one you’d get for the whole image.

Here’s the idea behind adaptive compression. Build the dictionary as you go: process the input, and see if you can find it in the (initially empty) dictionary. If you can’t, add the input to the dictionary. Now emit the compressed code, and keep on processing the input. In this way, your dictionary adapts as you go, and you only have to process the data once to create the dictionary and compress the data. The most famous example is LZW compression, which is used in many of the compression tools you’re probably using: gzip, pkzip, GIF images, and more.

Adaptive schemes have at least three interesting properties:

1. One pass over the data creates the dictionary and the compressed representation, an advantage over semi-static schemes
2. Adaptive schemes get better compression with the more input they see: since they don’t approximate the global symbol frequencies until lots of input has been processed, they’re usually much less optimal than semi-static schemes for small inputs
3. You can’t randomly access the compressed data, since the dictionary is derived from processing the data sequentially from the start. This is one good reason why folks use static and semi-static schemes for some applications

The taxonomy I’ve presented is for lossless compression schemes — those that are used to compress and decompress data such that you get an exact copy of the original input. Lossy compression schemes don’t guarantee that: they’re schemes where some of the input is thrown away by approximation, and the decompressed representation isn’t guaranteed to be the same as the input. A great example is JPEG image compression: it’s an effective way to store an image, by throwing away (hopefully) unimportant data and approximating it instead. The MP3 music file format and the MP4 video format (usually) do the same thing.

In general, lossy compression schemes are more compact than lossless schemes. That’s why, for example, GIF image files are often much larger than JPEG image files.

Hope you learnt something useful. Tell me if you did or didn’t. See you next time.

# Compression, Speed, and Search Engines

When you think compression, you probably think saving space. I think speed. In this post, I explain why compression is a performance tool, and how it saves tens of millions of dollars and helps speed-up the search process in most modern search engines.

Some background on how search works

Before I can explain how compression delivers speed, I’ll need to explain some of the basics of search data structures. All search engines use an inverted index to support querying by users. An inverted index is a pretty simple idea: it’s kind of like the index at the back of a book. There’s a set of terms you can search for, and a list of places those terms occur.

Let’s suppose you want to learn about the merge sorting algorithm and you pick up your copy of the third volume of Knuth’s Art of Computer Programming book to begin your research. First, you flip to the index at the back, and you do a kind of binary search thing: oops, you went to “q”, that’s too far; flip; oh, “h”, that’s too far the other way; flip; “m”, that’s it; scan, scan, scan; ah ha, “merge sorting”, found it! Now you look and see that the topic is on pages 98, and 158 through 168. You turn to page 98 to get started.

A Simple Inverted Index

An inverted index used in a search engine is similar. Take a look at the picture on the right. What you can see to the left side of it is a structure that contains the searchable terms in the index (the lexicon); in this example I’ve just shown the term cat, and I’ve shown the search structure as a chained hash table. On the right, you can see a list of term occurrences, in this case I’ve just shown the list for the term cat and it shows that cat occurs three times in our collection of documents: in documents numbers 1, 2, and 7.

If a user wants to query our search engine and learn something about a cat, here’s how it works. We look in the search structure to see if we have a matching term. In this case, yes, we find the term cat in the search structure. We then retrieve the list for the term cat, and use the information in the list to compute a ranking of documents that match the query. In this case, we’ve got information about documents 1, 2, and 7. Let’s imagine our ranking function thinks document 7 is the best, document 1 is the next best, and document 2 is the least best. We’d then show information about documents 7, 1, and 2 to the user, and get ready to process our next query. (I’ve simplified things here quite a bit, but that’s not too important in the story I’m going to tell.)

What you need to take away from this section is that inverted indexes are the backbone of search engines. And inverted indexes consist of terms and lists, and the lists are made up of numbers or, more specifically, integers.

Compressing Integers

Inverted indexes have two parts: the terms in our searchable lexicon, and the term occurrences in our lists of integers. It turns out that compression isn’t very interesting for the terms, and I’m going to ignore that topic here. Where compression gets interesting is for our lists of integers.

You might be surprised to learn that integer compression is a big deal. There are scholars best known for their work on compressing integers: Solomon Golomb, Robert F. Rice, and Peter Elias immediately come to mind. I spent a few years between 1997 and 2003 largely focusing on integer compression. Here’s a summary I wrote with Justin Zobel on the topic (you can download a PDF here.)

It turns out that one particular type of integer compression, which I refer to as variable-byte compression, works pretty well for storing numbers in inverted index lists (PDF is here). It appears that Google started off using this variable byte scheme, and has now moved to a tuned-up mix of integer compression techniques.

I could spend a whole blog post talking about the pros, cons, and details of the popular integer compression techniques. But, for now, I’ll just say a few words about variable-byte compression. It’s a very simple idea: use 7 bits in every 8 bit byte to store the integer, and use the remain 1 bit to indicate whether this is the last byte that stores the integer or whether another byte follows.

Suppose you want to store the decimal number 1,234.  In binary, it’d be represented as 10011010010. With the variable byte scheme, we take the first (lowest) seven bits (1010010), and we append a 0 to indicate this isn’t the last byte of the integer, and we get 10100100. We then take the remaining four bits of the binary value (1001), pad it out to be seven bits (wasting a little space), and append a to indicate this is the last byte of the integer (00010011). So, all up, we’ve got 000100110100100. When we’re reading this compressed version back, we can recreate the original value by throwing away the “indicator” bits (the ones shown in pink), and concatenating what’s left together: 10011010010. Voila: compression and decompression in a paragraph!

If you store decimal 1,234 using a typical computer architecture, without compression, it’d be typically stored in 4 bytes. Using variable-byte compression, we can store it in two. It isn’t perfect (we wasted 3 bits, and paid no attention to the frequency of different integers in the index), but we saved ourselves two bytes. Overall, variable-byte coding works pretty well.

Why This All Matters

Index size: compressed versus uncompressed

I’ve gone down the path of explaining inverted indexes and integer compression. Why does this all matter?

Take a look at the graph on the right. It shows  the size of the inverted index as a percentage of the collection that’s being searched. The blue bar is when variable byte compression is applied, the black bar is when there’s no compression. All up, the index is about half the size when you use a simple compression scheme. That’s pretty cool, but it’s about to get more interesting.

Take a look at the next graph below. This time, I’m showing you the speed of the search engine, that is, how long it takes to process a query. Without compression, it’s taking about .012 seconds to process a query. With compression, it’s taking about 0.008 seconds to process a query. And what’s really amazing here is that this is when the inverted index is entirely in memory — there’s no disk, or SSD, or network involved. Yes, indeed, compression is making the search engine about 25% faster.

That’s the punchline of this post: compression is a major contributor to making a search engine fast.

Search engine querying speed: compressed versus uncompressed

How’s this possible? It’s actually pretty simple. When you don’t use compression, the time to process the inverted lists really consists of two basic components: moving the list from memory into the CPU cache, and processing the list. The problem is there’s lots of data to move, and that takes time. When you add in compression, you have three costs: moving the data from memory into the CPU cache, decompressing the list, and processing the list. But there’s much less data to move when its compressed, and so it takes a lot less time. And since the decompression is very, very simple, that takes almost no time. So, overall, you win — compression makes the search engine hum.

It gets more exciting when disk gets involved. You will typically see that compression makes the system about twice as fast. Yes, twice as fast at retrieving and processing inverted indexes. How neat is that? All you’ve got to do is add a few lines of code to read and write simple variable bytes, and you’ll save an amazing amount in processing time (or cost in hardware, or both).

Rest assured that these are realistic experiments with realistic data, queries, and a reasonable search engine. All of the details are here. The only critiques I’d offer are these:

• The ranking function is very, very simple, and so the bottleneck in the system is retrieving data and not the processing of it. In my experience, that’s a reasonably accurate reflection of how search engines work — I/O is the bottleneck, not ranking
• The experiments are old. In my experience, nothing has changed — if anything, you’ll get even better results now than you did back then

Here is some C code for reading and writing variable-byte integers (scroll to the bottom to where it says vbyte.c). The “writing” code is just 27 fairly simple, not-too-compact lines. The reading code is 17. It’s pretty simple stuff. Feel free to use it.

Is Compression Useful Elsewhere in Search?

Oh yes, indeed. It’s important to use compression wherever you can. Compress data that’s moving over a network. Compress documents. Compress data you’re writing to logs. Think compression equals speed equals a better customer experience equals much less hardware.

Alright, that ends another whirlwind introduction to a search topic. Hope you enjoy it, looking forward to reading your comments. See you next time.