Compression, Speed, and Search Engines

When you think compression, you probably think saving space. I think speed. In this post, I explain why compression is a performance tool, and how it saves tens of millions of dollars and helps speed-up the search process in most modern search engines.

Some background on how search works

Before I can explain how compression delivers speed, I’ll need to explain some of the basics of search data structures. All search engines use an inverted index to support querying by users. An inverted index is a pretty simple idea: it’s kind of like the index at the back of a book. There’s a set of terms you can search for, and a list of places those terms occur.

Let’s suppose you want to learn about the merge sorting algorithm and you pick up your copy of the third volume of Knuth’s Art of Computer Programming book to begin your research. First, you flip to the index at the back, and you do a kind of binary search thing: oops, you went to “q”, that’s too far; flip; oh, “h”, that’s too far the other way; flip; “m”, that’s it; scan, scan, scan; ah ha, “merge sorting”, found it! Now you look and see that the topic is on pages 98, and 158 through 168. You turn to page 98 to get started.

A Simple Inverted Index

An inverted index used in a search engine is similar. Take a look at the picture on the right. What you can see to the left side of it is a structure that contains the searchable terms in the index (the lexicon); in this example I’ve just shown the term cat, and I’ve shown the search structure as a chained hash table. On the right, you can see a list of term occurrences, in this case I’ve just shown the list for the term cat and it shows that cat occurs three times in our collection of documents: in documents numbers 1, 2, and 7.

If a user wants to query our search engine and learn something about a cat, here’s how it works. We look in the search structure to see if we have a matching term. In this case, yes, we find the term cat in the search structure. We then retrieve the list for the term cat, and use the information in the list to compute a ranking of documents that match the query. In this case, we’ve got information about documents 1, 2, and 7. Let’s imagine our ranking function thinks document 7 is the best, document 1 is the next best, and document 2 is the least best. We’d then show information about documents 7, 1, and 2 to the user, and get ready to process our next query. (I’ve simplified things here quite a bit, but that’s not too important in the story I’m going to tell.)

What you need to take away from this section is that inverted indexes are the backbone of search engines. And inverted indexes consist of terms and lists, and the lists are made up of numbers or, more specifically, integers.

Start with the seminal Managing Gigabytes if you’re interested in learning more about inverted indexes.

Compressing Integers

Inverted indexes have two parts: the terms in our searchable lexicon, and the term occurrences in our lists of integers. It turns out that compression isn’t very interesting for the terms, and I’m going to ignore that topic here. Where compression gets interesting is for our lists of integers.

You might be surprised to learn that integer compression is a big deal. There are scholars best known for their work on compressing integers: Solomon Golomb, Robert F. Rice, and Peter Elias immediately come to mind. I spent a few years between 1997 and 2003 largely focusing on integer compression. Here’s a summary I wrote with Justin Zobel on the topic (you can download a PDF here.)

It turns out that one particular type of integer compression, which I refer to as variable-byte compression, works pretty well for storing numbers in inverted index lists (PDF is here). It appears that Google started off using this variable byte scheme, and has now moved to a tuned-up mix of integer compression techniques.

I could spend a whole blog post talking about the pros, cons, and details of the popular integer compression techniques. But, for now, I’ll just say a few words about variable-byte compression. It’s a very simple idea: use 7 bits in every 8 bit byte to store the integer, and use the remain 1 bit to indicate whether this is the last byte that stores the integer or whether another byte follows.

Suppose you want to store the decimal number 1,234.  In binary, it’d be represented as 10011010010. With the variable byte scheme, we take the first (lowest) seven bits (1010010), and we append a 0 to indicate this isn’t the last byte of the integer, and we get 10100100. We then take the remaining four bits of the binary value (1001), pad it out to be seven bits (wasting a little space), and append a to indicate this is the last byte of the integer (00010011). So, all up, we’ve got 000100110100100. When we’re reading this compressed version back, we can recreate the original value by throwing away the “indicator” bits (the ones shown in pink), and concatenating what’s left together: 10011010010. Voila: compression and decompression in a paragraph!

If you store decimal 1,234 using a typical computer architecture, without compression, it’d be typically stored in 4 bytes. Using variable-byte compression, we can store it in two. It isn’t perfect (we wasted 3 bits, and paid no attention to the frequency of different integers in the index), but we saved ourselves two bytes. Overall, variable-byte coding works pretty well.

Why This All Matters

Index size: compressed versus uncompressed

I’ve gone down the path of explaining inverted indexes and integer compression. Why does this all matter?

Take a look at the graph on the right. It shows  the size of the inverted index as a percentage of the collection that’s being searched. The blue bar is when variable byte compression is applied, the black bar is when there’s no compression. All up, the index is about half the size when you use a simple compression scheme. That’s pretty cool, but it’s about to get more interesting.

Take a look at the next graph below. This time, I’m showing you the speed of the search engine, that is, how long it takes to process a query. Without compression, it’s taking about .012 seconds to process a query. With compression, it’s taking about 0.008 seconds to process a query. And what’s really amazing here is that this is when the inverted index is entirely in memory — there’s no disk, or SSD, or network involved. Yes, indeed, compression is making the search engine about 25% faster.

That’s the punchline of this post: compression is a major contributor to making a search engine fast.

Search engine querying speed: compressed versus uncompressed

How’s this possible? It’s actually pretty simple. When you don’t use compression, the time to process the inverted lists really consists of two basic components: moving the list from memory into the CPU cache, and processing the list. The problem is there’s lots of data to move, and that takes time. When you add in compression, you have three costs: moving the data from memory into the CPU cache, decompressing the list, and processing the list. But there’s much less data to move when its compressed, and so it takes a lot less time. And since the decompression is very, very simple, that takes almost no time. So, overall, you win — compression makes the search engine hum.

It gets more exciting when disk gets involved. You will typically see that compression makes the system about twice as fast. Yes, twice as fast at retrieving and processing inverted indexes. How neat is that? All you’ve got to do is add a few lines of code to read and write simple variable bytes, and you’ll save an amazing amount in processing time (or cost in hardware, or both).

Rest assured that these are realistic experiments with realistic data, queries, and a reasonable search engine. All of the details are here. The only critiques I’d offer are these:

  • The ranking function is very, very simple, and so the bottleneck in the system is retrieving data and not the processing of it. In my experience, that’s a reasonably accurate reflection of how search engines work — I/O is the bottleneck, not ranking
  • The experiments are old. In my experience, nothing has changed — if anything, you’ll get even better results now than you did back then

Here is some C code for reading and writing variable-byte integers (scroll to the bottom to where it says vbyte.c). The “writing” code is just 27 fairly simple, not-too-compact lines. The reading code is 17. It’s pretty simple stuff. Feel free to use it.

Is Compression Useful Elsewhere in Search?

Oh yes, indeed. It’s important to use compression wherever you can. Compress data that’s moving over a network. Compress documents. Compress data you’re writing to logs. Think compression equals speed equals a better customer experience equals much less hardware.

Alright, that ends another whirlwind introduction to a search topic. Hope you enjoy it, looking forward to reading your comments. See you next time.

8 thoughts on “Compression, Speed, and Search Engines

  1. Mica

    Hi Hugh,
    Awesome simple explanation for VARINT8 (Byte-Aligned Variable-length Encodings).
    Could you please do the same and (try) to explain in one paragraph the “Group Varint Encoding” ;-)?

    /Mica

  2. Hugh E. Williams Post author

    Hi Mica, here’s the code for vbyte.c.

    int vbyteread(FILE *fp)
    {
    char tmp = 0x1;
    int val = 0;

    while((tmp & 0x1) == 0x1)
    {
    if (fread(&tmp, sizeof(char), 1, fp) == 0)
    {
    if (feof(fp))
    return(-1);
    else
    return(0);
    }
    val = (val <> 1) & 127);
    }
    return(val);
    }

    int vbytewrite(int number, FILE *fp)
    {
    char bytearray[4];
    char tmp = 0;
    int x, started = FALSE;
    int charswritten = 0;

    for(x=0;x<4;x++)
    {
    tmp = (number%128) <0;x–)
    {
    if (bytearray[x] != 0 || started == TRUE)
    {
    started = TRUE;
    bytearray[x] |= 0x1;
    fwrite(&bytearray[x], sizeof(char), 1, fp);
    charswritten++;
    }
    }
    bytearray[0] |= 0x0;
    fwrite(&bytearray[0], sizeof(char), 1, fp);
    charswritten++;
    return(charswritten);
    }

  3. Pingback: Hello World | Jeff Plumb

  4. Pingback: Variable Byte Integer Encoding | Jeff Plumb

  5. Ardent Logophile

    Found this post today with the help of the ping back on your bio page! I am sorry I missed it until today! Awesome post!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s