Tag Archives: search engines

Compression, Speed, and Search Engines

When you think compression, you probably think saving space. I think speed. In this post, I explain why compression is a performance tool, and how it saves tens of millions of dollars and helps speed-up the search process in most modern search engines.

Some background on how search works

Before I can explain how compression delivers speed, I’ll need to explain some of the basics of search data structures. All search engines use an inverted index to support querying by users. An inverted index is a pretty simple idea: it’s kind of like the index at the back of a book. There’s a set of terms you can search for, and a list of places those terms occur.

Let’s suppose you want to learn about the merge sorting algorithm and you pick up your copy of the third volume of Knuth’s Art of Computer Programming book to begin your research. First, you flip to the index at the back, and you do a kind of binary search thing: oops, you went to “q”, that’s too far; flip; oh, “h”, that’s too far the other way; flip; “m”, that’s it; scan, scan, scan; ah ha, “merge sorting”, found it! Now you look and see that the topic is on pages 98, and 158 through 168. You turn to page 98 to get started.

A Simple Inverted Index

An inverted index used in a search engine is similar. Take a look at the picture on the right. What you can see to the left side of it is a structure that contains the searchable terms in the index (the lexicon); in this example I’ve just shown the term cat, and I’ve shown the search structure as a chained hash table. On the right, you can see a list of term occurrences, in this case I’ve just shown the list for the term cat and it shows that cat occurs three times in our collection of documents: in documents numbers 1, 2, and 7.

If a user wants to query our search engine and learn something about a cat, here’s how it works. We look in the search structure to see if we have a matching term. In this case, yes, we find the term cat in the search structure. We then retrieve the list for the term cat, and use the information in the list to compute a ranking of documents that match the query. In this case, we’ve got information about documents 1, 2, and 7. Let’s imagine our ranking function thinks document 7 is the best, document 1 is the next best, and document 2 is the least best. We’d then show information about documents 7, 1, and 2 to the user, and get ready to process our next query. (I’ve simplified things here quite a bit, but that’s not too important in the story I’m going to tell.)

What you need to take away from this section is that inverted indexes are the backbone of search engines. And inverted indexes consist of terms and lists, and the lists are made up of numbers or, more specifically, integers.

Start with the seminal Managing Gigabytes if you’re interested in learning more about inverted indexes.

Compressing Integers

Inverted indexes have two parts: the terms in our searchable lexicon, and the term occurrences in our lists of integers. It turns out that compression isn’t very interesting for the terms, and I’m going to ignore that topic here. Where compression gets interesting is for our lists of integers.

You might be surprised to learn that integer compression is a big deal. There are scholars best known for their work on compressing integers: Solomon Golomb, Robert F. Rice, and Peter Elias immediately come to mind. I spent a few years between 1997 and 2003 largely focusing on integer compression. Here’s a summary I wrote with Justin Zobel on the topic (you can download a PDF here.)

It turns out that one particular type of integer compression, which I refer to as variable-byte compression, works pretty well for storing numbers in inverted index lists (PDF is here). It appears that Google started off using this variable byte scheme, and has now moved to a tuned-up mix of integer compression techniques.

I could spend a whole blog post talking about the pros, cons, and details of the popular integer compression techniques. But, for now, I’ll just say a few words about variable-byte compression. It’s a very simple idea: use 7 bits in every 8 bit byte to store the integer, and use the remain 1 bit to indicate whether this is the last byte that stores the integer or whether another byte follows.

Suppose you want to store the decimal number 1,234.  In binary, it’d be represented as 10011010010. With the variable byte scheme, we take the first (lowest) seven bits (1010010), and we append a 0 to indicate this isn’t the last byte of the integer, and we get 10100100. We then take the remaining four bits of the binary value (1001), pad it out to be seven bits (wasting a little space), and append a to indicate this is the last byte of the integer (00010011). So, all up, we’ve got 000100110100100. When we’re reading this compressed version back, we can recreate the original value by throwing away the “indicator” bits (the ones shown in pink), and concatenating what’s left together: 10011010010. Voila: compression and decompression in a paragraph!

If you store decimal 1,234 using a typical computer architecture, without compression, it’d be typically stored in 4 bytes. Using variable-byte compression, we can store it in two. It isn’t perfect (we wasted 3 bits, and paid no attention to the frequency of different integers in the index), but we saved ourselves two bytes. Overall, variable-byte coding works pretty well.

Why This All Matters

Index size: compressed versus uncompressed

I’ve gone down the path of explaining inverted indexes and integer compression. Why does this all matter?

Take a look at the graph on the right. It shows  the size of the inverted index as a percentage of the collection that’s being searched. The blue bar is when variable byte compression is applied, the black bar is when there’s no compression. All up, the index is about half the size when you use a simple compression scheme. That’s pretty cool, but it’s about to get more interesting.

Take a look at the next graph below. This time, I’m showing you the speed of the search engine, that is, how long it takes to process a query. Without compression, it’s taking about .012 seconds to process a query. With compression, it’s taking about 0.008 seconds to process a query. And what’s really amazing here is that this is when the inverted index is entirely in memory — there’s no disk, or SSD, or network involved. Yes, indeed, compression is making the search engine about 25% faster.

That’s the punchline of this post: compression is a major contributor to making a search engine fast.

Search engine querying speed: compressed versus uncompressed

How’s this possible? It’s actually pretty simple. When you don’t use compression, the time to process the inverted lists really consists of two basic components: moving the list from memory into the CPU cache, and processing the list. The problem is there’s lots of data to move, and that takes time. When you add in compression, you have three costs: moving the data from memory into the CPU cache, decompressing the list, and processing the list. But there’s much less data to move when its compressed, and so it takes a lot less time. And since the decompression is very, very simple, that takes almost no time. So, overall, you win — compression makes the search engine hum.

It gets more exciting when disk gets involved. You will typically see that compression makes the system about twice as fast. Yes, twice as fast at retrieving and processing inverted indexes. How neat is that? All you’ve got to do is add a few lines of code to read and write simple variable bytes, and you’ll save an amazing amount in processing time (or cost in hardware, or both).

Rest assured that these are realistic experiments with realistic data, queries, and a reasonable search engine. All of the details are here. The only critiques I’d offer are these:

  • The ranking function is very, very simple, and so the bottleneck in the system is retrieving data and not the processing of it. In my experience, that’s a reasonably accurate reflection of how search engines work — I/O is the bottleneck, not ranking
  • The experiments are old. In my experience, nothing has changed — if anything, you’ll get even better results now than you did back then

Here is some C code for reading and writing variable-byte integers (scroll to the bottom to where it says vbyte.c). The “writing” code is just 27 fairly simple, not-too-compact lines. The reading code is 17. It’s pretty simple stuff. Feel free to use it.

Is Compression Useful Elsewhere in Search?

Oh yes, indeed. It’s important to use compression wherever you can. Compress data that’s moving over a network. Compress documents. Compress data you’re writing to logs. Think compression equals speed equals a better customer experience equals much less hardware.

Alright, that ends another whirlwind introduction to a search topic. Hope you enjoy it, looking forward to reading your comments. See you next time.

Query Rewriting in Search Engines

There’s countless information on search ranking – creating ranking functions, and their factors such as PageRank and text. Query rewriting is less conspicuous but equally important. Our experience at eBay is that query rewriting has the potential to deliver as much improvement to search as core ranking, and that’s what I’ve seen and heard at other companies.

What is query rewriting?

Let’s start with an example. Suppose a user queries for Gucci handbags at eBay. If we take this literally, the results will be those that have the words Gucci and handbags somewhere in the matching documents. Unfortunately, many great answers aren’t returned. Why?

Consider a document that contains Gucci and handbag, but never uses the plural handbags. It won’t match the query, and won’t be returned. Same story if the document contains Gucci and purse (rather than handbag). And again for a document that contains Gucci but doesn’t contain handbags or a synonym – instead it’s tagged in the “handbags” category on eBay; the user implicitly assumed it’d be returned when a buyer types Gucci handbags as their query.

To solve this problem, we need to do one of two things: add words to the documents so that they match other queries, or add words to the queries so that they match other documents. Query rewriting is the latter approach, and that’s the topic of this post. What I will say about expanding documents is there are tradeoffs: it’s always smart to compute something once in search and store it, rather than compute it for every query, and so there’s a certain attraction to modifying documents once. On the other hand, there are vastly more words in documents than there are words in queries, and doing too much to documents gets expensive and leads to imprecise matching (or returning too many irrelevant documents). I’ve also observed over the years that what works for queries doesn’t always work for documents.

Let’s go back to the example. If the user typed Gucci handbags as the query, we could build a query rewriting system that altered the query silently and ran a query such as Gucci and (handbag or handbags or purse or purses or category:handbags) behind the scenes. The user will see more matches, and hopefully the quality or relevance of the matches will remain as high as without the alteration.

Query rewriting isn’t just about synonyms and plurals. I’ve shared another example where it’s about adding a category constraint in place of a query word. There are many other types of rewrites. Acronyms are another example: expanding NWT to New With Tags, or contracting Stadium Giveaway to SGA. So is dealing with global language variations: organization and organisation are the same. And then there’s conflation: t-shirt, tshirt, and t shirt are equivalent. There are plenty of domain specific examples: US shoe sizes 9.5 and 9 ½ are the same, and if we’re smart we’d know the equivalent UK sizing is 9 and European sizing is 42.5 (but 9 and 9.5 aren’t the same thing when we’re talking about something besides shoes). We can also get clever with brands and products, and common abbreviations: iPad and Apple iPad are the same, and Mac and MacIntosh are the same (when it’s in the context of a computing device or a rain jacket).

Dealing with other languages is a different challenge, but many of the techniques that I’ll discuss later also work in non-English retrieval. Brian Johnson of eBay recently wrote an interesting post about the challenges of German language query rewriting.

Recall versus precision

Query rewriting sounds like a miraculous idea (as long as we have a way of creating the dictionary, which we’ll get to later). However, like everything else in search there is a recall versus precision tradeoff. Recall is the fraction of all relevant documents that are returned by the search engine. Precision is the fraction of the returned results that are relevant.

If our query rewriting system determines that handbag and handbags are the same, it seems likely we’ll get improved recall (more relevant answers) and we won’t hurt precision (the results we see will still be relevant). That’s good. But what about if our system decides handbags and bags are the same: not good news, we’ll certainly get more recall but precision will be poor (since there’ll be many other types of bags). In general, increasing recall trades with decreasing precision – the trick is to figure out the sweet spot where the user is most satisfied. (that’s a whole different topic, one I’ve written about here.)

Query rewriting is like everything else in search: it isn’t perfect, it’s an applied science. But my experience is that query rewriting is a silver bullet: it delivers as much improvement to search relevance as continuous innovation in core ranking.

How do you build query rewriting?

You mine vast quantities of data, and look for patterns.

Suppose we have a way of logging user queries and clicks in our system, and assume we’ve got these to work with in an Hadoop cluster.  We could look for patterns that we think are interesting and might indicate equivalence. Here’s one simple idea: let’s look for a pattern where a user types a query, types another query, and then clicks on a result. What we’re looking for here is cases where a user tried something, wasn’t satisfied (they didn’t click on anything), tried something else, and then showed some satisfaction (they clicked on something). Here’s an example. Suppose a user tried apple iapd 2 [sic], immediately corrected their query to apple ipad 2, and then clicked on the first result.

If we do this at a large scale, we’ll get potentially millions of “wrong query” and “right query” pairs. We could then sort the output, and count the frequency of the pairs.  We might find that “apple iapd 2” and “apple ipad 2” has a moderate frequency (say it occurs tens of times) in our data. Our top frequency pairs are likely to be things like “iphon” and “iphone”, and “ipdo” and “ipod”. (By the way, it turns out that users very rarely mess up the first letter of a word. Errors are much more common at the end of the words.)

Once we’ve got this data, we can use it naively to improve querying. Suppose a future user types apple iapd 2. We would look this query up on our servers in some fast lookup structure that manages the data we produced with our mining. If there’s a match, we might decide to:

  • Silently switch the user’s query from apple iapd 2 to apple ipad 2
  • Make a spelling suggestion to the user: “Did you mean apple ipad 2?”
  • Rewrite the query to be, say, apple AND (iapd OR ipad) AND 2

Let’s suppose in this case that we’re confident the user made a mistake. We’d probably do the former, saving the user a click and a little embarrassment, and deliver great results instead of none (or very few). Pretty neat – our past users have helped a future user be successful. Crowdsourcing at a grand scale!

Getting Clever in Mining

Mining for “query, query, click” patterns is a good start but fairly naïve. Probably a step better than saving the fact apple iapd 2 and apple ipad 2 are same is to save the fact that iapd and ipad are the same. When we see that “query, query, click” pattern, we could look for differences between the adjacent queries and save those – rather than saving the query overall. This is a nice improvement: it’ll help us find many more cases where iapd and ipad are corrected in different contexts, and give us a much higher frequency of the pair, and more confidence that our correction is right.

We could also look for other patterns. For example, we could mine for a “query1, query2, query3, click” pattern. We could use this to find what’s different between query1 and query2, and what’s different between query2 and query3. We could even look for what’s different between query1 and query3, and perhaps give that a little less weight or confidence as a correction than for the adjacent queries.

Once you get started down this path, it’s pretty easy to get excited about other possibilities (what about “query, click, query”?). Rather than mine for every conceivable pattern, a nice way to think about this is as a graph: create a graph that connects queries (represented as nodes) to other queries, and store information about how they’re connected. For example, “apple iapd 2” and “apple ipad 2” might be connected and the edge that connects them annotated with the number of times we observed that correction in the data for all users. Maybe it’s also annotated with the average amount of time it took for users to make the correction. And anything else we think is worth storing that might help us look for interesting query rewrites.

My former colleague Nick Craswell wrote a great paper on this topic of “random walking” click graphs, which has direct application to mining for query alterations.

What can you do with query rewriting?

Once you’ve created a query rewriting dictionary, you can begin to experiment with using it in your search engine. There are many different things you can consider doing:

  • Spelling correction, without messaging to the user
  • Spelling correction, where you tell the user you’ve flipped their query to something else, and offer them a way to get back to the original query
  • A spelling suggestion, but showing the results for the original query
  • Showing a few results from the original query, and then the rest from the query alteration
  • Showing a few results from the query alteration, and then the rest from the original query
  • Running a modified query, where extra terms are added to the user’s query (or perhaps some hints are given to the ranking function about the words in the query and the rewriting system’s confidence in them)
  • Running the original query, but highlighting words in the results that come from the rewriting system
  • Using the query rewriting data in autosuggest or related searches
  • Or something else… really, it is a gold mine of ideas

What are the challenges?

Everything in search comes with its problems. There are recall and precision tradeoffs. And there are just plain mistakes that you need to avoid.

Suppose a user attended Central Michigan University. To 99% of users, CMU is Carnegie Mellon University, but not this user and a good number like him. This user comes to our search engine, runs the query CMU, doesn’t get what he wants (not interested in Carnegie Mellon), types Central Michigan University instead, and clicks on the first result. Our mining will discover that Central Michigan University and CMU are equivalent, and maybe add a query rewrite to our system. This isn’t going to end happy – most users don’t want anything from the less-known CMU. So, we need to be clever when a query has more than one equivalence.

There’s also problems with context. What does HP mean to you? Horsepower? Hewlett-Packard? Homepage? The query rewriting system needs context – we can’t just expand HP to all of them. If the user type HP Printer, it’s Hewlett-Packard. 350 HP engine is horsepower. So, we will need to be a little more sophisticated in how we mine – in many cases, we’ll need to preserve context.

The good news is we can test and learn. That’s the amazing thing about search when you have lots of traffic – you can try things out, mine how users react, and make adjustments. So, the good news is that the query rewrite system is an ecosystem – our first dictionary is our first attempt, and we can watch how users react to it. If we correct HP printer to Horsepower printer, you can bet users won’t click, or they’ll try a different query. We’ll very quickly learn we’ve made a mistake, and we can correct it. If we’ve got the budget, we can even use crowdsourcing to check the accuracy of our corrections before we even try them out on our customers.

Well, there you have it, a whirlwind introduction to query rewriting in search. Looking forward to your comments.