Category Archives: search engines

Ranking at eBay (Part #1)

Search ranking is the science of ordering search results from most- to least-relevant in response to user queries. In the case of eBay, the dominant user need is to find a great deal on something they want to purchase. And eBay search’s goal is to do a great job of finding relevant results in response to those customer needs.

eBay is amazingly dynamic. Around 10% of the 300+ million items for sale end each day (sell or end unsold), and a new 10% is listed. A large fraction of items have updates: they get bids, prices change, sellers revise descriptions, buyers watch, buyers offer, buyers ask questions, and so on. We process tens of millions of change events on items in a typical day, that is, our search engine receives that many signals that something important has changed about an item that should be used in the search ranking process. And all that is happening while we process around 250 million queries on a typical day.

In this post, I explain what makes eBay’s search ranking problem unique and complex. I’m aiming here to give you a sense of why we’ve built a custom search engine, and the types of technical search ranking challenges we’re dealing with as we rebuild search at eBay. Next week, I’ll continue this post and offer a few insights into how we’re working on the problem.

What’s different about eBay

Here are a few significantly different facets of eBay’s search problem space:

  1. Under typical load, it takes around 90 seconds from an item being listed by an eBay seller to when it can be found using the search engine. The same is true for any change that affects eBay’s search ranking — for example, if the number of sales of a fixed price multi-quantity item changes, it’s about 90 seconds until that count is updated in our index and can be used in search ranking. Even to an insider, that’s pretty impressive: there’s probably no other search engine that handles inserts, updates, and deletes at the scale and speed that eBay does. (I’ll explain real time index update in detail in a future post, but here’s a paper on the topic if you’d like to know more now.)
  2. In web search, there are many stable signals. Most documents persist and they don’t change very much. The link graph between documents on the web is reasonably stable; for example, my home page will always link to my blog, and my blog posts have links embedded in them that persist and lead to places on the web. All of this means that a web search engine can compute information about documents and their relationships, and use that as a strong signal in ranking. The same isn’t true of an auction item at eBay (which are live for between 1 and 7 days), and it’s less true of a fixed price item (many of which are live for only 30 days) — the link graph isn’t very valuable and static pages aren’t common at eBay
  3. eBay is an ecosystem, and not a search-and-leave search engine. The most important problem that web search engines solve is getting you somewhere else on the web — you run a query, you click on a link and you’re gone. eBay’s different: you run a query, you click on a link, and you’re typically still at eBay and interacting with a product, item, or hub page on eBay. This means that at eBay we know much more than at a web search engine: we know what our users are doing before and after they search, and have a much richer data set to draw from to build search ranking algorithms.
  4. Web search is largely unstructured. It’s mostly about searching blobs of text that form documents, and finding the highest precision matches. eBay certainly has plenty of text in its items and products, but there’s much more structure in the associated information. For example, items are listed in categories, and categories have a hierarchy. We also “paint” information on items as they’re listed in the form of value:attribute pairs; for example, if you list a men’s shirt, we might paint on the item that it is color:green, size:small, and brand:american apparel. We also often know the product that an item is: this is more often the case for listings that are books, DVDs, popular electronics, and motors. Net, eBay search isn’t just about matching text to blobs of text, it’s about matching text or preferences to structured information
  5. Anyone can author a web document, or create a web site. And it’ll happily be crawled by a search engine, perhaps indexed (depends on what they decide to put in their index), and perhaps available to be found. At eBay, sellers create listings (and sometimes products), and everything is always searchable (usually in 90 seconds under typical conditions). And we know much more about our sellers than a web search engine knows about its page authors
  6. We also know a lot about our buyers. A good fraction of the customers that search at eBay are logged in, or have cookies in their browser that identify them. Companies like Google and Microsoft also customize their search for their users when they are logged in (arguably, they do a pretty bad job of it — perhaps a post for another time too). The difference between web search and eBay is that we have information about our buyers’ purchase history, preferred categories, preferred buying formats, preferred sellers, what they’re watching, bidding on, and much more
  7. Almost every item and product has an image, and images play a key role in making purchase decisions (particularly for non-commodity products). We present images in our search results

There are more differences and challenges than these, but my goal here is to give you a taste, not an exhaustive list.

Who has similar problems?

Twitter is probably the closest analog technically to eBay:

  • They make use of changing signals in their ranking and so have to update their search indexes in near real-time too. But it’s not possible to edit a tweet and they don’t yet use clicks in ranking, so that means there’s probably much less updating going on than at eBay
  • Twitter explains that tweet rates go from 2,000 per second to 6000 to 8000 when there is a major event. eBay tends to have signals that change very quickly for a single item as it gets very close to ending (perhaps that’s similar to retweet characteristics). In both cases, signals about individual items are important in ranking those items, and those signals change quickly (whether they’re tweets or eBay items)
  • Twitter is largely an ecosystem like eBay (though many tweets contain links to external web sites)
  • Twitter makes everything searchable like eBay, though they typically truncate the result list and return only the top matches (with a link to see all matches). eBay shows you all the matches by default (you can argue whether or not we should)
  • Twitter doesn’t really have structured data in the sense that eBay does
  • Twitter isn’t as media rich as eBay
  • Twitter probably knows much less about their users’ buying and selling behaviors

(Thanks to Twitter engineering manager Krishna Gade for the links.)

Large commerce search engines (Amazon, Bestbuy, Walmart, and so on) bear similarity too: they are ecosystems, they have structure, they know about their buyers, they have imagery, and they probably search everything. The significant differences are they mostly sell products, and very few unique items, and they have vastly fewer sellers. They are also typically dominated by multi-quantity items (for example, a thousand copies of a book). The implication is there is likely vastly less data to search, relatively almost no index update issues, relatively much less inventory that ends, relatively much less diversity, and likely much fewer changing signals about the things they sell. That makes the search technical challenge vastly different; on the surface it seems simpler than eBay, though there are likely challenges I don’t fully appreciate.

Next week, I’ll continue this post by explaining how we think about ranking at eBay, and explain the framework we use for innovation in search.

Clicks in search

Have you heard of the Pareto principle? The idea that 80% of sales come from 20% of customers, or that the 20% of the richest people control 80% of the world’s wealth.

How about George K. Zipf? The author of the “Human behavior and the principle of least effort” and “The Psycho-Biology of Language” is best-known for “Zipf’s Law“, the observation that the frequency of a word is inversely proportional to the rank of its frequency. Over simplifying a little, the word “the” is about twice as frequent as the word “of”, and then comes “and”, and so on. This also applies to the populations of cities, corporation sizes, and many more natural occurrences.

I’ve spent time understanding and publishing work how Zipf’s work applies in search engines. And the punchline in search is that the Pareto principle and Zipf’s Law are hard at work: the first item in a list gets about twice as many clicks as the second, and so on. There are inverse power law distributions everywhere.

The eBay Search Results Click Curve

Here’s the eBay search results click curve, averaged over a very large number of queries. The y-axis is the total number of clicks on each result position, and the x-axis is the result position. For example, you can see that the first result in search (the top result, the first item you see when you run a query) gets about eight times as many clicks on average as the fifteenth result. The x-axis is labelled from 1 to 200, which is typically four pages of eBay results since we show 50 results per page by default.

eBay Click Curve. The y-axis is number of clicks per result, and the x-axis is the result position.

As a search guy, I’m not surprised by this curve (more on that topic later). It’s a typical inverse power law distribution (a “Zipf’s Law” distribution). But there are a couple of interesting quirks.

Take a look at the little bump around result position 50 on the x-axis. Why’s that there? What’s happening is that after scrolling for a while through the results, many users scroll to the very bottom of the page. They then inspect the final few results on the page (results 46 to 50), just above the pagination control. Those final few results therefore get a few more clicks than the ones above that the user skipped. Again, this isn’t a surprise to me — you’ll often see little spikes after user scroll points (in web search, you’ll typically see a spike in result 6 or 7 on a 10-result page).

I’ve blown up the first ten positions a little more so that you can see the inverse power law distribution.

Search click curve for the first 10 results in eBay search.

You can see that result 1 gets about 4 times as many clicks as result 10. You can also see that result 2 gets about 5/9ths of the clicks as result 1. This is pretty typical — it’s what you’d expect to see when search is working properly.

Interestingly, even if you randomize the first few results, you’ll still see a click curve that has an inverse power law distribution. Result 1 will almost always get more clicks than result 2, regardless of whether it’s less relevant.

Click Curves Are Everywhere

Here are some other examples of inverse power law distributions that you’ll typically see in search:

  • The query curve. The most popular query is much more popular than the second most popular query, and so on. The top 20% of queries account for at least 80% of the searches. That’s why caching works in search: most search engines serve more than 70% of their results from a cache
  • The document access curve. Because the queries are skew in distribution, and so are the the clicks per result position, it’s probably not surprising that a few documents (or items or objects) are accessed much more frequently than others. As a rule of thumb, you’ll typically find that 80% of the document accesses go to 20% of the documents. Pareto at work.
  • Clicks on related searches. Most search engines show related searches, and there’s a click curve on those that’s an inverse power law distribution
  • Clicks on just about any list: left navigation, pagination controls, ads, and any other list will typically have an inverse power law distribution. That’s why there’s often such a huge price differential between what advertisers will pay in search for the top position versus the second position
  • Words in queries, documents, and ads. Just like Zipf illustrated all those years ago, word frequencies follow an inverse power law distribution. Interestingly, and I explain this in this paper, Zipf’s formal distribution doesn’t hold very well on words drawn from web documents (a thing called Heap’s law does a better job). But the point remains: a few words account for much of the occurrences

What does this all mean? To a search guy, it means that when you see a curve that isn’t an inverse power law distribution, you should worry. There’s probably something wrong — an issue with search relevance, a user experience quirk (like the little bump I explained above), or something else. Expect to see curves that decay rapidly, and worry if you don’t.

See you again next Monday for a new post. If you’re enjoying the posts, please share with your friends by clicking on the little buttons below. Thanks!

Snippets: The unsung heroes of web search

Remember when Google launched into our consciousness? Below you’ll see what you probably saw back in 1999: a clean, simple results page with ten blue links.

Google search results, circa 1999

Google was a marked contrast to the other search engines of the era: AltaVista, Excite, Lycos, Looksmart, Inktomi, Yahoo!, Northern Light, and others.

What was different about Google? Most people talked about three things:

  1. PageRank, the use of the web’s link structure in search ranking. The perception was that Google found more relevant results than its competitors. Here’s the original paper
  2. Its clean page design. It wasn’t a link farm, there was plenty of white space, and an absence of advertising (banner or otherwise) for the first couple of years
  3. Its speed and its collection size. Google felt faster, reported fast search times, claimed a large index (remember when they used to say how many documents they had on the home page?), and showed large result set counts for your queries (result set estimation is a fun game – I should write a blog post about it)

What struck me most was the way Google presented result summaries on the search results page. Google’s result summaries were contextual: they were constructed of fragments of the documents that matched the query, with the query words highlighted. This was a huge step forward from what the others were doing: they were mostly just showing the first hundred or more characters of the document (which often was a mess of HTML, JavaScript, and other rubbish). With Google, you could suddenly tell if the result was relevant to you — and it made Google’s results page look incredibly relevant compared to its competitors.

And that’s what this blog post is about: contextual result summaries in search, snippets.

Snippets

Snippets are the summaries on the search results page. There’s typically ten of them per results page. Take a look at the result below from the Google search engine, it’s one of the ten snippets for the query gold base bobblehead.

A Google Snippet

The snippet is designed to help you make a decision: is this relevant or not to my information need? It helps you decide whether or not to invest a click. It’s important — it’s key to our impression of whether or not a search engine works well. The snippets are the bulk of the results page, they’re displayed below the brand of the search engine, and together they represent what the search engine found in response to your query. If they’re irrelevant, your impression of the search engine is negative; and that’s regardless of whether or not the underlying pages are relevant or not.

The snippet has three basic components:

  1. The title of the web document, usually what’s in the HTML <title> tag. In this case, “Vintage Baseball Memorabilia Bobble Heads Gold Base
  2. A representation of the URL of the web resource, which you can click on to visit the document on the web. In this case, “keymancollectibles.com/bobbleheads/goldbasenod.htm”
  3. (Usually) a query biased summary. One, two, or three fragments extracted from the document that are contextually relevant to the query. In this case, “The Gold base series of Bobbing Heads was the last one issued in the 60’s. They were sold from 1966 through 1971. This series saw the movement of several …”
Some words in the title, URL, and query biased summary are shown in bold. They’re the query words that the user typed (in this case, the query was “gold base bobblehead“) or variants of the query words, such as plurals, synonyms, or acronym expansions or contractions. (My recent post on query rewriting introduces how variants are discovered and used.) The idea here is that highlighting the query terms helps the user identify the likely relevant pieces of the snippet, and quickly make a decision about the relevance of the web resource to their need.

I’m going to come back to discussing query biased summaries, and how the fragments of text are found and ranked. But, before I do, let’s quickly cover what’s been happening in snippets over the past few years.

What’s happening in the world of snippets

All search engines show quicklinks or deeplinks for navigational queries. Navigational queries are those that are posed by users when they want to visit a specific location on the web, such as Google, Microsoft, or eBay. Take a look at the snippet below from Bing for the query nordstrom. What you can see below the snippet are links that lead to common pages on the nordstrom.com site including “Men”, “Baby & Kids”, and so on; these are what’s known in the industry as quicklinks or deeplinks.  When I was managing the snippets engineering team at Bing, we worked on enhancing snippets for navigational queries to show phone numbers, related sites, and other relevant data. The example for the query nordstrom shows the phone number and a search box for searching only the nordstrom.com site.

Nordstrom navigational query snippet from Bing

At Bing, we also invented customizing snippets for particular sites or classes of sites — for example, below is  a snippet that’s customized for a forum site. You can see that it highlights the number of posts, author, view count, and publication date separately from the query biased summary.

Custom snippet for a forum site from Bing

While Bing pioneered it, Google has certainly got in on the game — I’ve included below a few different Google examples for snippets from LinkedIn, Google Play, Google Scholar, and Fox Sports.

Custom Google snippets for LinkedIn, Google Scholar, Google Play, and Fox Sports

What’s all this innovation for? It’s to help users make better decisions about whether or not to click, and to occasionally give extra links to click on. For the Fox Sports snippet, it’s showing the top three headlines from the site — and that’s perhaps what the user wants when they query for Fox Sports (they don’t really want to go to the home page, they want to read a top story). For the Google Play and Scholar snippets, there’s information in the snippet that informs us whether the results is authoritative: has it been cited by other papers many times? Has it been voted on many times? The LinkedIn result is extracting key information and presenting it separately: the city that the person live in and who they work for, and it’s also including some data from Google+.

In my opinion, snippets are becoming too heterogeneous, the search engines are going too far. Users find it hard to switch from processing pictures to reading text, and they’re becoming too intermingled to allow users to quickly scan results pages. The snippets have too many different formats: users have to pause, understand what template they’re looking at, digest the information, and make a decision. The text justification is getting shaky. Life was simpler and faster when snippets were more consistent. I am sure the major search engines would argue that users are making better decisions (for example, I bet they’re seeing on lower mean average times from when the user types a query to when they invest their click). But I think the aesthetics are getting lost, and it’s getting a little too much for most users.

Query-Biased Summaries

Google didn’t invent query-biased summaries, Tassos Tombros and Mark Sanderson invented them in this 1998 paper.

The basic method for producing query-biased summary goes like this:

  1. Process the user’s query using the search index, and get a list of web resources that match (typically ten of them, and typically web [HTML] pages)
  2. For each document that matches:
    1. Find the location of the document in the collection
    2. Seek and read the document into memory
    3. Process the document sequentially and:
      1. Extract all fragments that contain one or more query words (or related terms — you get these from the query rewriting service)
      2. Assign a score to each fragment that approximates its likely relevance to the query
      3. Find the best few fragments based on the best few scores (one, two, or three)
    4. Neaten up the best fragments so they make sense (try and make them begin and end nicely on sentence boundaries or other sensible punctuation)
    5. Stitch the fragments together into one string of text, and save this somewhere
  3. Show the query-biased summaries to the user
I’ve spoken to engineers from the major search companies, and it’s pretty clear that producing snippets is one of the computationally most challenging problems they face. You only execute one query per user request, but you’ve got to produce ten summaries per results page. And unlike the search process, there’s no inverted index structure to help locate the terms in the documents. There are a few tricks that help:
  • Don’t process the entire document. In general, the fragments that are the best summaries are nearer the start of the documents, and so there’s diminishing returns in processing the entirety of large documents. It’s also possible to reorganize the document, by selecting fragments that are very likely to appear in snippets, and putting them together at the start of the document
  • Cache the popular results. Save the query-biased summary after it’s computed, and reuse it when the same query and document pair is requested again
  • Cache documents in memory. It’s been shown that caching just 1% of the documents in memory saves more than 75% of the disk seeks (which turn out to be a major bottleneck)
  • Compress the documents so they’re fast to retrieve from disk. Better still, compress the query words, and compare them directly to the compressed document. This is the subject of a paper I wrote with Dave Hawking, Andrew Turpin, and Yohannes Tsegay a few years ago. We showed that this makes producing snippets around twice as fast as without compression

Most search folks don’t think about the ranking techniques that are used in the snippet generator. The big ranking teams at major search companies focus on matching queries to documents. But the snippet generator does do interesting ranking: its task is to figure out the best fragments in the document that should be shown to the user, from all of the matching fragments in the document. Tombros and Sanderson discuss this briefly: they scored sentences using the square of the number of query words in the sentence divided by the number of query words in the query. Net, the more query words in the fragment, the better. I’d recommend adding a factor that gives more weight to fragments that are closer to the beginning of the document. There are also other things to consider: does the fragment make sense (is it a complete sentence)? Which of the query words are most important, and are those in the fragment? Do the set of fragments we’re showing cover the set of query terms? You could even consider how the set of query-biased summaries give an overall summary of the topic. I’m sure there’s lots of interesting ideas to try here.

Snippets have always fascinated me, and I believe their role in search is critical to users, unsung, and important to building a great search engine.

Hope you enjoyed this post — if you did, please share it with your friends through your favorite social network. There’ll be a new post next Monday…

Compression, Speed, and Search Engines

When you think compression, you probably think saving space. I think speed. In this post, I explain why compression is a performance tool, and how it saves tens of millions of dollars and helps speed-up the search process in most modern search engines.

Some background on how search works

Before I can explain how compression delivers speed, I’ll need to explain some of the basics of search data structures. All search engines use an inverted index to support querying by users. An inverted index is a pretty simple idea: it’s kind of like the index at the back of a book. There’s a set of terms you can search for, and a list of places those terms occur.

Let’s suppose you want to learn about the merge sorting algorithm and you pick up your copy of the third volume of Knuth’s Art of Computer Programming book to begin your research. First, you flip to the index at the back, and you do a kind of binary search thing: oops, you went to “q”, that’s too far; flip; oh, “h”, that’s too far the other way; flip; “m”, that’s it; scan, scan, scan; ah ha, “merge sorting”, found it! Now you look and see that the topic is on pages 98, and 158 through 168. You turn to page 98 to get started.

A Simple Inverted Index

An inverted index used in a search engine is similar. Take a look at the picture on the right. What you can see to the left side of it is a structure that contains the searchable terms in the index (the lexicon); in this example I’ve just shown the term cat, and I’ve shown the search structure as a chained hash table. On the right, you can see a list of term occurrences, in this case I’ve just shown the list for the term cat and it shows that cat occurs three times in our collection of documents: in documents numbers 1, 2, and 7.

If a user wants to query our search engine and learn something about a cat, here’s how it works. We look in the search structure to see if we have a matching term. In this case, yes, we find the term cat in the search structure. We then retrieve the list for the term cat, and use the information in the list to compute a ranking of documents that match the query. In this case, we’ve got information about documents 1, 2, and 7. Let’s imagine our ranking function thinks document 7 is the best, document 1 is the next best, and document 2 is the least best. We’d then show information about documents 7, 1, and 2 to the user, and get ready to process our next query. (I’ve simplified things here quite a bit, but that’s not too important in the story I’m going to tell.)

What you need to take away from this section is that inverted indexes are the backbone of search engines. And inverted indexes consist of terms and lists, and the lists are made up of numbers or, more specifically, integers.

Start with the seminal Managing Gigabytes if you’re interested in learning more about inverted indexes.

Compressing Integers

Inverted indexes have two parts: the terms in our searchable lexicon, and the term occurrences in our lists of integers. It turns out that compression isn’t very interesting for the terms, and I’m going to ignore that topic here. Where compression gets interesting is for our lists of integers.

You might be surprised to learn that integer compression is a big deal. There are scholars best known for their work on compressing integers: Solomon Golomb, Robert F. Rice, and Peter Elias immediately come to mind. I spent a few years between 1997 and 2003 largely focusing on integer compression. Here’s a summary I wrote with Justin Zobel on the topic (you can download a PDF here.)

It turns out that one particular type of integer compression, which I refer to as variable-byte compression, works pretty well for storing numbers in inverted index lists (PDF is here). It appears that Google started off using this variable byte scheme, and has now moved to a tuned-up mix of integer compression techniques.

I could spend a whole blog post talking about the pros, cons, and details of the popular integer compression techniques. But, for now, I’ll just say a few words about variable-byte compression. It’s a very simple idea: use 7 bits in every 8 bit byte to store the integer, and use the remain 1 bit to indicate whether this is the last byte that stores the integer or whether another byte follows.

Suppose you want to store the decimal number 1,234.  In binary, it’d be represented as 10011010010. With the variable byte scheme, we take the first (lowest) seven bits (1010010), and we append a 0 to indicate this isn’t the last byte of the integer, and we get 10100100. We then take the remaining four bits of the binary value (1001), pad it out to be seven bits (wasting a little space), and append a to indicate this is the last byte of the integer (00010011). So, all up, we’ve got 000100110100100. When we’re reading this compressed version back, we can recreate the original value by throwing away the “indicator” bits (the ones shown in pink), and concatenating what’s left together: 10011010010. Voila: compression and decompression in a paragraph!

If you store decimal 1,234 using a typical computer architecture, without compression, it’d be typically stored in 4 bytes. Using variable-byte compression, we can store it in two. It isn’t perfect (we wasted 3 bits, and paid no attention to the frequency of different integers in the index), but we saved ourselves two bytes. Overall, variable-byte coding works pretty well.

Why This All Matters

Index size: compressed versus uncompressed

I’ve gone down the path of explaining inverted indexes and integer compression. Why does this all matter?

Take a look at the graph on the right. It shows  the size of the inverted index as a percentage of the collection that’s being searched. The blue bar is when variable byte compression is applied, the black bar is when there’s no compression. All up, the index is about half the size when you use a simple compression scheme. That’s pretty cool, but it’s about to get more interesting.

Take a look at the next graph below. This time, I’m showing you the speed of the search engine, that is, how long it takes to process a query. Without compression, it’s taking about .012 seconds to process a query. With compression, it’s taking about 0.008 seconds to process a query. And what’s really amazing here is that this is when the inverted index is entirely in memory — there’s no disk, or SSD, or network involved. Yes, indeed, compression is making the search engine about 25% faster.

That’s the punchline of this post: compression is a major contributor to making a search engine fast.

Search engine querying speed: compressed versus uncompressed

How’s this possible? It’s actually pretty simple. When you don’t use compression, the time to process the inverted lists really consists of two basic components: moving the list from memory into the CPU cache, and processing the list. The problem is there’s lots of data to move, and that takes time. When you add in compression, you have three costs: moving the data from memory into the CPU cache, decompressing the list, and processing the list. But there’s much less data to move when its compressed, and so it takes a lot less time. And since the decompression is very, very simple, that takes almost no time. So, overall, you win — compression makes the search engine hum.

It gets more exciting when disk gets involved. You will typically see that compression makes the system about twice as fast. Yes, twice as fast at retrieving and processing inverted indexes. How neat is that? All you’ve got to do is add a few lines of code to read and write simple variable bytes, and you’ll save an amazing amount in processing time (or cost in hardware, or both).

Rest assured that these are realistic experiments with realistic data, queries, and a reasonable search engine. All of the details are here. The only critiques I’d offer are these:

  • The ranking function is very, very simple, and so the bottleneck in the system is retrieving data and not the processing of it. In my experience, that’s a reasonably accurate reflection of how search engines work — I/O is the bottleneck, not ranking
  • The experiments are old. In my experience, nothing has changed — if anything, you’ll get even better results now than you did back then

Here is some C code for reading and writing variable-byte integers (scroll to the bottom to where it says vbyte.c). The “writing” code is just 27 fairly simple, not-too-compact lines. The reading code is 17. It’s pretty simple stuff. Feel free to use it.

Is Compression Useful Elsewhere in Search?

Oh yes, indeed. It’s important to use compression wherever you can. Compress data that’s moving over a network. Compress documents. Compress data you’re writing to logs. Think compression equals speed equals a better customer experience equals much less hardware.

Alright, that ends another whirlwind introduction to a search topic. Hope you enjoy it, looking forward to reading your comments. See you next time.

Query Rewriting in Search Engines

There’s countless information on search ranking – creating ranking functions, and their factors such as PageRank and text. Query rewriting is less conspicuous but equally important. Our experience at eBay is that query rewriting has the potential to deliver as much improvement to search as core ranking, and that’s what I’ve seen and heard at other companies.

What is query rewriting?

Let’s start with an example. Suppose a user queries for Gucci handbags at eBay. If we take this literally, the results will be those that have the words Gucci and handbags somewhere in the matching documents. Unfortunately, many great answers aren’t returned. Why?

Consider a document that contains Gucci and handbag, but never uses the plural handbags. It won’t match the query, and won’t be returned. Same story if the document contains Gucci and purse (rather than handbag). And again for a document that contains Gucci but doesn’t contain handbags or a synonym – instead it’s tagged in the “handbags” category on eBay; the user implicitly assumed it’d be returned when a buyer types Gucci handbags as their query.

To solve this problem, we need to do one of two things: add words to the documents so that they match other queries, or add words to the queries so that they match other documents. Query rewriting is the latter approach, and that’s the topic of this post. What I will say about expanding documents is there are tradeoffs: it’s always smart to compute something once in search and store it, rather than compute it for every query, and so there’s a certain attraction to modifying documents once. On the other hand, there are vastly more words in documents than there are words in queries, and doing too much to documents gets expensive and leads to imprecise matching (or returning too many irrelevant documents). I’ve also observed over the years that what works for queries doesn’t always work for documents.

Let’s go back to the example. If the user typed Gucci handbags as the query, we could build a query rewriting system that altered the query silently and ran a query such as Gucci and (handbag or handbags or purse or purses or category:handbags) behind the scenes. The user will see more matches, and hopefully the quality or relevance of the matches will remain as high as without the alteration.

Query rewriting isn’t just about synonyms and plurals. I’ve shared another example where it’s about adding a category constraint in place of a query word. There are many other types of rewrites. Acronyms are another example: expanding NWT to New With Tags, or contracting Stadium Giveaway to SGA. So is dealing with global language variations: organization and organisation are the same. And then there’s conflation: t-shirt, tshirt, and t shirt are equivalent. There are plenty of domain specific examples: US shoe sizes 9.5 and 9 ½ are the same, and if we’re smart we’d know the equivalent UK sizing is 9 and European sizing is 42.5 (but 9 and 9.5 aren’t the same thing when we’re talking about something besides shoes). We can also get clever with brands and products, and common abbreviations: iPad and Apple iPad are the same, and Mac and MacIntosh are the same (when it’s in the context of a computing device or a rain jacket).

Dealing with other languages is a different challenge, but many of the techniques that I’ll discuss later also work in non-English retrieval. Brian Johnson of eBay recently wrote an interesting post about the challenges of German language query rewriting.

Recall versus precision

Query rewriting sounds like a miraculous idea (as long as we have a way of creating the dictionary, which we’ll get to later). However, like everything else in search there is a recall versus precision tradeoff. Recall is the fraction of all relevant documents that are returned by the search engine. Precision is the fraction of the returned results that are relevant.

If our query rewriting system determines that handbag and handbags are the same, it seems likely we’ll get improved recall (more relevant answers) and we won’t hurt precision (the results we see will still be relevant). That’s good. But what about if our system decides handbags and bags are the same: not good news, we’ll certainly get more recall but precision will be poor (since there’ll be many other types of bags). In general, increasing recall trades with decreasing precision – the trick is to figure out the sweet spot where the user is most satisfied. (that’s a whole different topic, one I’ve written about here.)

Query rewriting is like everything else in search: it isn’t perfect, it’s an applied science. But my experience is that query rewriting is a silver bullet: it delivers as much improvement to search relevance as continuous innovation in core ranking.

How do you build query rewriting?

You mine vast quantities of data, and look for patterns.

Suppose we have a way of logging user queries and clicks in our system, and assume we’ve got these to work with in an Hadoop cluster.  We could look for patterns that we think are interesting and might indicate equivalence. Here’s one simple idea: let’s look for a pattern where a user types a query, types another query, and then clicks on a result. What we’re looking for here is cases where a user tried something, wasn’t satisfied (they didn’t click on anything), tried something else, and then showed some satisfaction (they clicked on something). Here’s an example. Suppose a user tried apple iapd 2 [sic], immediately corrected their query to apple ipad 2, and then clicked on the first result.

If we do this at a large scale, we’ll get potentially millions of “wrong query” and “right query” pairs. We could then sort the output, and count the frequency of the pairs.  We might find that “apple iapd 2” and “apple ipad 2” has a moderate frequency (say it occurs tens of times) in our data. Our top frequency pairs are likely to be things like “iphon” and “iphone”, and “ipdo” and “ipod”. (By the way, it turns out that users very rarely mess up the first letter of a word. Errors are much more common at the end of the words.)

Once we’ve got this data, we can use it naively to improve querying. Suppose a future user types apple iapd 2. We would look this query up on our servers in some fast lookup structure that manages the data we produced with our mining. If there’s a match, we might decide to:

  • Silently switch the user’s query from apple iapd 2 to apple ipad 2
  • Make a spelling suggestion to the user: “Did you mean apple ipad 2?”
  • Rewrite the query to be, say, apple AND (iapd OR ipad) AND 2

Let’s suppose in this case that we’re confident the user made a mistake. We’d probably do the former, saving the user a click and a little embarrassment, and deliver great results instead of none (or very few). Pretty neat – our past users have helped a future user be successful. Crowdsourcing at a grand scale!

Getting Clever in Mining

Mining for “query, query, click” patterns is a good start but fairly naïve. Probably a step better than saving the fact apple iapd 2 and apple ipad 2 are same is to save the fact that iapd and ipad are the same. When we see that “query, query, click” pattern, we could look for differences between the adjacent queries and save those – rather than saving the query overall. This is a nice improvement: it’ll help us find many more cases where iapd and ipad are corrected in different contexts, and give us a much higher frequency of the pair, and more confidence that our correction is right.

We could also look for other patterns. For example, we could mine for a “query1, query2, query3, click” pattern. We could use this to find what’s different between query1 and query2, and what’s different between query2 and query3. We could even look for what’s different between query1 and query3, and perhaps give that a little less weight or confidence as a correction than for the adjacent queries.

Once you get started down this path, it’s pretty easy to get excited about other possibilities (what about “query, click, query”?). Rather than mine for every conceivable pattern, a nice way to think about this is as a graph: create a graph that connects queries (represented as nodes) to other queries, and store information about how they’re connected. For example, “apple iapd 2” and “apple ipad 2” might be connected and the edge that connects them annotated with the number of times we observed that correction in the data for all users. Maybe it’s also annotated with the average amount of time it took for users to make the correction. And anything else we think is worth storing that might help us look for interesting query rewrites.

My former colleague Nick Craswell wrote a great paper on this topic of “random walking” click graphs, which has direct application to mining for query alterations.

What can you do with query rewriting?

Once you’ve created a query rewriting dictionary, you can begin to experiment with using it in your search engine. There are many different things you can consider doing:

  • Spelling correction, without messaging to the user
  • Spelling correction, where you tell the user you’ve flipped their query to something else, and offer them a way to get back to the original query
  • A spelling suggestion, but showing the results for the original query
  • Showing a few results from the original query, and then the rest from the query alteration
  • Showing a few results from the query alteration, and then the rest from the original query
  • Running a modified query, where extra terms are added to the user’s query (or perhaps some hints are given to the ranking function about the words in the query and the rewriting system’s confidence in them)
  • Running the original query, but highlighting words in the results that come from the rewriting system
  • Using the query rewriting data in autosuggest or related searches
  • Or something else… really, it is a gold mine of ideas

What are the challenges?

Everything in search comes with its problems. There are recall and precision tradeoffs. And there are just plain mistakes that you need to avoid.

Suppose a user attended Central Michigan University. To 99% of users, CMU is Carnegie Mellon University, but not this user and a good number like him. This user comes to our search engine, runs the query CMU, doesn’t get what he wants (not interested in Carnegie Mellon), types Central Michigan University instead, and clicks on the first result. Our mining will discover that Central Michigan University and CMU are equivalent, and maybe add a query rewrite to our system. This isn’t going to end happy – most users don’t want anything from the less-known CMU. So, we need to be clever when a query has more than one equivalence.

There’s also problems with context. What does HP mean to you? Horsepower? Hewlett-Packard? Homepage? The query rewriting system needs context – we can’t just expand HP to all of them. If the user type HP Printer, it’s Hewlett-Packard. 350 HP engine is horsepower. So, we will need to be a little more sophisticated in how we mine – in many cases, we’ll need to preserve context.

The good news is we can test and learn. That’s the amazing thing about search when you have lots of traffic – you can try things out, mine how users react, and make adjustments. So, the good news is that the query rewrite system is an ecosystem – our first dictionary is our first attempt, and we can watch how users react to it. If we correct HP printer to Horsepower printer, you can bet users won’t click, or they’ll try a different query. We’ll very quickly learn we’ve made a mistake, and we can correct it. If we’ve got the budget, we can even use crowdsourcing to check the accuracy of our corrections before we even try them out on our customers.

Well, there you have it, a whirlwind introduction to query rewriting in search. Looking forward to your comments.

Ideas and Invention (and the story of Bing’s Image Search)

I was recently told that I am an ideas guy. Probably the best compliment I’ve received. It got me thinking, and I thought I’d share a story.

With two remarkable people, Nick Craswell and Julie Farago, I invented infinite scroll in MSN Search’s image search in 2005. Use Bing’s image search, you’ll see that there’s only one page of results – you can scroll and more images are loaded, unlike web search where you have to click on a pagination control to get to page two. Google released infinite scroll a couple of years ago in their image search, and Facebook, Twitter, and others use a similar infinite approach. Image search at Bing was the first to do this.

MSN Search's original image search with infinite scroll

How’d this idea come about? Most good ideas are small, obvious increments based on studying data, not lightning-bolt moments that are abstracted from the current reality. In this case, we began by studying data from image search engines, back when all image search engines had a pagination control and roughly twenty images per page.

Over a pizza lunch, Nick, Julie, and I spent time digging in user sessions from users who’d used web and image search. We learnt a couple of things in a few hours. At least, I recall a couple of things – somewhere along the way we invented a thumbnail slider, a cool hover-over feature, and a few other things. But I don’t think that was over pizza.

Back to the story. The first thing we learnt was that users paginate in image search.  A lot. In web search, you’ll typically see that for around 75% of queries, users stay on page one of the results; they don’t like pagination. In image search, it’s the opposite: 43% of queries stay on page 1 and it takes until page 8 to hit the 75% threshold.

Second, we learnt that users inspect large numbers of images before they click on a result. Nick remembers finding a session where a user went to page twenty-something before clicking on an image of a chocolate cake. That was a pretty wow moment – you don’t see that patience in web search (though, as it turns out, we do see it at eBay).

If you were there, having pizza with us, perhaps you would have invented infinite scroll. It’s an obvious step forward when you know that users are suffering through clicking on pagination for the bulk of their queries, and that they want to consume many images before they click. Well, perhaps the simplest invention would have been more than 20 images per page (say, 100 images per page) – but it’s a logical small leap from there to “infinity”. (While we called it “infinite scroll”, the limit was 1,000 images before you hit the bottom.) It was later publicly known as “smart scroll”.

To get from the inspiration to the implementation, we went through many incarnations of scroll bars, and ways to help users understand where they were in the results set (that’s the problem with infinite scroll – infinity is hard to navigate). In the end, the scroll bar was rather unremarkable looking – but watch what it does as you aggressively scroll down. It’s intuitive but it isn’t obvious that’s how the scroll bar should have worked based on the original idea.

This idea is an incremental one. Like many others, it was created through understanding the customer through data, figuring out what problem the customers are trying to solve, and having a simple idea that helps. It’s also about being able to let go of ideas that don’t work or clutter the experience – my advice is don’t hang onto a boat anchor for too long. (We had an idea of a kind of scratchpad where users could save their images. It was later dropped from the product.)

The Live Search scratchpad. No longer in Bing's Image Search

I’m still an ideas guy. That’s what fires me up. By the way, Nick’s still at Microsoft. Julie’s at Google. And I am at eBay.

(Here’s the patent document, and here’s a presentation I made about image search during my time at Microsoft. All of the details in this blog post are taken from the presentation or patent document.)