There’s countless information on search ranking – creating ranking functions, and their factors such as PageRank and text. Query rewriting is less conspicuous but equally important. Our experience at eBay is that query rewriting has the potential to deliver as much improvement to search as core ranking, and that’s what I’ve seen and heard at other companies.
What is query rewriting?
Let’s start with an example. Suppose a user queries for Gucci handbags at eBay. If we take this literally, the results will be those that have the words Gucci and handbags somewhere in the matching documents. Unfortunately, many great answers aren’t returned. Why?
Consider a document that contains Gucci and handbag, but never uses the plural handbags. It won’t match the query, and won’t be returned. Same story if the document contains Gucci and purse (rather than handbag). And again for a document that contains Gucci but doesn’t contain handbags or a synonym – instead it’s tagged in the “handbags” category on eBay; the user implicitly assumed it’d be returned when a buyer types Gucci handbags as their query.
To solve this problem, we need to do one of two things: add words to the documents so that they match other queries, or add words to the queries so that they match other documents. Query rewriting is the latter approach, and that’s the topic of this post. What I will say about expanding documents is there are tradeoffs: it’s always smart to compute something once in search and store it, rather than compute it for every query, and so there’s a certain attraction to modifying documents once. On the other hand, there are vastly more words in documents than there are words in queries, and doing too much to documents gets expensive and leads to imprecise matching (or returning too many irrelevant documents). I’ve also observed over the years that what works for queries doesn’t always work for documents.
Let’s go back to the example. If the user typed Gucci handbags as the query, we could build a query rewriting system that altered the query silently and ran a query such as Gucci and (handbag or handbags or purse or purses or category:handbags) behind the scenes. The user will see more matches, and hopefully the quality or relevance of the matches will remain as high as without the alteration.
Query rewriting isn’t just about synonyms and plurals. I’ve shared another example where it’s about adding a category constraint in place of a query word. There are many other types of rewrites. Acronyms are another example: expanding NWT to New With Tags, or contracting Stadium Giveaway to SGA. So is dealing with global language variations: organization and organisation are the same. And then there’s conflation: t-shirt, tshirt, and t shirt are equivalent. There are plenty of domain specific examples: US shoe sizes 9.5 and 9 ½ are the same, and if we’re smart we’d know the equivalent UK sizing is 9 and European sizing is 42.5 (but 9 and 9.5 aren’t the same thing when we’re talking about something besides shoes). We can also get clever with brands and products, and common abbreviations: iPad and Apple iPad are the same, and Mac and MacIntosh are the same (when it’s in the context of a computing device or a rain jacket).
Dealing with other languages is a different challenge, but many of the techniques that I’ll discuss later also work in non-English retrieval. Brian Johnson of eBay recently wrote an interesting post about the challenges of German language query rewriting.
Recall versus precision
Query rewriting sounds like a miraculous idea (as long as we have a way of creating the dictionary, which we’ll get to later). However, like everything else in search there is a recall versus precision tradeoff. Recall is the fraction of all relevant documents that are returned by the search engine. Precision is the fraction of the returned results that are relevant.
If our query rewriting system determines that handbag and handbags are the same, it seems likely we’ll get improved recall (more relevant answers) and we won’t hurt precision (the results we see will still be relevant). That’s good. But what about if our system decides handbags and bags are the same: not good news, we’ll certainly get more recall but precision will be poor (since there’ll be many other types of bags). In general, increasing recall trades with decreasing precision – the trick is to figure out the sweet spot where the user is most satisfied. (that’s a whole different topic, one I’ve written about here.)
Query rewriting is like everything else in search: it isn’t perfect, it’s an applied science. But my experience is that query rewriting is a silver bullet: it delivers as much improvement to search relevance as continuous innovation in core ranking.
How do you build query rewriting?
You mine vast quantities of data, and look for patterns.
Suppose we have a way of logging user queries and clicks in our system, and assume we’ve got these to work with in an Hadoop cluster. We could look for patterns that we think are interesting and might indicate equivalence. Here’s one simple idea: let’s look for a pattern where a user types a query, types another query, and then clicks on a result. What we’re looking for here is cases where a user tried something, wasn’t satisfied (they didn’t click on anything), tried something else, and then showed some satisfaction (they clicked on something). Here’s an example. Suppose a user tried apple iapd 2 [sic], immediately corrected their query to apple ipad 2, and then clicked on the first result.
If we do this at a large scale, we’ll get potentially millions of “wrong query” and “right query” pairs. We could then sort the output, and count the frequency of the pairs. We might find that “apple iapd 2” and “apple ipad 2” has a moderate frequency (say it occurs tens of times) in our data. Our top frequency pairs are likely to be things like “iphon” and “iphone”, and “ipdo” and “ipod”. (By the way, it turns out that users very rarely mess up the first letter of a word. Errors are much more common at the end of the words.)
Once we’ve got this data, we can use it naively to improve querying. Suppose a future user types apple iapd 2. We would look this query up on our servers in some fast lookup structure that manages the data we produced with our mining. If there’s a match, we might decide to:
- Silently switch the user’s query from apple iapd 2 to apple ipad 2
- Make a spelling suggestion to the user: “Did you mean apple ipad 2?”
- Rewrite the query to be, say, apple AND (iapd OR ipad) AND 2
Let’s suppose in this case that we’re confident the user made a mistake. We’d probably do the former, saving the user a click and a little embarrassment, and deliver great results instead of none (or very few). Pretty neat – our past users have helped a future user be successful. Crowdsourcing at a grand scale!
Getting Clever in Mining
Mining for “query, query, click” patterns is a good start but fairly naïve. Probably a step better than saving the fact apple iapd 2 and apple ipad 2 are same is to save the fact that iapd and ipad are the same. When we see that “query, query, click” pattern, we could look for differences between the adjacent queries and save those – rather than saving the query overall. This is a nice improvement: it’ll help us find many more cases where iapd and ipad are corrected in different contexts, and give us a much higher frequency of the pair, and more confidence that our correction is right.
We could also look for other patterns. For example, we could mine for a “query1, query2, query3, click” pattern. We could use this to find what’s different between query1 and query2, and what’s different between query2 and query3. We could even look for what’s different between query1 and query3, and perhaps give that a little less weight or confidence as a correction than for the adjacent queries.
Once you get started down this path, it’s pretty easy to get excited about other possibilities (what about “query, click, query”?). Rather than mine for every conceivable pattern, a nice way to think about this is as a graph: create a graph that connects queries (represented as nodes) to other queries, and store information about how they’re connected. For example, “apple iapd 2” and “apple ipad 2” might be connected and the edge that connects them annotated with the number of times we observed that correction in the data for all users. Maybe it’s also annotated with the average amount of time it took for users to make the correction. And anything else we think is worth storing that might help us look for interesting query rewrites.
What can you do with query rewriting?
Once you’ve created a query rewriting dictionary, you can begin to experiment with using it in your search engine. There are many different things you can consider doing:
- Spelling correction, without messaging to the user
- Spelling correction, where you tell the user you’ve flipped their query to something else, and offer them a way to get back to the original query
- A spelling suggestion, but showing the results for the original query
- Showing a few results from the original query, and then the rest from the query alteration
- Showing a few results from the query alteration, and then the rest from the original query
- Running a modified query, where extra terms are added to the user’s query (or perhaps some hints are given to the ranking function about the words in the query and the rewriting system’s confidence in them)
- Running the original query, but highlighting words in the results that come from the rewriting system
- Using the query rewriting data in autosuggest or related searches
- Or something else… really, it is a gold mine of ideas
What are the challenges?
Everything in search comes with its problems. There are recall and precision tradeoffs. And there are just plain mistakes that you need to avoid.
Suppose a user attended Central Michigan University. To 99% of users, CMU is Carnegie Mellon University, but not this user and a good number like him. This user comes to our search engine, runs the query CMU, doesn’t get what he wants (not interested in Carnegie Mellon), types Central Michigan University instead, and clicks on the first result. Our mining will discover that Central Michigan University and CMU are equivalent, and maybe add a query rewrite to our system. This isn’t going to end happy – most users don’t want anything from the less-known CMU. So, we need to be clever when a query has more than one equivalence.
There’s also problems with context. What does HP mean to you? Horsepower? Hewlett-Packard? Homepage? The query rewriting system needs context – we can’t just expand HP to all of them. If the user type HP Printer, it’s Hewlett-Packard. 350 HP engine is horsepower. So, we will need to be a little more sophisticated in how we mine – in many cases, we’ll need to preserve context.
The good news is we can test and learn. That’s the amazing thing about search when you have lots of traffic – you can try things out, mine how users react, and make adjustments. So, the good news is that the query rewrite system is an ecosystem – our first dictionary is our first attempt, and we can watch how users react to it. If we correct HP printer to Horsepower printer, you can bet users won’t click, or they’ll try a different query. We’ll very quickly learn we’ve made a mistake, and we can correct it. If we’ve got the budget, we can even use crowdsourcing to check the accuracy of our corrections before we even try them out on our customers.
Well, there you have it, a whirlwind introduction to query rewriting in search. Looking forward to your comments.