Did you know that the vast majority of results from search engines are served from a cache of previously-computed results? Probably only around one third of all queries are actually evaluated using the backend search infrastructure.
Caching provides fast response to most user queries, while allowing search companies to spend much less on hardware, or to devote the resources in their search infrastructure to better computation of results.
Why does caching work in search?
In this post on click curves, I explained that most everything in search follows an inverse power law distribution (a so-called “Zipf curve”). The implication is that a few queries account for the majority of distinct queries, that is, most users are searching for the same things.
AOL memorably released three months of query logs in 2006. They were slammed for doing so and pretty quickly apologized and took down the data. However, it’s a pretty nice data set for our purposes of discussing caching.
The most popular query at AOL in those three months of 2006 was google. Around 0.9% of the queries typed by users were those looking to leave AOL’s search and head over to Google. The second most popular query was ebay at 0.4% of all queries, and the third yahoo at 0.4%. If you sum the frequency of the top ten unique queries, you’ve seen around 3% of all the query volume. Here’s what happens as you inspect more unique queries:
- If you sum the total frequency of the top 100 queries, you get around 6% of all the user query volume
- The top 1,000 unique queries are around 11% of the query volume
- The top 10,000 are around 20% of the volume
- The top 100,000 are around 34% of the volume
- The top 1,000,000 are around 58% of the volume
Those points are plotted on a log-log graph below.
We’d expect there’s diminishing returns in caching more queries and their results. As the queries become less frequent, there’s less benefit in caching their results. There’s no benefit in caching a query that occurs once. By the time you’re caching the millionth query from this set, you’re caching queries that occur only 5 times in 3 months. By the way, there are just over 36 million queries in the log, and about 10 million unique queries when they’re normalized (which I didn’t do a very good job of).
The key point to take away is that if we only store only the results for the top 100,000 queries, we can save our search backend from having to evaluate around 34% of all the queries that users pose.
This is a slight exaggeration, since we can’t quite key our search cache on query string alone. Remember that web search users have different language preferences, safe search settings, and so on. All up, the key for our cache probably has around ten parts — but remember than most users likely stick with the defaults, and so the query is the most variable element in the key. I don’t know quite what effect this’d have — but I bet it’s small (say, it reduces the caching effectiveness of the top 100,000 queries from 34% to 32% or so). I expect that the recent Bing and Google pushes into personalization have made caching harder: but I also bet that personalization affects relatively few queries.
Storing Cached Data
The key to the cache is the query and around ten other elements including safe search settings, language preference, market, and so on.
What’s stored in the cache is the results of the query with those settings. For web search, the results includes the list of matching URLs, the snippets, freshness information (more in a moment), and other elements you need to build the page. You might, for example, store the related searches, or the images or news or videos that are associated with queries that show more than web search results.
One of the tricks of caching is knowing when to expire what’s in the cache. You don’t want to keep showing results for a query when the results have actually changed; for example, maybe there’s a new snippet, or a URL has changed, or a new result has entered the top ten results, or there’s some breaking news. Here’s a few factors I thought of that you could use to expire results in the cache:
- Historical change rate of the results (record how frequently the results change, and use that data to predict when it’ll change in the future)
- What data is being displayed (if the results contain, for example, news sites, perhaps you expire the cache entry earlier)
- Change in query frequency (if users suddenly start typing a query much more frequently, that’s a clue that’s something has changed)
- How long the results have been cached (perhaps you have some time limits that ensure everything is refreshed on a cycle)
- The load on the search engine (if the search engine is under heavy load, don’t expire the cache as aggressively; if it’s under low load, it’s a good time to refresh the cache)
When you expire a result in the cache, you fall back to the search backend to recompute the results for the query, and then you store those results in the cache.
Bottom line, caching is amazingly effective in search. It’s a super hard problem at eBay, given the dynamic nature of the auction and fixed price formats: items sell, bids change, prices change, and so on. We also blend auctions, fixed price items, and products dynamically based on the results — so even the mix of formats is dynamic. We’re excited about making caching work well at eBay, but we’ve so far not hit anywhere near the heights you’d expect from the analysis of AOL’s web search query logs. I’ll explain this more in the future.
You can learn more about the AOL query logs by downloading this paper from Abdur Chowdhury’s website. Here’s the full citation:
G. Pass, A. Chowdhury, C. Torgeson, “A Picture of Search“, The First
International Conference on Scalable Information Systems, Hong Kong, June,
I’ll explain some interesting facts about the AOL query logs in a future post.