How do you know when you’ve improved the relevance of a search engine? There are many ways to understand this, for example running A/B tests on your website or doing qualitative studies in a lab environment with a few customers. This blog post focuses on using large numbers of human judges to assess search performance.
The process of asking many judges to assess search performance is known as relevance judgment: collecting human judgments on the relevance of search results. The basic task goes like this: you present a judge with a search result, and a search engine query, and you ask the judge to assess how relevant the item is to the query on (say) a four-point scale.
Suppose the query you want to assess is ipod nano 16Gb. Imagine that one of the results is a link to Apple’s page that describes the latest Apple iPod nano 16Gb. A judge might decide that this is a “great result” (which might be, say, our top rating on the four-point scale). They’d then click on a radio button to record their vote and move on to the next task. If the result we showed them was a story about a giraffe, the judge might decide this result is “irrelevant” (say the lowest rating on the four point scale). If it were information about an iPhone, it might be “partially relevant” (say the second-to-lowest), and if it were a review of the latest iPod nano, the judge might say “relevant” (it’s not perfect, but it sure is useful information about an Apple iPod).
The human judgment process itself is subjective, and different people will make different choices. You could argue that a review of the latest iPod nano is a “great result” — maybe you think it’s even better than Apple’s page on the topic. You could also argue that the definitive Apple page isn’t terribly useful in making a buying decision, and you might only rate it as relevant. A judge who knows everything about Apple’s products might make a different decision to someone who’s never owned an digital music player. You get the idea. In practice, judging decisions depend on training, experience, context, knowledge, and quality — it’s an art at best.
There are a few different ways to address subjectivity and get meaningful results. First, you can ask multiple judges to assess the same results to get an average score. Second, you can judge thousands of queries, so that you can compute metrics and be confident statistically that the numbers you see represent true differences in performance between algorithms. Last, you can train your judges carefully, and give them information about what you think relevance means.
Choosing the Task and Running the Judging
You have to decide what queries to judge, how many queries to judge, how many answers to show the judges, and what search engines or algorithms you want to compare. One possible approach to choosing queries is to randomly sample queries from your search engine query logs. You might choose to judge hundreds or thousands of queries. For each query, you might choose to judge the first ten results, and you might choose to compare the first ten results from each of (say) Google, Bing, and DuckDuckGo.
Most search companies do their relevance judgment with crowdsourcing. They put tasks in the public domain, and pay independent people to perform the judgments using services such as CrowdFlower. This does create some problems – some people try to game the system by writing software that randomly answers questions, or they answer fast and erroneously. Search companies have to work constantly on detecting problems, and removing both the poor results and judges from the system. To give you a flavor, one thing search folks do is inject questions where they know what the relevance score should be, and then check that the judges answer most of those correctly (this is known as a ringer test). Another thing folks do is look for judges who consistently answer differently from other judges for the same tasks.
Scoring Relevance Judgments
When you’ve got tens of answers for each query, and you’ve completed judging at least a few hundred queries, you’re ready to compute a metric that allows us to compare algorithms.
An industry favorite is NDCG, Normalized Discounted Cumulative Gain. It sounds complicated, but it’s a common-sense measure. Suppose that on our four-point scale, you give a 0 score for an irrelevant result, 1 for a partially relevant, 2 for relevant, and 3 for perfect. Suppose also that a query is judged by one of the judges, and the first four results that the search engine returns are assessed as relevant, irrelevant, perfect, and relevant by the judge. The cumulative gain after four results is the sum of the scores for each result: 2 + 0 + 3 + 2 = 7. That’s shown in the table below: result position or rank in the first column, the judges score or gain in the second column, and a running total or cumulative gain in the third column.
Now for the Discounted part in NDCG. Search engine companies know that the first result in the search results is more important than the second, the second more important than the third, and so on. They know this because users click on result one much more than result two, and so on. Moreover, there’s plenty of research that shows users expect search engines to return great results at the top of the page, that they are unlikely to view results low on the page, and that they dislike having to use pagination.
The Discounted part of NDCG adds in a weighting based on position: one simple way to make position one more important than two (and so on) is to sum the score divided by the rank. So, for example, if the third result is ”great”, its contribution is 3 / 3 = 1 (since the score for “great” is 3, and the rank of the result is 3). If “great” were the first result, its contribution would be 3 / 1 = 3. In practice, the score is often divided by the log of the rank, which seems to better match the user perception of relevance. Anyway, for our example and to keep it simple, the Discounted Cumulative Gain (DCG) after four results is 2 / 1 + 0 / 2 + 3 / 3 + 2 / 4 = 3.5. You can see this in the table below: the third column has the discounted gain (the gain divided by the rank), and the fourth column keeps the running total or cumulative gain.
||Discounted Gain||Discounted Cumulative Gain (DCG)|
The Normalized part in NDCG allows us to compare DCG values between different queries. It’s not fair to compare DCG values across queries because some queries are easier than others: for example, maybe it’s easy to get four perfect results for the query ipod nano, and much harder to get four perfect results for 1968 Porsche 912 targa soft window. If the search engine gets a high score for the easy query, and a poor score for the hard query, it doesn’t mean it’s worse at hard queries – it might just mean the queries have different degrees of difficulties.
Normalization works like this: you figure out what the best possible score is given the results you’ve seen so far. In our previous example, the results scored 2, 0, 3, and 2. The best arrangement of these same results would have been: 3, 2, 2, 0, that is, if the “great” result had been ranked first, followed by the two “relevant” ones, and then the “irrelevant”. This best ranking would have a DCG score of 3 / 1 + 2 / 2 + 2 / 3 + 0 / 4 = 4.67. This is known as the “ideal DCG,” or iDCG. Our NDCG is the score we got (3.50) divided by the ideal DCG (4.67), or 3.50 / 4.67 = 0.75. Now we can compare scores across queries, since we’re comparing percentages of the best possible arrangements and not the raw scores.
The table below builds out the whole story. You’ve seen the first four columns before. The fifth and sixth columns show what would have happened if the search engine had ordered the results in the perfect order. The seventh and final column shows a running total of the fourth column (the DCG) divided by the sixth column the (ideal or iDCG), and the overall NDCG for our task is shown as 0.75 in bold in the bottom-right corner.
||Discounted Gain||Discounted Cumulative Gain (DCG)
||Ideal Discounted Gain||Ideal Discounted Cumulative Gain (iDCG)||Normalized Discounted Cumulative Gain (NDCG)|
Comparing Search Systems
Once you’ve computed NDCG values for each query, you can average them across thousands of queries. You can now compare two algorithms or search engines: you take the mean average NDCG values for each system, and check using a statistical test (such as a two sided t-test) whether one algorithm is better than the other, and with what confidence. You might, for example, be able to say with 90% confidence that Google is better than Bing.
As I mentioned at the beginning, this is one important factor you could consider when comparing two algorithms. But there’s more to search engine comparison than comparing NDCG metrics. As I’ve said in previous posts, I’m a huge fan of measuring in many different ways and making decisions with all the data at hand. It takes professional judgment to decide one algorithm is better than another, and that’s part of managing any search engineering effort.
Hope you found this useful, see you next time!