Tag Archives: ebay

The Cassini Search Engine

Cassini has been in the news lately: Wired.com recently featured a story on Cassini and I was interviewed by eCommerceBytes. Given the interest in the topic, this week’s post is more on the inside Cassini story.

The Beginning

At the end of 2010, we decided that we needed a new search platform. We’ve done great work in improving the existing Voyager search engine, and it had contributed substantially to the turnaround of eBay’s Marketplaces business. But it wasn’t a platform for the future: we needed a new, modern platform to innovate for our customers and business.

We agreed as a team to build Cassini, our new search engine platform. We started Cassini in around October 2010, with a team of a few folks lead by industry veteran Nick Whyte.

Voyager

Our Voyager search engine was built in around 2002. It’s delivered impressively and reliably for over ten years. It’s a good old workhorse. But it isn’t up for what we need today: it was architected before many of the modern advances in how search works, having launched before Microsoft began its search effort and before Google’s current generation of search engine.

We’ve innovated on top of Voyager, and John Donahoe has discussed how our work on search has been important. But it hasn’t been easy and we haven’t been able to do everything we’ve wanted – and our teams want a new platform that allows us to innovate more.

Cassini

Since Voyager was named after the late 1970s space probes, we named our new search engine after the 1996 Cassini probe. It’s a symbolic way of saying it’s still a search engine, but a more modern one.

The Cassini-Huygens probe, the namesake of our Cassini search engine.

The Cassini-Huygens probe, the namesake of our Cassini search engine.

In 2010, we made many fundamental decisions about the architecture and design principles of Cassini. In particular, we decided that:

  • It’d support searching over vastly more text; by default, Voyager lets users search over the title of our items and not the description. We decided that Cassini would allow search over the entire document
  • We’d build a data center automation suite to make deployment of the Cassini software much easier than deploying Voyager; we also built a vision around provisioning machines, fault remediation, and more
  • We decided to support sophisticated, modern approaches to search ranking, including being able to process a query in multiple rounds of ranking (where the first iteration was fast and approximate, and latter rounds were intensive and accurate)

Cassini is a true start-from-scratch rewrite of eBay’s search engine. Because of the unique nature of eBay’s search problem, we couldn’t use an existing solution. We also made a clean break from Voyager – we wanted to build a new, modular, layered solution.

Cassini in 2011

January 2011 marked the real beginning of the project. We’d been working for three or four months as a team of five or so, and we’d sketched out some of the key components. In 2011, we set the project in serious motion – adding more folks to the core team to begin to build key functionality, and embarking on the work needed to automate search. We also began to talk about the project externally.

Cassini in 2012

In 2012, we’d grown the team substantially from the handful of folks who began the project at the end of 2010. We’d also made substantial progress by June – Cassini was being used behind-the-scenes for a couple of minor features, and we’d been demoing it internally.

In the second half of 2012, we began to roll out Cassini for customer-facing scenarios. Our goal was to add value for our customers, but also to harden the system and understand how it performs when it’s operating at scale. This gave us practice in operating the system, and forced us to build many of the pieces that are needed to monitor and operate the system.

The two scenarios that we rolled out Cassini for in 2012 were Completed Search worldwide, and null and low search in North America. Mark Carges talked about this recently at eBay’s 2013 Analysts’ Day.

Completed search is the feature that allows our customers to search listings that have ended, sold or not; it’s a key way that our customers do price research, whether for pricing an item they’re selling or to figure out what to pay when they’re buying. Because Cassini is a more scalable technology, we were able to provide search over 90 days of items rather than the 14 days that Voyager has always offered.

When users get no results from a query (or very few), Voyager offered a very basic solution – no results and a few suggested queries that you might try (this is still what you’ll see on, for example, the Australian site today). Cassini could do much more – it could help us try related queries and search more data to find related matches. After customer testing, we rolled out Cassini in North America, the UK, and Germany to support this scenario – a feature that’s been really well received.

Cassini in May 2013

In January 2013, we began testing Cassini for the default search with around 5% of US customers. As I’ve discussed previously, when we make platform changes, we like to aim for parity for the customers so that it’s a seamless transition. That’s our goal in testing – replace Voyager with a better platform, but offer equivalent results for queries. The testing has been going on with 5% of our US customers almost continuously this year. Parity isn’t easy: it’s a different platform, and our Search Science and Engineering teams have done great work to get us there.

We’ve recently launched Cassini in the US market. If you’re querying on ebay.com, you’re using Cassini. It’s been a smooth launch – our hardening of the platform in 2012, and our extensive testing in 2013 have set us on a good path. Cassini doesn’t yet do anything much different to Voyager: for example, it isn’t by default searching over descriptions yet.

Cassini in 2013 and beyond

We’ll begin working on making Cassini the default search engine in other major markets later in the year. We’re beginning to test it in the UK, and we’ll work on Germany soon too. There’s also a ton of logistics to cover in launching the engine: cassini requires thousands of computers in several data centers.

We’ll also begin to innovate even faster in making search better. Now that we’ve got our new platform, our search science team can begin trying new features and work even faster to deliver better results for our customers. As a team, we’ll blog more about those changes.

We have much more work to do in replacing Voyager. Our users run more than 250+ million queries each day, but many more queries come from internal sources such as our selling tools, on site merchandizing, and email generation tools. We need to add features to Cassini to support those scenarios, and move them over from Voyager.

It’ll be a while before Voyager isn’t needed to support a scenario somewhere in eBay. We’ll then turn Voyager off, celebrate a little, and be entirely powered by Cassini.

See you next week.

The size and scale of eBay: 2013 edition

It’s time for an update of the eBay Marketplaces interesting statistics that I shared last year. eBay Marketplaces means we’re the team that builds ebay.comebay.co.ukebay.deebay.com.au, and most of the other worldwide marketplaces under the eBay brand.

eBay Marketplaces sold over US$75.3 billion in merchandise in 2012

eBay Marketplaces sold over US$75.3 billion in merchandise in 2012

Here are some refreshed and new facts:

  • We have over 50 petabytes of data stored in our Hadoop and Teradata clusters
  • We have over 400 million items for sale
  • We process more than 250 million user queries per day
  • We serve over 100,000 pages per second
  • Our users spend over 180 years in total every day looking at items
  • We have over 112 million active users
  • We sold over US$75 billion in merchandize in 2012

eBay’s an exciting place to be — plenty of engineering, business, and commerce challenges that are driven by users, items, traffic, and sales. See you next week.

Selling on eBay

I gave a webinar last week on Search at eBay. Thank you to those who attended and asked great questions. The webinar was recorded and you can listen and view the slides here.

Search at eBay. A tour of search from a recent webinar.

Search at eBay. A tour of search from a recent webinar.

In the webinar, I present a background of some of the facts and figures about eBay (an updated version of this post), a short introduction to how search works, and a tour of our current and future search platforms. The talk concludes with over thirty minutes of Q&A. The middle of the talk is dedicated to explaining how to sell successfully on eBay, and I summarize that advice in this post.

Listing for Success in Search

It’s important to remember that search isn’t the only step in selling on eBay. It’s a three-step process:

  1. Finding an item in search
  2. Choosing an item
  3. Purchasing the item

Visibility in search is therefore important, but it isn’t sufficient to guarantee sales. You must also focus on ensuring buyers click on your item and make the choice to purchase it.

On the first point: it’s true that most successful sales begin with a search on eBay, but many others occur through buyers finding items on another search engine, in an email,  through the merchandizing that we feature, or through another source such as a social network. Searches on eBay also don’t always have a keyword: many buyers browse through our categories to find the item they want.

Visibility in Search

Our search team focuses on three tenets of delivering a great experience for our buyers:

  1. Trust: ensuring our buyers get a retail-like experience from great sellers
  2. Value: ensuring our buyers find great deals, including shipping costs
  3. Relevance: ensuring our buyers see items that match what they want

If you focus on these three tenets, you will be successful in gaining visibility in search.

Delivering Trust, Value, and Relevance

Here is some specific advice on how you can be found in search, have your items chosen in search, and drive sales:

  • List in the correct category with a meaningful title; a meaningful title contains only words that accurately describe the item, it omits “spam words” (off-topic words) and isn’t unnecessarily short or long
  • Use item specifics and item condition; we match queries against the item specifics, and to be found successfully we recommended you adopt item specifics
  • When it makes sense for your business, use a long-duration, multi-quantity Buy It Now format
  • Study what it takes to be a Top Rated Seller on eBay; it is our definition of the key tenets of being a trusted seller on eBay, and we use those trust signals in search
  • Have great, large, clear pictures for your listings; we know from many experiments that pictures matter to our buyers, and you’re much more likely to be chosen in search and have sales if your pictures are high-quality
  • Be price competitive; this changes frequently, and you need to understand the market price for your items and be prepared to make changes
  • Specify shipping costs; our customers want great value, and that includes value in shipping
  • Have a clear, structured, and comprehensive item description. Stay “on topic” and describe your item accurately
  • Offer fast shipping and international options; fast shipping matters to our buyers, and offering international options means you have more exposure to more customers

We don’t offer specific answers to specific questions about how Best Match works on eBay. For example, we don’t comment on how we use each of the item specifics in search or whether having certain keywords in certain places in your titles matters. Why not? We want sellers to focus on the key tenets of Trust, Value, and Relevance, and not on specific features that may change or that might give sellers a short-term unfair advantage over other great sellers. Indeed, if a shortcut works today, it may not work tomorrow — we want a level playing field for all sellers, and we’re continually improving Best Match to use more Trust, Value, and Relevance information.

I encourage you to listen to the webinar for a richer explanation of how to sell successfully on eBay with a search-centric focus. See you next week.

Changing platforms at Scale: Lessons from eBay

At eBay, we’ve been on a journey to modernize our platforms, and rebuild old, crufty systems as modern, flexible ones.

In 2011, Sri Shivananda, Mark Carges, and I decided to modernize our front-end development stack. We were building our web applications in v4, an entirely home-grown framework. It wasn’t intuitive to new engineers who joined eBay, they were familiar with industry standards we didn’t use, and we’d also built very tightly coupled balls of spaghetti code over many years. (That’s not a criticism of our engineers – every system gets crufty and unwieldy eventually; software has an effective timespan and then it needs to be replaced.)

Sri Shivananda, eBay Marketplaces' VP of Platform and Intrastructure

Sri Shivananda, eBay Marketplaces’ VP of Platform and Intrastructure

We set design goals for our new Raptor framework, including that we wanted to do a better job separating presentation from business logic. We also wanted better tools for engineers, faster code build times, better monitoring and alerting when problems occur, the ability to test changes without restarting our web servers, and a framework that was intuitive to engineers who joined from other companies. It was an ambitious project, and one that Sri’s lead as a successful revolution in the Marketplaces business. We now build software much faster than ever before, and we’ve rewritten major parts of the front-end systems. (And we’ve open sourced part of the framework.)

That’s the context, but what this post is really about is how you execute a change in platforms in a large company with complex software systems.

The “Steel Thread” Model

Mark Carges, eBay's Chief Technical Officer

Mark Carges, eBay’s Chief Technical Officer

Our CTO, Mark Carges, advocates building a “steel thread” use case when we rethink platforms. What he means is that when you build a new platform, build it at the same time as a single use case on top of the platform. That is, build a system with the platform, like a steel thread running end-to-end through everything we do.

A good platform team thinks broadly about all systems that’ll be built on the platform, and designs for the present and the future. The risk is they’ll build the whole thing – including features that no one ultimately needs for use cases that are three years away. Things change fast in this world. Large platform projects can go down very deep holes, and sometimes never come out.

The wisdom of the “steel thread” model is that the platform team still does the thinking, but it’s pushed by an application team to only fully design and build that parts that are immediately needed. The tension forces prioritization, tradeoffs, and a pragmatism in the platform team. Once you’re done with the first use case, you can move onto subsequent ones and build more of the platform.

Rebuilding the Search Results Page

Our first steel thread use case on Raptor was the eBay Marketplaces Search Results Page (the SRP). We picked this use case because it was hard: it’s our second-most trafficked page, and one of our most complex; building the SRP on Raptor would exercise the new platform extensively.

We co-located our new Raptor platform team – which was a small team by design – together with one of our most mission critical teams, the search frontend team. We declared that their success was mutually dependent: we’re not celebrating until the SRP is built on Raptor.

We asked the team to rebuild the SRP together. We asked for an aggressive timeline. We set bold goals. But there was one twist: build the same functionality and look-and-feel as the existing SRP. That is, we asked the team to only change one variable: change the platform. We asked them not to change an important second variable: the functionality of the site.

This turned out to be important. The end result – after much hard work – was a shiny new SRP code base:  modular, cleaner, simpler, and built on a modern platform.  But it looked and behaved the same as the old one. This allowed us to test one thing: is it equivalent for our customers to the old one?

Testing the new Search Results Page

We ran a few weeks of A/B tests, where we showed different customer populations the old and new search results page. Remember, they’re pretty much the same SRPs from the customers’ perspective. What we were looking for were subtle problems: was the new experience slower for some scenarios than the old one? Did it break in certain browsers on some platforms? Was it as reliable? Could we operate it as effectively? We could compare the populations and spot the differences reasonably easily.

This was a substantial change in our platforms and systems, and the answer wasn’t always perfect. We took the new SRP out of service a few times, fixed bugs, and put it back in. Ultimately, we deemed it a fine replacement in North America, and turned it on for all our customers in North America. The next few months saw us repeat the process across our other major markets (where there are subtle differences between our SRPs).

What’s important is that we didn’t change the look and feel or functionality at first: if we’d done that, we may not have seen several of the small problems we did see as fast as we saw them.

Keeping the old application

Another wise choice was we didn’t follow that old adage of “out with the old, and in with the new”. We kept the old SRP around running in our data centers for a few months, even though it wasn’t in service.

This gave us a fallback plan: when you make major changes, it’s never going to be entirely plain sailing. We knew that the new SRP would have problems, and that we’d want to take it out of service. When we did, we could put the old one back in service while we fixed the problem.

Eventually, we reached the confidence with our new SRP that we didn’t need the old one. And so it was retired, and the hardware put to other uses. That was over a year ago – it has been smooth sailing since.

The curse of dual-development

You might ask why we set bold goals and pushed the teams hard to build the new Raptor platform and the SRP. We like to do that at eBay, but there’s also a pragmatic reason: while there’s two SRP code bases, there’s twice the engineering going on.

Imagine that we’ve got a new idea for an improvement to the SRP. While we’re building the new SRP, the team has to add that idea to the new code base.  The team also has to get the idea into the old code base too – both so we can get it out to our customers, and so that we can carry out that careful testing I described earlier.

To prevent dual development slowing down our project, we declared a moratorium on features in the SRP for a couple of months. This was tough on the broader team – lots of folks want features from the search team, and we delayed all requests. The benefit was we could move much faster in building the new SRP, and getting it out to customers. Of course, a moratorium can’t go on for too long.

And then we changed the page

After we were done with the rollout, the SRP application team could move with speed on modernizing the functionality and look-and-feel of the search results page.

Ultimately, this became an important part of eBay 2.0, a  refresh of the site that we launched in 2012. And they’re now set up to move faster whenever they need to: we are testing more new ideas that improve the customer experience than we’ve been able to before, and that’s key to the continued technology-driven revolution at eBay.

See you next week.

The size, scale, and numbers of eBay.com

I work in the Marketplaces business at eBay. That’s the part of the company that builds ebay.com, ebay.co.uk, ebay.de, ebay.com.au, and most of the other worldwide marketplaces under the eBay brand. (The other major parts of eBay Inc are PayPal, GSI Commerce, x.commerce, and StubHub.)

I am lucky to have opportunities to speak publicly about eBay, and about the technology we’re building. It’s an exciting time to give a talk – we are in the middle of rewriting our search engine, we’ve improved search substantially, we’re automating our data centers, we’re retooling our user experience development stack, and much more.

At the beginning of most talks, I get the chance to share a few facts about our scale and size. I thought I’d share some with you:

  • We have over 10 petabytes of data stored in our Hadoop and Teradata clusters. Hadoop is primarily used by engineers who use data to build products, and Teradata is primarily used by our finance team to understand our business
  • We have over 300 million items for sale, and over a billion accessible at any time (including, for example, items that are no longer for sale but that are used by customers for price research)
  • We process around 250 million user queries per day (which become many billions of queries behind the scenes – query rewriting implies many calls to search to provide results for a single user query, and many other parts of our system use search for various reasons)
  • We serve over 2 billion pages to customers every day
  • We have over 100 million active users
  • We sold over US$68 billion in merchandize in 2011
  • We make over 75 billion database calls each day (our database tables are denormalized because doing relational joins at our scale is often too slow – and so we precompute and store the results, leading to many more queries that take much less time each)

They’re some pretty large numbers, ones that make our engineering challenges exciting and rewarding to solve.

Any surprises for you?

Ranking at eBay (Part #3)

Over the last two posts on this topic, I’ve explained some of the unique problems of eBay’s search challenge, and how we think about using different factors to build a ranking function. In this post, I’ll tell you more about how we use the factors to rank, how we decide if we’ve improved ranking at eBay, and where we are on the ranking journey.

Hand-tuning a Ranking Function

A ranking function combines different factors to give an overall score that can be used to rank documents from most- to least-relevant to a query. This involves computing each factor using the information that it needs, and then plugging the results into the overall function to combine the factors. Ranking functions are complicated: there’s typically at least three factors in the most simple function, and they’re typically combined by multiplying constants by each of the factors. The output is just a score, which is simply used later to sort the results into rank order (by the way, the scores are typically meaningless across different queries).

If you’ve got two, three, or maybe ten different factors, you can combine them by hand, using a mix of intuition, and experimentation. That’s pretty much what happens in the public domain research. For example, there’s a well-known ranking function Okapi BM25 that brings together three major factors:

  1. Term frequency: How often does a word from the query occur in the document? (the intuition being that a document that contains a query word many times is more relevant than a document that contains it fewer times. For example, if your query is ipod, then a document that mentions ipod ten times is more relevant than one that mentions it once)
  2. Inverse document frequency: How rare is a query word across the whole collection? (the intuition being that a document that contains a rarer word from the query is more relevant than one that contains a more common word. For example, if your query was pink ipod nano, then a document that contains nano is more relevant than a document that contains pink)
  3. Inverse document length: How long is the document? (the intuition being that the longer the document, the more likely it is to contain a query word on the balance of probabilities. Therefore, longer documents need to be slightly penalized or they’ll dominate the results for no good reason)

How are these factors combined in BM25? Pretty much by hand. In the Wikipedia page for Okapi Bm25 the community recommends that the term frequency be weighted slightly higher than the inverse document frequency (a multiplication of 1.2 or 2.0). I’ve heard different recommendations from different people, and it’s pretty much a hand-tuning game to try different approaches and see what works. You’ll often find that research papers talk about what constants they used, and how they selected them; for example, in this 2004 paper of mine, we explain the BM25 variant we use and the constants we chose.

This all works to a certain point: it’s possible to tune factors, and still have a function you can intuitively understand, as long as you don’t have too many factors.

Training Algorithms to Combine Factors

At eBay, we’ve historically done just what I described to build the Best Match function. We created factors, and combined them by hand using intuition, and then used experimentation to see if what we’ve done is better than what’s currently running on the site. That worked for a time, and was key to making the progress we’ve made as a team.

At some point, combining factors by hand becomes very difficult to do — it becomes easier to learn how to combine the factors using algorithms (using what’s broadly known as machine learning). It’s claimed that AltaVista was the first to use algorithmic approaches to combine ranking factors, and that this is now prevalent in industry. It’s certainly true that everyone in the Valley talks about Yahoo!’s use of gradient boosted decision trees in their now-retired search engine, and that Microsoft announced they used machine-based approaches as early as 2005. Google’s approach isn’t known, though I’d guess there’s more hand tuning than in other search engines. Google has said they use more than 200 signals in ranking (I call these factors in this post).

Let me give you an example of how you’d go about using algorithms to combine factors.

First, you need to decide what you’re aiming to achieve, since you want to learn how to combine the factors so that you can achieve a specific goal. There’s lots of choices of what you might optimize for: for example, we might want to deliver relevant results on a per query basis, we might want to maximize clicks on the results per query, we might want to sell more items by dollar value, we might want to sell more items, or we might want to increase the amount of times that a user uses the search engine each month. Of course, there’s many other choices. But this is the important first step — decide what you’re optimizing for.

Second, once you’ve chosen what you want to achieve, you need training data so that your algorithm can learn how to rank. Let’s suppose we’ve decided we want to maximize the number of clicks on results. If we’ve stored (logged or recorded) the interactions of users with our search engine, we have a vast amount of data to extract and use for this task. We go to our data repository and we extract queries and items that were clicked, and queries and items that were not clicked. So, for example, we might extract thousands of sessions where a user ran the query ipod, and the different item identifiers that they did and didn’t click on; it’s important to have both positive and negative training data. We’d do this at a vast scale, we’re likely looking to have hundreds of thousands of data points. (How much data you need depends on how many factors you have, and the algorithm you choose.)

So, now we’ve got examples of what users do and don’t click on a per query basis. Third, it’s time to go an extract the factors that we’re using in ranking. So, we get our hands on all the original data that we need to compute our factors — whether it’s the original items, information about sellers, information about buyers, information from the images, or other behavioral information. Consider an example from earlier: we might want to use term frequency in the item as a factor, so we need to go fetch the original item text, and from that item we’d extract the number of times that each of the query words occurs in the document. We’d do this for every query we’re using in training, and every document that is and isn’t clicked on. For the query ipod, it might have generated a click on this item. We’d inspect this item, count the number of times that ipod occurs, and record the fact that it occurred 44 times. Once we’ve got the factor values for all queries and items, we’re ready to start training our algorithm to combine the factors.

Fourth, we choose an algorithmic approach to learning how to combine the factors. Typical choices might be a support vector machine, decision tree, neural net, or bayesian network. And then we train the algorithm using the training data we’ve created, and give it the target or goal we’re optimizing for. The goal is that the algorithm learns how to separate good examples from bad examples using the factors we’ve provided, and can combine the factors in a way that will lead to relevant documents being ranked ahead of irrelevant examples. In the case we’ve described, we’re aiming for the algorithm to be able to put items that are going to be clicked ahead of items that aren’t going to be clicked, and we’re allowing the algorithm to choose which factors will help it do that and to combine them in way that achieves the goal. Once we’re done training, we’d typically validate that our algorithm works by testing it on some data that we’ve set aside, and then we’re ready to do some serious analysis before testing it on customers.

Fifth, before you launch a new ranking algorithm, you want to know if it’s working sensibly enough for even a small set of customers to see. I’ll explain later how to launch a new approach.

If you’re looking for a simple, graphical way to play around with training using a variety of algorithms, I recommend Orange. It works on Mac OS X.

What about Best Match at eBay?

We launched a machine-learned version of Best Match earlier in 2012. You can learn more about the work we’re doing on machine learning at eBay here.

We now have tens of factors in our ranking function, and it isn’t practical to combine them by hand. And so the 2012 version of Best Match combines its factors by using a machine learned approach. As we add more factors — which we’re always trying to do — we retrain our algorithm, test, iterate, learn, and release new versions. We’re adding more factors because we want to bring more knowledge to the ranking process: the more different, useful data that the ranking algorithm has, the better it will do in separating relevant from irrelevant items.

We don’t talk about what target we’re optimizing for, nor have we explained in detail what factors are used in ranking. We might start sharing the factors soon — in the same way Google does for its ranking function.

Launching a New Ranking Algorithm

Before you launch a new ranking function, you should be sure it’s going to be a likely positive experience for your customers. No function is likely to be entirely better than a previous function — what you’re expecting is that the vast majority of experiences are the same or better, and that only a few scenarios are worse (and, hopefully, not much worse). It’s a little like buying a new car — you usually buy one that’s better than the old one, but there’s usually some compromise you’re making (like, say, not quite the right color, you don’t like the wheels as much, or maybe it doesn’t quite corner as well).

A good place to start in releasing a new function is to use it in the team. We have a side-by-side tool that allows us to see an existing ranking scheme alongside a new approach in a single screen. You run a query, and you see results for both approaches in the same screen. We use this tool to kick the tires of a new approach, and empirically observe whether there’s a benefit for the customers, and what kinds of issues we might see when we release it. I’ve included a simple example from our side by side tool, where you can see a comparison of two ranking for the query yarn, and slightly different results — the team saw that in the experiment on the left we were surfacing a great new result (in green), and on the right in the default control we were surfacing a result that wasn’t price competitive (in red).

Side by side results for the query yarn. On the left, an experiment, and on the right is the default experience.

If a new approach passes our bar as a team, we’ll then do some human evaluation on a large scale. I explained this in this blog post, but in essence what we do is ask people to judge whether results are relevant or not to queries, and then compute an overall score that tells us how good our new algorithm is compared to the old one. This also allows us to dig into cases where it’s worse, and make sure it’s not significantly worse. We also look at the basic facts about the new approach: for example, for a large set of queries, how different are the results? (with the rationale that we don’t want to dramatically change the customer experience). If we see some quick fixes we can make, we do so.

Once a new algorithm looks good, it’s time to test it on our customers. We typically start very small, trying it out on a tiny fraction of customers, and comparing how those customers use search relative to those who are using the regular algorithms. As we get more confident, we increase the number of customers who are seeing the new approach. And after a few week’s testing, if the new approach is superior to the existing approach, we’ll replace the algorithm entirely. We measure many things about search — and we use all the different facts to make decisions. It’s a complex process, and rarely clear cut — there’s facts that help, but in the end it’s usually a nuanced judgement to release a new function.

Hope you’ve enjoyed this post, the final one in my eBay ranking series. See you again next week, with something new on a new topic!

The race to build better search: a Reuters article

I spent a couple of hours with Alistair Barr from Reuters discussing search at eBay, and our Project Cassini rewrite of eBay’s search platform. Alistair published the story yesterday, and it’s a good, short read. Thanks Alastair for sharing the story with your readers.

Alistair discusses Walmart’s search rewrite (which didn’t take too long by the sounds of it — my recent blog post suggests why), quotes responses from Google’s PR team, and shares insights from my good friend Oren Etzioni who works at both the University of Washington and the rather awesome shopping decision engine, decide.com. He mentions Google’s features that obviously do a little light image processing to match color and shape intents in queries such as “red dress” or “v-neck dress” against the content in the images.

Google does do lots of things well, but they’re often not the first to do them — we built that color search into Bing’s image search in 2008 (try this “red dress” query). On a related note, eBay has a rather cool image search feature, which we really should make more prominent in our search experience (mental note: must work on that). Try this “red dress” query, and you’ll see results that use visual image features to find related items.

I’ll be back with part #3 of my Ranking at eBay series soon.

Ranking at eBay (Part #2)

In part 1 of Ranking at eBay, I explained what makes the eBay search problem different to other online search problems. I also explained why there’s a certain kinship with Twitter, the only other engine that deals with the same kinds of challenges that eBay does. To sum it up, eBay’s search problem is different because our items aren’t around for very long, the information about the items changes very quickly, and we have over 300 million items and the majority are not products like you’d find on major commerce web sites like Walmart or Amazon.

In this post, I explain how we think about using data in the eBay ranking problem. In the next post, I’ll explain how we combine all of that data to compute our Best Match function, and how it’s all coming together in a world where we are rebuilding search at eBay.

Ranking Factors at eBay

Let’s imagine that you and I work together and run the search science team at eBay. Part of our role is to help make sure that the items and products that are returned when a customer runs a query are ordered correctly. Correctly means that the most relevant item to the customer’s information need is in the first position in our search results, the next most relevant is in the second position, and so on.

What does relevant mean? In eBay’s case, you could abstract it to say that the item is great value from a trusted seller, it matches the intent of the query, and it’s something that buyers want to buy. For example, if the customer queries for a polaroid camera, our best result might be a great, used, vintage Polaroid camera in excellent condition. Of course, it’s subjective: you could argue it should be a new generation Polaroid camera, or some other plausible argument. In a general sense, relevance is approximated by computing some measure of statistical similarity — obviously, search engines can’t read a user’s mind, so they compute information to score how similar an item is to a query, and add any other information that’s query independent and can help. (In a future post, I’ll come back and explain how we understand whether we’ve got it right, and work to understand what the underlying intent is behind a query.)

Let’s agree for now that we want to order results from most- to least-relevant to a query, when the user is using our default Best Match sorting feature. So, how do we do that? The key is having information about what we’re ranking: and I’ll argue that the more, different information we have, the better job we can do. Let’s start simply: suppose we only have one data source, the title of the item. I’ve shown below an item, and you can see it’s title at the top, “NICE Older POLAROID 600 Land Camera SUN AUTO FOCUS 660”.

A Polaroid Camera on eBay. Notice the title of the item, "NICE Older POLAROID 600 Land Camera SUN AUTO FOCUS 660"

Let’s think about the factors we can use from the item title to help us order results in a likely relevant way:

  • Does the title contain the query words? The rationale for proposing this factor is pretty simple: if the words are in the title, the item is more relevant than an item that doesn’t contain the words.
  • How frequently are the query words repeated in the title? The rationale is: the more the words are repeated, the more likely that item is to be on the topic of the query, and so the more relevant the item.
  • How rare are each of the query words that match in the title? The rationale is that rarer words across all of the items at eBay are better discriminators between relevant and irrelevant items; in this example, we’d argue that items containing the rarer word polaroid are probably more likely to be relevant than items containing the less rare word camera.
  • How near are the query words to the beginning of the title? The argument is that items with query words near the beginning of the title are likely more relevant than those containing the query words later in the title, with the rationale that the key topic of the item is likely mentioned first or early in the title. Consider two examples to illustrate:  Polaroid land camera 420 1970s issued still in nice shape retro funk, and PX 100 Silver Shade Impossible Project Film for Polaroid SX-70 Camera. (The former example is a camera, the latter example is film for a camera.)

Before I move on, let me just say that these are example factors. I am not sharing that we do or don’t use these factors in ranking at eBay. What I’m illustrating is that you and I can successfully, rationally think about factors we might try in Best Match that might help separate relevant items from irrelevant items. And, overall, when we combine these factors in some way, we should be able to produce a complete ordering of eBay’s results from most- to least-relevant to the query.

So far, I’ve given you narrow examples about text factors from the title. There are many other text factors we could use: factors from the longer item description, category information, text that’s automatically painted onto the item by our algorithms at listing time, and more. If we worked through these methodically, we could together write down factors that we thought might intuitively help us rank items better. At the end of process, I’m guessing we’d have written downs tens of factors for the text alone we have at eBay.

You can see my argument coming together: if you used just one or two of these factors, you might do a good, basic job of ranking items. But if you use more information, you’ll do better. You’ll be able to more effectively discern differences between items, and you’ll do a better job of ranking the items. Net, the more (new, different, and useful) information you have, the better.

What’s key here is that we need different factors, and we need factors that actually do the right thing. There are some simple ways we can test the intuition about a factor before we use it. For example, we could ask a simple question: do users buy more of items that have this factor than those that don’t? In practice, there’s much more sophisticated things we can do to validate a factor before we decide to actually build it into search (and I’ll leave that discussion to another time).

The Factor Buckets

I believe in a five bucket framework of factors to build our eBay Best Match ranking function:

  1. Text factors (discussed above)
  2. Image factors
  3. Seller factors
  4. Buyer factors
  5. Behavioral factors

Pictures or images are an important part of the items and products at eBay. Images are therefore an interesting possible source of ranking factors. For example, we know that users prefer pictures where the background is a single color, that is, where the object of interest is easily distinguished from the background.

The seller is an important part of the buyer’s decision to purchase. You can likely think of many factors that we could include in search: how long have they been selling? How’s their feedback? Do they ship on time? Are they a trusted seller?

Buyer factors is an interesting bucket. If you think about the buyer, there’s many potential factors you might want to explore. Do they always buy fixed price items? What are the categories they buy in? What’s the shoe size they keep on asking for in their queries? Do they buy internationally?

Behavioral factors are also an exciting bucket. Here’s a few examples we could work on: does this item get clicks from buyers for this query? What’s the watch count on the item? How many bids does the auction have? How many sales have their been of this fixed price item, given it’s been shown to users that many times? If you want to dig deeper into this bucket, Mike Mathieson wrote a super blog post on part of our behavioral factor journey.

Where we are on the factors journey

We formed our search science team in late 2009, when Mike Mathieson joined our team. We’ve built the team from Mike to tens of folks in the past couple of years, and we’re on a journey to make search awesome at eBay. Indeed, if you want to join the team — and have an awesome engineering or applied science background, you can always reach out to me.

Right now, we use several text factors in Best Match, we have released a few seller factors and behavioral factors, and we have begun working on image and buyer factors. All up, we have tens of factors in our Best Match ranking function. You might ask: all of these factors seem like they’d be useful, so why haven’t you done more? There’s a few good reasons:

  1. Our current search engine doesn’t make it easy to flexibly combine factors in ranking. (that’s one good reason why we’re rewriting search at eBay.)
  2. It takes engineering time to develop a factor, and make it available at query time for the search ranking process. In many cases, factors are extremely complex engineering projects — for example, imagine how hard it is to process images and extract factors when there’s 10 million new items per day (and most items have more than 1 image), and you’re working hard to get additions to the index complete within 90 seconds. Or imagine how challenging it is to have real-time behavioral factors available in a multi-thousand computer search grid within a few seconds. (If you’ve read Part #1 of this series, you’ll appreciate just how real-time search is at eBay.)
  3. Experimentation takes time. Intuition is the easy part, building the factor, combining it with other factors, testing the new ranking function with users, and iterating and improving takes time. I’ll talk more about experimentation and testing in my next post

In the third and final post in this series, I’ll explain more about how we combine factors and give you some insights into where we are on the search journey at eBay. Thanks for reading: please share this post with your friends and colleagues using the buttons below.

Ranking at eBay (Part #1)

Search ranking is the science of ordering search results from most- to least-relevant in response to user queries. In the case of eBay, the dominant user need is to find a great deal on something they want to purchase. And eBay search’s goal is to do a great job of finding relevant results in response to those customer needs.

eBay is amazingly dynamic. Around 10% of the 300+ million items for sale end each day (sell or end unsold), and a new 10% is listed. A large fraction of items have updates: they get bids, prices change, sellers revise descriptions, buyers watch, buyers offer, buyers ask questions, and so on. We process tens of millions of change events on items in a typical day, that is, our search engine receives that many signals that something important has changed about an item that should be used in the search ranking process. And all that is happening while we process around 250 million queries on a typical day.

In this post, I explain what makes eBay’s search ranking problem unique and complex. I’m aiming here to give you a sense of why we’ve built a custom search engine, and the types of technical search ranking challenges we’re dealing with as we rebuild search at eBay. Next week, I’ll continue this post and offer a few insights into how we’re working on the problem.

What’s different about eBay

Here are a few significantly different facets of eBay’s search problem space:

  1. Under typical load, it takes around 90 seconds from an item being listed by an eBay seller to when it can be found using the search engine. The same is true for any change that affects eBay’s search ranking — for example, if the number of sales of a fixed price multi-quantity item changes, it’s about 90 seconds until that count is updated in our index and can be used in search ranking. Even to an insider, that’s pretty impressive: there’s probably no other search engine that handles inserts, updates, and deletes at the scale and speed that eBay does. (I’ll explain real time index update in detail in a future post, but here’s a paper on the topic if you’d like to know more now.)
  2. In web search, there are many stable signals. Most documents persist and they don’t change very much. The link graph between documents on the web is reasonably stable; for example, my home page will always link to my blog, and my blog posts have links embedded in them that persist and lead to places on the web. All of this means that a web search engine can compute information about documents and their relationships, and use that as a strong signal in ranking. The same isn’t true of an auction item at eBay (which are live for between 1 and 7 days), and it’s less true of a fixed price item (many of which are live for only 30 days) — the link graph isn’t very valuable and static pages aren’t common at eBay
  3. eBay is an ecosystem, and not a search-and-leave search engine. The most important problem that web search engines solve is getting you somewhere else on the web — you run a query, you click on a link and you’re gone. eBay’s different: you run a query, you click on a link, and you’re typically still at eBay and interacting with a product, item, or hub page on eBay. This means that at eBay we know much more than at a web search engine: we know what our users are doing before and after they search, and have a much richer data set to draw from to build search ranking algorithms.
  4. Web search is largely unstructured. It’s mostly about searching blobs of text that form documents, and finding the highest precision matches. eBay certainly has plenty of text in its items and products, but there’s much more structure in the associated information. For example, items are listed in categories, and categories have a hierarchy. We also “paint” information on items as they’re listed in the form of value:attribute pairs; for example, if you list a men’s shirt, we might paint on the item that it is color:green, size:small, and brand:american apparel. We also often know the product that an item is: this is more often the case for listings that are books, DVDs, popular electronics, and motors. Net, eBay search isn’t just about matching text to blobs of text, it’s about matching text or preferences to structured information
  5. Anyone can author a web document, or create a web site. And it’ll happily be crawled by a search engine, perhaps indexed (depends on what they decide to put in their index), and perhaps available to be found. At eBay, sellers create listings (and sometimes products), and everything is always searchable (usually in 90 seconds under typical conditions). And we know much more about our sellers than a web search engine knows about its page authors
  6. We also know a lot about our buyers. A good fraction of the customers that search at eBay are logged in, or have cookies in their browser that identify them. Companies like Google and Microsoft also customize their search for their users when they are logged in (arguably, they do a pretty bad job of it — perhaps a post for another time too). The difference between web search and eBay is that we have information about our buyers’ purchase history, preferred categories, preferred buying formats, preferred sellers, what they’re watching, bidding on, and much more
  7. Almost every item and product has an image, and images play a key role in making purchase decisions (particularly for non-commodity products). We present images in our search results

There are more differences and challenges than these, but my goal here is to give you a taste, not an exhaustive list.

Who has similar problems?

Twitter is probably the closest analog technically to eBay:

  • They make use of changing signals in their ranking and so have to update their search indexes in near real-time too. But it’s not possible to edit a tweet and they don’t yet use clicks in ranking, so that means there’s probably much less updating going on than at eBay
  • Twitter explains that tweet rates go from 2,000 per second to 6000 to 8000 when there is a major event. eBay tends to have signals that change very quickly for a single item as it gets very close to ending (perhaps that’s similar to retweet characteristics). In both cases, signals about individual items are important in ranking those items, and those signals change quickly (whether they’re tweets or eBay items)
  • Twitter is largely an ecosystem like eBay (though many tweets contain links to external web sites)
  • Twitter makes everything searchable like eBay, though they typically truncate the result list and return only the top matches (with a link to see all matches). eBay shows you all the matches by default (you can argue whether or not we should)
  • Twitter doesn’t really have structured data in the sense that eBay does
  • Twitter isn’t as media rich as eBay
  • Twitter probably knows much less about their users’ buying and selling behaviors

(Thanks to Twitter engineering manager Krishna Gade for the links.)

Large commerce search engines (Amazon, Bestbuy, Walmart, and so on) bear similarity too: they are ecosystems, they have structure, they know about their buyers, they have imagery, and they probably search everything. The significant differences are they mostly sell products, and very few unique items, and they have vastly fewer sellers. They are also typically dominated by multi-quantity items (for example, a thousand copies of a book). The implication is there is likely vastly less data to search, relatively almost no index update issues, relatively much less inventory that ends, relatively much less diversity, and likely much fewer changing signals about the things they sell. That makes the search technical challenge vastly different; on the surface it seems simpler than eBay, though there are likely challenges I don’t fully appreciate.

Next week, I’ll continue this post by explaining how we think about ranking at eBay, and explain the framework we use for innovation in search.

Clicks in search

Have you heard of the Pareto principle? The idea that 80% of sales come from 20% of customers, or that the 20% of the richest people control 80% of the world’s wealth.

How about George K. Zipf? The author of the “Human behavior and the principle of least effort” and “The Psycho-Biology of Language” is best-known for “Zipf’s Law“, the observation that the frequency of a word is inversely proportional to the rank of its frequency. Over simplifying a little, the word “the” is about twice as frequent as the word “of”, and then comes “and”, and so on. This also applies to the populations of cities, corporation sizes, and many more natural occurrences.

I’ve spent time understanding and publishing work how Zipf’s work applies in search engines. And the punchline in search is that the Pareto principle and Zipf’s Law are hard at work: the first item in a list gets about twice as many clicks as the second, and so on. There are inverse power law distributions everywhere.

The eBay Search Results Click Curve

Here’s the eBay search results click curve, averaged over a very large number of queries. The y-axis is the total number of clicks on each result position, and the x-axis is the result position. For example, you can see that the first result in search (the top result, the first item you see when you run a query) gets about eight times as many clicks on average as the fifteenth result. The x-axis is labelled from 1 to 200, which is typically four pages of eBay results since we show 50 results per page by default.

eBay Click Curve. The y-axis is number of clicks per result, and the x-axis is the result position.

As a search guy, I’m not surprised by this curve (more on that topic later). It’s a typical inverse power law distribution (a “Zipf’s Law” distribution). But there are a couple of interesting quirks.

Take a look at the little bump around result position 50 on the x-axis. Why’s that there? What’s happening is that after scrolling for a while through the results, many users scroll to the very bottom of the page. They then inspect the final few results on the page (results 46 to 50), just above the pagination control. Those final few results therefore get a few more clicks than the ones above that the user skipped. Again, this isn’t a surprise to me — you’ll often see little spikes after user scroll points (in web search, you’ll typically see a spike in result 6 or 7 on a 10-result page).

I’ve blown up the first ten positions a little more so that you can see the inverse power law distribution.

Search click curve for the first 10 results in eBay search.

You can see that result 1 gets about 4 times as many clicks as result 10. You can also see that result 2 gets about 5/9ths of the clicks as result 1. This is pretty typical — it’s what you’d expect to see when search is working properly.

Interestingly, even if you randomize the first few results, you’ll still see a click curve that has an inverse power law distribution. Result 1 will almost always get more clicks than result 2, regardless of whether it’s less relevant.

Click Curves Are Everywhere

Here are some other examples of inverse power law distributions that you’ll typically see in search:

  • The query curve. The most popular query is much more popular than the second most popular query, and so on. The top 20% of queries account for at least 80% of the searches. That’s why caching works in search: most search engines serve more than 70% of their results from a cache
  • The document access curve. Because the queries are skew in distribution, and so are the the clicks per result position, it’s probably not surprising that a few documents (or items or objects) are accessed much more frequently than others. As a rule of thumb, you’ll typically find that 80% of the document accesses go to 20% of the documents. Pareto at work.
  • Clicks on related searches. Most search engines show related searches, and there’s a click curve on those that’s an inverse power law distribution
  • Clicks on just about any list: left navigation, pagination controls, ads, and any other list will typically have an inverse power law distribution. That’s why there’s often such a huge price differential between what advertisers will pay in search for the top position versus the second position
  • Words in queries, documents, and ads. Just like Zipf illustrated all those years ago, word frequencies follow an inverse power law distribution. Interestingly, and I explain this in this paper, Zipf’s formal distribution doesn’t hold very well on words drawn from web documents (a thing called Heap’s law does a better job). But the point remains: a few words account for much of the occurrences

What does this all mean? To a search guy, it means that when you see a curve that isn’t an inverse power law distribution, you should worry. There’s probably something wrong — an issue with search relevance, a user experience quirk (like the little bump I explained above), or something else. Expect to see curves that decay rapidly, and worry if you don’t.

See you again next Monday for a new post. If you’re enjoying the posts, please share with your friends by clicking on the little buttons below. Thanks!