Changing platforms at Scale: Lessons from eBay

At eBay, we’ve been on a journey to modernize our platforms, and rebuild old, crufty systems as modern, flexible ones.

In 2011, Sri Shivananda, Mark Carges, and I decided to modernize our front-end development stack. We were building our web applications in v4, an entirely home-grown framework. It wasn’t intuitive to new engineers who joined eBay, they were familiar with industry standards we didn’t use, and we’d also built very tightly coupled balls of spaghetti code over many years. (That’s not a criticism of our engineers – every system gets crufty and unwieldy eventually; software has an effective timespan and then it needs to be replaced.)

Sri Shivananda, eBay Marketplaces' VP of Platform and Intrastructure

Sri Shivananda, eBay Marketplaces’ VP of Platform and Intrastructure

We set design goals for our new Raptor framework, including that we wanted to do a better job separating presentation from business logic. We also wanted better tools for engineers, faster code build times, better monitoring and alerting when problems occur, the ability to test changes without restarting our web servers, and a framework that was intuitive to engineers who joined from other companies. It was an ambitious project, and one that Sri’s lead as a successful revolution in the Marketplaces business. We now build software much faster than ever before, and we’ve rewritten major parts of the front-end systems. (And we’ve open sourced part of the framework.)

That’s the context, but what this post is really about is how you execute a change in platforms in a large company with complex software systems.

The “Steel Thread” Model

Mark Carges, eBay's Chief Technical Officer

Mark Carges, eBay’s Chief Technical Officer

Our CTO, Mark Carges, advocates building a “steel thread” use case when we rethink platforms. What he means is that when you build a new platform, build it at the same time as a single use case on top of the platform. That is, build a system with the platform, like a steel thread running end-to-end through everything we do.

A good platform team thinks broadly about all systems that’ll be built on the platform, and designs for the present and the future. The risk is they’ll build the whole thing – including features that no one ultimately needs for use cases that are three years away. Things change fast in this world. Large platform projects can go down very deep holes, and sometimes never come out.

The wisdom of the “steel thread” model is that the platform team still does the thinking, but it’s pushed by an application team to only fully design and build that parts that are immediately needed. The tension forces prioritization, tradeoffs, and a pragmatism in the platform team. Once you’re done with the first use case, you can move onto subsequent ones and build more of the platform.

Rebuilding the Search Results Page

Our first steel thread use case on Raptor was the eBay Marketplaces Search Results Page (the SRP). We picked this use case because it was hard: it’s our second-most trafficked page, and one of our most complex; building the SRP on Raptor would exercise the new platform extensively.

We co-located our new Raptor platform team – which was a small team by design – together with one of our most mission critical teams, the search frontend team. We declared that their success was mutually dependent: we’re not celebrating until the SRP is built on Raptor.

We asked the team to rebuild the SRP together. We asked for an aggressive timeline. We set bold goals. But there was one twist: build the same functionality and look-and-feel as the existing SRP. That is, we asked the team to only change one variable: change the platform. We asked them not to change an important second variable: the functionality of the site.

This turned out to be important. The end result – after much hard work – was a shiny new SRP code base:  modular, cleaner, simpler, and built on a modern platform.  But it looked and behaved the same as the old one. This allowed us to test one thing: is it equivalent for our customers to the old one?

Testing the new Search Results Page

We ran a few weeks of A/B tests, where we showed different customer populations the old and new search results page. Remember, they’re pretty much the same SRPs from the customers’ perspective. What we were looking for were subtle problems: was the new experience slower for some scenarios than the old one? Did it break in certain browsers on some platforms? Was it as reliable? Could we operate it as effectively? We could compare the populations and spot the differences reasonably easily.

This was a substantial change in our platforms and systems, and the answer wasn’t always perfect. We took the new SRP out of service a few times, fixed bugs, and put it back in. Ultimately, we deemed it a fine replacement in North America, and turned it on for all our customers in North America. The next few months saw us repeat the process across our other major markets (where there are subtle differences between our SRPs).

What’s important is that we didn’t change the look and feel or functionality at first: if we’d done that, we may not have seen several of the small problems we did see as fast as we saw them.

Keeping the old application

Another wise choice was we didn’t follow that old adage of “out with the old, and in with the new”. We kept the old SRP around running in our data centers for a few months, even though it wasn’t in service.

This gave us a fallback plan: when you make major changes, it’s never going to be entirely plain sailing. We knew that the new SRP would have problems, and that we’d want to take it out of service. When we did, we could put the old one back in service while we fixed the problem.

Eventually, we reached the confidence with our new SRP that we didn’t need the old one. And so it was retired, and the hardware put to other uses. That was over a year ago – it has been smooth sailing since.

The curse of dual-development

You might ask why we set bold goals and pushed the teams hard to build the new Raptor platform and the SRP. We like to do that at eBay, but there’s also a pragmatic reason: while there’s two SRP code bases, there’s twice the engineering going on.

Imagine that we’ve got a new idea for an improvement to the SRP. While we’re building the new SRP, the team has to add that idea to the new code base.  The team also has to get the idea into the old code base too – both so we can get it out to our customers, and so that we can carry out that careful testing I described earlier.

To prevent dual development slowing down our project, we declared a moratorium on features in the SRP for a couple of months. This was tough on the broader team – lots of folks want features from the search team, and we delayed all requests. The benefit was we could move much faster in building the new SRP, and getting it out to customers. Of course, a moratorium can’t go on for too long.

And then we changed the page

After we were done with the rollout, the SRP application team could move with speed on modernizing the functionality and look-and-feel of the search results page.

Ultimately, this became an important part of eBay 2.0, a  refresh of the site that we launched in 2012. And they’re now set up to move faster whenever they need to: we are testing more new ideas that improve the customer experience than we’ve been able to before, and that’s key to the continued technology-driven revolution at eBay.

See you next week.

One thought on “Changing platforms at Scale: Lessons from eBay

  1. Pingback: The Cassini Search Engine | Hugh E. Williams

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s