Category Archives: technology

CS in Schools: September Update

CS in Schools is a not-for-profit initiative at RMIT University that’s focused on helping high school teachers develop their confidence and competence in teaching computer science.

September was our third month out of stealth mode. Since our last update, we’ve formally joined forces with RMIT and advertised for teachers to join our programme. Please share this link with teacher colleagues, friends, and anyone who might want to work with us.

We signed up an eighth school into our free pilot, and we’re at capacity for 2019. We’re now deep in the logistics of timetabling, resourcing, and planning.

Woman Programming on a Notebook

Summary

CS in Schools is part of RMIT‘s Policy and Impact Portfolio. It’s a charity initiative that’s focused on helping teachers confidently teach computer science to high school students in Australia.

Today, most schools struggle to teach coding: there’s a shortage of teachers who feel qualified to teach computer science, and most successful coding classes are run outside of school hours. We believe that today’s teachers can effectively teach coding if they’re supported through in-class professional development. Microsoft’s TEALS programme exists in the US, and we want Australian teachers to have a similar opportunity.

Many of the important and best paid jobs of this and the next generation will require computational thinking. Even if a student doesn’t study computer science at university, it’s essential they have the basics because just about every job will be changed by technology. We want every student in Australia to have this opportunity.

In 2019, we are piloting a programme with eight schools, and studying how successfully we can help teachers ramp-up their skills. Beyond 2019, we plan to launch this programme broadly.

We’ve had an exciting September. In summary, in the past month:

We named ourselves CS in Schools, and bought the domain csinschools.io. You can now contact us at hugh@csinschools.io and selina@csinschools.io.
We’ve signed-up our eighth school into our 2019 pilot, and we’re at full capacity
We’ve become part of RMIT University, giving us access to resources, educators, facilities, and enabling us to become a Tier 1 DGR charity (an important Australian tax status)
We’ve advertised for teachers to work with us
We’re refining a roadmap for what happens from now until we start the programme in February

To learn more about our programme, see the “Programme” section in our August update.

September Update

September was another terrific month for our programme: it’s all coming together, and building momentum towards our 2019 pilot.

This month, we formally joined RMIT University. We’re now part of the Policy and Impact Portfolio, and have an office inside RMIT’s Activator space. RMIT’s generosity has been amazing: they’re providing offices, finance support, HR and legal support, and support for philanthropic engagement. Of course, being part of RMIT also gives us enormous credibility and opens all kinds of doors. We thank Vice-Chancellor Martin Bean and his team for their kindness and support in helping CS in Schools take a leap forward.

We finalised the schools that we’ll be working with in our pilot in 2019. These are:

Three schools on the Mornington Peninsula: Toorak College, Mount Erin College, and McClelland College
Two schools in Sale: Gippsland Grammar School and the Sale Catholic College
Three suburban schools, Greensborough College, Eltham High School, and Haileybury

It’s likely that somewhere between 10 and 15 teachers will be trained through our programme next year, and over 1,000 Year 7 students will experience our course. We’re excited to be able to study the outcomes of the programme, and hopefully scale it in 2020.

We are working hard on finalising financial support for 2019. After building our draft timetable and revising our budget, we could do with more money. If you have corporate connections or connections to philanthropic donors, we’d love an intro!

We are now hiring teachers. We plan to hire up to three qualified high school teachers who have experience teaching computing. Our job ad is live, and here’s a link to the position description. Please share it broadly with your colleagues, family, and friends who might be interested. Hopefully, in our next update, we can share the news of who we’ve hired, and have made progress on developing our lessons.

The final news from the month is that we’re getting closer to Microsoft’s TEALS programme. Kevin Wang at Microsoft leads this initiative, and he’s been generous in sharing his time to help us learn about onboarding schools, curriculum development, and structure of their engagement with volunteers, schools, and universities. Under an NDA between RMIT and Microsoft, we’ve recently been able to access their materials, and this has helped us not reinvent the wheel from first principles. We hope to further deepen our TEALS engagement over the next month.

Thanks again for making it all the way to the bottom of the update. Let us know if you’d like to learn more about any topic. Cheers, Hugh and Selina.

Computing in Schools: August update

1 Reply

We are a charity that’s focused on helping high school teachers develop their confidence and competence in teaching computer science.

It’s our second month out of stealth mode, and it’s been a good one for us. We’re exploring becoming part of RMIT University, and SEEK and REA became supporters of our programme. We have committed donations from generous supporters, and the first year of our programme is completely funded. We’ve been learning from a similar programme in the US that’s supported by Microsoft. We’ve refined our plans, with a sharper focus on Year 7 teacher professional development. We’ve signed up seven schools into our free pilot, and we’re in discussion with three more — we are pretty much at capacity for 2019!

Summary

We’ve launched a charity that’s focused on helping teachers confidently teach computer science to high school students in Australia. Today, most schools struggle to teach coding: there’s a shortage of teachers who feel qualified to teach computer science, and most successful coding classes are run outside of school hours. We believe that today’s teachers can effectively teach coding if they’re supported through in-class professional development. A somewhat similar and successful programme exists in the US, and we want Australian teachers to have this opportunity.

In 2019, we are piloting a programme with up to ten schools, and studying how successfully we can help teachers ramp-up their skills. Beyond 2019, we plan to launch this programme broadly.

We’ve had an exciting August. In summary:

We’ve signed-up seven schools into our 2019 pilot, and we’re speaking to
two or three more
We have committed donations from generous supporters, and the first year of our programme is completely funded
We’re exploring a partnership with RMIT University
We’ve made progress on becoming a charity
We’ve signed up corporate supporters
We’re developing a roadmap for what happens from now until we start the programme in February

Programme

Our programme continues to evolve. We’ve refined our goals to:

Trial a novel teacher professional development pilot programme in 2019
Work with between five and ten schools in 2019. This means we’ll work with between seven and twelve school teachers who’ll teach over 1,000 students
Study how effectively we can develop computer science teaching skills
Develop teachers who can continue to independently teach computing in 2020
Make this programme free for schools in 2019, supported by generous donations

Our plans to deliver the programme are as follows:

We will hire two or three teachers who have a computing background, and they will be paid by generous donations to our charity
Our teachers will be paired with teachers from the schools in the pilot, and will meet for background workshops between November and January
From February, our teachers will work in the classroom with the teachers from the
schools in the pilot
In our first term (or semester) working in a school, our teacher will deliver the syllabus to the students and the school’s teacher will learn through observation and by providing some student support
In our second term (or semester), it’s anticipated we’ll switch the roles: the school’s teacher will be the primary driver, with in-class support and training from our teacher
We’re hopeful that beyond the first two iterations, the school’s teacher will be independent, and require limited support from our programme

Our pilot is focused on:

Helping Year 7 teachers in Victorian high schools
A one-term (or semester) Year 7 computing subject of two or three hours per week
Providing everything that’s needed to teach the subject, including lesson plans, assignments, hardware, and software
Covering the “harder parts” of the Victorian DigiTech curriculum, especially the coding skills. In total, we’ll together deliver around half of the recommended 40 hours of the curriculum, and expect that the other half is delivered in other subjects

August Update

We have made significant progress in August, it’s been an exciting month.

SEEK Australia and Real Estate Australia (REA) have endorsed our plans. We’re now able to tell schools that SEEK and REA stand behind what we’re doing, and endorse the industry relevance of our programme. We’re also meeting and learning from other organisations, including Google and Microsoft. In particular, we’ve been meeting Kevin Wang at Microsoft who developed the
incredible TEALS programme in the US; it’s likely we will visit Kevin later this year.

We are hopeful of joining forces with RMIT University, and have met twice this month with Vice Chancellor Martin Bean and several times with his leadership team. This would be an amazing outcome, helping us in several ways:

It’d enhance our credibility, opening doors more easily to schools, academics, and
government institutions
It’d improve the breadth and depth of our pilot, by giving us easier access to relevant academics, policy experts, and education professionals
We’d have access to facilities, including an office, meeting rooms, payroll support, and rooms for running teacher workshops
We’d be a “Deductible Gift Recipient (DGR) Charity”, a desirable status for receiving donations
We are optimistic we will finalise this arrangement and announce it at the end of September.

We have signed up several more schools. With some help from good friends, we’re now working with three schools on the Mornington Peninsula: Toorak College, Mount Erin College, and McClelland College. We are working with two schools in Sale: Gippsland Grammar School and the Sale Catholic College. We have also signed up two suburban schools, Greensborough College and Haileybury. We are in discussions with three other schools, including two public schools and one large private girls school. We’re excited to have a mix of public, private, Catholic, city, and country schools to work with — we know this’ll help us better understand the effectiveness of our programme and provide a more persuasive argument as we go forward in 2020.

We’ve begun to work on our roadmap from September to February. This includes planning for hiring, syllabus development, and teacher workshops.

We look forward to an exciting September, and sharing even more progress at the end of the month. Thanks for reading all the way down here!

Cheers, Selina and Hugh.

It’s a Marathon, not a Sprint…

4 Replies

What makes a career successful over the long term? How do you sustain (or even increase) your professional impact, while also deriving meaning and enjoyment from your life? This was the question I set about answering in my talk at StretchCon last week. To see my answers, you can watch the presentation. You can also view the prezi. 1966321_313990785474109_1977531760847682083_o

The Backstory

I asked eleven colleagues about their success. I chose colleagues who have made it to the C-suite (whether as a CEO, CTO, or another C-level designation) and that appeared to do it with balance between their professional and personal lives. Ten of the eleven responded, and nine of the ten shared thoughts before my deadline. I thank Chris Caren, Adrian Colyer, John Donahoe, Ken Moss, Satya Nadella, Mike Olson, Christopher Payne, Stephanie Tilenius, and Joe Tucci for their help.

I sent each of these colleagues an email that went something like this:

I am speaking about successful careers being about sustained contribution (and not a series of sprints, all-nighters, or unsustainable peaks). Would you be up for giving me a quote I could use and attribute to you? I admire your ability to work hard and smart, while obviously also having a life outside of work.

Their replies were varied, but as you’ll see in the video, there were themes that repeated in their answers. I shared edited quotes in the talk, and promised that I’d share their complete thoughts in my blog. The remainder of this blog is their complete words.

Chris Caren

Chris is the CEO and Chairman of Turnitin. We worked together at Microsoft, and Chris was (and sometimes still is!) my mentor. Here are his thoughts in response to my questions:

My philosophy: I do my best work when my life is in balance — family, me, and work. I need a routine of hard work, but no more than 9-10 hours a day, solid exercise daily, low stress (via self-control), 7-8 hours of sleep at a minimum each day, and the time I want with my family and for myself. When I maintain this balance, I am maximally effective at work — both in terms of quality of thinking and decision making, and maximum output. More hours worked actually pull down my impact as a CEO.

Adrian Colyer

Adrian was the CTO of SpringSource, the custodians of the Spring Java programming framework. We worked together at Pivotal, where he was the CTO of the Application Fabric team. Recently, Adrian joined Accel Partners as an Executive-in-Residence. Here are Adrian’s thoughts:

A great topic! Maybe the most counter-intuitive lesson I’ve learned over the years is that I can make a much more valuable contribution when I work* less. So work-life balance isn’t really a trade-off as most people normally present it (I have more ‘life’, but sacrifice ‘work’ to get it), it’s actually a way of being better at both life *and* work!

* ‘work’ in the sense that most people would intuitively think of it – frenetic activity.

When I’ve analysed this, I came to realise that when work crowds everything else out I often end up in a very reactive mode. But the biggest and most impactful things you can do – especially as a leader – don’t come about during that constant fire-fighting mode. The vast majority of my important insights and decisions – the things that made the biggest positive impact on the organisations I was working with at the time – have come in the space I’ve made around the busy-ness of the office to actually allow myself the luxury of thinking! Running, cycling, walking and so on have all been very effective for me over the years. But so is just taking some time out in the evening and not necessarily even consciously thinking about work, the brain seems to be very good at background processing! That time has also given space to allow my natural curiosity and love of learning to be indulged. In turn that creates a broader perspective, exposes you to new ideas, and allows you to make connections and insights that you otherwise might not of. All of this feeds back into the work-related decisions and issues you are wrestling with and helps you to make breakthroughs from time to time.

To the extent I’ve been successful over the years, I attribute much of that not to being smarter than the people around me, nor to working ‘harder’, but to creating the space to think.

John Donahoe

John is the CEO of eBay Inc. John was an enthusiastic sponsor of my work while I was there. When I asked John for his thoughts, he sent me a speech he’d recently given to the graduating class at the Stanford Business School. In it, you’ll find John’s thoughts of his professional and personal journey.

Ken Moss

Ken recently became the CTO of Electronic Arts. Prior to that, Ken and I worked together on, off, and on over a period of nine years. Ken was the GM of Microsoft’s MSN Search when I joined Microsoft, and left to found his own company. I managed to help persuade Ken to come to eBay for a few years. Here are Ken’s thoughts:

Always focus on exceeding expectations in the present, while keeping your tank 100% full of gas for the future. There is no quicker way to stall your career than by burning yourself out. I’ve seen many potentially brilliant careers cut short as someone pushed themselves too far past their limits and became bitter under-performers. It’s always in your control.

Satya Nadella

Satya became the CEO of Microsoft at the beginning of 2014. Satya was the VP of the Bing search team at Microsoft for around half the time I was there, and we have stayed in touch since. Here are Satya’s thoughts:

I would say the thing that I am most focused on is to harmonize my work and life vs trying to find the elusive “balance”. Being present in the lives of my family in the moments I am with them is more important than any quantitative view of balance.

Mike Olson

Mike is the Chairman, Chief Strategy Officer, and former CEO of Cloudera. We have interacted during my time at Pivotal, and also during my time at eBay. Mike was kind enough to invite me to give the keynote at Hadoop World in 2011. Here’s Mike’s thoughts:

I have always tried to optimize for interesting — working on problems that are important to me, with people who blow my hair back. The combination has kept me challenged and inspired, and has guaranteed real happiness in the job.

By corollary, you have to be willing to walk away from a good paycheck and fat equity if the work or the people are wrong. Money is cheaper than meaning. I’ve done that a few times. There’s some short-term angst, but it’s paid off in the long term.

Christopher Payne

Christopher is the SVP of the North America business at eBay. Christopher and I have worked on, off, and on for nine years. Christopher was the founding VP of the search team at Microsoft. He left to found his own company, his company was bought by eBay, he hired me to eBay to help run engineering, and he then moved over to run the US and Canadian business teams. Here are Christopher’s thoughts:

I believe strongly in the need to maintain my energy level in order to have the most impact in my career. To do this I find I have to make the time to recharge. For me this means taking walks during the work day, taking all of my vacation, and not being on email 24/7. With my energy level high I find I can be significantly more creative and productive over the long term.

Stephanie Tilenius

Stephanie recently founded her own company, Vida. While she’s spent parts of her career at Kleiner-Perkins, Google, and other places, we met at eBay where we spent around six months working together. Here are Stephanie’s thoughts:

… my point of view is that you have to do something you love, that will sustain you. You also have to know what drives you, what gets you out of bed, for me it is having an impact (for others it may be making money or playing a sport, etc.) You will always be willing to give it your all and you are more likely to innovate if you love what you are doing and constantly growing, challenging the status quo (stagnation is really out of the question, humans don’t thrive on it!). I am committed to my work and to constant innovation but also to having a family and I could not be great at either without the other. They are symbiotic in my mind, they both make me happy and a better person. I have learned it is about integration not necessarily perfect balance. If you integrate life and work, you are much more likely to be successful. The other day my son was out of school early and our nanny had an issue so I brought him to work and he did code academy and talked to some of our engineers. He enjoyed himself and was inspired.

Joe Tucci

Joe is the Chairman of EMC, VMware, and Pivotal, and the CEO of EMC. I met Joe in the interview process at Pivotal, and have worked with him through board and other meetings over the past year. Here’s Joe’s thoughts:

Being a successful CEO is relatively straight forward… 1st – retain, hire, and develop the best talent, 2nd – get these talented individuals to work together as a team (do not tolerate selfishness), 3rd – get this leadership team to embrace a stretch goal that is bigger then any of them imagine they can attain, and 4th – maniacally focus the leadership team on our customers (always striving to exceed their expectations)

I enjoyed giving the talk at Stretch, and interacting with these colleagues in putting it together. I hope you enjoyed it too. See you next time.

Fireside chat at the DataEdge Conference

1 Reply

The video of my recent conversation with Michael Chui from McKinsey as part of the UC Berkeley DataEdge conference is now online. Here it is:

The discussion is around 30 minutes. I tell a few stories, and most of them are mostly true. We talk about my career in data, search, changing jobs, inventing infinite scroll, eBay, Microsoft, Pivotal, and more. Enjoy!

Armchair Guide to Data Compression

3 Replies

Data compression is used to represent information in less space than its original representation. This post explains the basics of lossless compression, and hopefully helps you think about what happens when you next compress data.

Similarly to my post on hash tables, it’s aimed at software folks who’ve forgotten about compression, or folks just interested in how software works; if you’re a compression expert, you’re in the wrong place.

A Simple Compression Example

Suppose you have four colored lights: red, green, blue, and yellow. These lights flash in a repeating pattern: red, red, green, red, red, blue, red, red, yellow, yellow (rrgrrbrryy). You think this is neat, and you want to send a message to a friend and share the light flashing sequence.

You know the binary (base 2) numbering system, and you know that you could represent each of the lights in two digits: 00 (for red, 0 in decimal), 01 (for green, 1 in decimal), 10 (for blue, 2 in decimal), and 11 (for yellow, 3 in decimal). To send the message rrgrrbrryy to your friend, you could therefore send them this message: 00 00 01 00 00 10 00 00 11 11 (of course, you don’t send the spaces, I’ve just included them to make the message easy for you to read).

Data compression. It’s hard to find a decent picture!

You also need to send a key or dictionary, so your friend can decode the message. You only need to send this once, even if the lights flash in a new sequence and you want to share that with your friend. The dictionary is: red green blue yellow. Your friend will be smart enough to know that this implies red is 00, green is 01, and so on.

Sending the message takes 20 bits: two bits per light flash and ten flashes in total (plus the one-off cost of sending the dictionary, which I’ll ignore for now). Let’s call that the original representation.

Alright, let’s do some simple compression. Notice that red flashes six times, yellow twice, and the other colors once each. Sending those red flashes is taking twelve bits, four for sending yellow, and the others a total of four for the total of twenty bits. If we’re smart, what we should be doing is sending the code for red in one bit (instead of two) since it’s sent often, and using more than two bits for green and blue since they’re sent once each. If we could send red in one bit, it’d cost us six bits to send the six red flashes, saving a total of six bits over the two-bit version.

How about we try something simple? Let’s represent red as 0. Let’s represent yellow as 10, green as 110, and blue as 1110 (this is a simple unary counting scheme). Why’d I use that scheme? Well, it’s a simple idea: sort the color flashes by decreasing frequency (red, yellow, green, blue), and assign increasingly longer codes to the colors in a very simple way: you can count 1s until you see a 0, and then you have a key you can use to look in the dictionary. When we see just a 0, we can look in the dictionary to find that seeing zero 1s means red. When we see 1110, we can look in the dictionary to find that seeing three 1s means blue.

Here’s what our original twenty-bit sequence would now look like: 0 0 110 0 0 1110 0 0 10 10. That’s a total of 17 bits, a saving of 3 bits — we’ve compressed the data! Of course, we need to send our friend the dictionary too: red yellow green blue.

It turns out we can do better than this using Huffman coding. We could assign 0 to red, 10 to yellow, 110 to blue, and 111 to green. Our message would then be 0 0 111 0 0 110 0 0 10 10. That’s 16 bits, 1 bit better than our simple scheme (and, again, we don’t need to send the spaces). We’d also need to share the dictionary: red <blank> yellow <blank> blue green to show that 0 is red, 1 isn’t anything, yellow is 10, 11 isn’t anything, blue is 110, and green is 111. A slightly more complicated dictionary for better message compression.

Semi-Static Compression

Our examples are two semi-static compression schemes. The dictionary is static, it doesn’t change. However, it’s built from a single-pass over the data to learn about the frequencies — so the dictionary is dependent on the data. For that reason, I’ll call our two schemes semi-static schemes.

Huffman coding (or minimum-redundancy coding) is the most famous example of a semi-static scheme.

Semi-static schemes have at least three interesting properties:

They require two passes over the data: one to build the dictionary, and another to emit the compressed representation
They’re data-dependent, meaning that the dictionary is built based on the symbols and frequencies in the original data. A dictionary that’s derived from one data set isn’t optimal for a different data set (one with different frequencies) — for example, if you figured out the dictionary for Shakespeare’s works, it isn’t going to be optimal for compressing War and Peace
You need to send the dictionary to the recipient, so the message can be decoded; lots of folks forget to include this cost when they share the compression ratios or savings they’re seeing, don’t do that

It doesn’t matter what kind of compression scheme you choose, you need to decide what the symbols are. For example, you could choose letters, words, or even phrases from English text.

Static Compression

Morse code is an example of a (fairly lame) static compression scheme. The dictionary is universally known (and doesn’t need to be communicated), but the dictionary isn’t derived from the input data. This means that the compression ratios you’ll see are at best the same as you’ll see from a semi-static compression scheme, and usually worse.

There aren’t many static compression schemes in widespread use. Two examples are Elias gamma and delta codes, which are used to compress inputs that consist of only integer values.

Adaptive Compression

The drawbacks of semi-static compression schemes are two-fold: you need to process the data twice, and they don’t adapt to local regions in the data where the frequencies of symbols might vary from the overall frequencies. Imagine, for example, that you’re compressing a very large image: you might find a region of blue sky, where there’s only a few blue colors. If you built a dictionary for only that blue section, you’d get a different (and better) dictionary than the one you’d get for the whole image.

Here’s the idea behind adaptive compression. Build the dictionary as you go: process the input, and see if you can find it in the (initially empty) dictionary. If you can’t, add the input to the dictionary. Now emit the compressed code, and keep on processing the input. In this way, your dictionary adapts as you go, and you only have to process the data once to create the dictionary and compress the data. The most famous example is LZW compression, which is used in many of the compression tools you’re probably using: gzip, pkzip, GIF images, and more.

Adaptive schemes have at least three interesting properties:

One pass over the data creates the dictionary and the compressed representation, an advantage over semi-static schemes
Adaptive schemes get better compression with the more input they see: since they don’t approximate the global symbol frequencies until lots of input has been processed, they’re usually much less optimal than semi-static schemes for small inputs
You can’t randomly access the compressed data, since the dictionary is derived from processing the data sequentially from the start. This is one good reason why folks use static and semi-static schemes for some applications

What about lossy compression?

The taxonomy I’ve presented is for lossless compression schemes — those that are used to compress and decompress data such that you get an exact copy of the original input. Lossy compression schemes don’t guarantee that: they’re schemes where some of the input is thrown away by approximation, and the decompressed representation isn’t guaranteed to be the same as the input. A great example is JPEG image compression: it’s an effective way to store an image, by throwing away (hopefully) unimportant data and approximating it instead. The MP3 music file format and the MP4 video format (usually) do the same thing.

In general, lossy compression schemes are more compact than lossless schemes. That’s why, for example, GIF image files are often much larger than JPEG image files.

Hope you learnt something useful. Tell me if you did or didn’t. See you next time.

The Cassini Search Engine

7 Replies

Cassini has been in the news lately: Wired.com recently featured a story on Cassini and I was interviewed by eCommerceBytes. Given the interest in the topic, this week’s post is more on the inside Cassini story.

The Beginning

At the end of 2010, we decided that we needed a new search platform. We’ve done great work in improving the existing Voyager search engine, and it had contributed substantially to the turnaround of eBay’s Marketplaces business. But it wasn’t a platform for the future: we needed a new, modern platform to innovate for our customers and business.

We agreed as a team to build Cassini, our new search engine platform. We started Cassini in around October 2010, with a team of a few folks lead by industry veteran Nick Whyte.

Voyager

Our Voyager search engine was built in around 2002. It’s delivered impressively and reliably for over ten years. It’s a good old workhorse. But it isn’t up for what we need today: it was architected before many of the modern advances in how search works, having launched before Microsoft began its search effort and before Google’s current generation of search engine.

We’ve innovated on top of Voyager, and John Donahoe has discussed how our work on search has been important. But it hasn’t been easy and we haven’t been able to do everything we’ve wanted – and our teams want a new platform that allows us to innovate more.

Cassini

Since Voyager was named after the late 1970s space probes, we named our new search engine after the 1996 Cassini probe. It’s a symbolic way of saying it’s still a search engine, but a more modern one.

The Cassini-Huygens probe, the namesake of our Cassini search engine.

In 2010, we made many fundamental decisions about the architecture and design principles of Cassini. In particular, we decided that:

It’d support searching over vastly more text; by default, Voyager lets users search over the title of our items and not the description. We decided that Cassini would allow search over the entire document
We’d build a data center automation suite to make deployment of the Cassini software much easier than deploying Voyager; we also built a vision around provisioning machines, fault remediation, and more
We decided to support sophisticated, modern approaches to search ranking, including being able to process a query in multiple rounds of ranking (where the first iteration was fast and approximate, and latter rounds were intensive and accurate)

Cassini is a true start-from-scratch rewrite of eBay’s search engine. Because of the unique nature of eBay’s search problem, we couldn’t use an existing solution. We also made a clean break from Voyager – we wanted to build a new, modular, layered solution.

Cassini in 2011

January 2011 marked the real beginning of the project. We’d been working for three or four months as a team of five or so, and we’d sketched out some of the key components. In 2011, we set the project in serious motion – adding more folks to the core team to begin to build key functionality, and embarking on the work needed to automate search. We also began to talk about the project externally.

Cassini in 2012

In 2012, we’d grown the team substantially from the handful of folks who began the project at the end of 2010. We’d also made substantial progress by June – Cassini was being used behind-the-scenes for a couple of minor features, and we’d been demoing it internally.

In the second half of 2012, we began to roll out Cassini for customer-facing scenarios. Our goal was to add value for our customers, but also to harden the system and understand how it performs when it’s operating at scale. This gave us practice in operating the system, and forced us to build many of the pieces that are needed to monitor and operate the system.

The two scenarios that we rolled out Cassini for in 2012 were Completed Search worldwide, and null and low search in North America. Mark Carges talked about this recently at eBay’s 2013 Analysts’ Day.

Completed search is the feature that allows our customers to search listings that have ended, sold or not; it’s a key way that our customers do price research, whether for pricing an item they’re selling or to figure out what to pay when they’re buying. Because Cassini is a more scalable technology, we were able to provide search over 90 days of items rather than the 14 days that Voyager has always offered.

When users get no results from a query (or very few), Voyager offered a very basic solution – no results and a few suggested queries that you might try (this is still what you’ll see on, for example, the Australian site today). Cassini could do much more – it could help us try related queries and search more data to find related matches. After customer testing, we rolled out Cassini in North America, the UK, and Germany to support this scenario – a feature that’s been really well received.

Cassini in May 2013

In January 2013, we began testing Cassini for the default search with around 5% of US customers. As I’ve discussed previously, when we make platform changes, we like to aim for parity for the customers so that it’s a seamless transition. That’s our goal in testing – replace Voyager with a better platform, but offer equivalent results for queries. The testing has been going on with 5% of our US customers almost continuously this year. Parity isn’t easy: it’s a different platform, and our Search Science and Engineering teams have done great work to get us there.

We’ve recently launched Cassini in the US market. If you’re querying on ebay.com, you’re using Cassini. It’s been a smooth launch – our hardening of the platform in 2012, and our extensive testing in 2013 have set us on a good path. Cassini doesn’t yet do anything much different to Voyager: for example, it isn’t by default searching over descriptions yet.

Cassini in 2013 and beyond

We’ll begin working on making Cassini the default search engine in other major markets later in the year. We’re beginning to test it in the UK, and we’ll work on Germany soon too. There’s also a ton of logistics to cover in launching the engine: cassini requires thousands of computers in several data centers.

We’ll also begin to innovate even faster in making search better. Now that we’ve got our new platform, our search science team can begin trying new features and work even faster to deliver better results for our customers. As a team, we’ll blog more about those changes.

We have much more work to do in replacing Voyager. Our users run more than 250+ million queries each day, but many more queries come from internal sources such as our selling tools, on site merchandizing, and email generation tools. We need to add features to Cassini to support those scenarios, and move them over from Voyager.

It’ll be a while before Voyager isn’t needed to support a scenario somewhere in eBay. We’ll then turn Voyager off, celebrate a little, and be entirely powered by Cassini.

See you next week.

The size and scale of eBay: 2013 edition

5 Replies

It’s time for an update of the eBay Marketplaces interesting statistics that I shared last year. eBay Marketplaces means we’re the team that builds ebay.com, ebay.co.uk, ebay.de, ebay.com.au, and most of the other worldwide marketplaces under the eBay brand.

eBay Marketplaces sold over US$75.3 billion in merchandise in 2012

Here are some refreshed and new facts:

We have over 50 petabytes of data stored in our Hadoop and Teradata clusters
We have over 400 million items for sale
We process more than 250 million user queries per day
We serve over 100,000 pages per second
Our users spend over 180 years in total every day looking at items
We have over 112 million active users
We sold over US$75 billion in merchandize in 2012

eBay’s an exciting place to be — plenty of engineering, business, and commerce challenges that are driven by users, items, traffic, and sales. See you next week.

Selling on eBay

2 Replies

I gave a webinar last week on Search at eBay. Thank you to those who attended and asked great questions. The webinar was recorded and you can listen and view the slides here.

Search at eBay. A tour of search from a recent webinar.

In the webinar, I present a background of some of the facts and figures about eBay (an updated version of this post), a short introduction to how search works, and a tour of our current and future search platforms. The talk concludes with over thirty minutes of Q&A. The middle of the talk is dedicated to explaining how to sell successfully on eBay, and I summarize that advice in this post.

Listing for Success in Search

It’s important to remember that search isn’t the only step in selling on eBay. It’s a three-step process:

Finding an item in search
Choosing an item
Purchasing the item

Visibility in search is therefore important, but it isn’t sufficient to guarantee sales. You must also focus on ensuring buyers click on your item and make the choice to purchase it.

On the first point: it’s true that most successful sales begin with a search on eBay, but many others occur through buyers finding items on another search engine, in an email, through the merchandizing that we feature, or through another source such as a social network. Searches on eBay also don’t always have a keyword: many buyers browse through our categories to find the item they want.

Visibility in Search

Our search team focuses on three tenets of delivering a great experience for our buyers:

Trust: ensuring our buyers get a retail-like experience from great sellers
Value: ensuring our buyers find great deals, including shipping costs
Relevance: ensuring our buyers see items that match what they want

If you focus on these three tenets, you will be successful in gaining visibility in search.

Delivering Trust, Value, and Relevance

Here is some specific advice on how you can be found in search, have your items chosen in search, and drive sales:

List in the correct category with a meaningful title; a meaningful title contains only words that accurately describe the item, it omits “spam words” (off-topic words) and isn’t unnecessarily short or long
Use item specifics and item condition; we match queries against the item specifics, and to be found successfully we recommended you adopt item specifics
When it makes sense for your business, use a long-duration, multi-quantity Buy It Now format
Study what it takes to be a Top Rated Seller on eBay; it is our definition of the key tenets of being a trusted seller on eBay, and we use those trust signals in search
Have great, large, clear pictures for your listings; we know from many experiments that pictures matter to our buyers, and you’re much more likely to be chosen in search and have sales if your pictures are high-quality
Be price competitive; this changes frequently, and you need to understand the market price for your items and be prepared to make changes
Specify shipping costs; our customers want great value, and that includes value in shipping
Have a clear, structured, and comprehensive item description. Stay “on topic” and describe your item accurately
Offer fast shipping and international options; fast shipping matters to our buyers, and offering international options means you have more exposure to more customers

We don’t offer specific answers to specific questions about how Best Match works on eBay. For example, we don’t comment on how we use each of the item specifics in search or whether having certain keywords in certain places in your titles matters. Why not? We want sellers to focus on the key tenets of Trust, Value, and Relevance, and not on specific features that may change or that might give sellers a short-term unfair advantage over other great sellers. Indeed, if a shortcut works today, it may not work tomorrow — we want a level playing field for all sellers, and we’re continually improving Best Match to use more Trust, Value, and Relevance information.

I encourage you to listen to the webinar for a richer explanation of how to sell successfully on eBay with a search-centric focus. See you next week.

Changing platforms at Scale: Lessons from eBay

2 Replies

At eBay, we’ve been on a journey to modernize our platforms, and rebuild old, crufty systems as modern, flexible ones.

In 2011, Sri Shivananda, Mark Carges, and I decided to modernize our front-end development stack. We were building our web applications in v4, an entirely home-grown framework. It wasn’t intuitive to new engineers who joined eBay, they were familiar with industry standards we didn’t use, and we’d also built very tightly coupled balls of spaghetti code over many years. (That’s not a criticism of our engineers – every system gets crufty and unwieldy eventually; software has an effective timespan and then it needs to be replaced.)

Sri Shivananda, eBay Marketplaces’ VP of Platform and Intrastructure

We set design goals for our new Raptor framework, including that we wanted to do a better job separating presentation from business logic. We also wanted better tools for engineers, faster code build times, better monitoring and alerting when problems occur, the ability to test changes without restarting our web servers, and a framework that was intuitive to engineers who joined from other companies. It was an ambitious project, and one that Sri’s lead as a successful revolution in the Marketplaces business. We now build software much faster than ever before, and we’ve rewritten major parts of the front-end systems. (And we’ve open sourced part of the framework.)

That’s the context, but what this post is really about is how you execute a change in platforms in a large company with complex software systems.

The “Steel Thread” Model

Mark Carges, eBay’s Chief Technical Officer

Our CTO, Mark Carges, advocates building a “steel thread” use case when we rethink platforms. What he means is that when you build a new platform, build it at the same time as a single use case on top of the platform. That is, build a system with the platform, like a steel thread running end-to-end through everything we do.

A good platform team thinks broadly about all systems that’ll be built on the platform, and designs for the present and the future. The risk is they’ll build the whole thing – including features that no one ultimately needs for use cases that are three years away. Things change fast in this world. Large platform projects can go down very deep holes, and sometimes never come out.

The wisdom of the “steel thread” model is that the platform team still does the thinking, but it’s pushed by an application team to only fully design and build that parts that are immediately needed. The tension forces prioritization, tradeoffs, and a pragmatism in the platform team. Once you’re done with the first use case, you can move onto subsequent ones and build more of the platform.

Rebuilding the Search Results Page

Our first steel thread use case on Raptor was the eBay Marketplaces Search Results Page (the SRP). We picked this use case because it was hard: it’s our second-most trafficked page, and one of our most complex; building the SRP on Raptor would exercise the new platform extensively.

We co-located our new Raptor platform team – which was a small team by design – together with one of our most mission critical teams, the search frontend team. We declared that their success was mutually dependent: we’re not celebrating until the SRP is built on Raptor.

We asked the team to rebuild the SRP together. We asked for an aggressive timeline. We set bold goals. But there was one twist: build the same functionality and look-and-feel as the existing SRP. That is, we asked the team to only change one variable: change the platform. We asked them not to change an important second variable: the functionality of the site.

This turned out to be important. The end result – after much hard work – was a shiny new SRP code base: modular, cleaner, simpler, and built on a modern platform. But it looked and behaved the same as the old one. This allowed us to test one thing: is it equivalent for our customers to the old one?

Testing the new Search Results Page

We ran a few weeks of A/B tests, where we showed different customer populations the old and new search results page. Remember, they’re pretty much the same SRPs from the customers’ perspective. What we were looking for were subtle problems: was the new experience slower for some scenarios than the old one? Did it break in certain browsers on some platforms? Was it as reliable? Could we operate it as effectively? We could compare the populations and spot the differences reasonably easily.

This was a substantial change in our platforms and systems, and the answer wasn’t always perfect. We took the new SRP out of service a few times, fixed bugs, and put it back in. Ultimately, we deemed it a fine replacement in North America, and turned it on for all our customers in North America. The next few months saw us repeat the process across our other major markets (where there are subtle differences between our SRPs).

What’s important is that we didn’t change the look and feel or functionality at first: if we’d done that, we may not have seen several of the small problems we did see as fast as we saw them.

Keeping the old application

Another wise choice was we didn’t follow that old adage of “out with the old, and in with the new”. We kept the old SRP around running in our data centers for a few months, even though it wasn’t in service.

This gave us a fallback plan: when you make major changes, it’s never going to be entirely plain sailing. We knew that the new SRP would have problems, and that we’d want to take it out of service. When we did, we could put the old one back in service while we fixed the problem.

Eventually, we reached the confidence with our new SRP that we didn’t need the old one. And so it was retired, and the hardware put to other uses. That was over a year ago – it has been smooth sailing since.

The curse of dual-development

You might ask why we set bold goals and pushed the teams hard to build the new Raptor platform and the SRP. We like to do that at eBay, but there’s also a pragmatic reason: while there’s two SRP code bases, there’s twice the engineering going on.

Imagine that we’ve got a new idea for an improvement to the SRP. While we’re building the new SRP, the team has to add that idea to the new code base. The team also has to get the idea into the old code base too – both so we can get it out to our customers, and so that we can carry out that careful testing I described earlier.

To prevent dual development slowing down our project, we declared a moratorium on features in the SRP for a couple of months. This was tough on the broader team – lots of folks want features from the search team, and we delayed all requests. The benefit was we could move much faster in building the new SRP, and getting it out to customers. Of course, a moratorium can’t go on for too long.

And then we changed the page

After we were done with the rollout, the SRP application team could move with speed on modernizing the functionality and look-and-feel of the search results page.

Ultimately, this became an important part of eBay 2.0, a refresh of the site that we launched in 2012. And they’re now set up to move faster whenever they need to: we are testing more new ideas that improve the customer experience than we’ve been able to before, and that’s key to the continued technology-driven revolution at eBay.

See you next week.

Five Myths about Hash Tables

32 Replies

A hash table is data structure that is used to search for exact matches to a search key. For searching, they work like this:

Take a search key (example: the word “cat”)
Hash the search key: pass the key to a function that returns an integer value (example: the hash function returns 47 when the input key is “cat”)
Use the integer value to inspect a slot to see if it contains the search key (example: look in an array at element 47)
If the key matches, great, you’ve found what you’re looking for and you can retrieve whatever metadata you’re looking for (example: in array element 47, we’re found the word “cat”. We retrieve metadata that tells us “cat” is an “animal”)
If the key doesn’t match and the slot contains something besides the key, carry out a secondary search process to make sure the search key really isn’t stored in the hash table; the secondary search process varies depending on the type of hashing algorithm used (example: in array element 47, we found the word “dog”. So, let’s look in slot 48 and see if “cat” is there. Slot 48 is empty, so “cat” isn’t in the hash table. This is called linear probing, it’s one kind of secondary search)

I had a minor obsession with hash tables for ten years.

The worst case is terrible

Engineers irrationally avoid hash tables because of the worst-case O(n) search time. In practice, that means they’re worried that everything they search for will hash to the same value; for example, imagine hashing every word in the English dictionary, and the hash function always returning 47. Put differently, if the hash function doesn’t do a good job of randomly distributing search keys throughout the hash table, the searching will become scanning instead.

This doesn’t happen in practice. At least, it won’t happen if you make a reasonable attempt to design a hash function using some knowledge of the search keys. Indeed, hashing is usually in the average case much closer to O(1). In practice, that means almost always you’ll find what you’re looking for on the first search attempt (and you will do very little secondary searching).

Hash tables become full, and bad things happen

The story goes like this: you’ve created a hash table with one million slots. Let’s say it’s an array. What happens when you try and insert the 1,000,001st item? Bad things: your hash function points you to a slot, it’s full, so you look in another slot, it’s full, and this never ends — you’re stuck in an infinite loop. The function never returns, CPU goes through the roof, your server stops responding, and you take down the whole of _LARGE_INTERNET_COMPANY.

This shouldn’t happen if you’ve spent time on design.

Here’s one solve: there’s a class of hash tables that deal with becoming full. They work like this: when the table becomes x% full, you create a new hash table that is (say) double the size, and move all the data into the new hash table by rehashing all of the elements that are stored in it. The downside is you have to rehash all the values, which is an O(n) [aka linear, scanning, time consuming] process.

Here’s another solve: instead of storing the metadata in the array, make your array elements pointers to a linked list of nodes that contain the metadata. That way, your hash table can never get full. You search for an element, and then traverse the linked list looking for a match; traversing the linked list is your secondary hash function. Here’s a diagram that shows a so-called chained hash table:

A chained hash table, showing “John Smith” and “Sandra Dee” both in slot 152 of the hash table. You can see that they’re chained together in a linked list. Taken from Wikipedia, http://en.wikipedia.org/wiki/Hash_table

Chained hash tables are a very good idea. Problem solved.

By the way, I recommend you create a hash table that’s about twice the size of the number of elements you expect to put into the table. So, suppose you’re planning on putting one million items in the table, go ahead and create a table with two million slots.

Trees are better

In general, trees aren’t faster for searching for exact matches. They’re slower, and here’s peer-reviewed evidence that compares B-trees, splay trees, and hashing for a variety of string types. So, why use a tree at all? Trees are good for inexact matches — say finding all names that begin with “B” — and that’s a task that hash tables can’t do.

Hash functions are slow

I’ve just pointed you to research that shows that must not be true. But there is something to watch out for: don’t use the traditional modulo based hash function you’ll find in your algorithms text book; for example, here’s Sedgewick’s version, note the “% M” modulo operation that’s performed once per character in the input string. The modulo is used to ensure that the hash value falls within the size of the hash table — for example, if we have 100 slots and the hash function returns 147, the modulo turns that into 47. Instead, do the modulo once, just before you return from the hash function.

Here’s the hash function I used in much of my research on hashing. You can download the code here.

A very fast hash function written in C. This uses bit shifts, and a single modulo at the end of the function. If you find a faster one, would love to hear about it.

Hash tables use too much memory

Again, not true. The same peer-reviewed evidence shows that a hash table uses about the same memory as an efficiently implemented tree structure. Remember that if you’re creating an array for a chained hash table, you’re just creating an array of pointers — you don’t really start using significant space until you’re inserting nodes into the linked lists that are chained from the hash table (and those nodes typically take less space than nodes in a tree, since they only have one pointer).

One Small Trick

If you want your hash tables to really fly for searching, move any node that matches your search to the front of the chained list that it’s in. Here’s more peer-reviewed evidence that shows this works great.

Alright. Viva Hash Tables! See you next time.

Summary

September Update

Share this:

Summary

Programme

August Update

Share this:

The Backstory

Chris Caren

Adrian Colyer

John Donahoe

Ken Moss

Satya Nadella

Mike Olson

Christopher Payne

Stephanie Tilenius

Joe Tucci

Share this:

Share this:

A Simple Compression Example

Semi-Static Compression

Static Compression

Adaptive Compression

What about lossy compression?

Share this:

The Beginning

Voyager

Cassini

Cassini in 2011

Cassini in 2012

Cassini in May 2013

Cassini in 2013 and beyond

Share this:

Share this:

Listing for Success in Search

Visibility in Search

Delivering Trust, Value, and Relevance

Share this:

The “Steel Thread” Model

Rebuilding the Search Results Page

Testing the new Search Results Page

Keeping the old application

The curse of dual-development

And then we changed the page

Share this:

The worst case is terrible

Hash tables become full, and bad things happen

Trees are better

Hash functions are slow

Hash tables use too much memory

One Small Trick

Share this: