Last week, I shared a few ways in which big data adds value. This week, I share a few more.
Predictions
You can predict the future using data. Google gets publicity from predicting flu outbreaks.
I did something similar, years earlier, that is thematically similar and illustrates the idea of using big data to predict the future. I was interested in what queries users typed before and after the query stomach ache (and a few synonymous queries). Google and Bing both give you examples of what users type next, including: diarrhea, nausea, constipation, peptic ulcer, and stomach acid symptoms. Why was I interested? I wanted to see if I could figure out which drugs had side effects that included stomach upsets.

Talking about eBay’s use of big data at the 2012 PHP UK conference
I collected all the queries that users typed before and after stomach ache (and its synonyms) over the period of two or so years. I then threw away all queries that contained only English dictionary words, leaving queries that contained one or more non-dictionary words. What’s left? Drug names, and a ton of other junk (places, people, websites, misspellings, foreign words, and so on). What I found was that users were typing the names of drugs they were taking, learning about them, and then searching for information on stomach problems (and vice-versa). I could also see how frequently each drug was associated with a stomach ache.
I looked up some of the drugs on various websites, and learnt about the side effects. Guess what? More than half of the drugs I checked had a side effect of a stomach ache. Less than half didn’t — but I suspect that probably isn’t right. If you have enough users, you can learn about the future — and I know that at least a couple of the drug side effects have been updated to include rare incidences of stomach aches. See: you can predict the future!
The world of big data has many companies built on predicting the future using vast amounts of historical data. One of my favorites is The Climate Corporation (who recently were purchased by Monsanto) — they invested their time in doing a better job of predicting the weather than existing weather providers, and commercializing the insights through selling insurance against weather events.
Relative Performance
Every major website is running A/B tests. The idea is pretty simple: show one set of users “experience A” and show another set of users “experience B”. You do this for a while, and then compare various metrics between the populations. You might learn, for example, that customers prefer a blue button over a grey button, or that customers buy more products if you show them better product imagery. I’ve written about this topic previously.
Why’s this related to big data? Well, you have to collect and process an enormous amount of data to derive insights. To find statistically significant differences between the behaviors of populations of users, you typically need tens of thousands of users in each test and a reasonable time period of tracking all of their behaviors. If you multiply this by the number of tests you’re concurrently running, you plan to keep the data forever, and you want to produce many different insights, you will have petabytes of data on your hands.
Creating Feature Ideas
My third ever blog post was about inventing infinite scroll on the Web. It’s a good example of how you can use data to understand customers, and then create intuitive insights based on that understanding. In that example, we saw that users of image search paginated a ton, and we created a future without pagination — what’s now known as “infinite scroll”. You need lots of data, you need to keep that data, and you need to be able to create insights from that data to have these kinds of feature ideas.
Afterword
I don’t intend this to be a taxonomy of big data themes. There’s much more you can do with data — this is a stream of consciousness of themes I’ve seen in action. In my world, very little happens without big data: you’re using data to understand users and systems, you’re creating new ideas with that data, and you’re iterating on those ideas by measuring them at scale. Even the big leaps — like infinite scroll — aren’t ideas that are created in the absence of data.
See you next time.