What’s Big Data anyway?

I spoke recently at SMX East on Leveraging Big Data in Search Marketing. I was the opening speaker, and I started by defining Big Data. I thought I’d share some of what I said.

First, I believe that Big Data itself isn’t valuable, it’s what you do with it that is. The name

I just bought the t-shirt. Grab yourself one too.

I just bought the t-shirt. Grab yourself one too.

implies only that you have a large amount of data — more than you can process in Microsoft Excel — and that you’re investing to store it. It implicitly implies that you want to store the data in one common infrastructure, so that you can organize, process, and extract value from the data. This is a large topic in itself — it is hard to get data into one infrastructure, get it cleansed and organized, and to create order and structure around how its processed — and I’ll save that for another time.

In this post, I’m going to focus on examples of creating business and customer value using big data. It’s the first of two posts on the topic — stay tuned next week for the conclusion.

Discovering Patterns

I wrote early in 2012 on the topic of query alterations. They’re a great example of extracting customer value from big data — in this case, discovering patterns and using those to improve the experience of your users. Suppose you work at a search engine company. You decide to process vast amounts of data to discover examples where users have typed a query into a search engine, haven’t found what they wanted, and refined their query to improve the results. By processing hundreds or thousands of millions of such query patterns, you learn how to improve queries automatically. For example, you learn that users who misspelt ryhthm [sic] refine their query to rhythm, and so you learn that you can automatically do this with high confidence (as Google does today).

Finding Anomalies and Outliers

I’ve been lucky enough to run very large, distributed computing infrastructures at eBay and Microsoft. They’re incredibly complex — thousands of machines carrying out hundreds of different functions in several data centers, and all orchestrated to work together as a complex system. The vast majority of the time, it works almost perfectly — but there’s always some anomaly or quirky behavior at the margin. For example, users of a particular version of Internet Explorer 8 might be having a problem with one page on the site when they carry out four rare actions in a specific order; we might hear about this from a customer service representative who’d been speaking to a customer.

The customer probably simply stated that they’re having a specific issue on a specific page. That is, we’d typically learn about the symptoms, but not much about the problem itself. Here’s where big data comes along to help: we might look for a specific error message in our logs, and collect all the steps and information about all customer experiences that lead up to that error message. From there, we might discover that the common thread is the Internet Explorer 8 browser, and the four rare actions in a specific order. That gives us clues, and then it’s down to the engineering team to diagnose the problem — say, it’s some subtle issue where data isn’t synced across data centers because of a race condition — and to prepare a fix for the site. Splunk has built a successful business around mining system diagnostic big data.

Summarizing and Generalizing

On eBay, a cell phone is sold every five seconds. That’s amazing, and also a good example of how big data helps you summarize what’s happening in terms that people can understand and discuss. Similar examples include sharing that eBay has over 124 million users, that top rated sellers contribute 46% of US GMV, or that fixed price listings were 71% of global GMV.

You need big data to create these kinds of insights. Let’s take the top rated seller fact. First, you need to find all purchases in the relevant time period and sum the total dollar value of the purchases — I don’t know what the time period was, but let’s say for argument’s sake it was the past year. Then, you need to sum the total purchases of the top rated sellers, by joining together the purchases and seller information to ensure you’re only counting the dollars sold from the top rated sellers. From there, it’s simple division to get the 46% answer. The bottom line is you need a year of purchase data and your complete user information to find the answer — in eBay’s case, that’s 124 million active users and (a guess) at least 3,000 billion transactions that need to be processed.

In the follow-up post, I talk about three more examples of creating value using big data: predictions, relative performance, and creating new ideas with data.

4 thoughts on “What’s Big Data anyway?

  1. Pingback: What’s Big Data anyway? Part Two | Hugh E. Williams

  2. Pingback: Big Data’s Little Secret | Grandiose Data Delusions

  3. Pingback: Joining Pivotal | Hugh E. Williams

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s