Armchair Guide to Data Compression

Data compression is used to represent information in less space than its original representation. This post explains the basics of lossless compression, and hopefully helps you think about what happens when you next compress data.

Similarly to my post on hash tables, it’s aimed at software folks who’ve forgotten about compression, or folks just interested in how software works; if you’re a compression expert, you’re in the wrong place.

A Simple Compression Example

Suppose you have four colored lights: red, green, blue, and yellow. These lights flash in a repeating pattern: red, red, green, red, red, blue, red, red, yellow, yellow (rrgrrbrryy). You think this is neat, and you want to send a message to a friend and share the light flashing sequence.

You know the binary (base 2) numbering system, and you know that you could represent each of the lights in two digits: 00 (for red, 0 in decimal), 01 (for green, 1 in decimal), 10 (for blue, 2 in decimal), and 11 (for yellow, 3 in decimal). To send the message rrgrrbrryy to your friend, you could therefore send them this message: 00 00 01 00 00 10 00 00 11 11 (of course, you don’t send the spaces, I’ve just included them to make the message easy for you to read).

Data compression. It's hard to find a decent picture!

Data compression. It’s hard to find a decent picture!

You also need to send a key or dictionary, so your friend can decode the message. You only need to send this once, even if the lights flash in a new sequence and you want to share that with your friend. The dictionary is: red green blue yellow. Your friend will be smart enough to know that this implies red is 00, green is 01, and so on.

Sending the message takes 20 bits: two bits per light flash and ten flashes in total (plus the one-off cost of sending the dictionary, which I’ll ignore for now). Let’s call that the original representation.

Alright, let’s do some simple compression. Notice that red flashes six times, yellow twice, and the other colors once each. Sending those red flashes is taking twelve bits, four for sending yellow, and the others a total of four for the total of twenty bits. If we’re smart, what we should be doing is sending the code for red in one bit (instead of two) since it’s sent often, and using more than two bits for green and blue since they’re sent once each. If we could send red in one bit, it’d cost us six bits to send the six red flashes, saving a total of six bits over the two-bit version.

How about we try something simple? Let’s represent red as 0. Let’s represent yellow as 10, green as 110, and blue as 1110 (this is a simple unary counting scheme). Why’d I use that scheme? Well, it’s a simple idea: sort the color flashes by decreasing frequency (red, yellow, green, blue), and assign increasingly longer codes to the colors in a very simple way: you can count 1s until you see a 0, and then you have a key you can use to look in the dictionary. When we see just a 0, we can look in the dictionary to find that seeing zero 1s means red. When we see 1110, we can look in the dictionary to find that seeing three 1s means blue.

Here’s what our original twenty-bit sequence would now look like: 0 0 110 0 0 1110 0 0 10 10. That’s a total of 17 bits, a saving of 3 bits — we’ve compressed the data! Of course, we need to send our friend the dictionary too: red yellow green blue.

It turns out we can do better than this using Huffman coding. We could assign 0 to red, 10 to yellow, 110 to blue, and 111 to green. Our message would then be 0 0 111 0 0 110 0 0 10 10. That’s 16 bits, 1 bit better than our simple scheme (and, again, we don’t need to send the spaces). We’d also need to share the dictionary: red <blank> yellow <blank> blue green to show that 0 is red, 1 isn’t anything, yellow is 10, 11 isn’t anything, blue is 110, and green is 111. A slightly more complicated dictionary for better message compression.

Semi-Static Compression

Our examples are two semi-static compression schemes. The dictionary is static, it doesn’t change. However, it’s built from a single-pass over the data to learn about the frequencies — so the dictionary is dependent on the data. For that reason, I’ll call our two schemes semi-static schemes.

Huffman coding (or minimum-redundancy coding) is the most famous example of a semi-static scheme.

Semi-static schemes have at least three interesting properties:

  1. They require two passes over the data: one to build the dictionary, and another to emit the compressed representation
  2. They’re data-dependent, meaning that the dictionary is built based on the symbols and frequencies in the original data. A dictionary that’s derived from one data set isn’t optimal for a different data set (one with different frequencies) — for example, if you figured out the dictionary for Shakespeare’s works, it isn’t going to be optimal for compressing War and Peace
  3. You need to send the dictionary to the recipient, so the message can be decoded; lots of folks forget to include this cost when they share the compression ratios or savings they’re seeing, don’t do that

It doesn’t matter what kind of compression scheme you choose, you need to decide what the symbols are. For example, you could choose letters, words, or even phrases from English text.

Static Compression

Morse code is an example of a (fairly lame) static compression scheme. The dictionary is universally known (and doesn’t need to be communicated), but the dictionary isn’t derived from the input data. This means that the compression ratios you’ll see are at best the same as you’ll see from a semi-static compression scheme, and usually worse.

There aren’t many static compression schemes in widespread use. Two examples are Elias gamma and delta codes, which are used to compress inputs that consist of only integer values.

Adaptive Compression

The drawbacks of semi-static compression schemes are two-fold: you need to process the data twice, and they don’t adapt to local regions in the data where the frequencies of symbols might vary from the overall frequencies. Imagine, for example, that you’re compressing a very large image: you might find a region of blue sky, where there’s only a few blue colors. If you built a dictionary for only that blue section, you’d get a different (and better) dictionary than the one you’d get for the whole image.

Here’s the idea behind adaptive compression. Build the dictionary as you go: process the input, and see if you can find it in the (initially empty) dictionary. If you can’t, add the input to the dictionary. Now emit the compressed code, and keep on processing the input. In this way, your dictionary adapts as you go, and you only have to process the data once to create the dictionary and compress the data. The most famous example is LZW compression, which is used in many of the compression tools you’re probably using: gzip, pkzip, GIF images, and more.

Adaptive schemes have at least three interesting properties:

  1. One pass over the data creates the dictionary and the compressed representation, an advantage over semi-static schemes
  2. Adaptive schemes get better compression with the more input they see: since they don’t approximate the global symbol frequencies until lots of input has been processed, they’re usually much less optimal than semi-static schemes for small inputs
  3. You can’t randomly access the compressed data, since the dictionary is derived from processing the data sequentially from the start. This is one good reason why folks use static and semi-static schemes for some applications

What about lossy compression?

The taxonomy I’ve presented is for lossless compression schemes — those that are used to compress and decompress data such that you get an exact copy of the original input. Lossy compression schemes don’t guarantee that: they’re schemes where some of the input is thrown away by approximation, and the decompressed representation isn’t guaranteed to be the same as the input. A great example is JPEG image compression: it’s an effective way to store an image, by throwing away (hopefully) unimportant data and approximating it instead. The MP3 music file format and the MP4 video format (usually) do the same thing.

In general, lossy compression schemes are more compact than lossless schemes. That’s why, for example, GIF image files are often much larger than JPEG image files.

Hope you learnt something useful. Tell me if you did or didn’t. See you next time.

Don’t use a Pie Chart

I don’t like pie charts. Why? It’s almost impossible to compare the relative size of the slices. Worse still, it is actually impossible to compare the slices between two pie charts.

Pie charts in action.

Pie charts in action.

Take the example above. Take a look at Pie Chart A. The blue slice looks the same size as the red or green slice. You might draw the conclusion they’re roughly the same. In fact, take a look at the histogram below — the red is 17 units, blue is 18 units, and green is 20 units. The histogram is informative, useful for comparison, and clear for communication. (There’s still a cardinal sin here though: no labels on the axes; I’ll rant about that some other day.)

Compare pie charts B and C above. It sure looks like there’s the same quantity of blue in B and C, and about the same amount of green. The quantities aren’t the same: the histograms below show that green is 19 and 20 respectively, and blue is 20 and 22.

You might argue that pie charts are useful for comparing relative quantities. I’d argue it’s possible but difficult to interpret. Take a look at three yellow slices in charts A, B, and C. They look to be similar percentages of the pies, but it’s hard to tell: they are oriented slightly differently, making the comparison unclear. What’s the alternative? Use a histogram with the y-axis representing the percentage.

Pie charts get even more ugly when there’s lots of slices, or some of the slices are ridiculously small. If you’re going to use them — I still don’t think it’s a good idea — then use them when there are only a few values and the smallest slices can be labeled. The pie chart below is ridiculous.

A pie chart that's nearly impossible to read. I'm not sure what conclusions you could draw from this one.

A pie chart that’s nearly impossible to read. I’m not sure what conclusions you could draw from this one.

Why do I care? I’m in the business of communicating information, and I spend much of my time reviewing information that people share with me. Clarity is important — decisions are made on data. See you next time.

Joining Pivotal

I joined Pivotal Software Inc. last week as their SVP of Research and Development. I’m excited to be part of a team that will change the big data landscape. I’m already enjoying getting down to work.

Why’d I choose Pivotal? I’ve seen big data transform businesses, but I’ve been lucky enough to see that from the inside of two Internet giants. I believe we’re on the cusp of seeing that transformation in the next few thousand or more large businesses, and Pivotal has the opportunity to help drive that transformation. One day, we will see health care, farming, manufacturing, education, finance, and communications iterate at the speed of an Internet company — and I believe it’s possible if they’re able to store their data, understand what it’s telling them, and iterate fast on their products. Indeed, I said as much yesterday in an interview with GigaOm.

I’m looking forward to the challenge. As I learn more, expect more blogs on Pivotal, its challenges, and more. See you next time.

Learning Lessons from the Apollo program

I’m fascinated by the Apollo program. It’s a triumph of human endeavour that we sent twenty-four men to the moon, and twelve to the surface between 1969 and 1972. It was a program built with reliable 1950s-designed equipment, and computers far less sophisticated than a cheap modern wristwatch. Achieving Kennedy’s audacious vision was made possible through teamwork, planning, hard work, and ingenuity.

Recently, I decided to learn more about the early Apollo missions to understand what work went into getting us to the moon. It’s an incredibly story, and perhaps more interesting than several later missions.

An incremental approach

Apollo 7 was the first manned Apollo mission to fly in space. It was a confidence-builder, and the first time three people had flown together in space around the Earth. After eleven days in 1968, the crew returned having tested the command module and successfully made a live TV appearance.

The Apollo 7 crew during the first live broadcast from space

The Apollo 7 crew during the first live broadcast from space

The critical pieces of Apollo 7 had flown earlier. The earlier unmanned Apollo 4 and 6 had tested launching a similar unmanned command module and returning it to earth, while Apollo 5 had also tested the same Saturn IB rocket that was used to launch Apollo 7. Apollo 7 was focused on testing putting three astronauts in space in what was a now-trusted setup.

Apollo 8 was a shorter, six day mission. Three men flew to the moon, orbited it a few times, and returned to earth. They were the first to fly atop the Saturn V rocket — the smaller Saturn IB rocket used in Apollo 7 wasn’t sufficient to reach the moon. The Saturn V had been tested in Apollo 4 and 6. This was an audacious mission — testing both a manned Saturn V mission, and visiting the moon. Behinds the scenes, it was a tough sell to management — they didn’t like being that aggressive. But it made sense as a mission: the lunar module was behind in development and wasn’t ready to be tested, and there was fear the Russians might get Cosmonauts around the moon in 1968.

The far side of the moon as seen from Apollo 8. The Apollo 8 crew was the first to ever see the dark side of the moon

The far side of the moon as seen from Apollo 8. The Apollo 8 crew was the first to ever see the dark side of the moon

Apollo 9 was a return to an incremental approach. The mission orbited the earth, tested the lunar module in earth orbit (that would later be used to land on the moon), and included a space walk to test the spacesuits. It also included a rendezvous, which was necessary with the command and lunar modules being separated.

Testing the Apollo 9 lunar module Spider in earth orbit

Testing the Apollo 9 lunar module Spider in earth orbit

Apollo 10 was the full dress rehearsal for landing on the moon. It was time to repeat the test of the lunar module, but this time in moon orbit. The lunar module detached from the command module, and the crew descended to within ten miles of the moon’s surface. They then returned to rendezvous with the command module, and journeyed back to earth. In total, the crew spent eight days in space, and the mission was a huge success — so successful that Apollo 11 was the mission that met Kennedy’s goal in July 1969.

The Apollo 10 lunar module Snoopy returns from almost landing on the moon

The Apollo 10 lunar module Snoopy returns from almost landing on the moon

Interestingly, many people at NASA thought early in the Apollo program that Apollo 12 was likely to be the mission that landed first on the moon. Perhaps if Apollo 9 had happened before Apollo 8 (as was originally planned), there might have been two separate missions to test the manned Saturn V and then a manned Saturn V to the moon. Certainly, if something substantial had gone wrong between Apollo 7 and 10, Apollo 11 would have been repeating validation of the space craft, space suits, and processes.

The tortoise beats the hare

The Soviet Union went all-in with Soyuz 1. It was the first flight of the new Soyuz spacecraft and Soyuz rocket, and was planned to be a rendezvous with the three-manned Soyuz 2. The mission had problems from the start — a solar panel failed to deploy, and this delayed the launch of Soyuz 2. The weather turned bad, and Soyuz 2 didn’t launch. This was fortunate, as the parachute on Soyuz 1 didn’t deploy due to a design fault, and the single Cosmonaut died on reentry. If Soyuz 2 had launched, the crew wouldn’t have survived.

Vladimir Komarov, the cosmonaut who died in the ill-fated Soyuz 1

Vladimir Komarov, the cosmonaut who died in the ill-fated Soyuz 1

This was 1967, a year before Apollo 7. The Soviets went for broke, testing rockets, capsules, rendezvous, and more in one mission. On paper, it looked like they were ahead. The result was failure — and an 18-month delay in the program, and ultimately failure to get to the moon before the Americans. Indeed, Soyuz 4 and 5 in early 1969 eventually completed the mission aims of Soyuz 1 and 2.

The tortoise beat the hare. It was a pretty fast tortoise, but you see the point. The pragmatic approach of trying one complex new component in each mission ultimately made the Apollo program successful. Doing everything at once didn’t work. There’s something in that for all of us. See you next time.

Music everywhere with Sonos

I’ve embraced Sonos as the way to enjoy music and radio in my house.

What’s Sonos?

I was late to the game too, so don’t worry if you haven’t heard of Sonos or don’t quite know what it does. Sonos is a company, and they make several powered speakers, that is, nice little units that contain an amplifier and speakers. They also make a product that allows you to connect your existing amplifier to the Sonos system.

The Sonos family of powered speakers and integration products. At the rear left is their subwoofer. The Play:3, Play:5, and Play:1 are grouped in the middle rear. At the front is Playbar for home theater. At the rear right are the integration products.

The Sonos family of powered speakers and integration products. At the rear left is their subwoofer. The Play:3, Play:5, and Play:1 are grouped in the middle rear. At the front is Playbar for home theater. At the rear right are the integration products.

One thing that’s cool about Sonos is that the powered speakers don’t need to be wired to a system. You put them where you want, and they connect wirelessly to a base station that’s plugged into your home wireless Internet router. Alternately, you can wire them to a standard Ethernet socket if you’ve wired your house. Sonos call their base station a bridge, and right now one of those comes free with any of Sonos’s speakers.

What makes a Sonos system cool, though, isn’t just that it’s portable and unwired. It’s that it sounds pretty darn good, and it integrates reasonably nicely with popular music services such as slacker and tunein radio. That means you can pay a few bucks a month and play a large library of music, and you can listen to a vast array of radio stations. You control this experience using your smartphone, tablet, or PC.

Playing music

It’s pretty simple to play music. You select the room you want to play — the available rooms are shown on the left in the image below. Then you select a source you want to play — you can choose from your own music library, or one of the streaming services, or a line-in input into one of the devices.

The Sonos Mac OS X application. Very similar to the Sonos iPad app. On the left are rooms, on the right are sounds sources.

The Sonos Mac OS X application. Very similar to the Sonos iPad app. On the left are rooms, on the right are sounds sources.

You can group rooms together to create a zone, and have the same source playing throughout part or all of your house. For example, I often put on the radio, and group together my bedroom, main living areas, garage gym, and outside patio so that I can listen to them as I move around the house.

I’ve got a turntable, and I’ve connected that to one of Sonos’s larger Play:5 systems; the smaller Play:1 and Play:3 don’t have a line-in input. I needed a pre-amp between the turntable and the Play:5, and picked up a reasonable one at an online store. With this setup, I can listen to vinyl throughout the house in the same way as I can listen to the rest of my music.

I sometimes plug other sources into another line-in socket in another Play:5. For example, when I want to listen to Major League Baseball, I fire up my MLB At:Bat app on my iPhone, and connect the iPhone to the Play:5. Then, I select the Line-in as a source in the Sonos app, and we’ve got baseball in the house. (Go Mariners!) The drawback is that if I want to adjust volume or settings, I have to walk to the Play:5 and fiddle with the iPhone.

What’s Great

Here’s the top five things I love about Sonos:

  1. Sounds good to great. I can’t get over how much sound is in the Play:1 for the size and price. The thing is about as big as a coffee tin, and it has nice bass response and looks good. The bigger Play:5 is a serious unit, and has five amplifiers and five speakers — when you pair two together to create a stereo system, and add a subwoofer, you’ve got a serious sound system (and it’s priced like one too — you’re talking US$1500)
  2. Music and radio everywhere. Buy a few units, put them around the house, and your life will be better. You’ll be better connected to the world through radio, and you’ll enjoy your music even more
  3. Easy to set up. When you buy a new speaker, you can use any Sonos app on any device to register the unit. It takes about two minutes to add the unit to your house
  4. Range. I can put speakers anywhere in my house — in locations where I don’t get wifi on my laptop or phone — and it works just fine. I can take one of them out in the yard, and all is well
  5. It’s an alarm clock. It’s easy to set up an alarm on any Sonos device, and choose a source. I wake up to KQED radio, and it gently fades in. It turns off after an hour (that’s configurable). The rest of my family uses this feature too

What Needs Work

Here’s where there’s room for improvement:

  1. It’s expensive. The Play:1 is the first sub $200 offering from Sonos, the Play:3 is $299, and it’s upward from there. The Play:1 is great value, but fitting out your house is an investment. Be warned: these things multiply, you’ll buy one or two, and you’ll be back for more
  2. The service integration is a bit clunky. I really like Slacker’s iPhone app — but you only get a fraction of the features when you use the Sonos app to stream the Slacker service. The Sonos folks use the APIs that these streaming companies provide, rather than the streaming companies integrating Sonos capabilities natively into their apps. You can also tell Sonos has no relationship with Apple — the music library integration is pretty clunky, it’s at the file system level
  3. The apps need a little bit of a rethink and redesign, they lack the beauty and simplicity of the hardware. The app paradigm is you select a room, then you select music. That isn’t always how you think — sometimes you want to dive into the music, and then select the room. You can do it, but it’s a little clunky (and sometimes you’ll surprise someone in your house with a blast of music). Still, I’ve seen tweens using it easily enough
  4. The apps or the network or something can be sluggish. I find that my iPhone is a little frustrating as the interface to Sonos — my iPad and Mac are much better. It sometimes takes a while for the iPhone app to find my Sonos system, and the app can be unresponsive to interactions sometimes. It’s also not a reliable device for streaming my music library
  5. It needs power. The Play:1 looks portable, but you need an electrical outlet

All up?

Pretty awesome. A game changer at my house. The hardware is amazing — and that’s what’s actually important. Software and music service integrations can be fixed, and they’re improving with every version.

See you again soon.

What’s Big Data anyway? Part Two

Last week, I shared a few ways in which big data adds value. This week, I share a few more.

Predictions

You can predict the future using data. Google gets publicity from predicting flu outbreaks.

I did something similar, years earlier, that is thematically similar and illustrates the idea of using big data to predict the future. I was interested in what queries users typed before and after the query stomach ache (and a few synonymous queries). Google and Bing both give you examples of what users type next, including: diarrhea, nausea, constipation, peptic ulcer, and stomach acid symptoms. Why was I interested? I wanted to see if I could figure out which drugs had side effects that included stomach upsets.

Talking about eBay's use of big data at the 2012 PHP UK conference

Talking about eBay’s use of big data at the 2012 PHP UK conference

I collected all the queries that users typed before and after stomach ache (and its synonyms) over the period of two or so years. I then threw away all queries that contained only English dictionary words, leaving queries that contained one or more non-dictionary words. What’s left? Drug names, and a ton of other junk (places, people, websites, misspellings, foreign words, and so on). What I found was that users were typing the names of drugs they were taking, learning about them, and then searching for information on stomach problems (and vice-versa). I could also see how frequently each drug was associated with a stomach ache.

I looked up some of the drugs on various websites, and learnt about the side effects. Guess what? More than half of the drugs I checked had a side effect of a stomach ache. Less than half didn’t — but I suspect that probably isn’t right. If you have enough users, you can learn about the future — and I know that at least a couple of the drug side effects have been updated to include rare incidences of stomach aches. See: you can predict the future!

The world of big data has many companies built on predicting the future using vast amounts of historical data. One of my favorites is The Climate Corporation (who recently were purchased by Monsanto) — they invested their time in doing a better job of predicting the weather than existing weather providers, and commercializing the insights through selling insurance against weather events.

Relative Performance

Every major website is running A/B tests. The idea is pretty simple: show one set of users “experience A” and show another set of users “experience B”. You do this for a while, and then compare various metrics between the populations. You might learn, for example, that customers prefer a blue button over a grey button, or that customers buy more products if you show them better product imagery. I’ve written about this topic previously.

Why’s this related to big data? Well, you have to collect and process an enormous amount of data to derive insights. To find statistically significant differences between the behaviors of populations of users, you typically need tens of thousands of users in each test and a reasonable time period of tracking all of their behaviors. If you multiply this by the number of tests you’re concurrently running, you plan to keep the data forever, and you want to produce many different insights, you will have petabytes of data on your hands.

Creating Feature Ideas

My third ever blog post was about inventing infinite scroll on the Web. It’s a good example of how you can use data to understand customers, and then create intuitive insights based on that understanding. In that example, we saw that users of image search paginated a ton, and we created a future without pagination — what’s now known as “infinite scroll”. You need lots of data, you need to keep that data, and you need to be able to create insights from that data to have these kinds of feature ideas.

Afterword

I don’t intend this to be a taxonomy of big data themes. There’s much more you can do with data — this is a stream of consciousness of themes I’ve seen in action. In my world, very little happens without big data: you’re using data to understand users and systems, you’re creating new ideas with that data, and you’re iterating on those ideas by measuring them at scale. Even the big leaps — like infinite scroll — aren’t ideas that are created in the absence of data.

See you next time.

What’s Big Data anyway?

I spoke recently at SMX East on Leveraging Big Data in Search Marketing. I was the opening speaker, and I started by defining Big Data. I thought I’d share some of what I said.

First, I believe that Big Data itself isn’t valuable, it’s what you do with it that is. The name

I just bought the t-shirt. Grab yourself one too.

I just bought the t-shirt. Grab yourself one too.

implies only that you have a large amount of data — more than you can process in Microsoft Excel — and that you’re investing to store it. It implicitly implies that you want to store the data in one common infrastructure, so that you can organize, process, and extract value from the data. This is a large topic in itself — it is hard to get data into one infrastructure, get it cleansed and organized, and to create order and structure around how its processed — and I’ll save that for another time.

In this post, I’m going to focus on examples of creating business and customer value using big data. It’s the first of two posts on the topic — stay tuned next week for the conclusion.

Discovering Patterns

I wrote early in 2012 on the topic of query alterations. They’re a great example of extracting customer value from big data — in this case, discovering patterns and using those to improve the experience of your users. Suppose you work at a search engine company. You decide to process vast amounts of data to discover examples where users have typed a query into a search engine, haven’t found what they wanted, and refined their query to improve the results. By processing hundreds or thousands of millions of such query patterns, you learn how to improve queries automatically. For example, you learn that users who misspelt ryhthm [sic] refine their query to rhythm, and so you learn that you can automatically do this with high confidence (as Google does today).

Finding Anomalies and Outliers

I’ve been lucky enough to run very large, distributed computing infrastructures at eBay and Microsoft. They’re incredibly complex — thousands of machines carrying out hundreds of different functions in several data centers, and all orchestrated to work together as a complex system. The vast majority of the time, it works almost perfectly — but there’s always some anomaly or quirky behavior at the margin. For example, users of a particular version of Internet Explorer 8 might be having a problem with one page on the site when they carry out four rare actions in a specific order; we might hear about this from a customer service representative who’d been speaking to a customer.

The customer probably simply stated that they’re having a specific issue on a specific page. That is, we’d typically learn about the symptoms, but not much about the problem itself. Here’s where big data comes along to help: we might look for a specific error message in our logs, and collect all the steps and information about all customer experiences that lead up to that error message. From there, we might discover that the common thread is the Internet Explorer 8 browser, and the four rare actions in a specific order. That gives us clues, and then it’s down to the engineering team to diagnose the problem — say, it’s some subtle issue where data isn’t synced across data centers because of a race condition — and to prepare a fix for the site. Splunk has built a successful business around mining system diagnostic big data.

Summarizing and Generalizing

On eBay, a cell phone is sold every five seconds. That’s amazing, and also a good example of how big data helps you summarize what’s happening in terms that people can understand and discuss. Similar examples include sharing that eBay has over 124 million users, that top rated sellers contribute 46% of US GMV, or that fixed price listings were 71% of global GMV.

You need big data to create these kinds of insights. Let’s take the top rated seller fact. First, you need to find all purchases in the relevant time period and sum the total dollar value of the purchases — I don’t know what the time period was, but let’s say for argument’s sake it was the past year. Then, you need to sum the total purchases of the top rated sellers, by joining together the purchases and seller information to ensure you’re only counting the dollars sold from the top rated sellers. From there, it’s simple division to get the 46% answer. The bottom line is you need a year of purchase data and your complete user information to find the answer — in eBay’s case, that’s 124 million active users and (a guess) at least 3,000 billion transactions that need to be processed.

In the follow-up post, I talk about three more examples of creating value using big data: predictions, relative performance, and creating new ideas with data.