Predictions

I did something similar, years earlier, that is thematically similar and illustrates the idea of using big data to predict the future. I was interested in what queries users typed before and after the query stomach ache (and a few synonymous queries). Google and Bing both give you examples of what users type next, including: diarrhea, nausea, constipation, peptic ulcer, and stomach acid symptoms. Why was I interested? I wanted to see if I could figure out which drugs had side effects that included stomach upsets.

Talking about eBay’s use of big data at the 2012 PHP UK conference

I collected all the queries that users typed before and after stomach ache (and its synonyms) over the period of two or so years. I then threw away all queries that contained only English dictionary words, leaving queries that contained one or more non-dictionary words. What’s left? Drug names, and a ton of other junk (places, people, websites, misspellings, foreign words, and so on). What I found was that users were typing the names of drugs they were taking, learning about them, and then searching for information on stomach problems (and vice-versa). I could also see how frequently each drug was associated with a stomach ache.

I looked up some of the drugs on various websites, and learnt about the side effects. Guess what? More than half of the drugs I checked had a side effect of a stomach ache. Less than half didn’t — but I suspect that probably isn’t right. If you have enough users, you can learn about the future — and I know that at least a couple of the drug side effects have been updated to include rare incidences of stomach aches. See: you can predict the future!

The world of big data has many companies built on predicting the future using vast amounts of historical data. One of my favorites is The Climate Corporation (who recently were purchased by Monsanto) — they invested their time in doing a better job of predicting the weather than existing weather providers, and commercializing the insights through selling insurance against weather events.

Relative Performance

Every major website is running A/B tests. The idea is pretty simple: show one set of users “experience A” and show another set of users “experience B”. You do this for a while, and then compare various metrics between the populations. You might learn, for example, that customers prefer a blue button over a grey button, or that customers buy more products if you show them better product imagery. I’ve written about this topic previously.

Why’s this related to big data? Well, you have to collect and process an enormous amount of data to derive insights. To find statistically significant differences between the behaviors of populations of users, you typically need tens of thousands of users in each test and a reasonable time period of tracking all of their behaviors. If you multiply this by the number of tests you’re concurrently running, you plan to keep the data forever, and you want to produce many different insights, you will have petabytes of data on your hands.

Creating Feature Ideas

My third ever blog post was about inventing infinite scroll on the Web. It’s a good example of how you can use data to understand customers, and then create intuitive insights based on that understanding. In that example, we saw that users of image search paginated a ton, and we created a future without pagination — what’s now known as “infinite scroll”. You need lots of data, you need to keep that data, and you need to be able to create insights from that data to have these kinds of feature ideas.

Afterword

I don’t intend this to be a taxonomy of big data themes. There’s much more you can do with data — this is a stream of consciousness of themes I’ve seen in action. In my world, very little happens without big data: you’re using data to understand users and systems, you’re creating new ideas with that data, and you’re iterating on those ideas by measuring them at scale. Even the big leaps — like infinite scroll — aren’t ideas that are created in the absence of data.

First, I believe that Big Data itself isn’t valuable, it’s what you do with it that is. The name

I just bought the t-shirt. Grab yourself one too.

implies only that you have a large amount of data — more than you can process in Microsoft Excel — and that you’re investing to store it. It implicitly implies that you want to store the data in one common infrastructure, so that you can organize, process, and extract value from the data. This is a large topic in itself — it is hard to get data into one infrastructure, get it cleansed and organized, and to create order and structure around how its processed — and I’ll save that for another time.

In this post, I’m going to focus on examples of creating business and customer value using big data. It’s the first of two posts on the topic — stay tuned next week for the conclusion.

Discovering Patterns

I wrote early in 2012 on the topic of query alterations. They’re a great example of extracting customer value from big data — in this case, discovering patterns and using those to improve the experience of your users. Suppose you work at a search engine company. You decide to process vast amounts of data to discover examples where users have typed a query into a search engine, haven’t found what they wanted, and refined their query to improve the results. By processing hundreds or thousands of millions of such query patterns, you learn how to improve queries automatically. For example, you learn that users who misspelt ryhthm [sic] refine their query to rhythm, and so you learn that you can automatically do this with high confidence (as Google does today).

Finding Anomalies and Outliers

I’ve been lucky enough to run very large, distributed computing infrastructures at eBay and Microsoft. They’re incredibly complex — thousands of machines carrying out hundreds of different functions in several data centers, and all orchestrated to work together as a complex system. The vast majority of the time, it works almost perfectly — but there’s always some anomaly or quirky behavior at the margin. For example, users of a particular version of Internet Explorer 8 might be having a problem with one page on the site when they carry out four rare actions in a specific order; we might hear about this from a customer service representative who’d been speaking to a customer.

The customer probably simply stated that they’re having a specific issue on a specific page. That is, we’d typically learn about the symptoms, but not much about the problem itself. Here’s where big data comes along to help: we might look for a specific error message in our logs, and collect all the steps and information about all customer experiences that lead up to that error message. From there, we might discover that the common thread is the Internet Explorer 8 browser, and the four rare actions in a specific order. That gives us clues, and then it’s down to the engineering team to diagnose the problem — say, it’s some subtle issue where data isn’t synced across data centers because of a race condition — and to prepare a fix for the site. Splunk has built a successful business around mining system diagnostic big data.

Summarizing and Generalizing

On eBay, a cell phone is sold every five seconds. That’s amazing, and also a good example of how big data helps you summarize what’s happening in terms that people can understand and discuss. Similar examples include sharing that eBay has over 124 million users, that top rated sellers contribute 46% of US GMV, or that fixed price listings were 71% of global GMV.

You need big data to create these kinds of insights. Let’s take the top rated seller fact. First, you need to find all purchases in the relevant time period and sum the total dollar value of the purchases — I don’t know what the time period was, but let’s say for argument’s sake it was the past year. Then, you need to sum the total purchases of the top rated sellers, by joining together the purchases and seller information to ensure you’re only counting the dollars sold from the top rated sellers. From there, it’s simple division to get the 46% answer. The bottom line is you need a year of purchase data and your complete user information to find the answer — in eBay’s case, that’s 124 million active users and (a guess) at least 3,000 billion transactions that need to be processed.

In the follow-up post, I talk about three more examples of creating value using big data: predictions, relative performance, and creating new ideas with data.