The nature of cloud-based data science

Mike Driscoll is CEO of Metamarkets, a cloud-based analytics company he co-founded in San Francisco in 2010.

Mike Driscoll of Metamarkets talks about the analytics challenges and opportunities that businesses moving to the cloud face.

Interview conducted by Alan Morrison and Bo Parker

PwC: What’s your background, and how did you end up running a data science startup?

MD: I came to Silicon Valley after studying computer science and biology for five years, and trying to reverse engineer the genome network for uranium-breathing bacteria. That was my thesis work in grad school. There was lots of modeling and causal inference. If you were to knock this gene out, could you increase the uptake of the reduction of uranium from a soluble to an insoluble state? I was trying all these simulations and testing with the bugs to see whether you could achieve that.

PwC: You wanted to clean up radiation leaks at nuclear plants?

MD: Yes. The Department of Energy funded the research work I did. Then I came out here and I gave up on the idea of building a biotech company, because I didn’t think there was enough commercial viability there from what I’d seen.

I did think I could take this toolkit I’d developed and apply it to all these other businesses that have data. That was the genesis of the consultancy Dataspora. As we started working with companies at Dataspora, we found this huge gap between what was possible and what companies were actually doing.

“[Companies] realize the problem with being walking amnesiacs, knowing what happened to their customers in the last 30 days and then forgetting every 30 days.”

Right now the real shift is that companies are moving from this very high-latency-course era of reporting into one where they start to have lower latency, finer granularity, and better visibility into their operations. They realize the problem with being walking amnesiacs, knowing what happened to their customers in the last 30 days and then forgetting every 30 days.

Most businesses are just now figuring out that they have this wealth of information about their customers and how their customers interact with their products.

PwC: On its own, the new availability of data creates demand for analytics.

MD: Yes. The absolute number-one thing driving the current focus in analytics is the increase in data. What’s different now from what happened 30 years ago is that analytics is the province of people who have data to crunch.

What’s causing the data growth? I’ve called it the attack of the exponentials—the exponential decline in the cost of compute, storage, and bandwidth, and the exponential increase in the number of nodes on the Internet. Suddenly the economics of computing over data has shifted so that almost all the data that businesses generate is worth keeping around for its analysis.

PwC: And yet, companies are still throwing data away.

MD: So many businesses keep only 60 days’ worth of data. The storage cost is so minimal! Why would you throw it away? This is the shift at the big data layer; when these companies store data, they store it in a very expensive relational database. There needs to be different temperatures of data, and companies need to put different values on the data—whether it’s hot or cold, whether it’s active. Most companies have only one temperature: they either keep it hot in a database, or they don’t keep it at all.

PwC: So they could just keep it in the cloud?

MD: Absolutely. We’re starting to see the emergence of cloud-based databases where you say, “I don’t need to maintain my own database on the premises. I can just rent some boxes in the cloud and they can persist our customer data that way.”

Metamarkets is trying to deliver DaaS—data science as a service. If a company doesn’t have analytics as a core competency, it can use a service like ours instead. There’s no reason for companies to be doing a lot of tasks that they are doing in-house. You need to pick and choose your battles.

We will see a lot of IT functions being delivered as cloud-based services. And now inside of those cloud-based services, you often will find an open source stack.

Here at Metamarkets, we’ve drawn heavily on open source. We have Hadoop on the bottom of our stack, and then at the next layer we have our own in-memory distributed database. We’re running on Amazon Web Services and have hundreds of nodes there.

PwC: How are companies that do have data science groups meeting the challenge? Take the example of an orphan drug that is proven to be safe but isn’t particularly effective for the application it was designed for. Data scientists won’t know enough about a broad range of potential biological systems for which that drug might be applicable, but the people who do have that knowledge don’t know the first thing about data science. How do you bring those two groups together?

MD: My data science Venn diagram helps illustrate how you bring those groups together. The diagram has three circles. [See above.] The first circle is data science. Data scientists are good at this. They can take data strings, perform processing, and transform them into data structures. They have great modeling skills, so they can use something like R or SAS and start to build a hypothesis that, for example, if a metric is three standard deviations above or below the specific threshold then someone may be more likely to cancel their membership. And data scientists are great at visualization.

But companies that have the tools and expertise may not be focused on a critical business question. A company is trying to build what it calls the technology genome. If you give them a list of parts in the iPhone, they can look and see how all those different parts are related to other parts in camcorders and laptops. They built this amazingly intricate graph of the actual makeup. They’ve collected large amounts of data. They have PhDs from Caltech; they have Rhodes scholars; they have really brilliant people. But they don’t have any real critical business questions, like “How is this going to make me more money?”

The second circle in the diagram is critical business questions. Some companies have only the critical business questions, and many enterprises fall in this category. For instance, the CEO says, “We just released a new product and no one is buying it. Why?”

The third circle is good data. A beverage company or a retailer has lots of POS [point of sale] data, but it may not have the tools or expertise to dig in and figure out fast enough where a drink was selling and what demographics it was selling to, so that the company can react.

On the other hand, sometimes some web companies or small companies have critical business questions and they have the tools and expertise. But because they have no customers, they don’t have any data.

PwC: Without the data, they need to do a simulation.

MD: Right. The intersection in the Venn diagram is where value is created. When you think of an e-commerce company that says, “How do we upsell people and reduce the number of abandoned shopping carts?” Well, the company has 600 million shopping cart flows that it has collected in the last six years. So the company says, “All right, data science group, build a sequential model that shows what we need to do to intervene with people who have abandoned their shopping carts and get them to complete the purchase.”

PwC: The questioning nature of business—the culture of inquiry—seems important here. Some who lack the critical business questions don’t ask enough questions to begin with.

MD: It’s interesting—a lot of businesses have this focus on real-time data, and yet it’s not helping them get answers to critical business questions. Some companies have invested a lot in getting real-time monitoring of their systems, and it’s expensive. It’s harder to do and more fragile.

A friend of mine worked on the data team at a web company. That company developed, with a real effort, a real-time log monitoring framework where they can see how many people are logging in every second with 15-second latency across the ecosystem. It was hard to keep up and it was fragile. It broke down and they kept bringing it up, and then they realized that they take very few business actions in real time. So why devote all this effort to a real-time system?

PwC: In many cases, the data is going to be fresh enough, because the nature of the business doesn’t change that fast.

MD: Real time actually means two things. The first thing has to do with the freshness of data. The second has to do with the query speed.

By query speed, I mean that if you have a question, how long it takes to answer a question such as, “What were your top products in Malaysia around Ramadan?”

PwC: There’s a third one also, which is the speed to knowledge. The data could be staring you in the face, and you could have incredibly insightful things in the data, but you’re sitting there with your eyes saying, “I don’t know what the message is here.”

MD: That’s right. This is about how fast can you pull the data and how fast can you actually develop an insight from it.

For learning about things quickly enough after they happen, query speed is really important. This becomes a challenge at scale. One of the problems in the big data space is that databases used to be fast. You used to be able to ask a question of your inventory and you’d get an answer in seconds. SQL was quick when the scale wasn’t large; you could have an interactive dialogue with your data.

But now, because we’re collecting millions and millions of events a day, data platforms have seen real performance degradation. Lagging performance has led to degradation of insights. Companies literally are drowning in their data.

In the 1970s, when the intelligence agencies first got reconnaissance satellites, there was this proliferation in the amount of photographic data they had, and they realized that it paralyzed their decision making. So to this point of speed, I think there are a number of dimensions here. Typically when things get big, they get slow.

PwC: Isn’t that the problem the new in-memory database appliances are intended to solve?

MD: Yes. Our Druid engine on the back end is directly competitive with those proprietary appliances. The biggest difference between those appliances and what we provide is that we’re cloud based and are available on Amazon.

“Being a cloud service really reduces friction. The coefficient of friction around data has for a long time been high, and I think we’re seeing that start to drop.”

If your data and operations are in the cloud, it does not make sense to have your analytics on some appliance. We solve the performance problem in the cloud. Our mantra is visibility and performance at scale.

Data in the cloud liberates companies from some of these physical box confines and constraints. That means that your data can be used as inputs to other types of services. Being a cloud service really reduces friction. The coefficient of friction around data has for a long time been high, and I think we’re seeing that start to drop. Not just the scale or amount of data being collected, but the ease with which data can interoperate with different services, both inside your company and out.