A few months ago, I spoke with UC Berkeley Professor and Databricks CEO Ion Stoica about the early days of Spark and the Berkeley Data Analytics Stack. Ion noted that by the time his students began work on Spark and Mesos, his experience at his other start-up Conviva had already informed some of the design choices:

“Actually, this story started back in 2009, and it started with a different project, Mesos. So, this was a class project in a class I taught in the spring of 2009. And that was to build a cluster management system, to be able to support multiple cluster computing frameworks like Hadoop, at that time, MPI and others. To share the same cluster as the data in the cluster. Pretty soon after that, we thought about what to build on top of Mesos, and that was Spark. Initially, we wanted to demonstrate that it was actually easier to build a new framework from scratch on top of Mesos, and of course we wanted it to be also special. So, we targeted workloads for which Hadoop at that time was not good enough. Hadoop was targeting batch computation. So, we targeted interactive queries and iterative computation, like machine learning.

…

“I also co-founded a company, Conviva. It was in the area of video management, and one of its products was an analytics tool. And as a part of that product, one feature was adhoc queries. And, at that time, you know … we were using MySQL. MySQL was not good enough. I saw first hand the limitation of the existing technologies, especially on the open source side. And finally, you’re pretty anchored in seeing the trends and the problems in the industry around us, because we are funded at Berkeley by many companies, like Facebook, Yahoo and so forth…so, after the initial building…batch jobs, on top of Hadoop, they were looking for the next level; you want to have something faster.”

It’s one thing to build something as an academic project, where papers and conference presentations are the standard metrics. Successful open source projects involve great developers coming together to tackle real problems — having great timing is also usually an important and under appreciated factor. Ion explained:

“There are many components. And if you look back, you can always revise history. Especially if you had success. First of all, we had a fantastic group of students. Matei, the creator of Spark and others who did Mesos. And then another great group of different students who contributed and built different modules on top of Spark, and made what Spark it is today, which is really a platform. So, that’s one: the students.

“The other one was a great collaboration with the industry. We are seeing first hand what the problems are, challenges, so you’re pretty anchored in reality.

“The third thing is, we are early. In some sense, we started very early to look at big data, we started as early 2006, 2007 starting to look at big data problems. We had a little bit of a first-mover advantage, at least in the academic space. So, all this together, plus the fact that the first releases of these tools, in particular Spark, was like 2000 lines of code…very small, so tractable.”

UC Berkeley has had many successful open source projects in the past — BSD and Postgres are prominent examples. I asked Ion if early on they thought Spark would be embraced so enthusiastically by industry:

“Absolutely not. We wanted to have some good, interesting research projects; we wanted to make it as real as possible, but in no way could we have anticipated the adoption and the enthusiasm of people and of the community around what we’ve built.”

Ben Lorica is the Chief Data Scientist and Director of Content Strategy for Data at O'Reilly Media, Inc. He has applied Business Intelligence, Data Mining, Machine Learning and Statistical Analysis in a variety of settings including Direct Marketing, Consumer and Market Research, Targeted Advertising, Text Mining, and Financial Engineering. His background includes stints with an investment management company, internet startups, and financial services.