Hadoop According To Hortonworks: An Insider's View

Hortonworks recently marked its second year in business and its first year of offering a distribution of Hadoop open-source software and related commercial support services. Next up, within a matter of weeks, will be the next release of the Hortonworks Data Platform, incorporating next-generation Hadoop 2.0.

YARN(a not-quite acronym for Yet Another Resource Manager) is a crucial new open source component that will improve Hadoop performance and move it beyond the confines of batch MapReduce processing. Work is also underway, as part of the Horton-supported Stinger project, to deliver a higher-performance, more SQL-compatible version of Hive. SQL-on-Hadoop capabilities are just one area in which Hortonworks is in a pitched competitive battle with Cloudera. While Hortonworks waits to ship foundation-approved open-source software, Cloudera has added Impala and other components to Hadoop that are best administered through its commercial management software.

Can Hortonworks innovate and build the value of its company, or is the company's "100% open source" strategy vulnerable to commoditization as the Hadoop platform matures? Shaun Connolly, Hortonwork's VP of Corporate Strategy, spoke with InformationWeekabout a range of topics including CTO Eric Baldeschwieler's recent departure, prospects for Hadoop 2.0, acquisition rumors and the company's long-range plans.

InformationWeek: Hortonworks presents itself as the company that promotes Hadoop as an "enterprise viable platform," but isn't that a foregone conclusion at this point?

Shaun Connolly:I think that mission has a lot more legs. If I draw a corollary to how the Linux market played out, Linux started out with some very targeted workloads. Hadoop, in its first generation, was clearly batch-oriented MapReduce processing. As Linux matured you got secure Linux and virtualized Linux, and the platform took on a lot more mission-critical workloads. That's what we're seeing with Hadoop. With YARN, other types of workloads will be able snap into Hadoop and be coordinated on the same platform.

IW: On the personnel front, Hortonworks' co-founder, CTO and former CEO Eric Baldeschwieler recently left the company. Have you selected a new CTO?

Connolly: Our new CTO is Ari Zilka, who was the CTO and one of the founders of Terracotta, which is an in-memory data-management technology that's now a part of SoftwareAG. Ari was previously at Walmart, where he deployed massive-scale data systems. Ari has been at Hortonworks for almost a year and a half, and he has worn mostly a field-CTO-type hat as chief product officer. He has also helped customers leverage Hadoop and integrate it with lower-latency architectures.

IW: How big of a technical depth hole did Eric Baldeschwieler's departure leave?

Connolly:We've effectively grown 10X since our founding in terms of number of employees We started with about 24 engineers from Yahoo, including Eric. Eric has chosen to move on and do other things, and that was a personal choice. The rest of the core team from Yahoo -- Arun Murthy, Owen O'Malley, Alan Gates, Sanjay Radia, Suresh Srinivas and Mahadev Konar and others -- are all active in their projects and are Hortonworks employees.

We've grown from those Yahoo roots and have a good many engineers from Oracle, IBM and MySQL. We also have folks from Microsoft and SAP as well as Amazon and Google. We have a good mix from Web-scale companies as well as enterprise software developers. Greg Pavlik [VP of engineering] in particular has been able to attract a bunch of folks because he spent many years at Oracle.

IW: What's your customer base like these days and what are the primary use cases you're seeing for Hadoop?

Connolly:We ended the last quarter with more than 120 customers. We're actively working with customers across Web retail, media, telco and healthcare. We see a fair amount in the Web and retail spaces, including brick-and-motor retailers. They'll typically get started with analytic applications taking advantage of new data sources, like clickstreams, social sentiment and devices. With clickstreams and social they're after the classic 360-degree customer view.

Hadoop offers a more economical solution where these customers can store way more data. In the case of healthcare, they're after a 360-degree view of the patient, and we're seeing electronic medical records applications as well as uses in pharma around manufacturing analytics.

IW: Are these net-new applications, or were these things firms were trying to do but without much success with relational databases?

Connolly: They were trying to do it, in many cases, but they had sprawl of systems and they could never tag one system as the place where they could pull all of that information together. They tended to have an incomplete view, and they were always focused on looking only at, say, 30 to 60 days of data when they had to put it in a traditional data warehouse. The cost structures of data warehouses are anywhere from 10 to 100 times higher than what they can drive per terabyte on a Hadoop cluster. Now they can store multiple years of data, not just a month or two.

Besides Stinger and Impala, another Apache project that provides SQL-On-Hadoop interactive speed capabilities is Apache Drill pulling in insprations from Dremel and other projects, has been making great progress with collaboration with multiple companies as well. It's soon to make Alpha and has a very flexible architecture.

Ari Zilka lead a massive, modernizing redeployment of the Java backend of Wal-Mart's web site (Mark Towfiq lead the user interaction side), then generalized the technology for Terracotta's in-memory data management system. Terracotta applied a big speed up to the way Java applications could handle data. He's the right successor to Hortonworks founding CTO Eric Baldeschweiler..

This was a long interview and I had to cut some good stuff. I pressed the point about Hortonwork's strategy in this exchange:

IW: Doesn't Hortonwork's strategy kind of put it in the background -- aservices company that takes a back seat to partners like Microsoft andTeradata?

Connolly: It doesn't put us in the background. If you look at the Teradata Unified Data Architecture, our box is one of three that they advertise to the market as part of a best-of-breed big data architecture. We're a technology platform provider. We're not a database provider. We're not going to focus only on SQL; that's just one of the workloads that the platform can and should support. So when you say, "are we going to run out of gas on things that can be done around Hadoop," we think the party has just started. If you look at the number of committers that we have, there are 21 at Hortonworks versus seven or eight at Cloudera. That's just the Apache Hadoop project. We have approaching 80 direct committers across Hadoop, Hive, Pig and other projects, and we do the open source project releases in many of those. That's why we're valuable to our partners.