Blog Post

EMC throws lots of hardware at Hadoop

Storage giant EMC (s emc) is adding more muscle to its Hadoop strategy with a 1,000-node cluster for testing new Apache Hadoop releases and a new analytics appliance combining EMC’s Hadoop distribution with the EMC Greenplum Database.

Most EMC watchers should have seen the new Data Computing Appliance coming since the company launched its Hadoop plans in May, as it’s a great way for the company to differentiate itself by offering a unified, and high-profit-margin, big data system. Hadoop and analytic databases target different workloads and data types after all, and the traditional method of integrating the two technologies involves maintaining two separate environments. As I noted recently, however, startups such as Platfora and Hadapt are trying to change this by actually integrating Hadoop processing and data warehouse queries within single software products.

Greenplum Co-Founder and SVP of Products Scott Yara explained the new appliance to me as letting customers mix and match different processing modules like Legos. Not only does it support the Greenplum Database and Hadoop, but also partner products for data integration and business intelligence. Everything shares the same high-speed interconnects within the system.

The new testbed, called the Greenplum Analytics Workbench, is a little more surprising because there were some questions early on over how involved EMC would get with the Apache Hadoop project. EMC, after all, offers a free Hadoop distribution that’s based on Facebook’s implementation rather than on the core Apache code, and its enterprise-grade distribution comes through an OEM deal with Hadoop startup MapR. However, Yara told me, EMC is still very pro-Apache and that its decision to offer advanced distributions “wasn’t an aggressive forking strategy away from Apache.”

That’s a smart decision, as all Hadoop distributions ultimately rely on the quality of the Apache code from which they draw at least some of their functionality. Yara said his team wants to help enable some degree of standardization across all Hadoop distributions so that they work together, and he thinks the testbed is a step in that direction. Previously, the Apache team tasked with building new Hadoop releases hasn’t had easy access to infrastructure that lets them test the code at scale, but now it has 1,000 nodes, which is still a respectably sized Hadoop cluster.

Thus far, the EMC Greenplum Hadoop strategy has been hitting all the right notes in terms of offering differentiated products designed for EMC’s large-enterprise customers while still maintaining a relationship with the open source Apache Hadoop community. That likely has a lot to do with the autonomy it has given Greenplum, which has been about big data and open source since its inception, to run EMC’s big data business. Yara told me, Greenplum wants to be to EMC’s analytics efforts what the independent VMware is for EMC’s virtualization efforts.

2 Responses to “EMC throws lots of hardware at Hadoop”

>>Apache team tasked with building new Hadoop releases hasnâ€™t had easy access to infrastructure that lets them test the code at scale, but now it has 1,000 nodes, which is still a respectably sized Hadoop cluster.

How different this is from ~4K nodes Yahoo is using with Hadoop and X number of nodes used by Facebook.

Those are internal deployments, any new code from which generally is contributed back to Apache. EMC is giving Apache itself a cluster on which to test the trunk Hadoop code that’s available for public download.