Hadoop Heads to the Enterprise via the Cloud

While the fact that technologies such as Hadoop have given IT organizations the ability to more easily access large amounts of data, the fact of the matter is that the infrastructure needed to process all that data is still beyond the means of most organizations.

Yahoo, for example, has over 40,000 nodes running Hadoop in order to manage 180-200 petabytes of data. Facebook has 2,000 compute nodes with 20 petabytes of data. Within most IT organizations the average cluster might scale up to about 30 servers. Hadoop clusters on average consist of 200 servers.

In an IBM webcast that can be found here, Leon Katsnelson, IBM program director for information management at the IBM Cloud Computing Center for Competence, makes a case that the cost of Hadoop by definition is going to push these deployments into the cloud. Simply put, Katsnelson says most companies can't make the upfront capital investment in the compute, storage and networking resources required.

Unless organizations find a way to access Hadoop resources in the cloud, adds Katsnelson, they are never going to develop the skills needed to use Hadoop. And yet, a recent study conducted by the Ventana Group found that 94 percent of Hadoop users perform analytics on large volumes of data not possible before; 88 percent analyze data in greater detail; while 82 percent can now retain more of their data. In effect, that means these organizations are starting to view Hadoop as a strategic technology they need to gain a competitive advantage.

Katsnelson argues that the only way the analytics playing filed is going to remain level is if organizations take advantage of Hadoop as a cloud service, which is why IBM launched BigInsights on SmartCloud Enterprise, which makes Hadoop available to customers in less than 30 minutes at a cost that starts at 30 cents per hour. At that price point, IBM figures that most organizations can't afford to ignore the potential impact that analytics applications based on Hadoop might have on their business.

IBM is not the only cloud service provider making Hadoop available. What's most interesting about all this is how cloud computing should substantially reduce the amount of time it takes for organizations to gain access to new and emerging technologies. That might not only reduce the time it normally takes for new technologies to reach mainstream adoption; it might ultimately have a major impact on the rate at which companies of all sizes compete with each other.

Subscribe to our Newsletters

Sign up now and get the best business technology insights direct to your inbox.

I agree that cloud principles are appealing when it comes to big data, and it's nice to see Hadoop distribution support from the likes of Amazon, Microsoft, Google, etc. Along with this it would be good to start a conversation around challenges that need to be overcome to effectively leverage the public cloud, not only for simple storage and processing but more advanced forms of analytics.

Topics like:

Data integration-loading millions of records via the internet

Data security-data level security that is easier to manage in on-premise systems. For example, you could argue that compliance initiatives are easier and cheaper when keeping data locally. Will cloud based storage and processing drive the need for encryption for data at rest that may not be necessary on-premise.

Cloud friendly interfaces-interfaces to many big data technologies are not the most cloud friendly vs. other cloud solutions that leverage RESTful APIs, web services, etc.

Architecture complexity-introducing the public cloud as a factor will likely complicate the overall architecture design

I'm certainly not a naysayer, we have many customers and use cases that we host in the SAS cloud, but it would be good to identify, educate and discuss these factors.