AWS Elastic MapReduce vs. Windows Azure HDInsight

In the past few years, Apache’s Hadoop software library has increased market share for Big Data analytics, which are useful for business intelligence (BI) today. There are several reasons why Hadoop’s had such success, but our favorites are that it was one of the first in the market and it’s led by the Open Source community.

By offering a Hadoop-based service, public cloud vendors can offer their customers rapidly scalable processing power and storage. On its own, Hadoop requires significant customization depending on the processing needs of the organization using it. Hadoop also helps manage situations that crank out large volumes of data, big enough to impact your storage resources. Yelp, a local business directory service and review site with social networking features, and AWS customer, is using Hadoop in-house, and deploying big RAID storage resources to handle the increase in their log file production. According to Yelp, they were pumping out up to 100GB of log files every day.

AWS made the Hadoop technology available via the cloud in its Elastic MapReduce (EMR) offering that came out in the early part of 2009. With AWS, customers access EMR through on-demand EC2 instances and can store data using its DynamoDB or S3. By using AWS EMR and S3, Yelp, Inc., was able to save $55,000 in upfront storage costs while meeting their performance needs. That’s a pretty compelling case for running Hadoop services in the cloud.

Recently, Microsoft released its Azure Hadoop-based service, called Azure HDInsight, which has gone through three public pre-release versions in 2012. Microsoft partnered with Hortonworks to build out HDInsight.

Azure is certainly an important and up-and-coming public cloud provider, but it’s mainly been playing a “me too” game with AWS, trying to match the competing service feature for feature. That’s a lot of catch-up; as it should be since EMR’s been in commercial operation since 2009 while HDInsight only just got off the ground.

That means there’s a maturity of both service and technology to EMR that’s not quite there yet with HDInsight. One example is that with AWS EMR, you can opt for an Elastic Load Balancer, which Azure doesn’t mention at all. And via EC2, those instances are also “available in minutes” just like Azure’s big virtualized infrastructure benefit play.

Analyzing Big Data takes massive amounts of processing power (which is why it lends itself so well to cloud-based computing clusters) and huge volumes of data. That means you’ll at least want the option to use a wide and well-managed WAN link for reliable connection up-time as well as big storage buckets. EMR lets you store up to 48TB using multiple deployment choices depending on your needs along with high-end compute cycles and up to 10Gbps worth of network throughput. EMR’s maturity provides for all that while it seems HDInsight is still learning.

Another difference is EMR’s use of the AWS management console to build and manage Elastic MapReduce clusters. Cloud-oriented IT folks are very familiar with the AWS management console, so managing EMR means a much lower learning curve than wading through a whole new set of tools via Azure. It makes use of MapR technology, which adds important features to the Hadoop platform, like data snapshots and high-availability management as well as Amazon-specific features including the ability to mirror EMR clusters across AWS availability zones. MapR has had a long time to integrate with AWS EMR, so its tools are pretty much seamless with EMR’s management capabilities at this point.

Then there’s cost. AWS has been leading the cloud cost wars for the last few years, against all competitors, not just Microsoft. Competitors are reacting to AWS rather than pushing ahead on their own. AWS has a free tier of business application operation, which includes EMR implementation that lasts for one year from sign-up. That allows you to grow your application, understand its long-term scope including spikes and dips, and then budget accordingly. After that, it goes to AWS’ pay-as-you-go model. At least, that’s the model we’d likely use for mission-critical BI, but there are plenty of customers with different priorities, so EMR supports all of AWS’ pricing models.

An example is BackType, a social analytics company and another AWS EMR customer, which uses approximately 25TB to hold over 100 billion records. To satisfy its business, BackType implemented an API that can process 400 requests per second. That was seriously straining both their in-house hardware and their budget. To help, it’s currently averaging around 60 EMR instances, but by using both the reserved and spot instance payment models it can quickly scale up to 150 instances when needed. By leveraging one pricing model against another, the company says it’s saved up to 34% in costs. Those kinds of flexible pricing options aren’t available on other services.

The one place where Azure HDInsight may pull ahead is in end-user tools. If your Big Data analytics team is using Excel as its front-end analysis tool, then Azure delivers a Hive ODBC driver and a Hive add-on for Excel. That’s a smart move on Azure’s part, but it can be duplicated on EMR with some front-end planning.

Additionally, Microsoft is only now attempting to become a serious player in the business intelligence space, and whether SQL Server and Excel can really compete against dedicated and far more mature platforms, like AWS or other established players like Karmasphere Analyst, is a big question mark. Those platforms are all allied with Amazon and available as EMR extensions and can also be had on AWS’ pricing model (including pay-as-you-use).

Whether Azure, SQL Server and Excel can really compete in BI against competition like that is definitely still up in the air. Azure needs to prove itself at the service level, not just as the infrastructure as a service (IaaS) and cloud storage provider it’s been so far. Microsoft has introduced new development tools that should allow developers to build such services, so we’ll likely see a closer race in the future. Currently, however, AWS remains our cloud service platform of choice.