Amazon slides MapR into elastic Hadoop service

Rolls up 2.0 releases for M3 and M5 distros

Hadoop World 2012 MapR Technologies, one of the main distributors of commercial-grade Hadoop data-munching software, has been tapped by Amazon Web Services to be an alternative to the open source Hadoop stack in the Elastic MapReduce service that Amazon sells to people who don't want to manage their own Hadoop clusters.

At the same time, MapR is trotting out the 2.0 release of its M3 open source and M5 open-core Hadoop distributions.

Until this week, if you were using Elastic MapReduce and you went to the configuration file to set up a MapReduce service (which AWS automagically spits out onto an appropriately sized cluster to fit your budget and job size), you were given two options: the open source Hadoop 0.020 or Hadoop 0.20.205 from the Apache Software Foundation.

Starting this week, however, you now have two more options: MapR M3 v1.2 or M5 v1.2, which were announced in December 2011 and therefore have had the kinks worked out of them.

The M3 and M5 v1.2 releases were also packaged up to run inside of a VMware ESXi hypervisor to allow for the creation of a baby demo Hadoop cluster that could run on a laptop or server, and it was not that much of a leap to spin up the distros into Amazon Machine Image (AMI) formats to run atop Amazon's home-tweaked Xen hypervisor used for its EC2 compute cloud and therefore underneath the Elastic MapReduce service.

The big news is not that there is an AMI for running the MapR variant of Hadoop, but rather that Amazon has made the MapR code – rather than Cloudera, HortonWorks, or IBM variants – a default alternative to its own rollup of Apache Hadoop.

Jack Norris, vice president of marketing at MapR, tells El Reg that this is particularly important given that 90 per cent of the Hadoop work running on the Amazon cloud is through Elastic MapReduce, not by companies setting up their own virty clusters on EC2 and S3.

"EMR is really how people consume Hadoop on Amazon," says Norris.

And while the AMI images for the M3 and M5 Hadoop distros are available for companies to license on the Amazon Marketplace that debuted two months ago, and run on clusters they configure themselves, the value in EMR is that it can spawn hundreds of virtual servers running the code in about five minutes, and then get to work data munching. See how many cans of Jolt Red Bull it takes for you to do the same.

The M3 v1.2 distribution has the same cost on EMR as the two AMIs packaged up by Amazon for the service; if you want to use the M5 distribution, which offers NFS mounting of the Hadoop Distributed File System (HDFS) underneath Hadoop, then it costs an extra 10 cents per hour atop the EMR fees on a standard large instance (m1.large in the AWS lingo), and an extra 72 cents per hour for a dedicated cluster compute eight extra large image (which is called cc2.8xlarge and which is essentially a whole physical server).

That price includes 24x7 tech support from MapR, which is just thrilled to get a piece of the Amazon action – particularly since MapR's only other route to market is through EMC's Greenplum data warehousing and analytics division.

Amazon EMR configuration screen with MapR options (click to enlarge)

The M5 release also includes distributed NameNode and JobTracker management nodes for extra resiliency. In addition, Amazon and MapR have done tweaks to both the M3 and M5 code running in the AMI to tune it for the EC2 compute and S3 storage utilities; they have also done work to interface the MapR Hadoopery with Amazon's DynamoDB NoSQL data store and its CloudWatch management tool.

This last item is particularly important because of the data compression that M5 offers to help speed up throughput on Hadoop jobs and the snapshotting capability that M5 has, which allows for point-in-time recovery snapshots to be taken of HDFS and dumped to S3.

Amazon does the Level 1 tech support on the M3 and M5 instances running on EMR, while MapR does Levels 2 and 3 support.

While MapR is rolling out the 2.0 releases of its M3 and M5 distributions this week as well, these are not yet available on Amazon's EMR service. But they will be shortly, says Norris.

MapR M3 and M5 are based on the Apache 1.0 Hadoop stack, with lots of extra patches thrown in by MapR and Amazon. The Amazon tweaks are the tunings for EC2 and S3, while the MapR tweaks are for its in-memory sharding of the NameNode data and replication to disk, which eliminates the single-point-of-failure issue of the standard Hadoop NameNode, which keeps track of which chunks of data are stored on what spindle in what server in the Hadoop cluster.

There's only one NameNode in a normal Hadoop cluster, although the Apache 2.0 stack, which is in alpha testing now, has some replication services to provide HA for this node. Cloudera is using it in its latest release, while Hortonworks is plunking the NameNode in a VMware ESXi VM and using vSphere and Site Replication Manager high availability extensions to replicate the name node.

Norris says that the MapR v2.0 Hadoop distros have features to allow a single cluster to be carved up into isolated sections so you can do multi-tenancy and run multiple MapReduce jobs across those sections rather than having to set up multiple, separate clusters. You can also use MapR internally and do replication out to the Amazon cloud, or do inter-cluster mirroring from different AWS availability zones (which are isolated chunks of EC2 within a single Amazon data center).

The 2.0 release has centralized logging and central configuration – you don't have to hop from node to node tweaking the Hadoop cluster or troubleshooting it – and also sports LZ4, LZf, and GZIP compression algorithms. New versions of HBase (the distributed database that rides atop HDFS), Pig (the high-level language to create MapReduce routines), and Hive (the ad-hoc query language and data warehousing tool that works with HDFS) have been updated to the latest stable releases in the MapR stacks.

MapR is now supporting SELinux security with its Hadoop distros, and has added SUSE Linux Enterprise Server 11 (including the SP1 and SP2 updates) as an operating system on which M3 or M5 can run. Prior releases as well as the MapR 2.0 releases ran on Red Hat Enterprise Linux 5 and 6, Canonical Ubuntu 9.04 and higher, and CentOS 5 and 6.

The M3 and M5 v2.0 distros are in a public beta now, and the software will be generally available in the third quarter. M3 is free and M5 costs $4,000 per node for the license to the proprietary extensions to the stack and a year of technical support for the code. ®