Microsoft corporate vice president Scott Guthrie announced a major upgrade of its HDInsight Service Preview to the Hortonworks Hadoop Data Platform v1.1.0 in March. The updated preview's DevOps features make it easy for developers with a Windows Azure subscription to create high-performance HDInsight compute clusters with the Windows Azure Management Portal (see Figure 1).

The SQL Server group's new version incorporates the following open source Apache Hadoop components and a redistributable Microsoft Java Database Connectivity (JDBC) driver for SQL Server:

Apache Hadoop, Version 1.0.3

Apache Hive, Version 0.9.0

Apache Pig, Version 0.9.3

Apache Sqoop, Version 1.4.3

Apache Oozie, Version 3.2.0

Apache H.Catalog, Version 0.4.1

Apache Templeton, Version 0.1.4

SQL Server JDBC Driver, Version 3.0

Figure 1. Developers or DevOps specialists use the Windows Azure Management Portal to specify the number of compute nodes for an HDInsight Data Services cluster.

A cluster consists of an extra-large head node costing $0.48 per hour, and one or more large compute nodes at $0.24 per hour. Therefore, a small-scale cluster with four compute nodes will set users back $1.44 per hour deployed, or about $1,000 per month. (Microsoft bases its charges on a list price of $2.88 per hour but discounts the time to 50% of actual clock hours; learn more about HDInsight Preview pricing here.) The first Hadoop on Windows Azure preview provided a prebuilt, renewable three-node cluster with a 24-hour lifetime; a later update increased the lifetime to five days but disabled renewals. Data storage and egress bandwidth charges aren't discounted in the latest preview, but they're competitive with Amazon Web Services' Simple Storage Service (S3) storage. The discounted Azure services don't offer a service-level agreement.

Moving HDFS storage from the local file system to Windows Azure blobs

One of Hadoop's fundamental DevOps precepts is to "move compute to the data," which ordinarily requires hosting Hadoop Distributed File System (HDFS) data storage files and compute operations in the operating system's file system. Windows Azure-oriented developers are accustomed to working with Azure blob storage, which provides high availability by replicating all stored objects three times. Durability is enhanced and disaster recovery is enabled by geo-replicating the triplicate copy to a Windows Azure data center in the same region after geo-locating the initial and duplicate copies more than 100 miles from the center. For example, an Azure blob store created in Dublin, the West Europe subregion, is auto-replicated to Amsterdam, the North Europe subregion. HDFS doesn't provide such built-in availability and durability features.

HDFS running on the local file system delivered better performance than Azure blobs for MapReduce tasks in the HDInsight service's first-generation network architecture by residing in the same file system as the MapReduce compute executables. Windows Azure storage was hobbled by an early decision to separate virtual machines (VMs) for computation from those for storage to improve isolation for multiple tenants.

According to Calder, second-generation Quantum 10, or Q10, storage "provides a fully nonblocking, 10-Gbps-based, fully meshed network, providing an aggregate backplane in excess of 50 Tbps of bandwidth for each Windows Azure datacenter." He claimed to have implemented storage account scalability targets that would achieve the following goals by the end of 2012:

Storage accounts have geo-replication on by default to provide geo-redundant storage. End users can turn geo-replication off and use locally redundant storage, which results in reduced prices relative to geo-redundant storage and higher ingress and egress targets.

Denny Lee, a member of the SQL Server team's business intelligence group, and Brad Sarsfield, a principal developer in the Windows Azure group, discussed the performance of blob storage with HDInsight on Azure. In short, they found these key points:

Availability: Azure's average response time was 25% faster than Amazon S3, which had the second-fastest average time.

Scalability: Amazon S3 varied only 0.6% from its average in the scaling tests, with Microsoft Windows Azure varying 1.9% -- both very acceptable levels of variance. HP and Rackspace, the two OpenStack-based clouds, showed a variance of 23.5% and 26.1%, respectively, with performance becoming more and more unpredictable as object counts increased.

The latest HDInsight Service Preview for Azure takes full advantage of the second-generation hardware for flat networking in Microsoft data centers. ASV users gain the availability and durability benefits of Windows Azure blob storage for HDFS clusters without the previous performance penalty. Developers must be vigilant in deleting unneeded clusters to avoid substantial billings for unused compute and, to a lesser degree, ASV storage.

About the authorRoger Jenningsis a data-oriented .NET developer and writer, a Windows Azure MVP, principal consultant at OakLeaf Systems, and curator of the OakLeaf Systems blog. He's also the author of more than 30 books on the Windows Azure Platform, Microsoft operating systems (Windows NT and 2000 Server), databases (SQL Azure, SQL Server and Access), .NET data access, Web services and InfoPath 2003. More than 1.25 million English copies of his books are in print, and they have been translated into more than 20 languages.

Start the conversation

0 comments

Register

I agree to TechTarget’s Terms of Use, Privacy Policy, and the transfer of my information to the United States for processing to provide me with relevant information as described in our Privacy Policy.

Please check the box if you want to proceed.

I agree to my information being processed by TechTarget and its Partners to contact me via phone, email, or other means regarding information relevant to my professional interests. I may unsubscribe at any time.