Microsoft Azure Stack is an extension of Azure – bringing the agility and innovation of cloud computing to your on-premises environment and enabling the only hybrid cloud that allows you to build and deploy hybrid apps anywhere. We bring together the best of the edge and cloud to deliver Azure services anywhere in your environment.

Developing for HDInsight

Windows Azure HDInsight provides the capability to dynamically provision clusters running Apache Hadoop to process Big Data. You can find more information here in the initial blog post for this series, and you can click here to get started using it in the Windows Azure portal. This post enumerates the different ways for a developer to interact with HDInsight, first by discussing the different scenarios and then diving into the variety of capabilities in HDInsight. As we are built on top of Apache Hadoop, there is a broad and rich ecosystem of tools and capabilities that one can leverage.

In terms of scenarios, as we've worked with customers, there are really two distinct scenarios, authoring jobs where one is using the tool to process big data, and integrating HDInsight with your application where the input and output of jobs are incorporated as part of a larger application architecture. One key design aspect of HDInsight is the integration with Windows Azure Blob Storage as the default file system. What this means is that in order to interact with data, you can use existing tools and API's for accessing data in blob storage. This blog post goes into more detail on our utilization of Blob Storage.

Within the context of authoring jobs, there is a wide array of tools available. From a high level, there are a set of tools that are part of the existing Hadoop ecosystem, a set of projects we've built to get .NET developers started with Hadoop, and work we've begun to leverage JavaScript for interacting with Hadoop.

Job Authoring

Existing Hadoop Tools

As HDInsight leverages Apache Hadoop via the Hortonworks Data Platform, there is a high degree of fidelity with the Hadoop ecosystem. As such, many capabilities will work “as-is.” This means that investments and knowledge in any of the following tools will work in HDInsight. Clusters are created with the following Apache projects for distributed processing:

Hive uses a syntax similar to SQL to express queries that compile to a set of Map/Reduce programs. Hive has support for many of the constructs that one would expect in SQL (aggregation, groupings, filtering, etc.), and easily parallelizes across the nodes in your cluster.

Oozie is a workflow scheduler for managing a directed acyclic graph of actions, where actions can be Map/Reduce, Pig, Hive or other jobs. You can find more details in the quick start guide here.

You can find an updated list of Hadoop components here. The table below represents the versions for the current preview:

Apache Hadoop

1.0.3

Apache Hive

0.9.0

Apache Pig

0.9.3

Apache Sqoop

1.4.2

Apache Oozie

3.2.0

Apache HCatalog

0.4.1

Apache Templeton

0.1.4

Additionally, other projects in the Hadoop space, such as Mahout (see this sample) or Cascading can easily be used on top of HDInsight. We will be publishing additional blog post on these topics in the future.

.NET Tooling

We're working to build out a portfolio of tools that allow developers to leverage their skills and investments in .NET to use Hadoop. These projects are hosted on CodePlex, with packages available from NuGet to author jobs to run on HDInsight. For instructions on these, please see the getting started pages on the CodePlex site.

We are currently providing .NET clients to these API's, available here, and one is able to easily build clients using the HTTP stacks in other languages as well.

Connectivity via ODBC

By leveraging the ODBC client (instructions here), one can easily integrate existing applications (Excel) with data that is being stored in Hive tables in HDInsight.

Debugging/Testing

In order to provide an experience where one can work disconnected from a cluster running in Azure, we have provided the HDInsight Developer Preview, a one-box setup, easily installed from the Web Platform Installer. You can use this to experiment, debug, and test all of the technologies above on a smaller set of data. You can then deploy the artifacts to Azure and run against your big data in Blob Storage. In order to install this, simply search for HDInsight inside the Web Platform Installer, or click here to install directly from the web.

Summary

This post covered the wide array of options that you have in order to write Hadoop jobs as well as integrated HDInsight into your applications. HDInsight enables you to develop with the platform and tools of your choice, from Java to .NET to JavaScript, on top of clusters that are easily deployed and managed using Windows Azure.

The final post in our 5-part series on HDInsight will explore how to analyze data from HDInsight with Excel. Stay tuned!