HDInsight Services For Windows

HDInsight Services For Windows

This article is the main portal for technical information about HDInsight Services for Windows and related Microsoft technologies. It provides a brief overview of Apache Hadoop, as well as information for the HDInsight Services provided by Microsoft
for deployment on both Windows and Windows Azure.

It also provides links to more detailed technical content in various formats.

Note: Contributions are welcome and appreciated: Please feel free to update this and other articles on this Wiki, and to add links to relevant content both from within and outside Microsoft.

Hadoop Overview

Apache Hadoop is an open source software framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It consists of two primary components: Hadoop
Distributed File System (HDFS), a reliable and distributed data storage, and
MapReduce, a parallel and distributed processing system. A Hadoop cluster can be made up of a single node or thousands.

HDFS is the primary distributed storage used by Hadoop applications. As you load data into a Hadoop cluster, HDFS splits up the data into blocks/chunks and creates multiple replicas of blocks and distributes them across the nodes of the cluster to enable
reliable and extremely rapid computations.

Hadoop MapReduce is a software framework for writing applications that rapidly process vast amounts of data in parallel on a large cluster of compute nodes. A MapReduce job usually splits the input data-set into independent chunks. These independent chunks
are processed by the map tasks running across the nodes of the Hadoop cluster in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored
in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

Some of the main advantages of Hadoop are that it can process vast amounts of data, hundreds of terabytes or even petabytes quickly and efficiently, process both structured and non-structured data, perform the processing where the data is located rather
than moving the data to some processing location, and detect and handle failures by design.

There are two other key Apache technologies that are frequently used with Hadoop:
Hive and Pig. Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of
large datasets stored in Hadoop compatible file systems such as HDFS. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows map/reduce programmers
to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is
amenable to substantial parallelization, which in turns enables them to handle very large data sets.

Tour through the Microsoft HDInsight dashboard and resources for getting started with the developer preview.

Getting Started with HDInsight Services for Windows Azure

The links in this section provide information on deploying and using Apache Hadoop on the Microsoft Windows Azure Platform. Instead of setting up and managing a Hadoop cluster on Azure by yourself, you can use the HDInsight Services for Windows Azure dashboard
that Microsoft has made available at hadooponazure.com. This is a preview of the HDInsight Services for Windows Azure to which you can submit MapReduce jobs to
be processed along with the data used in the processing. It enables you to process vast amounts of structured as well as non-structured data easily without worrying about setting up the Hadoop cluster, configuring, maintaining, and managing it manually.

In this tutorial you will query, explore, and analyze data from Twitter using Apache™ Hadoop™-based Services for Windows Azure and a Hive query in Excel. Social web sites are one of the major
driving forces for Big Data adoption.

With the explosion of data, the open source Apache™ Hadoop™ Framework is gaining traction thanks to its huge ecosystem that has arisen around the core functionalities of Hadoop distributed file system (HDFS™) and Hadoop
Map Reduce. As of today, being able to have SQL Server working with Hadoop™ becomes increasingly important because the two are indeed complementary. For instance, while petabytes of data can be stored unstructured in Hadoop and take hours to be queried, terabytes
of data can be stored in a structured way in the SQL Server platform and queried in seconds. This leads to the need to transfer data between Hadoop and SQL Server.