Hadoop

What is Hadoop?

Apache Hadoop is open-source software for storing and analyzing massive amounts of structured and unstructured data–terabytes or more of everything from emails to sensor readings to server logs to Twitter feeds to GPS signals to just about anything else you can think of. Hadoop can process big, messy data sets for insights and answers–which helps explain all the buzz around it.

A brief history of Hadoop

Created in 2005 by Mike Cafarella and Doug Cutting (who named it after his son's toy elephant), Hadoop was originally intended for web-related search data. Today, it's an open-source, community-built project of the Apache Software Foundation that's used in all kinds of organizations and industries. Microsoft is an active contributor to the community development effort.

Microsoft has logged over 6,000 engineering hours in the last year, committing code and driving innovation in partnership with the open source community across a range of Hadoop projects. In addition, we have committers on Hadoop, and Microsoft employee Chris Douglas is the Apache Working Group Chair for Hadoop.

–David Campbell, Microsoft Fellow and CTO

Built for big data, everyday servers

One reason for Hadoop's popularity is simple economics. Processing big data sets once required supercomputers and other pricey, specialized hardware. Hadoop makes reliable, scalable, distributed computing possible on industry-standard servers–allowing you to tackle petabytes of data and beyond on smaller budgets. Hadoop is also designed to scale from a single server to thousands of machines, and detect and handle failures at the application layer for better reliability.

Researchers at Virginia Tech are using Hadoop to sift through petabytes of DNA data for new cancer therapies and antibiotics.

Insights from all kinds of data

By some estimates, as much as 80 percent of the data organizations deal with today isn't the kind that comes neatly packaged in columns and rows. Instead, it's a messy avalanche of emails, social media feeds, satellite images, GPS signals, server logs, and other unstructured, non-relational files. Hadoop can handle nearly any file or format–its other big advantage–so organizations can pose questions they never thought possible.

By using Windows Azure, HDInsight, and SQL Server 2012, we can collect, analyze, and generate near-real time BI with Big Data collected from social media feeds, GPS signals, and data from government systems

Why Hadoop in the cloud?

You can deploy Hadoop in a traditional on-site datacenter. Some companies–including Microsoft–also offer Hadoop as a cloud-based service. One obvious question is: why use Hadoop in the cloud? Here's why a growing number of organizations are choosing this option.

The cloud saves time and money

Open source doesn't mean free. Deploying Hadoop on-premises still requires servers and skilled Hadoop experts to set up, tune, and maintain them. A cloud service lets you spin up a Hadoop cluster in minutes without up-front costs.

See how Virginia Tech is using Microsoft's cloud instead of spending millions of dollars to establish their own supercomputing center.

The cloud is flexible and scales fast

In the Microsoft Azure cloud, you pay only for the compute and storage you use, when you use it. Spin up a Hadoop cluster, analyze your data, then shut it down to stop the meter.

We quickly spun up the Azure HDInsight cluster and processed six years worth of data in just a few hours, and then we shut it down&ellipsis; processing the data in the cloud made it very affordable.

The cloud makes you nimble

It was simply so much faster to do this in the cloud with Windows Azure. We were able to implement the solution and start working with data in less than a week.

–Morten Meldgaard, Chr. Hansen

Meet HDInsight: Hadoop in the Azure cloud

Microsoft Azure HDInsight is a 100% Apache Hadoop-based service in the Azure cloud. It offers all the advantages of Hadoop, plus the ability to integrate with Excel, your on-premises Hadoop clusters, and the Microsoft ecosystem of business software and services.