This is a follow up blog post based on the Intro to Data Factory session I gave on the Training on the T’s with Pragmatic Works. Find more free training from the past and upcoming here. I did my session on January 13, 2015.

Intro To Data Factory

In this session, I gave a simple introduction to new Azure Data Factory using a CopyActivity pipeline between Azure Blob Storage and Azure SQL Database. Below is a diagram illustrating the factory that is created in the demo.

I have published my presentation materials here. This includes the sample JSON files, the Movies.csv, and PowerShell scripts.

Q & A

Here are a few questions that were answered during the session.

1. Does Availability refer to when data that has been transferred will be available? Or when the data source is actually available for query?

Availability refers to when the datasets will make a slice available. This is the when the dataset can be consumed as an input or be targeted as an output. This means you can consume data hourly but choose to push it to its final destination on a different cadence to prevent issues on the receiving end.

An Azure Account is the only real must have. You could use two on premise SQL Server instances.

HDInsight if you want to use the HDInsight activitities

An Azure Storage account to use blob or table storage

3. How do you decide to use a Factory or Warehouse?

The factory is more of a data movement tool. A warehouse could be a source or target of a factory pipeline.

4. Is this similar to SSIS in SQL Server?

Yes and no. SSIS is definitely more mature and has more tooling available such as data sources and transformations. SSIS also have a good workflow constructor. The focus of the Data Factory initially was to load HDInsight tables from a variety of sources with more flexibility. The other note here is that Data Factory is being built from the ground up to support the scale of the cloud or Azure.

5. Can this be used for Big Data?

Absolutely. I would say that it is one of the primary reasons for the tool. In reference to the previous question, it will likely be the tool of choice for big data operations because it will be able to scale with Azure.

Links to Additional Resources on Data Factory or tools that were used in the presentation:

After all the hype about Big Data, Hadoop, and now HDInsight, I decided to build out my own big data cluster on HDInsight. My overall goal is to have a cluster I can use with Excel and Data Explorer. After all, I needed more data in my mashups. I am not going to get into the details or definitions of Big Data, there are entire books on the subject. I will discuss any issues or tidbits during the process while I am here.

Setting Up the Environment

I am actually doing this on a VM on my Windows 8 laptop. I created a Windows 2012 VM with 1 GB of RAM and 50 GB of storage. (Need some help creating a VM in Windows 8, check out my post on the subject.

Installing the HDInsight Server

First, this product is still in Preview at the time of this writing, so mileage will vary and likely change over the next few months. You will find the installer at http://www.microsoft.com/web/gallery/install.aspx?appid=HDINSIGHT-PREVIEW. This uses the Microsoft Web Platform Installer. When prompted I just ran the installer. This took about one hour to complete on my VM setup. Once it completed, it opened up the dashboard view in IE.

At this point we have installed a cluster called “local (hdfs)”.

Exploring My Local Cluster

Well, things did not go well at first. Whenever I clicked the big gray box to view my dashboard, I received the following error: “Your cluster ‘local (hdfs)’ is not responding. Please click here to navigate to cluster.” I clicked “here” and ended up on a IIS start page. Not really effective. Let the troubleshooting begin.

Based on this forum issue response, I opened the services window to find that none of my Apache Hadoop services were running after a restart AND they were set to manual. To resolve this I took two steps. First, I changed all of my services to run automatically. This makes sense for my situation because the VM would be running when I wanted to use HDInsight. Second, I used the command line option to restart all of the services as also noted in the forum post above.

From a command prompt execute the following code to restart all Hadoop services:

c:\hadoop\start-onebox

And, VOILA!, my cluster is now running.

Maybe we can get a better error message next time.

At this point I walked through the Getting Started option on the home screen and proceeded to do “Hello World”. I used these samples as intended to get data in my cluster and start working with the various tools. Stay tuned for more posts in the future on my Big Data adventures.

Why Not HDInsight Service on Azure?

The primary reason I did not use the HDInsight Service on Azure was that I did not want to risk the related charges. Once I have a good understanding of how HDInsight Server works, I will be more comfortable working with HDInsight Service.

From my About.Me page

Family man - Scoutmaster - Data Pro

I have been married to my beautiful wife Sheila since 1993. We are the parents of four awesome kids who keep us busy with all of their adventures including Burnsville High School Marching Band, Winter Drum Line, basketball, dance, scouts (boy & girl) and even learning to drive.

I am also the Scoutmaster for BSA Troop 226 in Savage, MN where I enjoy mentoring boys and seeing them become young men. Along the way we get to serve the community and spend a lot of time outdoors.

I am a Business Intelligence Architect at Pragmatic Works. I am passionate about using data effectively and helping customers understand that data is valuable and profitable. Not only do I do this for customers, I have also delivered over 30 presentations on SQL server and data architecture at local, regional and national conferences. I am also a regular blog contributor at http://dataonwheels.com.