Upload data to HDInsight Cluster

Download the file w3c_2MB.txt to your local drive, say to ‘C:\w3c_2MB.txt’

Since HDInsight uses the Azure Blob Storage as its distributed file storage (unlike non-could Hadoop clusters that use the default HDFS implementation that’s based on local file-system), you can choose your preferred tool to upload data. We will use Windows Azure PowerShell to upload the sample data.

Open MIcrosoft Azure PowerShell

Run the command below, you will be prompt to login to your Azure account:

Add-AzureAccount

Paste the following script into the Windows Azure PowerShell console window to run it.

Initialize Hive session in Windows Azure PowerShell

Since we are already using Windows Azure PowerShell, we will continue to use it to submit Hive jobs. Quite frankly, PowerShell is not the best tool for ad-hoc interaction with Hive. To debug Hive jobs, (as of now) it’s better to remote connect to the cluster and use the built in Hive CLI. Even for automation, it’s simpler to submit jobs via the .NET SDK. The Hive support in Windows Azure PowerShell might become handy if you already have automated PS script and you want to keep everything in that script.

Before you start running Hive queries via Windows Azure PowerShell (using the Invoke-Hive cmdlet), you need to select your cluster. Paste the following script into the Windows Azure PowerShell console window to run it.

Create Non-Partitioned Hive Table

The script bellow will create a new Hive table name ‘w3c’ and load it with the date that we uploaded to the blob in the previous section

You should know that operations like create/delete execute faster when using New-AzureHDInsightHiveJobDefinition and Start-AzureHDInsightJob instead of Invoke-Hive (more details here). I choose to use Invoke-Hive for all the hive queries in this post to keep things simple.