Big Data SupportThis is the team blog for the Big Data Support team at Microsoft. We support HDInsight which is Hadoop running on Windows Azure in the cloud, as well as other big data features. http://blogs.msdn.com/b/bigdatasupport/atom.aspxTelligent Evolution Platform Developer Build (Build: 5.6.50428.7875)2014-01-09T10:47:02ZSqoop Job Performance Tuning in HDinsight (Hadoop)http://blogs.msdn.com/b/bigdatasupport/archive/2015/02/17/sqoop-job-performance-tuning-in-hdinsight-hadoop.aspx2015-02-18T01:36:06Z2015-02-18T01:36:06Z<p><strong>Overview
</strong></p><p><a href="http://sqoop.apache.org/">Apache Sqoop</a> is designed for efficiently transferring bulk data between <a href="http://hadoop.apache.org/">Apache Hadoop</a> and structured datastores such as relational databases. HDInsight is Hadoop cluster deployed in Microsoft Azure and it includes Sqoop. When transferring small amount of data Sqoop performance is not an issue. However, when transferring huge amount of data it is important to consider the things that can improve the performance to keep the execution time within the desirable limit.
</p><p><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/6622.021915_5F00_0206_5F00_SqoopJobPer1.png" alt=""/><strong>
</strong></p><p><strong>Increase the number of parallel tasks by using an appropriate value for –m parameter
</strong></p><p>A Sqoop job essentially boils down to bunch of map tasks (there is no reducer). So the performance tuning of any Sqoop job is somewhat same as optimizing a map-reduce job or at least this is where one should start. Therefore, the <strong>first thing</strong> one should consider to improve the performance of a Sqoop job is to increase the number of parallel tasks. In other words increase the number of mappers to utilize maximum available resources in the cluster. This may require some experimentation given the user's dataset and system in which Sqoop is running. The argument is "-m, --num-mappers". By default –m is set to 4 and therefore if not specified Sqoop will use only four map tasks in parallel. In general you would want to use a higher value of –m to increase the degree of parallelism and hence the performance. However, it is not recommended to increase the degree of parallelism greater than the resources available in the cluster because mappers will run serially and will likely increase the amount of time required to complete the job.
</p><p>Now the questions arises how do you determine the right value for –m. Actually there is no perfect way to find that magic number but you can try to determine some approximate range based on your cluster size and test to find out which gives you the best results. Hadoop 2.x (HDI 3.x) uses YARN and each Yarn task is assigned a container which has a memory limit. In other words each mapper would require a container to run. So if we can get a rough estimate of the maximum number containers available in your cluster then maybe you can use that number for –m as a starting point assuming there is no other job running in the cluster and you do not want to run the multiple sets of mappers in serial. The number of available containers in a cluster depends on few configuration settings. Based on the <a href="http://azure.microsoft.com/en-gb/documentation/articles/hdinsight-release-notes/">HDInsight release notes</a> following are the default settings in HDInsight for the mapper, reducer and AM (Application Master) as of 10/7/2014 release:
</p><p><span style="font-family:Courier New">mapreduce.map.memory.mb = 768
</span></p><p><span style="font-family:Courier New">mapreduce.reduce.memory.mb = 1536
</span></p><p><span style="font-family:Courier New">yarn.app.mapreduce.am.resource.mb = 768
</span></p><p style="text-align: center"><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/6710.021915_5F00_0206_5F00_SqoopJobPer2.png" alt=""/>
</p><p>Currently each data node of an HDInsight cluster uses a <a href="http://msdn.microsoft.com/en-us/library/azure/dn197896.aspx">Large Size</a> Azure PaaS VM which has 4 cores and 7 GB RAM. Out of which about 1 GB is used by the node manager (The NodeManager daemon's heap size is set via yarn-env.sh, via its YARN_NODEMANAGER_HEAPSIZE env-var.) So with the remaining 6 GB you can have maximum (6*1024)/768 = 8 containers per worker node for mapper. The reducers are configured to use twice as much memory (1536 MB) as mappers but in a Sqoop job there is no reducer. Let's assume we have a 16 node cluster and there is no other job running. So the total number of available containers or maximum number of parallel map tasks we can have is 8x16=128. So if you do not want to run multiple sets to map talks in serial and there is no other job running in the cluster then maybe we can set –m as 128.
</p><p><strong>Use smaller fs.azure.block.size to increase the number of mapper further.
</strong></p><p>However, the value passed for –m parameter is a guide only and the actual number of mappers may be different based on other factors like input file size and number, dfs.block.size which is represented by fs.azure.block.size in <a href="http://azure.microsoft.com/en-us/documentation/articles/storage-dotnet-how-to-use-blobs/">Windows Azure Storage B</a>lob, WASB (set to 512 MB by default), max split size etc. If the individual input files are smaller than the block size then we will have one map task for each input file. However, if an input file is bigger than the block size; number of mappers for that input file would be (file size/block size). Therefore if you have resources available in the cluster you can try to increase the number of mappers by setting a smaller value for block size (which is represented by fs.azure.block.size in <a href="http://azure.microsoft.com/en-us/documentation/articles/storage-dotnet-how-to-use-blobs/">WASB</a>) and see if that improves the performance and this is the <strong>second thing</strong> you should consider when tuning the performance of a Sqoop job. For Hive or Map-reduce jobs we can set the fs.azure.block.size property to a different value during running the job. But unfortunately HDInsight PowerShell object <a href="http://msdn.microsoft.com/en-us/library/dn593741.aspx"><span style="font-family:Courier New">New-AzureHDInsightSqoopJobDefinition</span></a> doesn't include the parameter [-Defines &lt;Hashtable&gt;] which allows Hadoop configuration values to be set during the job execution. However, we can always <a href="http://blogs.msdn.com/b/bigdatasupport/archive/2014/04/15/customizing-hdinsight-cluster-provisioning-via-powershell-and-net-sdk.aspx">provision a customize HDInsight</a> cluster and set the fs.azure.block.size to a smaller value when creating the HDInsight cluster if needed. This will change the default in the cluster level and will use that value for all jobs running in the cluster.
</p><p style="background: white"><strong>Is my cluster too small to handle the data?
</strong></p><p style="background: white">After you have tested enough to find optimum values for –m parameter and the fs.azure.block.size and feel that you can't improve the performance of your sqoop job any further then may be it is time to think about increasing the cluster resources or in other words increase the size of your cluster and is the <strong>third thing</strong> you should consider when performance tuning a sqoop job. You should especially consider this when you are transferring huge amount of data and you need to bring down the execution time significantly. To give you some idea I have recently worked on a case where the customer wanted to export about 75GB data and each input file was ~500 MB. Initially we used a 24 node cluster with –m set to 160 and the export took about ~58 hours to complete. Then we tested with a 36 node cluster setting –m as 300 and it took about ~34 hours to compete. This customer didn't want to try setting a smaller value for fs.azure.block.size as they didn't want to change the default in the cluster level. If you are transferring a reasonably huge amount of data then you should start with a reasonable size cluster even before starting to play with –m or fs.azure.block.size parameters. I hope the example of my customer's data and cluster size gives you some idea in that regard.
</p><p><strong>Is the database a bottleneck?
</strong></p><p>Increasing the cluster size or the degree of parallelism will not improve the performance indefinitely. For example if you increase the degree of parallelism higher than that which your database can reasonably support it will not improve the overall performance. Sqoop exports are performed by multiple writers in parallel. Each writer uses a separate connection to the database; these have separate transactions from one another. Connecting 100 concurrent clients to your database may increase the load on the database server to a point where performance suffers as a result. This brings us to the <strong>fourth thing</strong> that you should consideration while tuning Sqoop job performance and that is to check if the database is the bottleneck. If logs captured from the database side show indeed that is the case then you need to figure out if there is a way to scale up the database capabilities. The customer I mentioned earlier was using Azure SQL Database and we actually found that the database was a significant bottleneck for performance. We scaled up his SQL Azure Database by using higher performance levels of Database Throughput Unit (or simply DTU) and as a result we were able to improve the overall performance by 50%. <a href="https://msdn.microsoft.com/en-us/library/azure/dn741336.aspx">This</a>
<a href="https://msdn.microsoft.com/en-us/library/azure/dn741336.aspx">Azure SQL Database Service Tiers and Performance Levels</a> MSDN article has more information on different scale up options for Azure SQL Database.
</p><p><strong>Is the storage a bottleneck?
</strong></p><p>HDInsight uses Windows Azure Blob Storage, WASB, for storing the data and <a href="http://azure.microsoft.com/blog/2013/03/21/azure-hdinsight-and-azure-storage/">this Azure document</a> details what are the benefits of using WASB. However, the HDInsight cluster can throttled when the throughput rate of Windows Azure Storage blob exceeds the limits detailed in <a href="http://blogs.msdn.com/b/brian_swan/archive/2013/11/25/maximizing-hdinsight-throughput-to-azure-blob-storage.aspx">this blog post</a>. Therefore, when running Sqoop jobs in HDInsight cluster another point of bottleneck could be the WASB throughput and that is the <strong>fifth thing</strong> you should consider while tuning the performance of a Sqoop job. You can use the Windows Azure Storage Log Analysis Tool detailed in <a href="http://blogs.msdn.com/b/brian_swan/archive/2014/01/30/analyzing-windows-azure-storage-logs.aspx">this</a> blog post to determine if that is case and then take appropriate measures to mitigate the same. While importing data in WASB you also want to make sure the data size did not cross the WASB block size limits as described in <a href="http://msdn.microsoft.com/en-us/library/dd135726.aspx">this</a> MSDN document otherwise you may see error like below.
</p><p><span style="font-family:Courier New">Caused by: com.microsoft.windowsazure.services.core.storage.StorageException: The request body is too large and exceeds the maximum permissible limit.<strong>
</strong></span></p><p><strong>Two other scenario specific Sqoop performance tips
</strong></p><p>Let's briefly discuss two other scenario specific Sqoop performance tips. For Sqoop export you can use<strong> --batch </strong>argument which uses batch mode for underlying statement execution and thus may improve performance. For example you can set --batch=200 or higher. If the table has too many columns and you use a higher batch value you may end up seeing OOM errors. The second one is specific to when running sqoop jobs from Oozie. Sqoop copies the jars in $SQOOP_HOME/lib folder to job cache every time when start a Sqoop job. When launched by Oozie this is unnecessary since Oozie use its own Sqoop share lib which keeps Sqoop dependencies in the distributed cache. Oozie will do the localization on each worker node for the Sqoop dependencies only once during the first Sqoop job and reuse the jars on worker node for subsquencial jobs. Using option --skip-dist-cache in Sqoop command when launched by Oozie will skip the step which Sqoop copies its dependencies to job cache and save massive I/O.<strong>
</strong></p><p><strong>Conclusion
</strong></p><p>I am sure there are other ways one can think of optimizing the performance of a Sqoop job in HDInsight. I tried to cover the main ones in this blog post and I hope it either helps you to improve the performance of your Sqoop job or at least serves as a starting point for you.
</p><p><strong>References:
</strong></p><p><a href="http://sqoop.apache.org/docs/1.4.5/SqoopUserGuide.html">Apache <a name="idp91632"/>Sqoop User Guide (v1.4.5)</a>
</p><p>
</p><div style="clear:both;"></div><img src="http://blogs.msdn.com/aggbug.aspx?PostID=10594142" width="1" height="1">Mohammad Farooqhttp://blogs.msdn.com/mafarooq_4000_hotmail.com/ProfileUrlRedirect.ashxAzure PowerShell 0.8.14 Released, fixes problems with pipelining HDInsight configuration cmdletshttp://blogs.msdn.com/b/bigdatasupport/archive/2015/02/16/azure-powershell-0-8-14-released-fixes-problems-with-pipelining-hdinsight-configuration-cmdlets.aspx2015-02-16T15:16:00Z2015-02-16T15:16:00Z<script type="text/javascript">// <![CDATA[
// Telligent is stripping Style tags, so adding via DOM
var coreStyle = document.createElement("link");
coreStyle.setAttribute("rel", "stylesheet");
coreStyle.setAttribute("type", "text/css");
coreStyle.setAttribute("href", "http://alexgorbatchev.com/pub/sh/current/styles/shCore.css");
document.getElementsByTagName("head")[0].appendChild(coreStyle);
var shTheme = document.createElement("link");
shTheme.setAttribute("rel", "stylesheet");
shTheme.setAttribute("type", "text/css");
shTheme.setAttribute("href", "http://alexgorbatchev.com/pub/sh/current/styles/shThemeMidnight.css");
document.getElementsByTagName("head")[0].appendChild(shTheme);
var smallerfontdiv = document.createElement('style')
smallerfontdiv.innerHTML = "div.smallerfont {font-size: 90%;}\n.comments {display: inline !important;}\ndiv.syntaxhighlighter{overflow-y: hidden !important;overflow-x: auto !important;}";
document.body.appendChild(smallerfontdiv);
// ]]></script>
<script type="text/javascript" src="http://alexgorbatchev.com/pub/sh/current/scripts/shCore.js"></script>
<script type="text/javascript" src="http://alexgorbatchev.com/pub/sh/current/scripts/shBrushJScript.js"></script>
<script type="text/javascript" src="http://alexgorbatchev.com/pub/sh/current/scripts/shBrushPowerShell.js"></script>
<script type="text/javascript" src="http://alexgorbatchev.com/pub/sh/current/scripts/shBrushPlain.js"></script>
<p>We recently pushed out the <a href="https://github.com/Azure/azure-powershell/releases/tag/v0.8.14-February2015">0.8.14 release of Azure PowerShell</a>. This release includes some updates to the following cmdlets to ensure that values passed in via the PowerShell pipeline, or via the -Config parameter, are maintained:</p>
<ul>
<li>Set-AzureHDInsightDefaultStorage</li>
<li>Add-AzureHDInsightStorage</li>
<li>Add-AzureHDInsightMetastore</li>
</ul>
<p>Previously if you had done something like:</p>
<pre class="brush: ps; auto-links: false;">$myconfig = New-AzureHDInsightClusterConfig -ClusterSizeInNodes 2 -ClusterType HBase
$myconfig = Set-AzureHDInsightDefaultStorage -Config $myconfig `
-StorageContainerName "somecontainer" `
-StorageAccountName "somedefaultstorage.blob.core.windows.net" `
-StorageAccountKey "U09NRVRISU5HIEVOQ09ERUQgSU4gQkFTRTY0Lg=="
$myconfig = Add-AzureHDInsightStorage -Config $myconfig `
-StorageAccountName "someaddedstorage.blob.core.windows.net" `
-StorageAccountKey "U09NRVRISU5HIEVOQ09ERUQgSU4gQkFTRTY0Lg=="
$myconfig = Add-AzureHDInsightMetastore -Config $myconfig `
-Credential (Get-Credential) -DatabaseName "somedatabase" `
-MetastoreType HiveMetastore -SqlAzureServerName "someserver"
$myconfig | Format-Custom # This is where you would usually call New-AzureHDInsightCluster
</pre>
<p>or</p>
<pre class="brush: ps; auto-links: false;">New-AzureHDInsightClusterConfig -ClusterSizeInNodes 2 -ClusterType HBase | `
Set-AzureHDInsightDefaultStorage -StorageContainerName "somecontainer" `
-StorageAccountName "somedefaultstorage.blob.core.windows.net" `
-StorageAccountKey "U09NRVRISU5HIEVOQ09ERUQgSU4gQkFTRTY0Lg==" | `
Add-AzureHDInsightStorage -StorageAccountName "someaddedstorage.blob.core.windows.net" `
-StorageAccountKey "U09NRVRISU5HIEVOQ09ERUQgSU4gQkFTRTY0Lg==" | `
Add-AzureHDInsightMetastore -Credential (Get-Credential) -DatabaseName "somedatabase" `
-MetastoreType HiveMetastore -SqlAzureServerName "someserver" | `
Format-Custom # This is where you would usually call New-AzureHDInsightCluster
</pre>
<p>You would have found that some elements, like the initial ClusterType of "HBase" would have been lost from the configuration. These values will now be maintained as you add elements to the configuration. This should also address some scenarios where people have found that they needed to set options in a particular order for them to be maintained.</p>
<p><em>Side Note: Passing a configuration object to <strong>Format-Custom</strong> before using it for <strong>New-AzureHDInsightCluster</strong> is a great way to troubleshoot whether the configuration object is set up as you expect.</em></p>
<script type="text/javascript">// <![CDATA[
SyntaxHighlighter.all()
// ]]></script><div style="clear:both;"></div><img src="http://blogs.msdn.com/aggbug.aspx?PostID=10593763" width="1" height="1">Rick_Hhttp://blogs.msdn.com/Rick_5F00_H/ProfileUrlRedirect.ashxProblems When Using a Shared Default Storage Container with Multiple HDInsight Clustershttp://blogs.msdn.com/b/bigdatasupport/archive/2015/02/12/problems-when-using-a-shared-default-storage-container-with-multiple-hdinsight-clusters.aspx2015-02-12T16:06:00Z2015-02-12T16:06:00Z<p>We have seen several cases come in to Microsoft Support that ended up being caused by having multiple HDInsight clusters using the same Azure Blob Storage container for default storage. While we don't currently block you from creating clusters using the same default storage container, we do know that this can cause some specific problems. Many folks have been asking whether this configuration is supported, and the short answer is that it is not.</p>
<p>When it comes to determining whether a particular setup is supportable, we typically look at whether the configuration is tested and proven to work reliably. Since HDInsight is based on Apache Hadoop, this is obviously a bit more complex. If you look out into the Hadoop ecosystem there is not much precedence for primary storage being shared between multiple clusters. It just happens to be easy to manually configure HDInsight clusters in this way, and some customers have chosen to do so because it provides convenient access to shared data in the container. The problems may not manifest for many days or weeks, depending on some specific timing conditions on job completion and background maintenance, so it can appear to be working just fine for a while.</p>
<p>The types of problems that we have seen center around errors retrieving job status, which can cascade into unexpected errors, hangs or delays in Hive, Pig, WebHCat/Templeton, and Oozie. Each of these frameworks has different error handling and retry logic so the ways in which the problems surface are very broad.</p>
<p>What this means is that if you are using a shared default container between multiple HDInsight clusters and you call in to support, we will ask you to eliminate the shared default container configuration as a first troubleshooting step.</p>
<p>If you need to use a shared container to provide access to data for multiple HDInsight clusters then you should add it as an <strong>Additional Storage Account</strong> in the cluster configuration. This option is available when using the <a href="http://azure.microsoft.com/en-us/documentation/articles/hdinsight-provision-clusters/">Azure Portal</a>, PowerShell (<a href="https://msdn.microsoft.com/en-us/library/dn593756.aspx">Add-AzureHDInsightStorage</a>), or the SDK (<a href="https://msdn.microsoft.com/en-us/library/microsoft.windowsazure.management.hdinsight.clustercreateparameters.additionalstorageaccounts.aspx">AdditionalStorageAccounts</a>) to provision clusters.</p>
<p><em>Note: For detailed information about how HDInsight uses Blob storage check out: <a href="http://azure.microsoft.com/en-us/documentation/articles/hdinsight-use-blob-storage/">http://azure.microsoft.com/en-us/documentation/articles/hdinsight-use-blob-storage/</a> </em></p><div style="clear:both;"></div><img src="http://blogs.msdn.com/aggbug.aspx?PostID=10593041" width="1" height="1">Rick_Hhttp://blogs.msdn.com/Rick_5F00_H/ProfileUrlRedirect.ashxLoading data in HBase Tables on HDInsight using bult-in ImportTsv utilityhttp://blogs.msdn.com/b/bigdatasupport/archive/2014/12/12/loading-data-in-hbase-tables-on-hdinsight-using-bult-in-importtsv-utility.aspx2014-12-12T18:02:00Z2014-12-12T18:02:00Z<p><a href="http://hbase.apache.org/book/arch.bulk.load.html">Apache HBase</a> can give random access to very large tables-- billions of rows X millions of columns. But the question is how do you upload that kind of data in the Hbase tables in the first place? HBase includes several methods of loading data into tables. The most straightforward method is to either use the <span style="font-family: Courier New; font-size: 10pt;">TableOutputFormat</span> class from a MapReduce job, or use the normal client APIs; however, these are not always the most efficient methods.</p>
<h2>Overview</h2>
<p>HBase ships with build-in <a href="http://hbase.apache.org/book/ops_mgt.html">ImportTsv utility</a> and in many cases it will be much faster and easier to upload data in HBase using ImportTsv utility compared to other methods. As the name suggests using ImportTsv tool you can upload data in <a href="http://en.wikipedia.org/wiki/Tab-separated_values">TSV format</a> into HBase. In a TSV file each field value of a record is separated from the next by a <a title="Tab stop" href="http://en.wikipedia.org/wiki/Tab_stop">tab stop</a> character. However, the tool has an option <span style="font-family: Courier New;">importtsv.separator</span> which allows you to specify a separator if the filed are separated on a different separator instead of tab &ndash; for example pipes or comma. ImportTsv has two distinct usages.</p>
<ol>
<li>Loading data from TSV format in HDFS into HBase via Puts ((i.e., non-bulk loading)</li>
<li>Preparing StoreFiles to be loaded via the <span style="font-family: Courier New; font-size: 10pt;">completebulkload(</span>Bulk Loading<span style="font-family: Courier New; font-size: 10pt;">)</span></li>
</ol>
<p>If you don't have huge amount of data may be you can directly upload to HBase via Puts (#1). Using <a href="http://hbase.apache.org/book/arch.bulk.load.html">Bulk Loading</a> (#2) on the other hand will come in handy when you have huge amount of data to upload. Bulk loading will be faster as it uses less CPU and network resources than simply using the HBase API. However, keep in mind bulk loading bypasses the write path, the Write Ahead Log (WAL) doesn't get written to as part of the process and it can cause some issue for some other process, for example, replication. To find out more about HBase bulk loading please review the <a href="file:///D:\Farooq\Hadoop\HBase\To%20find%20out%20more%20about%20HBase%20bulk%20review%20this%20Apache%20HBase%20reference%20page">Bulk Loading</a> page in Apache HBase reference guide. The HBase bulk load process consists of two main steps.</p>
<ol>
<li>The first step of a bulk load is to generate HBase data files (StoreFiles) from a MapReduce job using <span style="font-family: Courier New; font-size: 10pt;">HFileOutputFormat</span>.</li>
<li>After the data has been prepared using <span style="font-family: Courier New; font-size: 10pt;">HFileOutputFormat</span>, it is loaded into the cluster using <span style="font-family: Courier New; font-size: 10pt;">completebulkload</span>.</li>
</ol>
<p>&nbsp;</p>
<p style="text-align: center; margin-left: 54pt;"><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/3731.121614_5F00_2343_5F00_Loadingdata1.png" alt="" /></p>
<h2>Examples</h2>
<p><span style="color: black;">HBase in HDInsight (Hadoop in Microsoft Azure) is the same in its core as HBase in any other environment. However, someone not familiar with Microsoft Azure environment may get stuck by some minor differences when interacting with the HBase cluster in HDInsight. This is why the examples provided in this blog are specific to HBase cluster in HDInsight and I hope it will make your experience with Hbase cluster in HDInsight smoother. </span>We will provide detail steps for both the usage scenarios of ImportTsv utility.</p>
<h3>Prerequisites</h3>
<p>Before uploading the data to HBase we need to move the data to <a href="http://azure.microsoft.com/en-us/documentation/articles/hdinsight-use-blob-storage/">Windows Azure Storage Blob</a> (WASB) first and we also need to create an empty HBase table to upload the data. So let's do the following steps to get ready to upload the data in HBase using the ImportTsv utility.</p>
<ol>
<li>
<div><span style="color: black;">For this blog we will use the sample <span style="font-family: Courier New;">data.tsv</span> as shown below where each filed in a row is separated by a Tab. </span></div>
<p><span style="color: black; font-family: Courier New; font-size: 10pt;">row1&nbsp;&nbsp;&nbsp;&nbsp;c1&nbsp;&nbsp;&nbsp;&nbsp;c2 </span></p>
<p><span style="color: black; font-family: Courier New; font-size: 10pt;">row2&nbsp;&nbsp;&nbsp;&nbsp;c1&nbsp;&nbsp;&nbsp;&nbsp;c2 </span></p>
<p><span style="color: black; font-family: Courier New; font-size: 10pt;">row3&nbsp;&nbsp;&nbsp;&nbsp;c1&nbsp;&nbsp;&nbsp;&nbsp;c2 </span></p>
<p><span style="color: black; font-family: Courier New; font-size: 10pt;">row4&nbsp;&nbsp;&nbsp;&nbsp;c1&nbsp;&nbsp;&nbsp;&nbsp;c2 </span></p>
<p><span style="color: black; font-family: Courier New; font-size: 10pt;">row5&nbsp;&nbsp;&nbsp;&nbsp;c1&nbsp;&nbsp;&nbsp;&nbsp;c2 </span></p>
<p><span style="color: black; font-family: Courier New; font-size: 10pt;">row6&nbsp;&nbsp;&nbsp;&nbsp;c1&nbsp;&nbsp;&nbsp;&nbsp;c2 </span></p>
<p><span style="color: black; font-family: Courier New; font-size: 10pt;">row7&nbsp;&nbsp;&nbsp;&nbsp;c1&nbsp;&nbsp;&nbsp;&nbsp;c2 </span></p>
<p><span style="color: black; font-family: Courier New; font-size: 10pt;">row8&nbsp;&nbsp;&nbsp;&nbsp;c1&nbsp;&nbsp;&nbsp;&nbsp;c2 </span></p>
<p><span style="color: black; font-family: Courier New; font-size: 10pt;">row9&nbsp;&nbsp;&nbsp;&nbsp;c1&nbsp;&nbsp;&nbsp;&nbsp;c2 </span></p>
<p><span style="color: black; font-family: Courier New; font-size: 10pt;">row10&nbsp;&nbsp;&nbsp;&nbsp;c1&nbsp;&nbsp;&nbsp;&nbsp;c2 </span></p>
<p><span style="color: black; font-family: Courier New; font-size: 10pt;">row11&nbsp;&nbsp;&nbsp;&nbsp;c1&nbsp;&nbsp;&nbsp;&nbsp;c2 </span></p>
<p><span style="color: black; font-family: Courier New; font-size: 10pt;">row12&nbsp;&nbsp;&nbsp;&nbsp;c1&nbsp;&nbsp;&nbsp;&nbsp;c2 </span></p>
</li>
<li><span style="color: black;">Follow any of the methods/tools described in <a href="http://azure.microsoft.com/en-us/documentation/articles/hdinsight-upload-data/">Upload data for Hadoop jobs in HDInsight</a></span> Azure document to upload <span style="font-family: Courier New;">data.tsv</span> file to WASB. For example I used the PowerShell script sample provided in the above link to upload the <span style="font-family: Courier New;">data.tsv</span> file at <span style="font-family: Courier New;">example/data/data.tsv</span> and used the <a href="http://azurestorageexplorer.codeplex.com/">Azure Storage Explorer</a> tool to verify that the file is uploaded in the right location.</li>
<li>
<div><span style="color: black;">Now we need to create the table from HBase shell. We will call the table '<span style="font-family: Courier New;">t1</span>' and our row key will be the first column. We will have the two remaining columns in a column family called '<span style="font-family: Courier New;">cf</span>'. </span></div>
<p><span style="color: black;">If you are preparing a lot of data for bulk loading, you need to make sure the target HBase table is pre-split appropriately. The best practice when creating a table is to split it according to the row key distribution. If your rowkeys start with a letter or number, you can split your table at letter or number boundaries. In our sample <span style="font-family: Courier New;">data.tsv</span> file we only have 12 rows but we will use three splits just to show how it works. </span></p>
<p><span style="color: black;">To open HBase shell we need to RDP to the head node; open Hadoop command line; navigate to <span style="font-family: Courier New;">%hbase_home%\bin</span> and then type the following. </span></p>
<p><span style="color: black; font-family: Courier New;">C:\apps\dist\hbase-0.98.0.2.1.6.0-2103-hadoop2\bin&gt;hbase shell </span></p>
<p><span style="color: black;">Then run the following from Hbase shell to create the table with 3 splits. </span></p>
<p><span style="color: black; font-family: Courier New;">hbase(main):008:0&gt; create 't1', {NAME =&gt; 'cf1'}, {SPLITS =&gt; ['row5', 'row9']} </span></p>
</li>
<li>
<div>Now let's browse to the link below Hbase dashboard from the headnode to check the table we just created.</div>
<p style="margin-left: 36pt;">http://zookeeper2.MyHbaseCluster.d3.internal.cloudapp.net:60010/master-status</p>
<p><span style="color: black;">In the dashboard go to the <strong>Table Details</strong> tab and you will see the list of all tables and the one we just created '<span style="font-family: Courier New;">t1</span>'. Names of all the tables are hyper linked. Click '<span style="font-family: Courier New;">t1</span>' and should be able to view the three regions and other details as shown in the screenshot below. </span></p>
</li>
</ol>
<p style="text-align: center;"><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/7382.121614_5F00_2343_5F00_Loadingdata2.png" alt="" width="512" height="330" border="2" /></p>
<h3>Usage 1: Upload the data from TSV format in HDFS into HBase via Puts ((i.e., non-bulk loading)</h3>
<p><span style="color: black;">Open a new Hadoop command like and type '<span style="font-family: Courier New;">cd %hbase_home%\bin'</span> to navigate to the HBase home and then run the following to upload the data from the tsv file data.tsv in HDFS to Hbase table t1. </span></p>
<p><span style="font-family: Courier New;">hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns="HBASE_ROW_KEY,cf1:c1,cf1:c2" t1 /example/data/data.tsv </span></p>
<p>Note: If the fields in the file were separated by a comma instead of Tab and the corresponding file name were data.csv then we would have used the following to upload the data to the Hbase table 't1' where the separator comma ("," ) is specified using the option importtsv.separator.</p>
<p><span style="font-family: Courier New;">hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns="HBASE_ROW_KEY,cf1:c1,cf1:c2" -Dimporttsv.separator="," t1 /example/data/data.csv </span></p>
<p>To verify that the data is uploaded open HBase shell again and run the following.</p>
<p style="margin-left: 36pt;"><span style="font-family: Courier New;">scan 't1' </span></p>
<p>You should see the rows as below.</p>
<p style="text-align: center; margin-left: 36pt;"><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/1121.121614_5F00_2343_5F00_Loadingdata3.png" alt="" /></p>
<h3>Usage 2: Preparing StoreFiles to be loaded via the <span style="font-size: 10pt;">completebulkload (bulk Loading)</span>.</h3>
<p><span style="font-size: 11pt;"><span style="font-family: Calibri;">We will use the same table 't1' to bulk load the data from the same input file. So let's disable, drop and recreate table '</span>t1<span style="font-family: Calibri;">' from HBase shell as shown in the screen shot below. Our input data file data.tsv will remain in the same location in WASB. </span></span></p>
<p style="text-align: center;"><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/2438.121614_5F00_2343_5F00_Loadingdata4.png" alt="" /></p>
<p>Now that table '<span style="font-family: Courier New;">t1</span>' is recreated let's follow the steps to prepare StoreFiles and then load them to the Hbase table via the <span style="font-family: Courier New;">completebulkload</span> tool.</p>
<ol>
<li>
<div>Run the following to transform the data file to StoreFiles and store at a<span style="color: #505050; font-family: Helvetica;"> relative path specified by </span>Dimporttsv.bulk.output.</div>
<p><span style="font-family: Courier New;">hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns="HBASE_ROW_KEY,cf1:c1,cf1:c2" -Dimporttsv.bulk.output="/example/data/storeDataFileOutput" t1 /example/data/data.tsv </span></p>
<p>You should see the output as below in WASB (this screen shot is taken using Azure Storage Explorer). Notice there are three files under "<span style="font-family: Courier New;">example/data/storeDataFileOutput/cf1/",</span> one per region.</p>
</li>
</ol>
<p style="text-align: center;"><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/5658.121614_5F00_2343_5F00_Loadingdata5.png" alt="" /></p>
<p style="margin-left: 36pt;">Note: If the fields in the file were separated by a comma instead of Tab and the corresponding file name were data.csv then we would have used the following.</p>
<p style="margin-left: 36pt;"><span style="font-family: Courier New;">hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns="HBASE_ROW_KEY,cf1:c1,cf1:c2" -Dimporttsv.separator="," -Dimporttsv.bulk.output="/example/data/storeDataFileOutput" t1 /example/data/data.csv </span></p>
<ol start="2">
<li><span style="color: black;">Now we need to use the <a href="http://hbase.apache.org/book/ops_mgt.html">completebulkload tool</a> to complete the bulk upload. Run the following to upload the data from the HFiles located at <span style="font-family: Courier New;">/example/data/storeDataFileOutput</span> to the HBase table t1. </span>
<p style="font-family: Courier New;">hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles <br />/example/data/storeDataFileOutput t1</p>
</li>
</ol>
<p>Again to verify that the data is uploaded open HBase shell and run the following.</p>
<p>scan 't1'</p>
<p>You can also use Hive and Pig to upload data in HBase tables on HDInsight. I intend to blog on those in future. This is it for today and I hope it was helpful.</p><div style="clear:both;"></div><img src="http://blogs.msdn.com/aggbug.aspx?PostID=10580436" width="1" height="1">Mohammad Farooqhttp://blogs.msdn.com/mafarooq_4000_hotmail.com/ProfileUrlRedirect.ashxSome Commonly Used Yarn Memory Settingshttp://blogs.msdn.com/b/bigdatasupport/archive/2014/11/11/some-commonly-used-yarn-memory-settings.aspx2014-11-11T12:27:00Z2014-11-11T12:27:00Z<p style="text-align: justify;"><span style="color: black; font-size: 10pt; line-height: 1.1;">We were&nbsp;recently working on an out of memory issue that was occurring with certain workloads on HDInsight clusters. I thought it might be a good time to write on this topic based on all the current experience troubleshooting some memory issues. There are a few memory settings that can be tuned to suit your specific workloads. The nice thing about some of these settings is that they can be configured either at the Hadoop cluster level, or can be set for specific queries known to exceed the cluster's memory limits. </span></p>
<h3>Some Key Memory Configuration Parameters</h3>
<p style="text-align: justify;"><span style="color: black; font-size: 10pt; line-height: 1.1;">So, as we all know by now, Yarn is the new data operating system that handles resource management and also serves batch workloads that can use MapReduce and other interactive and real-time workloads. There are memory settings that can be set at the Yarn container level and also at the mapper and reducer level. Memory is requested in increments of the Yarn container size. Mapper and reducer tasks run inside a container. Let us introduce some parameters here and understand what they mean.&nbsp;&nbsp; </span></p>
<h4>mapreduce.map.memory.mb and mapreduce.reduce.memory.mb</h4>
<p style="text-align: justify;"><span style="color: black; font-size: 10pt; line-height: 1.1;"><strong>Description: </strong>Upper memory limit for the map/reduce task and if memory subscribed by this task exceeds this limit, the corresponding container will be killed.<br /><br />These parameters determine the maximum amount of memory that can be assigned to mapper and reduce tasks respectively. Let us look at an example to understand this well. Say, a Hive job runs on MR framework and it needs a mapper and reducer. Mapper is bound by an upper limit for memory which is defined in the configuration parameter mapreduce.map.memory.mb. However, if the value for yarn.scheduler.minimum-allocation-mb is greater than this value of mapreduce.map.memory.mb, then the yarn.scheduler.minimum-allocation-mb is respected and the containers of that size are given out. </span></p>
<p style="text-align: justify;"><span style="color: black; font-size: 10pt; line-height: 1.1;">This parameter needs to be set carefully as this will restrict the amount of memory that a mapper/reducer task has to work with and if not set properly, this could lead to slower performance or OOM errors. Also, if it is set to a large value&nbsp;that is not typically needed for your workloads,&nbsp;it could result in a reduced concurrency&nbsp;on a busy system, as&nbsp;lesser applications can run in parallel due to larger container allocations.&nbsp;It is important to test workloads&nbsp;and set this parameter to&nbsp;an optimal&nbsp;value&nbsp;that can serve most workloads and tweak this value at the job level for&nbsp;unique memory requirements. &nbsp;Please note that for Hive queries on Tez, the configuration parameters hive.tez.container.size and hive.tez.java.opts can be used to set the container limits. By default these values are set to -1, which means that they default to the mapreduce settings, however there is an option to override this at the Tez level. Shanyu's <a href="http://blogs.msdn.com/b/shanyu/archive/2014/07/31/hadoop-yarn-memory-settings-in-hdinsigh.aspx" target="_blank">blog</a> covers this in greater detail.<br /><br /><strong>How to set this property: </strong>Can be set at site level with mapred-site.xml. This change does not require a service restart.</span></p>
<h4>yarn.scheduler.minimum-allocation-mb</h4>
<p style="text-align: justify;"><span style="color: black; font-size: 10pt; line-height: 1.1;"><strong>Description:</strong>This is the minimum allocation size of a container. All memory requests will be handed out as increments of this size.<br />.<br />Yarn uses the concept of containers for acquiring resources in increments. So, the minimum allocation size of a container is determined by the configuration property yarn.scheduler.minimum-allocation-mb. This is the minimum unit of container memory grant possible. So, even if a certain workload needs 100 MB of memory, it will still be granted 512 MB of memory as that is the minimum size as defined in yarn.scheduler.minimum-allocation-mb property for this cluster. This parameter needs to be chosen carefully based on your workload to ensure that it is providing the optimal performance you need while also utilizing the cluster resources.&nbsp;If this configuration parameter is not able to&nbsp;accommodate the memory needed by a task, it will lead to&nbsp;out of memory errors.&nbsp;The error below is from my cluster where&nbsp;a container has oversubscribed memory, so the NodeManager kills the container and you will notice an error message like below on the logs.&nbsp;So, I would need to look at my workload and increase the value for this configuration property to allow my workloads to complete without OOM errors.</span></p>
<p><span style="color: red; font-size: 8pt; background-color: white; line-height: 1;">Vertex failed,vertexName=Reducer 2, vertexId=vertex_1410869767158_0011_1_00,diagnostics=[Task failed, taskId=task_1410869767158_0011_1_00_000006,diagnostics=[AttemptID:attempt_1410869767158_0011_1_00_000006_0<br />Info:Container container_1410869767158_0011_01_000009 COMPLETED with diagnostics set to [Container [pid=container_1410869767158_0011_01_000009,containerID=container_1410869767158_0011_01_000009] is running beyond physical memory limits.<br />Current usage: 512.3 MB of 512MB physical memory used; 517.0 MB of 1.0 GB virtual memory used. Killing container.<br />Dump of the process-tree for container_1410869767158_0011_01_000009 : <br />|- PID CPU_TIME(MILLIS) VMEM(BYTES)WORKING_SET(BYTES)<br />|- 7612 62 1699840 2584576|- 9088 15 663552 2486272<br />|- 7620 3451328 539701248 532135936<br />Container killed on request. Exit code is 137<br />Container exited with a non-zero exit code 137</span></p>
<p style="text-align: justify;"><span style="color: black; font-size: 10pt; line-height: 1.1;"><strong>How to set this property: </strong>Can be set at site level&nbsp;with mapred-site.xml, or can be set at the job level. This change needs a recycle of the RM service. On HDInsight, this needs to be done when provisioning the cluster with custom configuration parameters.&nbsp;</span></p>
<h4>yarn.scheduler.maximum-allocation-mb</h4>
<p style="text-align: justify;"><span style="color: black; font-size: 10pt; line-height: 1.1;"><strong>Description: </strong>This is the maximum allocation size allowed for a container.<br /><br />This property defines the maximum memory allocation possible for an application master container allocation request. Again, this needs to be chosen carefully as if the value is not large enough to accommodate the needs for processing, then this would result in an OOM error. Say mapreduce.map.memory.mb is set to 1024 and if the yarn.scheduler.maximum-allocation-mb is set to 300, it leads to a problem as the maximum allocation possible for a container is 300 MB, but the mapper task would need 1024 MB as defined in the mapreduce.map.memory.mb setting. Here is the error you would see in the logs: </span></p>
<p><span style="color: red; font-size: 8pt; background-color: white; line-height: 1;">org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory &lt; 0, or requested memory &gt; max configured, requestedMemory=1024, maxMemory=300 </span></p>
<p style="text-align: justify;"><span style="color: black; font-size: 10pt; line-height: 1.1;"><strong>How to set this property: </strong>Can be set at site level by with&nbsp;mapred-site.xml, or can be set at the job level. This change needs a recycle of the RM service. On HDInsight, this needs to be done when provisioning the cluster with custom configuration parameters.</span></p>
<h4>mapreduce.reduce.java.opts and mapreduce.map.java.opts</h4>
<p style="text-align: justify;"><span style="color: black; font-size: 10pt; line-height: 1.1;"><strong>Description: </strong>This allows to configure the maximum and minimum JVM heap size. For maximum use, -Xmx and for minimum use &ndash;Xms. <br /><br />This property value needs to be less than the upper bound for map/reduce task as defined in mapreduce.map.memory.mb/mapreduce.reduce.memory.mb, as it should fit within the memory allocation for the map/reduce task. These configuration parameters specify the amount of heap space that is available for the JVM process to work within a container.If this parameter is not properly configured, this will lead to Java heap space errors as shown below. </span></p>
<p><a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/2061.HeapError.jpg"><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/7245.103114_5F00_1527_5F00_SomeCommonl1.jpg" alt="" border="0" /></a></p>
<p style="text-align: justify;"><span style="color: black; font-size: 10pt; line-height: 1.1;">You can get the yarn logs by executing the following command by feeding in the applicationId. </span></p>
<p><a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/3580.YarnLogs.jpg"><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/5125.103114_5F00_1527_5F00_SomeCommonl2.jpg" alt="" border="0" /></a></p>
<p style="text-align: justify;"><span style="color: black; font-size: 10pt; line-height: 1.1;">Now, let us look at the relevant error messages from the Yarn logs that has been extracted to the heaperr.txt by executing the command above &ndash; </span></p>
<p><span style="color: red; font-size: 7pt; background-color: white; line-height: 1;">2014-09-18 14:36:32,303 INFO [IPC Server handler 29 on 59890] org.apache.hadoop.mapred.TaskAttemptListenerImpl: JVM with ID : jvm_1410959438010_0051_r_000004 asked for a task<br />2014-09-18 14:36:32,303 INFO [IPC Server handler 29 on 59890] org.apache.hadoop.mapred.TaskAttemptListenerImpl: JVM with ID: jvm_1410959438010_0051_r_000004 given task: attempt_1410959438010_0051_r_000000_1<br />2014-09-18 14:41:11,809 FATAL [IPC Server handler 18 on 59890] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: attempt_1410959438010_0051_r_000000_1 - exited : Java heap space<br />2014-09-18 14:41:11,809 INFO [IPC Server handler 18 on 59890] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Diagnostics report from attempt_1410959438010_0051_r_000000_1: Error: Java heap space<br />2014-09-18 14:41:11,809 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1410959438010_0051_r_000000_1: Error: Java heap space </span></p>
<p style="text-align: justify;"><span style="color: black; font-size: 10pt; line-height: 1.1;">As we can see the reduce attempt 1 has exited with an error due to insufficient Java heap space. I am able to tell that this is a reduce attempt due to the letter 'r' in the task attempt ID as highlighted here - jvm_1410959438010_0051_<span style="color: red; background-color: yellow; line-height: 1;"><strong>r</strong></span><span style="color: black; font-size: 10pt; line-height: 1.1;">_000004. It would be a letter 'm' for the mapper. This reduce task would be tried 4 times as defined in mapred.reduce.max.attempts config property in mapred-site.xml and in my case all the four attempts failed due to the same error. When you are testing a workload to determine memory settings on a standalone one-node box, you can reduce the number of attempts by reducing the mapred.reduce.max.attempts config property and find the right amount of memory that the workload would need by tweaking the different memory settings and determining the right configuration for the cluster. From the above output, it is clear that a reduce task has a problem with the available heap space and I could solve this issue by increasing the heap space with a set statement just for this query as most of my other queries were happy with the default heap space as defined in the mapred-site.xml for the cluster. Counter committed heap bytes can be used to look at the heap memory that the job eventually consumed. This is accessible from the job counters page. </span></span></p>
<p style="text-align: justify;"><span style="color: black; font-size: 10pt; background-color: white; line-height: 1.1;"><strong>How to set this property: </strong>Can be set at site level with mapred-site.xml, or can be set at the job level. This change does not require a service restart. </span></p>
<h4>yarn.app.mapreduce.am.resource.mb</h4>
<p style="text-align: justify;"><span style="color: black; font-size: 10pt; line-height: 1.1;"><strong>Description: </strong>This is the amount of memory that the Application Master for MR framework would need. <br /><br />Again, this needs to be set with care as a larger allocation for the AM would mean lesser concurrency, as you can spin up only so many AMs before exhausting the containers on a busy system. This value also needs to be less than what is defined in yarn.scheduler.maximum-allocation-mb, if not, it will create an error condition &ndash; example below. </span></p>
<p><span style="color: red; font-size: 7pt; background-color: white; line-height: 1;">2014-10-23 13:31:13,816 ERROR [main]: exec.Task (TezTask.java:execute(186)) - Failed to execute tez graph. <br />org.apache.tez.dag.api.TezException: org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested memory &lt; 0, or requested memory &gt; max configured, requestedMemory=1536, maxMemory=699<br />&nbsp;&nbsp;&nbsp;&nbsp;at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:228)<br />&nbsp;&nbsp;&nbsp;&nbsp;at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateResourceRequest(RMAppManager.java:385) </span></p>
<p style="text-align: justify;"><span style="color: black; font-size: 10pt; background-color: white; line-height: 1.1;"><strong>How to set this property: </strong>Can be set at site level with mapred-site.xml, or can be set at the job level. This change does not require a service restart. </span></p>
<p style="text-align: justify;"><span style="color: black; font-size: 10pt; line-height: 1.1;">We looked at some common Yarn memory parameters in this post. This brings us to the end of our journey in highlighting some of the most commonly used memory config parameters with Yarn. There is a nice reference from Hortonworks <a href="http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.1.1/bk_installing_manually_book/content/rpm-chap1-11.html" target="_blank"><span style="color: blue; text-decoration: underline;">here</span></a> that talks about a tool that gives some best practice suggestions for memory settings and also goes over how to manually set these values. This can be used as a starting point for testing some of your workloads and tuning it&nbsp;iteratively from there to suit your specific needs.&nbsp;Hope you found some of these nuggets helpful. We will try and explore some more&nbsp;tuning parameters in further posts. </span></p>
<p><span style="color: black; font-family: Arial; font-size: 9pt; line-height: 1.1;">-Dharshana </span></p>
<p><span style="color: black; font-family: Arial; font-size: 9pt; line-height: 1.1;">@dharshb</span></p>
<p><span style="color: black; font-family: Arial; font-size: 9pt; line-height: 1.1;">Thanks to <a href="https://social.msdn.microsoft.com/profile/dan%20(msft)/" target="_blank">Dan</a> and <a href="https://social.msdn.microsoft.com/profile/jason%20h%20(hdinsight)/">JasonH</a> for reviewing this post!</span></p><div style="clear:both;"></div><img src="http://blogs.msdn.com/aggbug.aspx?PostID=10569040" width="1" height="1">Dharshana_Bharadwajhttp://blogs.msdn.com/dharshana_5F00_bharadwaj_4000_hotmail.com/ProfileUrlRedirect.ashxHow to use HBase Java API with HDInsight HBase cluster, part 1http://blogs.msdn.com/b/bigdatasupport/archive/2014/11/04/how-to-use-hbase-java-api-with-hdinsight-hbase-cluster-part-1.aspx2014-11-05T05:56:00Z2014-11-05T05:56:00Z<p>Recently we worked with a customer, who was trying to use HBase Java API to interact with an HDInsight HBase cluster. Having worked with the customer and trying to follow our existing documentations <a href="http://azure.microsoft.com/en-us/documentation/articles/hdinsight-hbase-build-java-maven/">here</a> and <a href="http://azure.microsoft.com/en-us/documentation/articles/hdinsight-hbase-provision-vnet/">here</a>, we realized that it may be helpful if we clarify a few things around HBase JAVA API connectivity to HBase cluster and show a simpler way of running the JAVA client application using HBase JAVA APIs. In this blog, we will explain the recommended steps for using HBase JAVA APIs to interact with HDInsight HBase cluster.</p>
<p><strong><em>The Background: </em></strong></p>
<p>Our existing documentation <a href="http://azure.microsoft.com/en-us/documentation/articles/hdinsight-hbase-build-java-maven/">here</a> does a nice job in explaining how to use Maven to develop a Java application and use HBase JAVA API to interact with HDInsight HBase cluster &ndash; but one may wonder why we are packaging the HBase Java client code as a MapReduce JAR and running the jar as a MapReduce job. This part begs a little more clarity. Remember that HBase JAVA API uses RPC (Remote Procedure Call) to communicate with an HBase Cluster, which means that the client application running HBase JAVA API code and the HBase cluster needs to exist in the same network and subnet. In the absence of Azure Virtual Netowrk, aka VNet, (I imagine, the <a href="http://azure.microsoft.com/en-us/documentation/articles/hdinsight-hbase-build-java-maven/">documentation</a> was written before we introduced the capability of installing HBase cluster in a Virtual Network), the example takes an approach of packaging the HBase Client code as a mapreduce JAR and submitting the job as a mapreduce job via WebHCat/Templeton. With&nbsp;this approach, the client Java JAR (containing HBase Java API calls) runs on one of the worker nodes in the HBase cluster and runs successfully. However, with the current capability of provisioning an HDInsight HBase cluster in a Virtual Network, as shown in this <a href="http://azure.microsoft.com/en-us/documentation/articles/hdinsight-hbase-provision-vnet/">documentation</a>, we feel that a more realistic and better approach for using HBase JAVA APIs is to provision the HDInsight HBase cluster in a VNet, provision the client machine/VM in the same Vnet and then run the HBase Java API Client on the client VM within the same Vnet &ndash; this is shown in the diagram below-</p>
<p><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/7028.110514_5F00_0556_5F00_HowtouseHBa1.png" alt="" /></p>
<p>We will touch on each of these steps below &ndash;</p>
<p><span style="font-size: 14pt;"><strong>Provision HDInsight HBase cluster in a VNet: </strong></span></p>
<p>You can follow our HDInsight HBase <a href="http://azure.microsoft.com/en-us/documentation/articles/hdinsight-hbase-provision-vnet/">documentation</a> which has very detailed steps on how we can do this either via Azure Portal or Azure PowerShell.</p>
<p><span style="font-size: 14pt;"><strong>Provision a Microsoft Azure VM in the same VNet and subnet: </strong></span></p>
<p>Following the same <a href="http://azure.microsoft.com/en-us/documentation/articles/hdinsight-hbase-provision-vnet/">documentation</a> above, provision a Microsoft Azure virtual machine in the same VNet and subnet as the HDInsight HBase cluster &ndash; A standard Windows Server 2012 image with a small VM size should be sufficient. Since we need JDK installed on the VM in order to use HBase JAVA API, we have found it convenient to use an Oracle JDK image from the gallery for our testing (<strong>this is not required though and may have special pricing</strong>), like below &ndash;</p>
<p><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/3364.110514_5F00_0556_5F00_HowtouseHBa2.png" alt="" /></p>
<p><span style="font-size: small;">If you choose a standard windows&nbsp;server VM (that does not have JDK installed), you can install JDK from <a title="Zulu" href="http://www.azulsystems.com/products/zulu/downloads#Windows">Zulu</a>.</span></p>
<p><span style="font-size: 14pt;"><strong>Get the DNS Suffix to build FQDN of ZooKeeper nodes: </strong></span></p>
<p>When using HBase Java API to connect to HBase cluster remotely, we must use the fully qualified domain name (FQDN). To determine this, we need to get the connection-specific DNS suffix of the HDInsight HBase cluster. The <a href="http://azure.microsoft.com/en-us/documentation/articles/hdinsight-hbase-provision-vnet/">documentation</a> shows multiple ways to accomplish this. The simplest way is to RDP into the HDInsight HBase cluster, and execute ipconfig /all and copy and paste the connection specific DNS suffix for the Ethernet adapter as shown on the screenshot below &ndash;</p>
<p><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/4278.110514_5F00_0556_5F00_HowtouseHBa3.png" alt="" /></p>
<p>So, in my cluster as shown above, connection specific DNS suffix is <strong>AzimHbaseTest.d3.internal.cloudapp.net</strong>. Please make a note of the value from your cluster and we will use this to build the ZooKeeper FQDN in the next section. To verify that the virtual machine can communicate with the HBase cluster, use the following command, ping headnode0.&lt;dns suffix&gt; from the virtual machine, as shown below &ndash;</p>
<p style="background: #e7e6e6;"><span style="font-family: Segoe UI; font-size: 10pt;"><span style="color: #000080;">C:\Users\DBAdmin&gt;ping headnode0.AzimHbaseTest.d3.internal.cloudapp.net <br />Pinging headnode0.AzimHbaseTest.d3.internal.cloudapp.net [10.0.0.6] with 32 bytes of data: <br />Reply from 10.0.0.6: bytes=32 time=3ms TTL=128 <br />&hellip;... </span>&nbsp; </span></p>
<p>&nbsp;<span style="font-size: 14pt;"><strong>Develop/Test HBase JAVA API Client on the Azure VM: </strong></span></p>
<p>Our <a href="http://azure.microsoft.com/en-us/documentation/articles/hdinsight-hbase-build-java-maven/">documentation</a> has detailed steps of how to use <a href="http://maven.apache.org/">Maven</a> to develop a Java Client using HBase JAVA APIs and we don't want to repeat all the steps here &ndash; but we would like to share our own experience and show a few different ways we can use Maven for developing the JAVA client. A few options (not limited to) are &ndash;</p>
<ol>
<li>Use Maven command line and a JAVA IDE (like Eclipse, IntelliJ etc)</li>
<li>Use a JAVA IDE (that comes integrated with Maven) like Eclipse to develop the JAVA client</li>
</ol>
<p><span style="font-size: 14pt;"><strong>Using Maven command line and a Java IDE: </strong></span></p>
<p><strong><span style="font-size: x-small;">Note: I am using <a title="IntelliJ" href="https://www.jetbrains.com/idea/">IntelliJ</a> as an example - you can use your preferred JAVA IDE. Also, the steps below assume that you have installed IntelliJ on the Azure VM (client).</span></strong></p>
<ol>
<li>From the command-line on your Azure VM, go to the folder where you wish to create the project. For example, cd C:\Maven\MavenProjects</li>
<li>
<div>Use the mvn command, which is installed with Maven, to generate the project template, as shown below &ndash;</div>
<p style="background: #e7e6e6;"><span style="font-family: Segoe UI; font-size: 10pt;"><span style="color: #000080;">mvn archetype:generate -DgroupId=com.microsoft.css -DartifactId=HBaseJavaApiTest -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false</span>&nbsp; </span></p>
<p>This will create the src directory and POM.xml in the directory <strong>HbaseJavaApiTest </strong>(same as<strong> </strong>artifactId)</p>
</li>
<li>
<div>Start the JAVA IDE IntelliJ and select 'Import Project' and point to the POM.xml created in the last step, as shown below &ndash;</div>
<p><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/7532.110514_5F00_0556_5F00_HowtouseHBa4.png" alt="" /></p>
<p>&nbsp;</p>
</li>
<li>
<div>On the next window, in addition to the default settings enabled, also select the options 'Import Maven Projects automatically' and automatically download 1)sources and 2)documentation, as shown below &ndash;</div>
<p><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/0027.110514_5F00_0556_5F00_HowtouseHBa5.png" alt="" /></p>
<p>&nbsp;</p>
</li>
<li>
<div>Select the default options on the next windows and the project will open inside IntelliJ &ndash; add the necessary JAVA source files and remove 'test' folder if you don't plan to use it. In our case, we just just tested the CreateTable.java from the above <a href="http://azure.microsoft.com/en-us/documentation/articles/hdinsight-hbase-build-java-maven/">documentation</a> page and it looks something like this &ndash;</div>
<p><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/7384.110514_5F00_0556_5F00_HowtouseHBa6.png" alt="" /></p>
<p>&nbsp;</p>
</li>
<li>Modify the POM.xml file as shown in the <a href="http://azure.microsoft.com/en-us/documentation/articles/hdinsight-hbase-build-java-maven/">documentation</a>&ndash; something like this &ndash;
<script type="text/javascript" src="https://gist.github.com/AzimUddin/0d352840299078895e5a.js"></script>
</li>
<li>Create a new directory named conf in the <strong>HbaseJavaApiTest </strong>directory. In the conf directory, create a new file named hbase-site.xml and use the ZooKeeper FQDNs&nbsp;created using the DNS suffix you got previously, as shown below:
<script type="text/javascript" src="https://gist.github.com/AzimUddin/5a6c9665f268f0b1444f.js"></script>
</li>
<li>Open a command prompt and change directories to the <strong>HbaseJavaApiTest </strong>directory. Use the following command to build a JAR containing the application:
<p style="background: #e7e6e6;"><span style="font-family: Segoe UI; font-size: 10pt;"><span style="color: #000080;">mvn clean package</span>&nbsp; </span></p>
<p>This will clean any previous build artifacts, download any dependencies that have not already been installed, then build and package the application. The command will create a jar file <strong>HBaseJavaApiTest-1.0-SNAPSHOT.jar&nbsp;</strong>in&nbsp;the directory <strong>HbaseJavaApiTest\target.</strong></p>
</li>
</ol>
<p>&nbsp;</p>
<p><span style="font-size: 14pt;"><strong>Using Eclipse IDE to develop and build the HBase JAVA client application: </strong></span></p>
<p style="text-align: justify;">You can use the same steps as above for generating the project template using Maven and then import the project (POM.xml) in Eclipse. Alternatively, you can use the Eclipse IDE itself (without using Maven command line) to create&nbsp;the Maven project, as shown below-</p>
<p style="text-align: justify;">1. Install a latest package of 'Eclipse IDE for Java EE Developers', such as <a title="Kepler SR2" href="http://www.eclipse.org/downloads/packages/release/Kepler/SR2">Kepler SR2</a> or <a title="LUNA SR1" href="http://www.eclipse.org/downloads/packages/eclipse-ide-java-ee-developers/lunasr1">Luna SR1</a></p>
<p style="text-align: justify;">2. Open Eclipse IDE and select File -&gt; new -&gt; Project -&gt; Maven -&gt; New Maven Project, Leave the default options and enter GroupId and ArtifactId</p>
<p style="text-align: justify;"><a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/3073.Eclipse_5F00_New_5F00_Project_5F00_4.png"><img src="http://blogs.msdn.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/3073.Eclipse_5F00_New_5F00_Project_5F00_4.png" alt="" border="0" /></a></p>
<p style="text-align: justify;">This will create a vanilla maven project. You can then modify the project to add dependencies to pom.xml, and add/modify the source code like CreateTable.java etc. When loading the project on Eclipse IDE, you&nbsp;may notice errors such as "Missing artifact jdk.tools:jdk.tools.jar:1.7". This can be fixed by either modifying eclipse.ini to add the -vm argument to point to the JDK\bin directory, or by including the following dependency within pom.xml -</p>
<p style="text-align: justify;" align="LEFT"><span style="background: #e7e6e6;">&nbsp;&nbsp;&nbsp; &lt;/dependency&gt;<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;groupId&gt;jdk.tools&lt;/groupId&gt;<br /> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;artifactId&gt;jdk.tools&lt;/artifactId&gt;<br /> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;version&gt;${java.version}&lt;/version&gt;<br /> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;scope&gt;system&lt;/scope&gt;<br /> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &lt;systemPath&gt;${JAVA_HOME}/lib/tools.jar&lt;/systemPath&gt;<br /> &nbsp;&nbsp;&nbsp; &lt;/dependency&gt;</span></p>
<p style="text-align: justify;" align="LEFT">This is due to a limitation with Maven support on Eclipse IDE. It is documented <a href="https://bugs.eclipse.org/bugs/show_bug.cgi?id=432992" target="_blank">here</a>. Once the above changes are done, you can build the project and run it from within Eclipse IDE and debug as needed.</p>
<p><span style="font-size: 14pt;"><strong>Running the HBase JAVA API Client on Azure VM: </strong></span></p>
<p>If you have made your JAR an executable one using a Maven&nbsp;build plugin (see the POM.xml file above) like this &ndash;</p>
<p style="background: #e7e6e6;"><span style="font-family: Segoe UI; font-size: 10pt;"><span style="color: #000080;">&lt;plugin&gt; <br /> &lt;!-- Build an executable JAR --&gt; <br />&lt;groupId&gt;org.apache.maven.plugins&lt;/groupId&gt; <br />&lt;artifactId&gt;maven-jar-plugin&lt;/artifactId&gt; <br />&lt;version&gt;2.4&lt;/version&gt; <br />&lt;configuration&gt; <br />&lt;archive&gt; <br />&lt;manifest&gt; <br />&lt;addClasspath&gt;true&lt;/addClasspath&gt; <br />&lt;classpathPrefix&gt;lib/&lt;/classpathPrefix&gt; <br />&lt;mainClass&gt;com.microsoft.css.CreateTable&lt;/mainClass&gt; <br />&lt;/manifest&gt; <br />&lt;/archive&gt; <br />&lt;/configuration&gt; <br />&lt;/plugin&gt; </span>&nbsp; </span></p>
<p>You can run the executable JAR from a command line. Change directory to <strong>HbaseJavaApiTest\target </strong>and run the following command &ndash;<strong> </strong></p>
<p style="background: #e7e6e6;"><span style="font-family: Segoe UI; font-size: 10pt;"><span style="color: #000080;">java -jar HBaseJavaApiTest-1.0-SNAPSHOT.jar</span>&nbsp; </span></p>
<p>Alternatively, you can test and debug the code within the IDE itself, by setting a breakpoint and stepping through the code, as shown in the screenshot below-</p>
<p><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/4101.110514_5F00_0556_5F00_HowtouseHBa7.png" alt="" /></p>
<p>&nbsp;</p>
<p>I hope you find the blog helpful in using HBase JAVA API to interact with an HDInsight HBase cluster, we would love to hear your feedback!&nbsp;&ndash; in part 2, we will discuss some troubleshooting tools you can use for an HBase JAVA API client application.</p>
<p>Thanks <a title="Farooq" href="https://social.msdn.microsoft.com/profile/mohammad%20farooq/">Farooq</a> for reviewing this!</p>
<p>- Azim Uddin and Dharshana Kumar</p><div style="clear:both;"></div><img src="http://blogs.msdn.com/aggbug.aspx?PostID=10569915" width="1" height="1">Azim Uddinhttp://blogs.msdn.com/azim91_4000_hotmail.com/ProfileUrlRedirect.ashxHow to use parameter substitution with Pig Latin and PowerShellhttp://blogs.msdn.com/b/bigdatasupport/archive/2014/08/12/passing-parameters-to-pig-latin-scripts.aspx2014-08-12T16:37:00Z2014-08-12T16:37:00Z<p>When running Pig in a production environment, you'll likely have one or more Pig Latin scripts that run on a recurring basis (daily, weekly, monthly, etc.) that need to locate their input data based on when or where they are run. For example, you may have a Pig job that performs daily log ingestion by geographic region. It would be costly and error prone to manually edit the script to reference the location of the input data each time log data needs to be ingested. Ideally, you'd like to pass the date and geographic region to the Pig script as parameters at the time the script is executed. Fortunately, Pig provides this capability via <strong>parameter substitution</strong>. There are four different mechanisms to define parameters that can be referenced in a Pig Latin script:</p>
<ul>
<li>Parameters can be defined as command line arguments; each parameter is passed to Pig as a separate argument using <strong>-param</strong> switches at script execution time</li>
<li>Parameters can be defined in a parameter file that's passed to Pig using the <strong>-param_file</strong> command line argument when the script is executed</li>
<li>Parameters can be defined inside Pig Latin scripts using the "<strong>%declare</strong>"&nbsp;and "<strong>%default</strong>" preprocessor statements</li>
</ul>
<p>You can use none, one or any combination of the above options.</p>
<p>Let's look at an example Pig script that could be run to perform IIS log ingestion. The script loads and filters an IIS log looking for requests that didn't complete with status-code of 200 or 201.&nbsp;</p>
<p>Note that parameter names in Pig Latin scripts are preceded by a dollar sign, $. For example, the&nbsp;<strong>LOAD</strong>&nbsp;statement references six parameters; $WASB_SCHEME, $ROOT_FOLDER, $YEAR, $MONTH, $DAY and $INPUTFILE.</p>
<p>Note also the script makes use of the&nbsp;<strong>%default</strong>&nbsp;preprocessor statement to define default values for the WASB_SCHEME and ROOT_FOLDER parameters:&nbsp;</p>
<script type="text/javascript" src="https://gist.github.com/dgshaver/162740cf8eb70fc9f862.js"></script>
<h3><strong style="font-size: 1.17em;">Specifying Parameters in a Parameter File</strong></h3>
<p>Parameters are defined as key-value pairs. Below is an example parameter file that defines four parameters referenced by the above script; YEAR, MONTH, DAY and INPUTFILE. The YEAR key has a value of 2014, the DAY key has a value of 27&nbsp;the MONTH key has a value of 07 and the INPUTFILE key has a value of iis.log:
<script type="text/javascript" src="https://gist.github.com/dgshaver/46eb37ab54cedb0d91e2.js"></script>
</p>
<p>The Pig preprocessor locates parameters in the Pig script by searching for the parameter name prepended with a dollar sign, $, and substitutes the value of the key for the parameter. You can pass the parameter file to Pig using the <strong>-param_file </strong>command line argument:</p>
<table style="width: 982px; height: 57px;" border="1" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td valign="top" width="623">
<p><span style="font-family: courier new,courier;">pig -param_file&nbsp;d:\users\rdpuser\documents\parameters.txt -f&nbsp;d:\users\rdpuser\documents\LoadLog.pig</span></p>
</td>
</tr>
</tbody>
</table>
<h3><strong>Specifying Parameters on the Command Line</strong></h3>
<p>The second method of passing parameters to your Pig script at execution time is to pass each parameter as a separate key-value pair using individual <strong>-param </strong>arguments.&nbsp;</p>
<table style="width: 984px; height: 42px;" border="1" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td valign="top" width="623">
<p><span style="font-family: courier new,courier;">pig -param&nbsp;"YEAR=2014" -param "MONTH=07" -param "DAY=27"&nbsp; -param "INPUTFILE=iis.log" -f&nbsp; d:\users\rdpuser\documents\LoadLog.pig</span></p>
<p>&nbsp;</p>
</td>
</tr>
</tbody>
</table>
<p>Note: On Windows&nbsp;key-value pairs must be enclosed in double quotes as the equal sign, =,&nbsp;is an assignment operator in the Windows cmd shell.</p>
<h3><strong>Testing Parameter Substitution Using the -dryrun Command Line Option </strong></h3>
<p>Before&nbsp;submitting the Pig script to the cluster's Templeton endpoint for execution using PowerShell, let's make sure that parameter substitution will work as desired.&nbsp; There's a useful Pig command line parameter, <strong>-dryrun</strong>, that can be used to test parameter substitution. &nbsp;The -<strong>dryrun</strong> option directs Pig to substitute parameter values for parameters in the Pig script, write the resulting script to a file named <strong>&lt;original_script_name&gt;</strong>.<strong>substituted</strong> and shut down without executing the script. The best way to try -<strong>dryrun</strong> is to enable remote access to your cluster, and use&nbsp;RDP to log into your HDInsight cluster's active headnode.&nbsp; Once you're logged in,&nbsp;you can execute PIG.CMD interactively as demonstrated below. Pig will report the name and location of the substituted file before it shuts down:</p>
<div>
<table style="border-collapse: collapse;" border="0"><colgroup><col style="width: 875px;" /></colgroup>
<tbody valign="top">
<tr>
<td style="border: 0.5pt solid currentColor; padding-right: 7px; padding-left: 7px;">
<p><span style="font-family: Courier New; font-size: 10pt; background-color: #ffffff;"> C:\apps\dist\pig-0.12.1.2.1.3.0-1887\bin&gt;pig -param_file d:\users\rdpuser\documents\parameters.txt -param "MONTH=08" -param "DAY=24" <span style="background-color: #ffff00;">-dryrun&nbsp;</span> -f d:\users\rdpuser\documents\LoadLog.pig</span></p>
<p><span style="font-family: Courier New; font-size: 10pt;">. . . </span></p>
<p><span style="font-family: Courier New; font-size: 10pt;">2014-08-24 15:58:37,625 [main] INFO&nbsp; org.apache.pig.impl.util.Utils - Default bootup file D:\Users\dansha/.pigbootup not found<br />2014-08-24 15:58:37,638 [main] WARN&nbsp; org.apache.pig.tools.parameters.PreprocessorContext - <span style="background-color: #ffff00;">Warning : Multiple values found for MONTH. Using value 08</span><br />2014-08-24 15:58:37,638 [main] WARN&nbsp; org.apache.pig.tools.parameters.PreprocessorContext - <span style="background-color: #ffff00;">Warning : Multiple values found for DAY. Using value 24</span><br />2014-08-24 15:58:38,305 [main] INFO&nbsp; org.apache.pig.Main - Dry run completed. Substituted pig script is at d:\users\dansha\documents\<span style="background-color: #ffff00;">LoadLog.pig.substituted</span></span></p>
</td>
</tr>
</tbody>
</table>
</div>
<h3><strong>Precedence Rules for Parameter Substitution</strong></h3>
<p>Note the "Warning" messages that showed up in the -<strong>dryrun&nbsp;</strong>output.&nbsp; If a parameter is defined more than once, there are precedence rules that determine what the final value of the parameter will be.&nbsp; The following precedence order is documented in the <a href="http://pig.apache.org/docs/r0.13.0/cont.html">Pig parameter substitution documentation</a>. The list is ordered from highest to lowest precedence.&nbsp;</p>
<ol>
<li>Parameters defined using a <strong>declare</strong> preprocessor statement have the highest precedence</li>
<li>Parameters defined on the command line using -<strong>param</strong> have the second highest precedence</li>
<li>Parameters defined in parameter files have the third highest precedence</li>
<li>Parameters defined using the <strong>default</strong> preprocessor statement have the lowest precedence</li>
</ol>
<p>Given the above precedence rules,&nbsp;even though the MONTH and DAY parameters were defined in the parameter file, the individual command line parameters specified with the <strong>-param </strong>arguments overrode them.</p>
<p>Below please find the content of the <strong>LoadLog.pig.substituted</strong> file that was output by the <strong>-dryrun</strong> command. Note that all parameters were replaced with values. Some parameters were replaced by values specified in the parameter file, some were replaced by parameters passed via the -<strong>param</strong> argument,&nbsp;and others were replaced by parameters defined with the <strong>default </strong>preprocessor statements.</p>
<script type="text/javascript" src="https://gist.github.com/dgshaver/da422800490958ebd6b0.js"></script>
<p>&nbsp;</p>
<h3><strong>Submitting a Pig Job that Uses Parameters with PowerShell </strong></h3>
<p>Now, let's bring it all together with an example that demonstrates how to use the Azure HDInsight&nbsp;PowerShell cmdlets&nbsp;to&nbsp;submit a Pig job that uses&nbsp;command line parameters and a parameter file.</p>
<p>
<script type="text/javascript" src="https://gist.github.com/dgshaver/495cab7fb8d052f1d82c.js"></script>
There are a couple of things in the script that are worthy of closer examination. First, if the job will reference any files they need to be copied to one of the storage accounts the target HDInsight cluster is configured to use. This gives the Templeton server access to the files to set the job up for execution. For the example we've been referring to, we needed to copy the Pig Latin script, <strong>LoadLog.pig</strong>, and the parameter file, <strong>Parameters.txt</strong>, to Azure blob storage using the <strong>Set-AzureStorageBlobContent</strong> cmdlet.</p>
<div>
<table style="border-collapse: collapse;" border="0"><colgroup><col style="width: 875px;" /></colgroup>
<tbody valign="top">
<tr>
<td style="padding-left: 7px; padding-right: 7px; border: solid 0.5pt;">
<p style="background: white;"><span style="font-family: Lucida Console; font-size: 10pt;"><span style="color: darkgreen;"># Get storage context</span> </span></p>
<p style="background: white;"><span style="font-family: Lucida Console; font-size: 10pt;"><span style="color: orangered;">$AzureStorageContext</span> <span style="color: darkgray;">=</span> <span style="color: blue;">New-AzureStorageContext</span> <span style="color: navy;">-StorageAccountName</span> <span style="color: orangered;">$BlobStorageAccount</span> <span style="color: navy;">-StorageAccountKey</span> <span style="color: orangered;">$PrimaryStorageKey</span> </span></p>
<p style="background: white;"><span style="font-family: Lucida Console; font-size: 10pt;"><span style="color: darkgreen;"># Copy pig script and parameter file up to Azure storage where they can be accessed by the Templeton server while setting up the job for execution</span>&nbsp;</span></p>
<p style="background: white;"><span style="font-family: Lucida Console; font-size: 10pt;"><span style="color: blue;">Set-AzureStorageBlobContent</span> <span style="color: navy;">-File</span> <span style="color: blueviolet;">C:\src\Hadoop\Pig\LoadLog.pig</span> <span style="color: navy;">-BlobType</span> <span style="color: blueviolet;">Block</span> <span style="color: navy;">-Container</span> <span style="color: orangered;">$DefaultStorageContainer</span> <span style="color: navy;">-Context</span> <span style="color: orangered;">$AzureStorageContext</span> <span style="color: navy;">-Blob</span> <span style="color: blueviolet;">http://<span style="color: orangered;">$BlobStorageAccount<span style="color: blueviolet;">.blob.core.windows.net/<span style="color: orangered;">$DefaultStorageContainer<span style="color: blueviolet;">/<span style="color: orangered;">$ScriptsFolder<span style="color: blueviolet;">/<span style="color: orangered;">$ScriptName</span> </span></span></span></span></span></span></span></span></p>
<p style="background: white;"><span style="font-family: Lucida Console; font-size: 10pt;"><span style="color: blue;">Set-AzureStorageBlobContent</span> <span style="color: navy;">-File</span> <span style="color: blueviolet;">C:\src\Hadoop\Pig\ParamFile.txt</span> <span style="color: navy;">-BlobType</span> <span style="color: blueviolet;">Block</span> <span style="color: navy;">-Container</span> <span style="color: orangered;">$DefaultStorageContainer</span> <span style="color: navy;">-Context</span> <span style="color: orangered;">$AzureStorageContext</span> <span style="color: navy;">-Blob</span> <span style="color: blueviolet;">http://<span style="color: orangered;">$BlobStorageAccount<span style="color: blueviolet;">.blob.core.windows.net/<span style="color: orangered;">$DefaultStorageContainer<span style="color: blueviolet;">/<span style="color: orangered;">$ScriptsFolder<span style="color: blueviolet;">/<span style="color: orangered;">$ParamFile</span>&nbsp;</span></span></span></span></span></span></span></span>&nbsp;</p>
</td>
</tr>
</tbody>
</table>
</div>
<h3><strong>Passing Command Line Options via PowerShell </strong></h3>
<p>Passing parameters to Pig jobs via the PowerShell cmdlets can be a bit confusing, and we've received a number of inquiries how to go about it.&nbsp; Keeping that in mind, the most important thing to "call out" from the job submission script is&nbsp;how to pass parameters to a Pig script using the -<strong>param</strong> and <strong>-param_file</strong> Pig command line arguments.&nbsp;Command line arguments are specified at the time the Pig job is defined with the&nbsp;<strong>New-AzureHDInsightPigJobDefinition </strong>cmdlet. A job's&nbsp;command line arguments must be passed to <strong>New-AzureHDInsightPigJobDefinition</strong> as an array of <strong>String</strong> objects using the -<strong>Arguments </strong>parameter. Each command line element that will be passed to Pig is stored as a separate array entry. This is straight forward for command line options that are "switches" with no associated arguments like "<strong>-verbose</strong>", "<strong>-warning</strong>" and "<strong>-stop_on_failure</strong>"; each of these command line arguments are added as separate entries to the $pigParams array:</p>
<table style="width: 878px; height: 41px;" border="1" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td valign="top" width="623">
<p><span style="font-family: Lucida Console; font-size: 10pt;"><span style="color: orangered;">$pigParams</span> <span style="color: darkgray;">=</span> <span style="color: darkred;">"-verbose"<span style="color: darkgray;">,<span style="color: darkred;">"-warning"<span style="color: darkgray;">,<span style="color: darkred;">"-stop_on_failure"</span> </span></span></span></span></span></p>
</td>
</tr>
</tbody>
</table>
<p>However, things get tricky when passing command line arguments that have associated values.&nbsp; Individual&nbsp;Pig parameters are passed using a <strong>-param</strong> command line argument followed directly its associated key-value pair.&nbsp;&nbsp;The&nbsp;key-value&nbsp;pair is&nbsp;added to the $pigParams array as a separate, but adjacent, array entry.</p>
<p>For example, consider the first line of code below where the INPUTFILE parameter is added to the $pigParams parameter array. First the command line parameter, "-<strong>param</strong>" is added.&nbsp; Next, the key-value pair associated with the -<strong>param</strong> argument, "INPUTFILE=$InputFile"<span style="color: orangered;">&nbsp;</span>are added in the adjacent array entry.&nbsp; The pattern simply repeats for each successive command line parameter.</p>
<table style="width: 881px; height: 41px;" border="1" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td valign="top" width="623">
<p style="background: white;"><span style="font-family: Lucida Console; font-size: 10pt;"><span style="color: orangered;">$pigParams</span> <span style="color: darkgray;">+=</span> <span style="color: darkred;">"-param"<span style="color: darkgray;">,<span style="color: darkred;">"INPUTFILE=<span style="color: orangered;">$InputFile<span style="color: darkred;">"</span> </span></span></span></span></span></p>
<p style="background: white;"><span style="font-family: Lucida Console; font-size: 10pt;"><span style="color: orangered;">$pigParams</span> <span style="color: darkgray;">+=</span> <span style="color: darkred;">"-param"<span style="color: darkgray;">,<span style="color: darkred;">"MONTH=<span style="color: orangered;">$Month<span style="color: darkred;">"</span> </span></span></span></span></span></p>
<p style="background: white;"><span style="font-family: Lucida Console; font-size: 10pt;"><span style="color: orangered;">$pigParams</span> <span style="color: darkgray;">+=</span> <span style="color: darkred;">"-param"<span style="color: darkgray;">,<span style="color: darkred;">"DAY=<span style="color: orangered;">$Day<span style="color: darkred;">"</span> </span></span></span></span></span></p>
</td>
</tr>
</tbody>
</table>
<p>For the parameter file,&nbsp;the "-<strong>param_file</strong>" argument is added to the $pigParams&nbsp;array followed by a separate, but adjacent, array entry that&nbsp;specifies the parameter file name.&nbsp; Finally, the $pigParams are passed to <strong>New-AzureHDInsightPigJobDefinition</strong> using the -<strong>Arguments</strong> parameter.&nbsp;</p>
<div>
<table style="border-collapse: collapse;" border="0"><colgroup><col style="width: 881px;" /></colgroup>
<tbody valign="top">
<tr>
<td style="padding-left: 7px; padding-right: 7px; border: solid 0.5pt;">
<p style="background: white;"><span style="font-family: Lucida Console; font-size: 10pt;"><span style="color: orangered;">$pigParams</span> <span style="color: darkgray;">+=</span> <span style="color: darkred;">"-param_file"<span style="color: darkgray;">,<span style="color: darkred;">"<span style="color: orangered;">$param_file<span style="color: darkred;">"</span> </span></span></span></span></span></p>
<p style="background: white;"><span style="font-family: Lucida Console; font-size: 10pt;"><span style="color: darkgreen;"># Create pig job definition</span> </span></p>
<p style="background: white;"><span style="font-family: Lucida Console; font-size: 10pt;"><span style="color: orangered;">$pigJobDefinition</span> <span style="color: darkgray;">=</span> <span style="color: blue;">New-AzureHDInsightPigJobDefinition</span> <span style="color: navy;">-File</span> <span style="color: orangered;">$PigScript</span> <span style="color: navy;">-Arguments</span> <span style="color: orangered;">$PigParams</span></span></p>
</td>
</tr>
</tbody>
</table>
</div>
<p>The job definition created by <strong>New-AzureHDInsightPigJobDefinition</strong> is then used by&nbsp;the <strong>Start-AzureHDInsightJob</strong> cmdlet to submit&nbsp;the Pig script to the&nbsp;Azure HDInsight cluster for execution:</p>
<table style="width: 885px; height: 56px;" border="1" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td valign="top" width="623">
<p style="background: white;"><span style="font-family: Lucida Console; font-size: 10pt;"><span style="color: orangered;">$pigJob</span> <span style="color: darkgray;">=</span> <span style="color: blue;">Start-AzureHDInsightJob</span> <span style="color: navy;">-Subscription</span> <span style="color: orangered;">$subscriptionName&nbsp;</span><span style="color: navy;">-Cluster <span style="color: orangered;">$clusterDnsName</span> -JobDefinition <span style="font-family: Lucida Console; font-size: 10pt;"><span style="color: orangered;">$pigJobDefinition</span></span></span></span></p>
</td>
</tr>
</tbody>
</table>
<p>I hope&nbsp;this post clears up&nbsp;questions some have had about&nbsp;how to pass parameters to Pig jobs via PowerShell, and that you found it informative.&nbsp; Please let us know how we are doing, and what kind of content you would like us to write about in the future.</p><div style="clear:both;"></div><img src="http://blogs.msdn.com/aggbug.aspx?PostID=10549470" width="1" height="1">Dan (MSFT)http://blogs.msdn.com/dgshaver_4000_outlook.com/ProfileUrlRedirect.ashxHDInsight: - Creating, Deploying and Executing Pig UDFhttp://blogs.msdn.com/b/bigdatasupport/archive/2014/07/07/hdinsight-creating-deploying-and-executing-pig-udf.aspx2014-07-07T08:05:00Z2014-07-07T08:05:00Z<p>&nbsp;</p>
<p style="text-align: justify;">During my developer experience, I always look for how customization (write my own processing) can be done if functionality is not available in programming language. That thought was triggered again when I was working on Apache Pig in HDInsight. So I started researching it and thought it would be good to share.</p>
<p style="text-align: justify;">In this article I take a basic example (which might be doable with pig scripts alone) to focus on creating, deploying and executing Pig UDF.</p>
<h1>Scenario</h1>
<p style="text-align: justify;">In this article, we'll see how we can evaluate an employee's performance based on the sales they made for given years. Along with that we'll also calculate the commission based on certain condition. If an employee achieves the target then they were allowed for 8% commission else no commission. We will have employee's sales record stores on blob storage.</p>
<p>&nbsp;</p>
<h1>Step 1:- Creating a Pig UDF</h1>
<p>Pig UDF can be implemented through Java, Python, and JavaScript. We will use Java platform to build UDF. We'll be using <a href="https://www.eclipse.org/downloads/">eclipse</a> to write java program.</p>
<p>Open eclipse, Click on <span style="text-decoration: underline;">F</span>ile -&gt; New -&gt; Java Project</p>
<p><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/8371.070714_5F00_0805_5F00_HDInsightCr1.png" alt="" /></p>
<p>Provide a suitable <strong>Project Name such </strong>as "Commission"<strong> </strong>and click on Finish.</p>
<p><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/4454.070714_5F00_0805_5F00_HDInsightCr2.png" alt="" /></p>
<p>Right click on Project Name -&gt; Select New -&gt; Package</p>
<p><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/2744.070714_5F00_0805_5F00_HDInsightCr3.png" alt="" /></p>
<p>Provide Java Package <strong>Name </strong>such as<strong> salescommission</strong></p>
<p><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/3757.070714_5F00_0805_5F00_HDInsightCr4.png" alt="" /></p>
<p>Right click on package -&gt; Click New -&gt; Click on Class</p>
<p><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/4278.070714_5F00_0805_5F00_HDInsightCr5.png" alt="" /></p>
<p>&nbsp;</p>
<p>Provide a Class <strong>Name </strong>such as calculatecommission and click on Finish</p>
<p><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/3644.070714_5F00_0805_5F00_HDInsightCr6.png" alt="" /></p>
<p style="margin-left: 36pt;">The next step is to add logic in class, which you can see below. The same code can be copied from <a href="https://github.com/rawatsudhir/Samples/blob/master/calculatecommission.java">here</a>.</p>
<p style="margin-left: 36pt;"><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/6327.070714_5F00_0805_5F00_HDInsightCr7.png" alt="" /></p>
<p style="text-align: justify; margin-left: 36pt;">Before we compile this code we need to add reference of Pig jar file for the reason of using some classes in code. One thing you need to be sure of is to add correct reference. I recommend to have the jar file targeting the Pig version that you will be utilizing in Hadoop.</p>
<p style="text-align: justify; margin-left: 36pt;">In HDInsight, we have pig 0.11 version available so I copied <strong>pig-0.11.0.1.3.7.1-01293.jar</strong> from "C:\apps\dist\pig-0.11.0.1.3.7.1-01293" after logging onto an RDP session on the head node. The version number will vary based on the version of HDInsight or Hadoop that you have deployed. You can also download from Apache website as well. If versions are mismatched we might get errors.</p>
<p style="text-align: justify;">Copy that .jar locally to a folder, and reference that. We have to add the reference in the eclipse solution and here how we can do it.</p>
<p><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/7776.070714_5F00_0805_5F00_HDInsightCr8.png" alt="" /></p>
<p>Right click on the project, and select <strong>Build Path</strong> &gt; "<strong>Add External Archives&hellip;</strong>" and select the location where you have copied <strong>pig-0.11.0.1.3.7.1-01293.jar </strong>on<strong> </strong>your local hard drive.</p>
<p>The next step is to generate jar. Right click on Project and click <strong>Export </strong></p>
<p><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/4064.070714_5F00_0805_5F00_HDInsightCr9.png" alt="" /></p>
<p>Select <strong>JAR file</strong> and click Next</p>
<p><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/5141.070714_5F00_0805_5F00_HDInsightCr10.png" alt="" /></p>
<p>Select the package and then select <strong>calculationcommission.java. </strong></p>
<p>Provide an export destination path and click <strong><span style="text-decoration: underline;">N</span>ext.</strong></p>
<p><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/5001.070714_5F00_0805_5F00_HDInsightCr11.png" alt="" /></p>
<p>Use the default settings on the Packaging Options page and click <strong><span style="text-decoration: underline;">N</span>ext</strong></p>
<p><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/3060.070714_5F00_0805_5F00_HDInsightCr12.png" alt="" /></p>
<p>&nbsp;</p>
<h1>Step 2:- Deploying a Pig UDF</h1>
<p>&nbsp;</p>
<p style="text-align: justify; margin-left: 36pt;">Once the .jar file is compiled and generated locally, copy the .jar file onto Microsoft Azure Blob storage. Details of how to copy files into Microsoft Azure Blob storage can be find <a href="http://azure.microsoft.com/en-us/documentation/articles/hdinsight-upload-data/">here</a>.</p>
<p style="text-align: justify; margin-left: 36pt;">As this stage, you may want to copy the data file as well so we can later try it out. I have uploaded a sample data text file <a href="https://github.com/rawatsudhir/Samples/blob/master/salesdata.txt">here</a>.</p>
<p style="text-align: justify; margin-left: 36pt;">In the next step we'll register the jar file and use the function we created.</p>
<h1>Step 3:- Executing a Pig UDF</h1>
<p>&nbsp;</p>
<p style="text-align: justify; margin-left: 36pt;">Before executing the Pig UDF, first register it. This is required because Pig should know which JAR file to load via classloader. You will get error if you try to use a function from a class without registering it. When PIG is running in distributed mode then it sends the JAR files to the cluster. Below is the screenshot of PowerShell and outcome. I have copied the sample PowerShell <a href="https://github.com/rawatsudhir/Samples/blob/master/RegisterAndExecutePigUDFPowershell">here</a>.</p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;<img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/4774.070714_5F00_0805_5F00_HDInsightCr13.png" alt="" /></p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;Instead of seeing the result on screen, I stored the result in a file by changing <strong>DUMP B</strong> command from above to <strong>STORE B INTO 'LOCATION_NAME'</strong>.</p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;For example you can add this line into the Pig Latin on the $QueryString variable in powershell by using the + sign to concatenate the string.</p>
<p style="background: white;"><span style="font-family: Lucida Console;"><span style="font-size: 9pt;">&nbsp;&nbsp;<span style="color: darkred;">"STORE B INTO 'wasb://CONTAINER@STORAGEACCOUNT.blob.core.windows.net/user/pigsampleudf/output/';" </span></span></span></p>
<p>&nbsp;</p>
<p>The output will go into a folder named output, and the file will be named part-m-00000 for example.</p>
<p>&nbsp;</p>
<h1>Step 4:- Analyzing output in Excel 2013</h1>
<p>&nbsp;&nbsp;&nbsp;&nbsp;</p>
<p>Once the output is stored on Microsoft Azure Blob storage, you can open it in Power Query.</p>
<p>First Open Microsoft Excel 2013 and Click on the<strong> Power Query</strong> ribbon.</p>
<p>Next choose to import data <strong>From Other Sources</strong> then choose <strong>From Windows Azure HDInsight </strong></p>
<p><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/6724.070714_5F00_0805_5F00_HDInsightCr14.png" alt="" /></p>
<p>&nbsp;</p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;Provide the Microsoft Azure Blob storage name. This is the storage account name as you would see on your storage page in the Azure portal. All your containers in that account will be shown.</p>
<p><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/7801.070714_5F00_0805_5F00_HDInsightCr15.png" alt="" /></p>
<p>&nbsp;</p>
<p>Once connected, select the container where output is stored.</p>
<p><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/7750.070714_5F00_0805_5F00_HDInsightCr16.png" alt="" /></p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;</p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;One connected, you will find many files in blob storage so one of the ways is to sort on <strong>Date modified</strong> column.</p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;<img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/6746.070714_5F00_0805_5F00_HDInsightCr17.png" alt="" /></p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;Once you select the output file from our Pig script, let's see how the result is displayed. If you notice year, name, sales and commission are in one merged column. So the next step is to split that into different columns for the fields.</p>
<p><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/6545.070714_5F00_0805_5F00_HDInsightCr18.png" alt="" /></p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;Right click on <strong>Column1 </strong>and select <strong>Split Column </strong>and click on <strong>By Delimiter. </strong>Choose Tab as a delimiter.</p>
<p><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/7484.070714_5F00_0805_5F00_HDInsightCr19.png" alt="" /></p>
<p>Once the split is done, below is the outcome and notice all fields are displaying perfectly. Change the column name as well for better understanding the fields.</p>
<p><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/5381.070714_5F00_0805_5F00_HDInsightCr20.png" alt="" /></p>
<p>&nbsp;</p>
<p style="margin-left: 36pt;">Once it's imported, I have add conditional formatting on <strong>Status</strong> and then did sort on <strong>FullName</strong> column. Now I can analyze the employee performance year by year for example if you look for employee "Amy Albert" not able to meet target (Status column shows -1 values) and if you look for employee "Jillian Carson" is able to meet the target all years (Status column shows 1 values).</p>
<p>&nbsp;</p>
<p><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/7343.070714_5F00_0805_5F00_HDInsightCr21.png" alt="" />&nbsp;&nbsp;&nbsp;&nbsp;</p>
<p>&nbsp;</p>
<p>At last</p>
<ul>
<li>Check out piggybank jar file, since it might have canned features you were looking for.</li>
<li>If computation takes more memory then specified, it will spill to disk (which will be expensive). Use bag data type in such cases as pig will manage it. It knows how to spill.</li>
</ul>
<p>Thanks Jason for reviewing this post.</p>
<p>Thanks and Happy learning.</p>
<p>Sudhir Rawat</p>
<p>&nbsp;</p><div style="clear:both;"></div><img src="http://blogs.msdn.com/aggbug.aspx?PostID=10539984" width="1" height="1">sudhirbloghttp://blogs.msdn.com/sudhirrawat1_4000_live.com/ProfileUrlRedirect.ashxHow to use a Custom JSON Serde with Microsoft Azure HDInsighthttp://blogs.msdn.com/b/bigdatasupport/archive/2014/06/18/how-to-use-a-custom-json-serde-with-microsoft-azure-hdinsight.aspx2014-06-18T21:00:00Z2014-06-18T21:00:00Z<p>I had a recent need to parse JSON files using Hive. There were a couple of options that I could use. One is using native Hive JSON function such as <a href="https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-get_json_object" target="_blank">get_json_object</a> and the other is to use a JSON Serde to parse JSON&nbsp;objects containing nested elements with lesser code. I decided to go with the second approach for ease of use when parsing nested elements. Just a note here that this blog post just lists some of the ideas to make the custom JSON Serde with HDInsight, but this is not officially supported by the HDInsight Support team. So, please feel free to test this approach for your use case to see if it fits.</p>
<p>For this exercise, we will be using the custom JSON Serde <a href="https://github.com/rcongiu/Hive-JSON-Serde" target="_blank">https://github.com/rcongiu/Hive-JSON-Serde</a> as referenced in this Hortonworks <a href="http://hortonworks.com/blog/howto-use-hive-to-sqlize-your-own-tweets-part-two-loading-hive-sql-queries/" target="_blank">blog</a>.</p>
<h2><strong>Step 1: Download Hive-JSON-Serde from GitHub and build the target JARs </strong></h2>
<p>The first step is to download the ZIP from GitHub project <a href="https://github.com/rcongiu/Hive-JSON-Serde" target="_blank">here</a>. The screenshot below shows the downloaded file on my Windows machine.</p>
<p><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/3681.061614_5F00_2356_5F00_HowtouseCus1.png" alt="" /></p>
<p>The screenshot below shows the extracted contents from the zip file.</p>
<p><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/3034.061614_5F00_2356_5F00_HowtouseCus2.png" alt="" /></p>
<p>So, we have a pom.xml file that basically defines the project and Config details needed by Maven to build this project. There is also a src folder here that contains all the source files. So, now we need to download a tool like&nbsp;Maven to build this project and create target JARs.</p>
<p>The current version of Maven can be downloaded&nbsp;from <a href="http://maven.apache.org/download.cgi" target="_blank">here</a> in Binary ZIP format. The screen shot below shows the contents of the Maven zip file extracted it into C:\Tools\Maven folder on my Windows machine.</p>
<p><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/7750.061614_5F00_2356_5F00_HowtouseCus3.png" alt="" /></p>
<p>It is a good idea to&nbsp;point the&nbsp;environment path variable to this folder so mvn can be executed from anywhere in the system. The article <a href="http://java.com/en/download/help/path.xml" target="_blank">here</a> describes how the environment variable path can be set. A tool like <a href="http://curl.haxx.se/" target="_blank">Curl</a>&nbsp;can&nbsp;be used on Windows machines to&nbsp;download a dependency for this JSON Serde project. The latest version of Curl can be downloaded from <a href="http://curl.haxx.se/latest.cgi?curl=win64-nossl" target="_blank">here</a>. The screen clip below shows the commands that need to be executed to build this package&nbsp;as outlined in this Hortonworks blog <a href="http://hortonworks.com/blog/howto-use-hive-to-sqlize-your-own-tweets-part-two-loading-hive-sql-queries/" target="_blank">here</a></p>
<p><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/1057.061614_5F00_2356_5F00_HowtouseCus4.png" alt="" /></p>
<p>So, the command line on a Windows machine with Curl and mvn looks like this now. We need to execute this from the folder that contains the src folder (source files) for the JSON Serde project &ndash; in our case - C:\Blogs\CustomJSONSerde\Hive-JSON-Serde-develop</p>
<p><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/6278.061614_5F00_2356_5F00_HowtouseCus5.png" alt="" /></p>
<p><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/2072.061614_5F00_2356_5F00_HowtouseCus6.png" alt="" /></p>
<p><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/4048.061614_5F00_2356_5F00_HowtouseCus7.png" alt="" width="640" height="43" /></p>
<p>Once the build is successful, you will notice that there is a new target folder created that contains the JAR files.&nbsp;</p>
<p><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/0334.061614_5F00_2356_5F00_HowtouseCus8.png" alt="" /></p>
<p><strong>NOTE: Please see the special note <a href="#HDI31">here</a> on some concepts on pom.xml, and also a section on the changes needed to make this Serde work with Hive 0.13.</strong></p>
<h3><strong>Step 2: Upload these JAR files to WASB and create a customized HDInsight cluster </strong></h3>
<p>For this example, we will create an HDInsight cluster by name MyHDI30 with two storage accounts myhdi30primary, and myhdi30libs. We will create a container by name install within the myhdi30primary storage account, and container by name libs within the myhdi30libs account. Here is a screen clip of the storage account layout using the <a href="http://msdn.microsoft.com/en-us/library/ff683677.aspx" target="_blank">Server explorer on Visual Studio.</a></p>
<p>We can go ahead and upload the target JAR files into the libs container in the myhdi31libs storage account using the Server explorer.</p>
<p><a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/6763.libs.PNG"><img style="border: 0px;" src="http://blogs.msdn.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/6763.libs.PNG" alt="" width="2578" height="300" /></a></p>
<p>Now using a PowerShell script as shown below, you can create an Azure HDInsight cluster pointing to these storage accounts, and with a default metastore for simplicity. Please note that if we go with the default metastore, when the cluster is dropped the metastore is also deleted. You would need to provision a custom metastore if you need to retain the metastore post HDInsight cluster deletion.</p>
<p>The PS Script below creates a new HDInsight cluster with a default metastore, pointing to the storage accounts created above.</p>
<script type="text/javascript" src="https://gist.github.com/dharkum/1b42076dd81f46ac094f.js"></script>
<p>Now, that we have a customized HDInsight cluster with the external JARs added in to the aux jars path, we are ready to see our JSON files getting parsed in action!</p>
<h3><strong>Step 3: See the JSON Serde in action! </strong></h3>
<p>So, for this demo, I am just going to use the example script complex_test.sql that comes along with the JSON Serde. This can be found in the folder where the Serde ZIP file has been extracted to. The screenshot below shows the path on my machine.</p>
<p><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/1263.061614_5F00_2356_5F00_HowtouseCus11.png" alt="" /></p>
<p>So, now we can make a slight change to the complex_test.sql to make it work with Azure HDInsight.</p>
<p>We no longer need the add jar command , since we have already provisioned the HDInsight cluster with the JAR. We also need to modify the LOAD DATA LOCAL INPATH command on that file such that we load from the WASB account, instead of the local file system. A copy of complex_test.sql with these changes is available <a href="https://gist.github.com/dharkum/47092457f279a4b63d7b" target="_blank">here</a>. This file can be uploaded to the default container on the primary storage account. In our case,&nbsp; it is the container install on the primary storage account: myhdi30primary. The complexdata.txt file is found with the Serde files, or it can also be downloaded from <a href="https://gist.github.com/dharkum/f753abc6a189483aab1d" target="_blank">here</a>&nbsp;and can be uploaded to the default container on the primary storage account.</p>
<script type="text/javascript" src="https://gist.github.com/dharkum/47092457f279a4b63d7b.js"></script>
<p>Next we can use PowerShell to test the JSON parsing with Hive. The PowerShell script is below:</p>
<script type="text/javascript" src="https://gist.github.com/dharkum/0a7605ff3db80571b746.js"></script>
<p style="background: white;">We can go ahead and invoke the <a href="https://gist.github.com/dharkum/47092457f279a4b63d7b" target="_blank">complex_test.sql</a> script that creates a Hive table with JSON Serde as the row format, and loads data and does a few selects to output the results form the data.</p>
<p style="background: white;"><a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/4087.hiveresults.JPG"><img src="http://blogs.msdn.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/4087.hiveresults.JPG" alt="" width="1017" height="175" border="0" /></a></p>
<p>So, with a very simple test we have seen how easy it is to load an external custom Serde&nbsp;JAR&nbsp;into Azure HDInsight! Hope you found this helpful.</p>
<p>Happy Customizing! <br />Dharshana (@dharshb) <br />Thanks to Gregory Suarez for reviewing this article!</p>
<p><strong><a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/3808.warning.PNG"><img src="http://blogs.msdn.com/resized-image.ashx/__size/50x0/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/3808.warning.PNG" alt="" width="20" height="16" border="0" /></a>&nbsp;A note on HDInsight 3.1 Clusters (Hive 0.13 versions and up) </strong></p>
<p>Please note that due to some API changes in Hive 0.13, you may run into build errors when trying to build the project to work with Hive 0.13. One such error when I tried to use this Serde with Hive 0.13 was:</p>
<p><span style="font-size: xx-small;">Caused by: java.lang.NoSuchMethodError: </span><br /><span style="font-size: xx-small;">org.apache.hadoop.hive.serde2.objectinspector.primitive.AbstractPrimitiveJavaObjectInspector.</span><br /><span style="font-size: xx-small;">&lt;init&gt;(Lorg/apache/hadoop/hive/serde2/objectinspector/primitive/PrimitiveObjectInspectorUtils$PrimitiveTypeEntry;)V</span><br /><span style="font-size: xx-small;">&nbsp;at org.openx.data.jsonserde.objectinspector.primitive.JavaStringByteObjectInspector.&lt;init&gt;(JavaStringByteObjectInspector.java:28)</span><br /><span style="font-size: xx-small;">&nbsp;at org.openx.data.jsonserde.objectinspector.JsonObjectInspectorFactory.&lt;clinit&gt;(JsonObjectInspectorFactory.java:174)</span></p>
<p>There are a couple of concepts that I would like to illustrate below, which in turn, will give some ideas for adapting this Serde for any future versions of Hive.</p>
<h4><strong>a. Pom.xml </strong></h4>
<p>Pom.xml defines the version of Hive that is used by Maven. We can change the version to pull from Hortonworks Hive version &ndash; modification the file is outlined below -</p>
<p><code class="html">&lt;properties&gt; <br /> &lt;project.build.sourceEncoding&gt;UTF-8&lt;/project.build.sourceEncoding&gt; <br /> &lt;cdh.version&gt;0.13.0.2.1.2.0-402&lt;/cdh.version&gt; <br /> &lt;/properties&gt;&nbsp;<br /></code></p>
<p>You can find the different versions in greater detail here - <a href="http://repo.hortonworks.com/content/repositories/releases/org/apache/hive/hive-serde/maven-metadata.xml">http://repo.hortonworks.com/content/repositories/releases/org/apache/hive/hive-serde/maven-metadata.xml</a></p>
<p>Next,&nbsp;the dependency&nbsp;of Hadoop-core&nbsp;can be changed&nbsp;to point to the right version. You can see the different versions from here - <a href="http://repo.hortonworks.com/content/repositories/releases/org/apache/hadoop/hadoop-core/maven-metadata.xml">http://repo.hortonworks.com/content/repositories/releases/org/apache/hadoop/hadoop-core/maven-metadata.xml</a><code class="html"></code></p>
<p><code class="html">&lt;dependency&gt; <br /> &lt;groupId&gt;org.apache.hadoop&lt;/groupId&gt; <br /> &lt;artifactId&gt;hadoop-core&lt;/artifactId&gt; <br /> &lt;version&gt;1.2.0.23&lt;/version&gt; <br /> &lt;scope&gt;provided&lt;/scope&gt; <br /> &lt;/dependency&gt;&nbsp;<br /></code></p>
<p>Finally, the repository can be modified to&nbsp;point to Hortonworks Maven Repo.</p>
<p><code class="html">&lt;repository&gt; <br />&lt;id&gt;Hortonworks&lt;/id&gt; <br />&lt;name&gt;Hortonworks Maven Repo&lt;/name&gt; <br />&lt;url&gt;http://repo.hortonworks.com/content/repositories/releases/&lt;/url&gt; <br />&lt;/repository&gt; </code></p>
<p class="scroll"><strong>Note</strong>: The modified pom.xml for Hive 0.13 can be found <a title="pom.xml" href="https://gist.github.com/dharkum/c1f53165abf658464395#file-pom-xml" target="_blank">here</a></p>
<h4 class="scroll">b. Source</h4>
<p class="scroll">Now, on the source code files, there needs to be a few changes to make it work with Hive 0.13. This specific issue is being tracked here - <a href="https://github.com/rcongiu/Hive-JSON-Serde/pull/64">https://github.com/rcongiu/Hive-JSON-Serde/pull/64</a> . The pull request outlines some source code changes that can be done to make it work with Hive 0.13. Once those source code changes are complete, the source builds without any errors! NOTE: the JARs that are built from this version of the source will not be compatible with the older versions of Hive (Hive 0.12 and before).</p>
<p class="scroll"><strong>Note: </strong>The overall source changes that would need to be done are outlined in the pull request here - <a href="https://github.com/rcongiu/Hive-JSON-Serde/pull/64/files">https://github.com/rcongiu/Hive-JSON-Serde/pull/64/files</a>. Thanks to this pull request, I was able to build the source to make it work with Hive 0.13.</p><div style="clear:both;"></div><img src="http://blogs.msdn.com/aggbug.aspx?PostID=10534811" width="1" height="1">Dharshana_Bharadwajhttp://blogs.msdn.com/dharshana_5F00_bharadwaj_4000_hotmail.com/ProfileUrlRedirect.ashxSome Frequently Asked Questions on Microsoft Azure HDInsighthttp://blogs.msdn.com/b/bigdatasupport/archive/2014/05/22/some-frequently-added-questions-on-microsoft-azure-hdinsight.aspx2014-05-22T19:00:00Z2014-05-22T19:00:00Z<p>&nbsp;</p>
<p>We have seen some common questions on HDInsight when interacting with customers and partners.&nbsp;On this blog post, we are going to help answer some of those common questions.</p>
<p><strong>1. What is Microsoft Azure HDInsight?</strong></p>
<p>HDInsight is a Hadoop-based service from Microsoft that brings a 100 percent Apache Hadoop solution to the cloud. Through deep integration into BI tools such as PowerPivot, Power view, HDInsight enables end users to easily gain insight into big data. HDInsight also makes it very easy to deploy a Hadoop cluster within minutes with a few clicks. It also makes available programmatic interfaces like Powershell, and .Net SDK for customized cluster provisioning.&nbsp; Please visit the landing page of Microsoft Azure HDInsight Service for more information - <a href="http://azure.microsoft.com/en-us/services/hdinsight/" target="_blank">here.</a> Also, here is a nice article written by Dan that gives an architectural overview of Microsoft Azure HDInsight <a href="http://blogs.msdn.com/b/bigdatasupport/archive/2013/11/01/the-hdinsight-support-team-is-open-for-business.aspx" target="_blank">here</a>.</p>
<p><strong>2. Where can I find product documentation on Microsoft Azure HDInsight?</strong></p>
<p>Please visit our product documentation page at <a href="http://azure.microsoft.com/en-us/documentation/services/hdinsight/" target="_blank">http://azure.microsoft.com/en-us/documentation/services/hdinsight/</a> , where you will find working samples and demos to provision and interact with HDInsight. Please visit our Big Data support page at&nbsp; <a href="http://blogs.msdn.com/b/bigdatasupport/" target="_blank">http://blogs.msdn.com/b/bigdatasupport/</a> to read some articles on HDInsight and also Hadoop core topics.</p>
<p><strong>3. How do I get Support for HDInsight?</strong></p>
<p>There are several options to get support for HDInsight. The&nbsp;technical product and billing support options available&nbsp;are detailed <a title="Azure Support" href="http://azure.microsoft.com/en-us/support/options/" target="_blank">here</a>. You can also access our online forums <a title="Azure HDInsight Forums" href="http://social.msdn.microsoft.com/Forums/windowsazure/en-US/home?forum=hdinsight" target="_blank">here</a>&nbsp;to collaborate with the rest of the community on Azure HDInsight related questions.A service dashboard is available <a title="Azure Service Dashboard" href="http://azure.microsoft.com/en-us/support/service-dashboard/" target="_blank">here</a>, which shows the current health of all Azure services including Microsoft Azure HDInsight.&nbsp;</p>
<p><strong>4. What versions of Hadoop and HDP are available on Microsoft Azure HDInsight?</strong></p>
<p>The version page <a href="http://azure.microsoft.com/en-us/documentation/articles/hdinsight-component-versioning/" target="_blank">here</a>&nbsp;is updated with the new features on the cluster versions provided by HDInsight.</p>
<p><strong>5. How do I provision an HDInsight cluster?</strong></p>
<p>The blog post <a title="Provision HDInsight Cluster" href="http://blogs.msdn.com/b/bigdatasupport/archive/2014/05/09/hdinsight-news-new-videos-to-watch-hdinsight-provisioning-demonstrations.aspx" target="_blank">here</a>&nbsp;outlines the two common approaches for provisioning HDInsight : Management Portal, and PowerShell. It is also possible to create a customized HDInsight cluster with some custom configuration options to suit your need. This blog post <a href="http://blogs.msdn.com/b/bigdatasupport/archive/2014/04/15/customizing-hdinsight-cluster-provisioning-via-powershell-and-net-sdk.aspx" target="_blank">here</a>&nbsp;from Azim&nbsp;outlines a PowerShell approach for creating a customized cluster.</p>
<p><strong>6. Where does HDInsight store data, and metadata?</strong></p>
<p>HDInsight supports both HDFS and Windows Azure Storage BLOB (WASB) for data storage. However, using WASB is recommended due to several benefits like data reuse and sharing, archiving, storage cost, and elastic scale out as described in more detail <a href="http://azure.microsoft.com/en-us/documentation/articles/hdinsight-use-blob-storage/" target="_blank">here&nbsp;</a>. By default, metastore for Hive and Oozie is provisioned on&nbsp;Azure SQL Database.</p>
<p><strong>7. Is the WASB storage account retained even when the HDInsight cluster is dropped?</strong></p>
<p>Yes, WASB storage account is left behind even when the HDInsight cluster is dropped. This is real nice because the same Azure storage account can be attached to another HDInsight cluster for reusing the data in there. This helps with saving compute hours when the cluster is not needed.</p>
<p><strong>8. Is the metastore database on Azure SQL Database&nbsp;retained when the HDInsight cluster is dropped?</strong></p>
<p>Short answer is no. By default, the metastore databases on&nbsp;Azure SQL database&nbsp;are dropped when the HDInsight cluster is dropped. However, there is an explicit option that you can choose when you provision the HDInsight cluster using custom create option to ask for a custom metastore as shown on the screenshot below.</p>
<p><a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/1614.CustomMetastore.png"><img src="http://blogs.msdn.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/1614.CustomMetastore.png" alt="" border="0" /></a></p>
<p>If you have explicitly defined a custom metastore, when provisioning the HDInsight cluster, then the metastore database is also left behind when the cluster is dropped.</p>
<p><strong>9. Is there a connect site for requesting or voting on feature requests for Microsoft Azure HDInsight?</strong></p>
<p>Yes, you can vote for feature requests or add your feature request <a href="http://feedback.azure.com/forums/217335-hdinsight" target="_blank">here</a>&nbsp;for Microsoft Azure HDInsight.</p>
<p><strong>10. Is there&nbsp;a System Center&nbsp;Management Pack available for HDInsight?</strong></p>
<p>Yes, it is available for download <a href="http://www.microsoft.com/en-us/download/details.aspx?id=42521" target="_blank">here</a></p>
<p><strong>11. How do I connect BI tools to HDInsight for gaining insights into the data?</strong></p>
<p>You can download Microsoft Hive ODBC driver from <a href="http://www.microsoft.com/en-us/download/details.aspx?id=40886" target="_blank">here</a>&nbsp;to connect from Excel, or PowerPivot to gain insight. Please find the walkthrough on how to connect Excel to HDInsight using Hive ODBC driver <a href="http://azure.microsoft.com/en-us/documentation/articles/hdinsight-connect-excel-hive-odbc-driver/" target="_blank">here</a></p>
<p><strong>12. How do I access logs on HDInsight?</strong></p>
<p>Hadoop service logs, and templeton logs are stored on Windows Azure Storage BLOB account among other things. Brian's blog <a href="http://blogs.msdn.com/b/brian_swan/archive/2014/01/06/accessing-hadoop-logs-in-hdinsight.aspx" target="_blank">here</a>&nbsp;goes over&nbsp;logging on HDInsight in more detail.</p>
<p><strong>13. How can&nbsp;an HDInsight cluster be upgraded?</strong></p>
<p>Upgrading an HDInsight cluster is very simple to do. If you have plugged in an Azure BLOB storage account(s) and a custom metastore to HDInsight, all you need to do is to drop the existing HDInsight cluster and create a new HDInsight cluster on the version you need and plug in the existing Azure BLOB Storage account(s) and the metastore to get an upgraded HDInsight cluster!. Note that this is possible only if the data and metastore&nbsp;is externalized to the HDInsight compute cluster by using the Azure BLOB storage for file system and SQL Azure database for metastores.</p>
<p><strong>14. What are the different options available to move data into a Windows Azure Storage BLOB account? </strong></p>
<p>If you are looking to move a large amount of data, you can use the Microsoft Azure Import/Export Service to transfer data to Azure&nbsp;BLOB storage. More details on that&nbsp;are available <a href="http://azure.microsoft.com/en-us/documentation/articles/storage-import-export-service/" target="_blank">here</a>. For smaller incremental data,&nbsp;uploads can be scheduled into Azure BLOB storage using one of these tools as detailed on this article <a href="http://azure.microsoft.com/en-us/documentation/articles/hdinsight-upload-data/" target="_blank">here</a>. Also, ExpressRoute enables a faster, private connection into Azure. A technical overview on ExpressRoute can be found <a href="http://msdn.microsoft.com/en-us/library/azure/dn606309.aspx" target="_blank">here</a>.</p>
<p><strong>15. Is there a local development platform available for HDInsight?</strong></p>
<p>Yes, HDInsight emulator provides a local development platform and comes with the same components from the Hadoop ecosystem as Azure HDInsight. it is available for download <a href="http://www.microsoft.com/web/gallery/install.aspx?appid=HDINSIGHT" target="_blank">here</a>. Some samples for working with the emulator are available <a href="http://azure.microsoft.com/en-us/documentation/articles/hdinsight-get-started-emulator/" target="_blank">here</a></p>
<p>This concludes the list of some common questions for this post. Hope&nbsp;you find this helpful!&nbsp;</p>
<p>Thank you.</p>
<p>Dharshana Bharadwaj (@dharshb)</p>
<p>Thanks to <a href="http://social.msdn.microsoft.com/profile/jason%20h%20(hdinsight)/" target="_blank">JasonH</a> for reviewing this!</p>
<p>&nbsp;</p><div style="clear:both;"></div><img src="http://blogs.msdn.com/aggbug.aspx?PostID=10525961" width="1" height="1">Dharshana_Bharadwajhttp://blogs.msdn.com/dharshana_5F00_bharadwaj_4000_hotmail.com/ProfileUrlRedirect.ashxHDInsight News - New Videos to watch - HDInsight Provisioning demonstrationshttp://blogs.msdn.com/b/bigdatasupport/archive/2014/05/09/hdinsight-news-new-videos-to-watch-hdinsight-provisioning-demonstrations.aspx2014-05-09T20:17:00Z2014-05-09T20:17:00Z<h2>Check out these two recent videos&nbsp;demos regarding HDInsight provisioning</h2>
<p>These videos complement the product documentation outlined at <a href="http://azure.microsoft.com/en-us/documentation/articles/hdinsight-get-started/#provision">http://azure.microsoft.com/en-us/documentation/articles/hdinsight-get-started/#provision</a></p>
<p><em><strong>HDInsight</strong></em> is the name given to the Microsoft Azure service (in the Microsoft cloud data centers) running the Hortonworks Data Platform distribution of Apache Hadoop on Microsoft Windows.</p>
<p><em><strong>Provisioning</strong></em> is the word we use to describe the initial setup process&nbsp;of allocating all the compute resources, network, and storage&nbsp;in a Microsoft Azure data centers to prepare a Hadoop installation automatically for you. With just a few clicks, or by filling in a few variables in PowerShell script, you can allocate a whole Hadoop cluster with lots of worker nodes easily. In a few minutes after submitting the provisioning request, you'll have a ready-to-use Hadoop cluster that you can connect to and run work.</p>
<p>&nbsp;</p>
<h3>There are two ways to&nbsp;create HDInsight clusters shown in these videos:</h3>
<p>1. The first video shows how to Provision HDInsight Hadoop clusters using the Azure&nbsp;Management Portal by interactively clicking in the web site.</p>
<p><iframe src="http://www.youtube.com/embed/JaW_uACHg10" frameborder="0" width="640" height="390"></iframe></p>
<p>2. The second video here shows how to Provision HDInsight Hadoop clusters using Azure Powershell cmdlets from a client computer. This means you can automate a script to quickly delete and create HDInsight clusters.</p>
<p><iframe src="http://www.youtube.com/embed/RxY_QgD-2Os" frameborder="0" width="640" height="390"></iframe></p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<h2>HDInsight v3.0 == Hadoop v2.2</h2>
<p>This video from the HDInsight product group Lead PM, Matt Winkler, shows much of the new stuff in Hadoop 2.2 which is included in HDInsight v 3.0. You can pick this version when you provision a new cluster as shown above in the provisioning videos.</p>
<p>The complementary documentation for this topics is here <a href="http://azure.microsoft.com/en-us/documentation/articles/hdinsight-component-versioning/">http://azure.microsoft.com/en-us/documentation/articles/hdinsight-component-versioning/</a>&nbsp;that explains what's in each version number.</p>
<p><a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/5531.HDInsightDeploymentScreenshots.jpg"><img src="http://blogs.msdn.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/5531.HDInsightDeploymentScreenshots.jpg" alt="" width="447" height="121" border="0" /></a></p>
<p>&nbsp;</p>
<p>Watch the video here: <a href="http://channel9.msdn.com/Events/Build/2014/3-612">http://channel9.msdn.com/Events/Build/2014/3-612</a></p>
<p><iframe style="height: 360px; width: 640px;" src="http://channel9.msdn.com/Events/Build/2014/3-612/player?h=360&amp;w=640" frameborder="0" scrolling="no"></iframe></p><div style="clear:both;"></div><img src="http://blogs.msdn.com/aggbug.aspx?PostID=10524158" width="1" height="1">Jason H (Azure)http://blogs.msdn.com/Jason-Howell/ProfileUrlRedirect.ashxHDInsight: - backup and restore hive tablehttp://blogs.msdn.com/b/bigdatasupport/archive/2014/05/01/hdinsight-backup-and-restore-hive-table.aspx2014-05-01T13:15:00Z2014-05-01T13:15:00Z<p><span style="font-size: small;"><strong>Introduction</strong></span></p>
<p style="text-align: justify;"><span style="font-size: small;">My name is Sudhir Rawat and I work on the Microsoft HDInsight </span><span style="font-size: small;">support team. In this blog I am going to explain the options for backing up and </span><span style="font-size: small;">restoring a Hive table on HDInsight. The general recommendation is to store </span><span style="font-size: small;">hive metadata on SQL Azure during provisioning the cluster. </span><span style="font-size: small;">Sometimes, we may have many Hive tables in an HDInsight </span><span style="font-size: small;">cluster and over time may need to increase the number of nodes on the cluster </span><span style="font-size: small;">for providing more computation power. Currently, to change the number of </span><span style="font-size: small;">compute nodes in an HDInsight cluster, we have to provision a fresh cluster.&nbsp; So the question is how to move metastore and </span><span style="font-size: small;">what are the options available. There are various options to achieve a reusable&nbsp; </span></p>
<p><span style="font-size: small;"><a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/2235.1.png"><img src="http://blogs.msdn.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/2235.1.png" alt="" width="834" height="215" border="0" /></a></span></p>
<p>&nbsp;</p>
<p><span style="font-size: small;"><strong>Saving metadata using &ldquo;Enter the Hive/Oozie Metastore&rdquo;&nbsp; </strong></span></p>
<p style="text-align: justify;"><span style="font-size: small;">As mentioned earlier this is the recommended approach. In this scenario you must specify an existing Azure SQL Database that exists in the same data center as the HDInsight cluster. The provisioning process will automatically create the necessary Hive tables from scratch in the database.</span><br /><span style="font-size: small;">This option will save the Hive metadata on SQL Azure and will save good amount of time.</span></p>
<p style="text-align: justify;"><span style="font-size: small;">Below is the screenshot of the option which will be available during provisioning cluster in the Azure Portal.</span></p>
<p><span style="font-size: small;"><a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/7077.2.png"><img src="http://blogs.msdn.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/7077.2.png" alt="" border="0" /></a></span></p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p><span style="font-size: small;">Go to the Azure portal and click &ldquo;SQL DATABASES&rdquo; and create a new one. Make sure the creation of Azure SQL Database must be on the same region as where we provision the cluster.</span></p>
<p><span style="font-size: small;"><a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/6472.3.png"><img src="http://blogs.msdn.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/6472.3.png" alt="" border="0" /></a></span></p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p style="text-align: justify;"><span style="font-size: small;">Now that we have an SQL Database ready, go ahead to provision the HDInsight cluster. When choosing the configuration options, at second step select the checkmark &ldquo;Enter the Hive/Oozie Metastore&rdquo;. In the &ldquo;METASTORE DATABASE&rdquo; field dropdown, choose the database which we created earlier.</span></p>
<p><span style="font-size: small;"><a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/0268.4.png"><img src="http://blogs.msdn.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/0268.4.png" alt="" border="0" /></a></span></p>
<p>&nbsp;</p>
<p style="text-align: justify;"><span style="font-size: small;">Once the cluster is provisioned successfully, upload your data files to Windows Azure blob storage. One of the ways to do it is through PowerShell which is described <a href="http://www.windowsazure.com/en-us/documentation/articles/hdinsight-upload-data/">here</a>.</span></p>
<p style="text-align: justify;"><span style="font-size: small;">The Next step is to create a Hive table manually. Below is the PowerShell script for creating the Hive table.</span></p>
<p><span style="font-size: small;"><a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/5466.1.png"><img src="http://blogs.msdn.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/5466.1.png" alt="" width="779" height="183" border="0" /></a></span></p>
<p><span style="font-size: small;">Now, let&rsquo;s query the table to verify it exists and the command works as expected.</span></p>
<p><span style="font-size: small;"><span style="font-size: small;"><a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/4477.1.png"><img src="http://blogs.msdn.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/4477.1.png" alt="" width="939" height="321" border="0" /></a></span></span></p>
<p><span style="font-size: small;"><span style="font-family: 'Calibri','sans-serif'; mso-bidi-font-size: 10.0pt; mso-ascii-theme-font: minor-latin; mso-fareast-font-family: 'Times New Roman'; mso-fareast-theme-font: minor-fareast; mso-hansi-theme-font: minor-latin; mso-bidi-font-family: 'Times New Roman'; mso-bidi-theme-font: minor-bidi; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA;">If I look onto the SQL Azure Database and query &ldquo;TBLS&rdquo; table, it shows there is one external table created.</span></span></p>
<p><span style="font-size: small;"><span style="font-family: 'Calibri','sans-serif'; mso-bidi-font-size: 10.0pt; mso-ascii-theme-font: minor-latin; mso-fareast-font-family: 'Times New Roman'; mso-fareast-theme-font: minor-fareast; mso-hansi-theme-font: minor-latin; mso-bidi-font-family: 'Times New Roman'; mso-bidi-theme-font: minor-bidi; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA;"><a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/3386.1.png"><img src="http://blogs.msdn.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/3386.1.png" alt="" width="928" height="302" border="0" /></a></span></span></p>
<p style="text-align: justify;"><span style="font-size: small;">Next, we will delete the HDInsight cluster and provision a new one with more nodes. During this cluster creation, make sure to select existing storage and &ldquo;Enter the Hive/Oozie Metastore&rdquo;.&nbsp; Once the new cluster is ready, query the table again and you will be able to retrieve results. Notice below the cluster name is different.</span></p>
<p><span style="font-size: small;"><a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/4774.1.png"><img src="http://blogs.msdn.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/4774.1.png" alt="" width="922" height="407" border="0" /></a></span></p>
<p>&nbsp;</p>
<p><span style="font-size: small;"><strong>Hive Import/Export</strong></span></p>
<p style="text-align: justify;"><span style="font-size: small;">The export/import feature in Apache Hive is introduced in </span><span style="font-size: small;">version 0.8.0. The Hive export command allows us to take backup of table or </span><span style="font-size: small;">partition data along with metadata. The Hive import command allows us to </span><span style="font-size: small;">restore the metadata and data in new cluster. More information about this </span><span style="font-size: small;">addition can be found <a href="https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ImportExport">here</a>.</span></p>
<p><span style="font-size: small;"><strong>Hive Import/Export before HDI 3.0 Cluster</strong></span></p>
<p style="text-align: justify;"><span style="font-size: small;">If we are in stage of provision cluster, we can add the hive </span><span style="font-size: small;">configuration via the PowerShell script. To do that, please look at the </span><span style="font-size: small;">PowerShell script <a href="https://github.com/rawatsudhir/Samples/blob/master/CreateHDInsightClusterusingHiveconfig">here</a>. </span><span style="font-size: small;">The script will create Azure BLOB storage and cluster along with the changes in </span><span style="font-size: small;">hive configuration.&nbsp; These changes are </span><span style="font-size: small;">not required in case provisioning HDI 3.0 cluster.</span></p>
<p style="text-align: justify;"><span style="font-size: small;">Let us look at a slightly different scenario. This is the </span><span style="font-size: small;">case of an operational cluster that already exists and was originally created </span><span style="font-size: small;">without specifying an explicit Hive/Oozie metastore (that is, unchecking the </span><span style="font-size: small;">checkmark specifying that Hive/Oozie Metastore enabled). If there is a need to </span><span style="font-size: small;">recreate this cluster with more nodes, then Hive export/import command can be </span><span style="font-size: small;">configured and used.</span></p>
<p style="text-align: justify;"><span style="font-size: small;">Please note that this is not a recommended and supported </span><span style="font-size: small;">approach. The reason is, that when the HDInsight cluster gets reimaged, then </span><span style="font-size: small;">such configuration changes will not be retained. This scenario is described in </span><span style="font-size: small;">more detail in our other blog post <a href="http://blogs.msdn.com/b/bigdatasupport/archive/2013/11/01/the-hdinsight-support-team-is-open-for-business.aspx">here</a>.</span></p>
<p style="text-align: justify;"><span style="font-size: small;">Now follow below step to get export/import working on </span><span style="font-size: small;">existing HDInsight cluster.</span></p>
<p>&nbsp;</p>
<p><span style="font-size: small;">1. Do remote access (RDP) to headnode and add following property in hive-site.xml under &ldquo;C:\apps\dist\hive-0.11.0.1.3.2.0-05\conf\&rdquo;.</span></p>
<p style="padding-left: 60px;"><span style="font-size: small;">&lt;property&gt;</span></p>
<p style="padding-left: 60px;">&nbsp;<span style="font-size: small;">&lt;name&gt;hive.exim.uri.scheme.whitelist&lt;/name&gt;</span></p>
<p style="padding-left: 60px;">&nbsp;<span style="font-size: small;">&lt;value&gt;wasb,hdfs,pfile&lt;/value&gt;</span></p>
<p style="padding-left: 60px;"><span style="font-size: small;">&nbsp;&lt;/property&gt;</span></p>
<p>&nbsp;</p>
<p><span style="font-size: small;">2. Restart hivemetastore service. To do that you need remote (RDP) to cluster, open &ldquo;Hadoop Command Line&rdquo; and run below command.&nbsp;</span></p>
<p><span style="font-size: small;"><a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/8585.1.png"><img src="http://blogs.msdn.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/8585.1.png" alt="" width="803" height="97" border="0" /></a></span></p>
<p><span style="font-size: small;"><span style="font-family: 'Calibri','sans-serif'; mso-bidi-font-size: 10.0pt; mso-ascii-theme-font: minor-latin; mso-fareast-font-family: 'Times New Roman'; mso-fareast-theme-font: minor-fareast; mso-hansi-theme-font: minor-latin; mso-bidi-font-family: 'Times New Roman'; mso-bidi-theme-font: minor-bidi; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA;">It will stop all services related to hive. Below is the screenshot of it</span></span></p>
<p><span style="font-size: small;"><span style="font-family: 'Calibri','sans-serif'; mso-bidi-font-size: 10.0pt; mso-ascii-theme-font: minor-latin; mso-fareast-font-family: 'Times New Roman'; mso-fareast-theme-font: minor-fareast; mso-hansi-theme-font: minor-latin; mso-bidi-font-family: 'Times New Roman'; mso-bidi-theme-font: minor-bidi; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA;"><a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/5315.1.png"><img src="http://blogs.msdn.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/5315.1.png" alt="" width="812" height="229" border="0" /></a></span></span></p>
<p><span style="font-size: small;"><span style="font-family: 'Calibri','sans-serif'; mso-bidi-font-size: 10.0pt; mso-ascii-theme-font: minor-latin; mso-fareast-font-family: 'Times New Roman'; mso-fareast-theme-font: minor-fareast; mso-hansi-theme-font: minor-latin; mso-bidi-font-family: 'Times New Roman'; mso-bidi-theme-font: minor-bidi; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA;"><span style="font-family: 'Calibri','sans-serif'; mso-bidi-font-size: 10.0pt; mso-ascii-theme-font: minor-latin; mso-fareast-font-family: 'Times New Roman'; mso-fareast-theme-font: minor-fareast; mso-hansi-theme-font: minor-latin; mso-bidi-font-family: 'Times New Roman'; mso-bidi-theme-font: minor-bidi; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA;">To start all the services related to hive run below command.</span></span></span></p>
<p><span style="font-size: small;"><span style="font-family: 'Calibri','sans-serif'; mso-bidi-font-size: 10.0pt; mso-ascii-theme-font: minor-latin; mso-fareast-font-family: 'Times New Roman'; mso-fareast-theme-font: minor-fareast; mso-hansi-theme-font: minor-latin; mso-bidi-font-family: 'Times New Roman'; mso-bidi-theme-font: minor-bidi; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA;"><span style="font-family: 'Calibri','sans-serif'; mso-bidi-font-size: 10.0pt; mso-ascii-theme-font: minor-latin; mso-fareast-font-family: 'Times New Roman'; mso-fareast-theme-font: minor-fareast; mso-hansi-theme-font: minor-latin; mso-bidi-font-family: 'Times New Roman'; mso-bidi-theme-font: minor-bidi; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA;"><a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/7534.1.png"><img src="http://blogs.msdn.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/7534.1.png" alt="" width="806" height="68" border="0" /></a></span></span></span></p>
<p><span style="font-size: small;"><span style="font-family: 'Calibri','sans-serif'; mso-bidi-font-size: 10.0pt; mso-ascii-theme-font: minor-latin; mso-fareast-font-family: 'Times New Roman'; mso-fareast-theme-font: minor-fareast; mso-hansi-theme-font: minor-latin; mso-bidi-font-family: 'Times New Roman'; mso-bidi-theme-font: minor-bidi; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA;"><span style="font-family: 'Calibri','sans-serif'; mso-bidi-font-size: 10.0pt; mso-ascii-theme-font: minor-latin; mso-fareast-font-family: 'Times New Roman'; mso-fareast-theme-font: minor-fareast; mso-hansi-theme-font: minor-latin; mso-bidi-font-family: 'Times New Roman'; mso-bidi-theme-font: minor-bidi; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA;"><span style="font-family: 'Calibri','sans-serif'; mso-bidi-font-size: 10.0pt; mso-ascii-theme-font: minor-latin; mso-fareast-font-family: 'Times New Roman'; mso-fareast-theme-font: minor-fareast; mso-hansi-theme-font: minor-latin; mso-bidi-font-family: 'Times New Roman'; mso-bidi-theme-font: minor-bidi; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA;">It will start all the services related to hive. Below is the screenshot of it.</span></span></span></span></p>
<p><span style="font-size: small;"><span style="font-family: 'Calibri','sans-serif'; mso-bidi-font-size: 10.0pt; mso-ascii-theme-font: minor-latin; mso-fareast-font-family: 'Times New Roman'; mso-fareast-theme-font: minor-fareast; mso-hansi-theme-font: minor-latin; mso-bidi-font-family: 'Times New Roman'; mso-bidi-theme-font: minor-bidi; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA;"><span style="font-family: 'Calibri','sans-serif'; mso-bidi-font-size: 10.0pt; mso-ascii-theme-font: minor-latin; mso-fareast-font-family: 'Times New Roman'; mso-fareast-theme-font: minor-fareast; mso-hansi-theme-font: minor-latin; mso-bidi-font-family: 'Times New Roman'; mso-bidi-theme-font: minor-bidi; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA;"><span style="font-family: 'Calibri','sans-serif'; mso-bidi-font-size: 10.0pt; mso-ascii-theme-font: minor-latin; mso-fareast-font-family: 'Times New Roman'; mso-fareast-theme-font: minor-fareast; mso-hansi-theme-font: minor-latin; mso-bidi-font-family: 'Times New Roman'; mso-bidi-theme-font: minor-bidi; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA;"><a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/7433.1.png"><img src="http://blogs.msdn.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/7433.1.png" alt="" width="799" height="231" border="0" /></a></span></span></span></span></p>
<p>&nbsp;</p>
<p style="text-align: justify;"><span style="font-size: small;">Once done, run export command in Hive CLI &ldquo;export table table_name to &lsquo;wasb://&lt;Container_Name&gt;@&lt;Storage_Name&gt;.blob.core.windows.net/&lt;Folder_Name&gt;&rsquo;;&rdquo;</span></p>
<p style="text-align: justify;"><span style="font-size: small;">Here is the outcome when I did <strong>export</strong> on external table.</span></p>
<p><span style="font-size: small;"><a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/8750.1.png"><img src="http://blogs.msdn.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/8750.1.png" alt="" width="894" height="239" border="0" /></a></span></p>
<p>&nbsp;</p>
<p style="text-align: justify;"><span style="font-size: small;">Once the Hive <strong>export </strong>is done, I see several files in the blob storage. You will notice two files: the first one for metadata (_metadata is the file name), and the second is the data file itself. I used a tool <a href="http://azureblobmanager.codeplex.com/">Azure Blob Manager</a> for navigation because I like the functionality of searching blobs. However, you can use whatever tool you are comfortable with.</span></p>
<p><span style="font-size: small;"><a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/7658.1.png"><img src="http://blogs.msdn.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/7658.1.png" alt="" width="880" height="208" border="0" /></a></span></p>
<p style="text-align: justify;"><span style="font-size: small;"><span style="font-family: 'Calibri','sans-serif'; mso-bidi-font-size: 10.0pt; mso-ascii-theme-font: minor-latin; mso-fareast-font-family: 'Times New Roman'; mso-fareast-theme-font: minor-fareast; mso-hansi-theme-font: minor-latin; mso-bidi-font-family: 'Times New Roman'; mso-bidi-theme-font: minor-bidi; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA;">Now let&rsquo;s say, I create a new cluster and add it with existing storage (where I export the metadata and data). Below is the screenshot of tables in newly created cluster</span></span></p>
<p><span style="font-size: small;"><span style="font-family: 'Calibri','sans-serif'; mso-bidi-font-size: 10.0pt; mso-ascii-theme-font: minor-latin; mso-fareast-font-family: 'Times New Roman'; mso-fareast-theme-font: minor-fareast; mso-hansi-theme-font: minor-latin; mso-bidi-font-family: 'Times New Roman'; mso-bidi-theme-font: minor-bidi; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA;"><a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/4186.1.png"><img src="http://blogs.msdn.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/4186.1.png" alt="" width="880" height="104" border="0" /></a></span></span></p>
<p>&nbsp;</p>
<p><span style="font-size: small;">Below is the screenshot when I run the Hive <strong>import</strong> command followed by select command to check all the records.</span></p>
<p><span style="font-size: small;"><a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/2330.1.png"><img src="http://blogs.msdn.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/2330.1.png" alt="" width="876" height="284" border="0" /></a></span></p>
<p>&nbsp;</p>
<p style="text-align: justify;"><span style="font-size: small;">One thing you will notice is there are three copies of the data on blob storage &ndash;</span></p>
<p style="text-align: justify;">&nbsp;<span style="font-size: small;">1. The original location</span></p>
<p style="text-align: justify;"><span style="font-size: small;">2. Location to which data is exported</span></p>
<p style="text-align: justify;"><span style="font-size: small;">3. Location used for import.</span></p>
<p style="text-align: justify;"><span style="font-size: small;">You may want to delete other data files to reduce the space on blob storage.</span></p>
<p style="text-align: justify;">&nbsp;</p>
<p><span style="font-size: small;"><strong>Hive Import/Export in/after HDI 3.0 Cluster</strong></span></p>
<p><span style="font-size: small;">The configuration changes are not required in case of HDI 3.0 cluster. We can directly run the export/import command directly from PowerShell. Below are the commands for the reference.</span></p>
<p><span style="font-size: small;">export table TABLE_NAME to 'wasb://CONTAINER_NAME@STORAGE_NAME.blob.core.windows.net/DIRECTORY_NAME';</span></p>
<p><span style="font-size: small;">import table TABLE_NAME from 'wasb://CONTAINER_NAME@STORAGE_NAME.blob.core.windows.net/DIRECTORY_NAME';</span></p>
<p>&nbsp;</p>
<p><span style="font-size: small;">Thanks Shuaishuai Nei for his input and thanks Dharshana, Jason and Dan for reviewing.</span></p>
<p><span style="font-size: small;">Thanks Much and Happy Learning</span></p>
<p><span style="font-size: small;">Sudhir Rawat</span></p><div style="clear:both;"></div><img src="http://blogs.msdn.com/aggbug.aspx?PostID=10521896" width="1" height="1">sudhirbloghttp://blogs.msdn.com/sudhirrawat1_4000_live.com/ProfileUrlRedirect.ashxStart using flume with HDInsight by installing HDP 2.0 on Windows Azure Virtual Machine http://blogs.msdn.com/b/bigdatasupport/archive/2014/04/24/start-using-flume-with-hdinsight-by-installing-hdp-2-0-on-windows-azure-virtual-machine.aspx2014-04-24T14:19:00Z2014-04-24T14:19:00Z<p>After reading Greg's article <a href="http://blogs.msdn.com/b/bigdatasupport/archive/2014/03/18/using-apache-flume-with-hdinsight.aspx" target="_blank">Using apache flume with HDInsight</a> I wanted to start to learn more about flume, but my Linux skills are none existent and currently flume is not included in HDInsight. For more information on HDInsight see <a href="http://azure.microsoft.com/en-us/documentation/services/hdinsight/" target="_blank">Windows Azure HDInsight</a>. For more information on Apache Flume see <a href="http://flume.apache.org/" target="_blank">Apache Flume</a>.</p>
<p>So, I decided to setup a Windows Azure virtual machine and install HDP 2.0 for windows in order to start to use flume with my HDInsight server. Based on the article I could also use cloudberry drive on my virtual machine to move my twitter data to my storage account that HDInsight could access. Because the data is moved all within the data center no data movement cost would be incurred but there would be storage costs <a href="http://azure.microsoft.com/en-us/pricing/details/storage/" target="_blank">Windows Azure Storage Costs</a> and of course the cost of the virtual machine.</p>
<p>Although I am going to setup a single HDP 2.0 server on a single virtual machine, in the past I had setup a 6 node HDP 1.0 cluster on multiple Windows Azure virtual machine. Here are a couple of suggestions in case you want to do this.</p>
<ul>
<li>Choose a location or data center.</li>
<li>Create a storage account.</li>
<li>Create a Virtual Private Network.</li>
<li>Use a Windows + SQL gallery image.</li>
</ul>
<p>For performance and cost reasons, choose a location or data center and create all of your Windows Azure services within this data center. You can create a storage account in your chosen data center using quick create from the Windows Azure portal. After it is created review the storage configure tab and choose your local or geo replication options. These have cost implications. Also notice that you can turn on and off monitoring for blobs. This allows your monitor tab to be populated with IO information! You can use the same storage account for your HDInsight cluster and virtual machine. They will be placed in different containers. Take a quick look at the containers tab. For more information on Windows Azure storage see <a href="http://azure.microsoft.com/en-us/documentation/services/storage/" target="_blank">Windows Azure Storage</a>. When you create a virtual machine a vhds container will be created which is the location of your virtual machines vhd files. When you create an HDInsight cluster you can also using this existing storage account and create a container here. Below are a couple of screen shots showing creating a storage account and the configure tab. For more information on Windows Azure virtual machines see <a href="http://azure.microsoft.com/en-us/documentation/services/virtual-machines/" target="_blank">Windows Azure Virtual Machine</a>.</p>
<p>&nbsp;</p>
<p style="text-align: left; margin-left: 36pt;"><a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/5488.test.png"><img src="http://blogs.msdn.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/5488.test.png" alt="" width="553" height="297" border="0" /></a>&nbsp;</p>
<p style="text-align: left; margin-left: 36pt;"><a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/5488.image2.png"><img src="http://blogs.msdn.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/5488.image2.png" alt="" border="0" /></a></p>
<p style="text-align: left; margin-left: 36pt;">&nbsp;</p>
<p>&nbsp;If you want to install a HDP 2.0 cluster the virtual machines must be able to communicate with each other. You can use either a Windows Azure network (virtual private network) or an Affinity group. I have found the network path to be easier. Below is a screen shot where you are giving a name to your network, an IP address range and the data center location. You will then install all of your virtual machines within this network and they will be able to communicate with each other. If you just want to install a HDP 2.0 single server a network is not necessary and you can skip this step. For more information on Windows Azure Virtual Networks see <a href="http://azure.microsoft.com/en-us/documentation/services/virtual-network/" target="_blank">Windows Azure Virtual Network</a>.</p>
<p style="padding-left: 30px;"><br /><a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/4276.image3.png"><img src="http://blogs.msdn.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/4276.image3.png" alt="" border="0" /></a></p>
<p style="padding-left: 30px;">&nbsp;&nbsp;</p>
<p>Next you can install your virtual machine. Because I like to configure HDP 2.0 to use SQL Server for the Hive and Oozie metastore instead of Derby, I like to install a virtual machine image that has the Windows 2012 operating system and a SQL Server 2012 standard edition. Instead of choosing quick create, choose from gallery. You can then choose an image with the operating system and SQL Server you want installed. If you are not interested in using SQL Server for your Hive and Oozie metastore and want to use derby, you can choose an image without SQL Server from the gallery. Give your virtual machine a name, choose a size which determines how many CPU's, memory and additional disks your virtual machine gets. A2 (2 cores, 3.5GB) or A3 (4 cores, 7GB) should be fine for a single install of HDP 2.0. If needed after the virtual machine is created you can increase or decrease the size. In choosing to configure your virtual machine leave the cloud service boxes at the default. Choose your data center or network you created before. Choose your storage account you created before. Leave the rest at the defaults. Below are some screen shots showing the steps.</p>
<p>&nbsp;</p>
<p style="padding-left: 30px;">&nbsp;<a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/8666.image4.png"><img src="http://blogs.msdn.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/8666.image4.png" alt="" border="0" /></a></p>
<p style="padding-left: 30px;">&nbsp;</p>
<p style="padding-left: 30px;"><a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/8306.image6.png"><img src="http://blogs.msdn.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/8306.image6.png" alt="" border="0" /></a></p>
<p style="padding-left: 30px;"><a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/3750.image7.png"><img src="http://blogs.msdn.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/3750.image7.png" alt="" border="0" /></a></p>
<p>&nbsp;</p>
<p>After the provisioning is complete you can highlight the virtual machine and click the connect button in order to RDP to your new virtual machine!</p>
<p>&nbsp;</p>
<p style="padding-left: 30px;">&nbsp;<a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/0246.image8.png"><img src="http://blogs.msdn.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/0246.image8.png" alt="" border="0" /></a></p>
<p>&nbsp;&nbsp;</p>
<p>Next we will makes some configuration changes to enable SQL Server to be used as the Hive and Oozie metastore.</p>
<ul>
<li>Create an endpoint.</li>
<li>Change the authorization to windows and SQL.</li>
<li>Create a hive and oozie databases.</li>
<li>Create a hive and oozie users with read and write permissions.</li>
</ul>
<p>From the Windows Azure portal click the endpoints tab for your newly created virtual machine. Click add and stand alone endpoint and choose the MSSQL (TCP port 1433) and leave the rest at the defaults. Once the endpoint is created click the dashboard tab and then click connect to RDP to the virtual machine. Once logged in startup SQL Server management studio and connect to your instance. Right mouse click the server and choose properties. On your security tab change the authentication to windows and SQL Authentication. Create a hive and an oozie database. Create a hive and oozie SQL Server login that are members of the db_datareader and db_datawriter groups in their respective databases. Also modify the enforce password policy. Restart the SQL Server service for the authentication change to take effect. Below are some screenshots.</p>
<p>&nbsp;</p>
<p style="padding-left: 30px;">&nbsp;<a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/6661.image9.png"><img src="http://blogs.msdn.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/6661.image9.png" alt="" border="0" /></a></p>
<p style="padding-left: 30px;"><a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/8535.image10.png"><img src="http://blogs.msdn.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/8535.image10.png" alt="" border="0" /></a></p>
<p style="padding-left: 30px;"><a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/8231.image11.png"><img src="http://blogs.msdn.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/8231.image11.png" alt="" border="0" /></a></p>
<p style="padding-left: 30px;"><a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/1537.image12.png"><img src="http://blogs.msdn.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/1537.image12.png" alt="" border="0" /></a></p>
<p>&nbsp;</p>
<p>In SQL Configuration manager verify that SQL Server has TCP enabled and listening on port 1433.</p>
<p>&nbsp;</p>
<p style="padding-left: 30px;">&nbsp;<a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/6472.image13.png"><img src="http://blogs.msdn.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/6472.image13.png" alt="" border="0" /></a></p>
<p>&nbsp;</p>
<p>Now that SQL Server is configured it is time to prepare the operating system and install HDP 2.0 for windows. Before proceeding I suggest to review Hortonwork's blog post on installing HDP 2.0 on windows. <a href="http://hortonworks.com/blog/install-hadoop-windows-hortonworks-data-platform-2-0/" target="_blank">Install hadoop windows</a></p>
<ul>
<li>Install C++ runtime. <a href="http://www.microsoft.com/en-us/download/details.aspx?id=14632" target="_blank">C++ Runtime download</a></li>
<li>Install and configure Python 2.7.X. <a href="http://www.python.org/download/" target="_blank">Python download</a></li>
<li>Install and configure Java 1.6.X. <a href="http://www.oracle.com/technetwork/java/javasebusiness/downloads/java-archive-downloads-javase6-419409.html" target="_blank">Java jdk download</a></li>
<li>Install and configure HDP 2.0 for windows. <a href="http://hortonworks.com/products/hdp-windows/" target="_blank">HDP 2.0 for Windows</a></li>
</ul>
<p>If you have chosen the Windows 2012 Server operating system the C++ runtime is already installed and you can skip this step. Download Python and run the install. You will need to add your Python install location to the path statement. In control panel open the system icon and choose the advanced tab and the environment variables button. Find the path variable and append the python path. Don't forget to add your semicolon.</p>
<p>&nbsp;</p>
<p style="padding-left: 30px;">&nbsp;<a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/8611.image14.gif"><img src="http://blogs.msdn.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/8611.image14.gif" alt="" border="0" /></a></p>
<p>&nbsp;</p>
<p>Download and install java. You can install java 1.7.X instead of 1.6.X. When you install java you should change the default install path for the jdk and jre. Java does not like spaces in the path and the default location is c:\Program File\. Chose a path without spaces like c:\Java\. Next, in control panel system icon, click the advanced tab and Environment variables button. This time create a new system variable named JAVA_HOME with a path to your java install, C:\Java\jdk1.6.0_31.</p>
<p>&nbsp;</p>
<p style="padding-left: 30px;">&nbsp;<a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/2821.image15.gif"><img src="http://blogs.msdn.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/2821.image15.gif" alt="" border="0" /></a></p>
<p>&nbsp;</p>
<p>Now we are ready to install our stand alone instance of HDP 2.0 on Windows. Unzip the hdp-2.0.6-GA.zip file. Open a powershell prompt as Administrator, "Run as Administrator" and execute the MSI file. Msiexec /I "hdp-2.0.6.0.winpkg.msi". The install will display the install window.</p>
<ul>
<li>Set the Hadoop User Password.</li>
<li>Check the "Delete Existing HDP Data".</li>
<li>Check "Install HDP Additional Components".</li>
<li>Set the Hive and Ooze database credentials.</li>
<li>Select Derby or MSSQL Server as the database flavor.</li>
</ul>
<p>&nbsp;</p>
<p style="padding-left: 30px;">&nbsp;<a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/5086.image16.png"><img src="http://blogs.msdn.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/5086.image16.png" alt="" border="0" /></a></p>
<p>&nbsp;</p>
<p>Once the install is successful you can open a command prompt or windows explorer and change directories to c:\hdp\ and execute the start_local_hdp_services.cmd. This will start the Hadoop services. You can also start and stop the services from your services icon in administrative tools in control panel. Below are screenshots of the c:\hdp folder and the services icon. Your flume service should remain stopped until we configure it.</p>
<p>&nbsp;</p>
<p style="padding-left: 30px;">&nbsp;<a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/5707.image17.png"><img src="http://blogs.msdn.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/5707.image17.png" alt="" border="0" /></a></p>
<p style="padding-left: 30px;"><a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/7840.image18.png"><img src="http://blogs.msdn.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/7840.image18.png" alt="" border="0" /></a></p>
<p style="padding-left: 30px;">&nbsp;</p>
<p>Next let's start to look at flume under C:\hdp\flume-1.4.0.2.0.6.0-0009. You have a bin, conf, lib, and tools folders. In your conf folder there is a flume.conf file which is the flume configuration file. We will be modifying this file.</p>
<p>&nbsp;</p>
<p style="padding-left: 30px;">&nbsp;<a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/8831.image19.png"><img src="http://blogs.msdn.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/8831.image19.png" alt="" border="0" /></a></p>
<p>&nbsp;</p>
<p>Before we start to configure flume we will need two things.</p>
<ul>
<li>Twitter account and a Twitter Streaming Application. <a href="http://twitter.com/signup" target="_blank">Twitter Signup</a>.</li>
<li>Java code to access Twitter. <a href="http://s3.amazonaws.com/hw-sandbox/tutorial13/SentimentFiles.zip" target="_blank">SentimentFile Download</a> from Hortonworks blog <a href="http://hortonworks.com/hadoop-tutorial/how-to-refine-and-visualize-sentiment-data/" target="_blank">How To Refine and Visualize Sentiment Data</a>.</li>
</ul>
<p>Once you have your twitter account go to <a href="https://apps.twitter.com/" target="_blank">Twitter Create Application</a> link and click create new app button and follow the instructions. Below is an example of a Twitter streaming Application I created on Twitter.</p>
<p>&nbsp;</p>
<p style="padding-left: 30px;">&nbsp;<a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/0083.image20.png"><img src="http://blogs.msdn.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/0083.image20.png" alt="" border="0" /></a></p>
<p>&nbsp;</p>
<p>Once you have created your Twitter streaming application you will need four pieces of information from Twitter. These can be found on the API Keys tab of your Twitter application. We will need these in order to configure our flume.conf file.</p>
<ul>
<li>API Key</li>
<li>API Secret</li>
<li>Access Token</li>
<li>Access Secret</li>
</ul>
<p>We need a way for flume to communicate with our Twitter streaming application. In the Hortonworks blog post on refine and visualize sentiment data there is a SentimentFile.zip download. See the links above. Within this zip file there is a flume-custom-source-1.0.0-SNAPSHOT.jar file that has a poc.hortonworks.flume.source.twitter.TwitterSource class. Take this jar file and copy it to C:\hdp\flume-1.4.0.2.0.6.0-0009\lib folder. We will configure flume to use this jar file to stream tweets to flume for certain Twitter keywords.</p>
<p>&nbsp;</p>
<p style="padding-left: 30px;">&nbsp;<a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/3755.image21.png"><img src="http://blogs.msdn.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/3755.image21.png" alt="" border="0" /></a></p>
<p>&nbsp;</p>
<p>You can use the java jar command to see the contents of the flume-custome-sources-1.0.0-SNAPSHOT.jar file.</p>
<p><code class="cplusplus">C:\Java\jdk1.7.0_51\bin&gt;jar -tvf C:\hdp\flume-1.4.0.2.0.6.0-0009\lib\flume-custom-sources-1.0.0-SNAPSHOT.jar<br /></code></p>
<p>&nbsp;</p>
<pre class="scroll"><code class="cplusplus"> <br /> 0 Thu Mar 14 14:38:32 UTC 2013 META-INF/ <br /> <br /> 130 Thu Mar 14 14:38:32 UTC 2013 META-INF/MANIFEST.MF <br /> <br /> 2227 Thu Mar 14 14:38:32 UTC 2013 flume.conf <br /> <br /> 1219 Thu Mar 14 14:38:32 UTC 2013 log4j.xml <br /> <br /> 0 Thu Mar 14 14:38:32 UTC 2013 poc/ <br /> <br /> 0 Thu Mar 14 14:38:32 UTC 2013 poc/hortonworks/ <br /> <br /> 0 Thu Mar 14 14:38:32 UTC 2013 poc/hortonworks/flume/ <br /> <br /> 0 Thu Mar 14 14:38:32 UTC 2013 poc/hortonworks/flume/source/ <br /> <br /> 0 Thu Mar 14 14:38:32 UTC 2013 poc/hortonworks/flume/source/twitter/ <br /> <br /> 2718 Thu Mar 14 14:38:32 UTC 2013 poc/hortonworks/flume/source/twitter/TwitterSource$1.class <br /> <br /> 3975 Thu Mar 14 14:38:32 UTC 2013 poc/hortonworks/flume/source/twitter/TwitterSource.class <br /> <br /> 772 Thu Mar 14 14:38:32 UTC 2013 poc/hortonworks/flume/source/twitter/TwitterSourceConstants.class <br /> <br /> &lt;Truncated output&gt; <br /> </code></pre>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>Now we need to configure our C:\hdp\flume-1.4.0.2.0.6.0-0009\conf\flume.conf file. As Greg's article describes, flume has sources, channels and sinks.</p>
<p>&nbsp;</p>
<p><a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/4643.image22.png"><img src="http://blogs.msdn.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/4643.image22.png" alt="" border="0" /></a></p>
<p>&nbsp;</p>
<p>Our source is the Hortonworks java jar file that connects to our Twitter application and reads Tweet information for our Twitter keywords to the channel. The sources type is the poc.hortonworks.flume.source.twitter.TwitterSource. Flume knows to look for this class in the jar files in its lib folder. In the flum.conf file we have configured the source with our Twitter API and Token information. We have also provided a list of keywords that we want to capture from Twitter. The channel is in memory and has a 10000 capacity and transaction capacity. Our sink then reads from the channel and writes to local disk. I have created a c:\wcarroll\flume directory which is where the sink will write our tweets to disk. This directory location could be a cloudberry drive that writes to Windows Azure blob storage. The sink will roll over a new file every 7200 seconds (2 hours).</p>
<p>&nbsp;</p>
<p><span style="text-decoration: underline;"><strong>Flume.conf </strong></span></p>
<p>&nbsp;</p>
<pre class="scroll"><code class="cplusplus"><span style="background-color: #ffffff;"> agent.sources = Twitter1 </span><br /> <br /><span style="background-color: #ffffff;"> agent.channels = MemChannel </span><br /> <br /><span style="background-color: #ffffff;"> agent.sinks = k1 </span><br /> <br /> <br /> <br /><span style="background-color: #ffffff;"> agent.sources.Twitter1.type = poc.hortonworks.flume.source.twitter.TwitterSource </span><br /> <br /><span style="background-color: #ffffff;"> agent.sources.Twitter1.channels = MemChannel </span><br /> <br /><span style="background-color: #ffffff;"> agent.sources.Twitter1.consumerKey = &lt;Twitter API Key&gt; </span><br /> <br /><span style="background-color: #ffffff;"> agent.sources.Twitter1.consumerSecret = &lt;Twitter API Secret&gt; </span><br /> <br /><span style="background-color: #ffffff;"> agent.sources.Twitter1.accessToken = &lt;Twitter Access Token&gt; </span><br /> <br /><span style="background-color: #ffffff;"> agent.sources.Twitter1.accessTokenSecret = &lt;Twitter Access Secret&gt; </span><br /> <br /><span style="background-color: #ffffff;"> agent.sources.Twitter1.keywords = Ukraine,Crimea,Crimean peninsula,Kiev,Kharkiv,Donetsk,Luhansk,Rostov,Kursk,Belgorod,Tambov </span><br /> <br /> <br /> <br /><span style="background-color: #ffffff;"> agent.sinks.k1.type = file_roll </span><br /> <br /><span style="background-color: #ffffff;"> agent.sinks.k1.sink.rollInterval = 7200 </span><br /> <br /><span style="background-color: #ffffff;"> agent.sinks.k1.channel = MemChannel </span><br /> <br /><span style="background-color: #ffffff;"> agent.sinks.k1.sink.directory = c:\\wcarroll\\flume </span><br /> <br /> <br /> <br /><span style="background-color: #ffffff;"> agent.channels.MemChannel.type = memory </span><br /> <br /><span style="background-color: #ffffff;"> agent.channels.MemChannel.capacity = 10000 </span><br /> <br /><span style="background-color: #ffffff;"> agent.channels.MemChannel.transactionCapacity = 10000</span> <br /> </code></pre>
<p>&nbsp;</p>
<p>&nbsp;&nbsp;</p>
<p>After configuring your flume.conf file go ahead and startup the flume service. If you have issues starting flume review the log files at C:\hdp\flume-1.4.0.2.0.6.0-0009\bin\logs. If it is successful you should start to see a file generated in your sink directory. Below is a screenshot showing flume writing Twitter tweets to file in my sink directory.</p>
<p>&nbsp;</p>
<p style="padding-left: 30px;">&nbsp;<a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/4331.image23.png"><img src="http://blogs.msdn.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/4331.image23.png" alt="" border="0" /></a></p>
<p>&nbsp;</p>
<p>&nbsp;Now that we are successfully collecting Twitter data and storing it in our Windows Azure storage account, we can access the data from HDInsight. The Twitter data is a JSON format. Next Greg and I will discuss using Pig and Hive to do ETL tasks so that we can start to analyze the Twitter data more.</p>
<p>&nbsp;</p>
<p>Bill</p>
<p>&nbsp;</p><div style="clear:both;"></div><img src="http://blogs.msdn.com/aggbug.aspx?PostID=10514256" width="1" height="1">carrollwphttp://blogs.msdn.com/carrollwp_4000_hotmail.com/ProfileUrlRedirect.ashxSliding Window Data Partitioning on Microsoft Azure HDInsighthttp://blogs.msdn.com/b/bigdatasupport/archive/2014/04/23/sliding-window-data-partitioning-on-hdinsight.aspx2014-04-24T02:56:00Z2014-04-24T02:56:00Z<p>HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools like Pig, Mapreduce, Hive, and Oozie to read and write data. HCatalog's table abstraction presents these tools and users with a relational view of data in the cluster. <span style="background-color: white;">HCatalog</span> Integration was made available starting with Apache Oozie 4.0. HCatalog serves as a nice metadata integration point for Pig, MapReduce, Hive, and Oozie. You may have seen scenarios where Pig is being used for ETL and bringing in data and adding it to metastore partitions. When using Oozie for scheduling workflows, it is key to integrate with HCatalog to sense the presence of data on a partition and act on that information. Apache Oozie 4.0 introduces HCatalog Integration and HDInsight Clusters starting with Version 3.0 support that as well. If you are interested in reading more about the HCatalog Integration feature, you can find more information on the Apache documentation <a href="http://oozie.apache.org/docs/4.0.0/DG_HCatalogIntegration.html"><span style="color: blue; text-decoration: underline;">here</span></a></p>
<h2><span style="color: #002060;">Configuration and Deployment </span></h2>
<p>The process of configuring HCatalog Integration with Oozie is detailed <a href="http://oozie.apache.org/docs/4.0.0/AG_Install.html"><span style="color: blue; text-decoration: underline;">here</span></a>. But, please note that these configuration steps are already done for you on HDInsight and the feature is available out of the box when you create a cluster! Let us do a quick run-down on the configuration highlights for this feature:</p>
<ol>
<li>PartitionDependencyManagerService and HCatAccessorService are required to work with HCatalog and support Coordinators having HCatalog URIs as data dependency. Both of these are configured on oozie-site.xml for HDInsight.</li>
<li>HCatalog polling frequency is determined by the configuration parameter oozie.service.coord.push.check.requeue.interval. This is set to 30000 by default on HDInsight, but can be configured to a value that suits your environment by modifying that value to poll at a different frequency than the default of 30 seconds. The Add-AzureHDInsightConfigValues PowerShell cmdlet can be used to specify a custom Oozie configuration when building the HDInsight cluster as described <a href="http://msdn.microsoft.com/en-us/library/dn593759.aspx"><span style="color: blue; text-decoration: underline;">here</span></a></li>
<li>All the necessary JARs that are needed for HCatalog integration are also configured out of the box for you with HDInsight cluster.</li>
</ol>
<h2><span style="color: #002060;">HCatalog URI format </span></h2>
<p>HCatalog partitions can be defined as a data dependency now with a URI notation. The general notation to specify a HCatalog table partition URI is <em>hcat://[metastore server]:[port]/[database name]/[table name]/[partkey1]=[value];[partkey2]=[value];&hellip; </em></p>
<p>With HDInsight, the metastore server will point to the Oozie metastore that you configured when provisioning the cluster. The port by default is 9083.</p>
<p>Here is a sample URL from my HDInsight cluster:</p>
<p><em>&lt;uri-template&gt;hcat://headnode0:9083/default/samplelog/dt=${YEAR}-${MONTH}-${DAY}${HOUR}${MINUTE}&lt;/uri-template&gt; </em></p>
<p>This port value comes in from hive-site.xml property &ndash;</p>
<p><em>&lt;name&gt;hive.metastore.uris&lt;/name&gt; <br />&lt;value&gt;thrift://headnode0:9083&lt;/value&gt; </em></p>
<h2><span style="color: #002060;">What is Sliding Window Partitioning? </span></h2>
<p>Hive provides a nice way to organize data logically with partitions. As an example, you may want to retain only 3 months of server log data at any time. A sliding window is basically a window of a certain width, and can accommodate only a certain number of partitions at any given time. When a new partition comes into the window, the oldest partition is automatically archived out. Archival could mean either moving that data out from that partition into HAR (Hadoop archive), or into a different Hive table, or completely deleting that partition. Note that on HDInsight, if the partition is deleted, the data still exists on the BLOB storage if the table is external.</p>
<h2><span style="color: #002060;">HCatalog Integration and Sliding Window </span></h2>
<p>So, the key with implementing a sliding window mechanism, is to sense the arrival of the next partition and automatically archive the oldest partition that complies with the retention policy determined by a business. In our example, we need to drop the partition for the oldest month in the sliding window, when data for the next month arrives. This is a nice fit for using HCatalog integration from an Oozie coordinator app. So, you could schedule an Oozie coordinator application that runs at the frequency of once every month, and checks to see if the partition for the next month has been made at which point it deletes the oldest partition. The HCatalog call is done using the HCatalog URI format given above.</p>
<h2><span style="color: #002060;">Apache Oozie Terminology </span></h2>
<p>Oozie is a workflow scheduler system to manage Apache Hadoop jobs<sup> [1].</sup> Oozie workflow jobs are Directed Acyclical Graphs (DAGs) of actions<sup>[1]</sup>. Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availability<sup> [1]</sup>. With addition of HCatalog support, Coordinators also support specifying a set of HCatalog table partitions as a dataset<sup> [2].</sup> The workflow is triggered when the HCatalog table partitions are available and the workflow actions can then read the partition data<sup> [2]</sup>.</p>
<p>Let us take a specific example for the sliding window, and try to implement it with an Oozie coordinator application. We need an Oozie coordinator application that runs at a frequency of once every month, and for demo purposes we need to schedule the application such that it starts on the current month, and ends itself within a few minutes within the jobstart time so that we don't leave the test Oozie coordinator app running on the cluster and taking up resources. Below, I will try to define some terminology for an Oozie coordinator application &ndash;</p>
<h3><span style="color: #000080;">Oozie Coordinator application &ndash; (implemented as slidingwindow/coordinator.xml in our example) </span></h3>
<p>&nbsp;A coordinator application defines the conditions under which coordinator actions should be created (the frequency) and when the actions can be started. The coordinator application also defines a start and an end time. Normally, coordinator applications are parameterized. A Coordinator application is written in XML<sup> [3]. </sup></p>
<p>The screen shot below shows the example coordinator application. You can give it any name, in this example we are giving it the name "MY_APP". <span style="color: black;">As you can see from the screen shot below, coordinator application is defined in the form of XML and the first line defines the name of the job, frequency of recurrence of the coordinator actions within the job, start time for the job, end time for the job, time zone and the XML name space. When a coordinator job is submitted to Oozie, the submitter may specify as many coordinator job configuration properties as required (similar to Hadoop JobConf properties). Configuration properties that are a valid Java identifier, [A-Za-z_][0-9A-Za-z_]*, are available as ${NAME} variables within the coordinator application definition.<sup> [3] </sup></span></p>
<p><span style="color: black;">In the example below, </span>the frequency is set to ${coord:months(1)}. ${coord:months(1)} is an expression language function that returns the number of minutes for 1 complete month. So, this indicates that the Oozie coordinator application is set to recur at a frequency of once a month. <span style="color: black;">${jobStart} indicates the start time for this job, which we will be passing on from the job properties payload from within PowerShell. ${jobEnd} indicates the time at which the job should end, and this again will be passed from the job properties payload. Job properties payload is described in greater detail further down on this blog. </span></p>
<p style="margin-left: 18pt;"><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/2148.042514_5F00_1358_5F00_SlidingWind1.png" alt="" /></p>
<p><span style="color: black;">Within the coordinator application, one can define the datasets, input events, and actions as seen on the code snippet above. We will look at each of these in greater detail below: </span></p>
<h3><span style="color: #002060;">Synchronous Dataset </span></h3>
<p><span style="color: black;">The screen show below shows the synchronous dataset defined for our example. Synchronous dataset instances are generated at fixed time intervals and there is a dataset instance associated with each time interval.<sup>[3]</sup> In our example, the frequency of datasets getting produced is once every month. Initial-instance defines the lower bound/baseline for the dataset. So, if there are instances for the dataset that occur prior to the initial-instance, those will be silently ignored. For the purpose of this example, we are going to set the initial-instance to be the same as the job start time. This is because, in our case, we will be looking for the presence of a future partition that is the partition corresponding to the next month to be present for us to act on it and drop the oldest partition. </span></p>
<p><span style="color: black;">Notice on line#4 from the screen shot below, how we reference the HCatalog URI for HDInsight. We are looking for a specific partition in the format dt=YYYY-MM on the table samplelog for us to proceed. </span></p>
<p><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/6305.042514_5F00_1358_5F00_SlidingWind2.png" alt="" /></p>
<h3><span style="color: #002060;">Input Events </span></h3>
<p>The input events of a coordinator application specify the input conditions that are required in order to execute a coordinator action<sup> [3]</sup>. In our example, we would need the partition for the next month to be present in order for the coordinator action to execute. So, we can use the EL (Expression Language) function ${coord:current(n)} which represents the n<sup>th </sup>dataset instance for a synchronous dataset, relative to the coordinator action creation (materialization) time. The coordinator action creation (materialization) time is computed based on the coordinator job start time and its frequency. So, ${coord:current(1)} would represent the instance corresponding to the 1<sup>st</sup> instance of month from the start time. So, given the start month of May, the instance represented by ${coord:current(1)} would be the month of June.</p>
<p><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/3146.042514_5F00_1358_5F00_SlidingWind3.png" alt="" /></p>
<h3><span style="color: #002060;">Oozie Payload &ndash; job.properties </span></h3>
<p>Oozie payload/job.properties is used to pass the configuration parameters for the Oozie coordinator application. The XML payload below defines the various parameters passed to the Oozie coordinator job, like jobStart, jobEnd, initialInstance etc.</p>
<p><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/2134.042514_5F00_1358_5F00_SlidingWind4.png" alt="" /></p>
<h2><span style="color: #002060;">Bringing it all together with PowerShell </span></h2>
<p>Please see the video&nbsp;for a demo on implementing sliding window mechanism for a Hive partitioned table on HDInsight. All the scripts demonstrated in the video can be downloaded from <a title="Demo Scripts" href="https://gist.github.com/dharkum">here</a></p>
<p><iframe src="http://www.youtube.com/embed/SYtzY-RxgBI" frameborder="0" width="640" height="360"></iframe></p>
<p>This is&nbsp;a simple example of how you can use HCatalog polling that is available with the feature of HCatalog Integration on Apache Oozie 4.0 to solve a business need for data archival such as sliding window data partitioning. This is simple and easy to do starting with Version&nbsp;3.0&nbsp;HDInsight clusters.</p>
<p>Dharshana Bharadwaj (@dharshb)</p>
<p>Thank you to <a title="JasonH" href="http://social.msdn.microsoft.com/profile/jason%20h%20(hdinsight)/">JasonH</a> and <a title="Bill Carroll" href="http://social.msdn.microsoft.com/profile/carrollwp/">Bill Carroll</a> for reviewing this!</p>
<h2><span style="color: #003366;">References</span></h2>
<ol>
<li>Oozie Apache Wiki - <a href="https://oozie.apache.org/">https://oozie.apache.org/</a></li>
<li>Oozie 4.0: HCatalog Integration Explained - <a href="http://oozie.apache.org/docs/4.0.0/DG_HCatalogIntegration.html">http://oozie.apache.org/docs/4.0.0/DG_HCatalogIntegration.html</a></li>
<li>Oozie Coordinator Functional Specification - <a href="http://oozie.apache.org/docs/4.0.0/CoordinatorFunctionalSpec.html">http://oozie.apache.org/docs/4.0.0/CoordinatorFunctionalSpec.html</a></li>
<li>Use time-based Oozie Coordinator with HDInsight - <a href="http://azure.microsoft.com/en-us/documentation/articles/hdinsight-use-oozie-coordinator-time/">http://azure.microsoft.com/en-us/documentation/articles/hdinsight-use-oozie-coordinator-time/</a></li>
</ol><div style="clear:both;"></div><img src="http://blogs.msdn.com/aggbug.aspx?PostID=10520027" width="1" height="1">Dharshana_Bharadwajhttp://blogs.msdn.com/dharshana_5F00_bharadwaj_4000_hotmail.com/ProfileUrlRedirect.ashxQuerying HDInsight Job Status with WebHCat via Native PowerShell or Node.jshttp://blogs.msdn.com/b/bigdatasupport/archive/2014/04/22/querying-hdinsight-job-status-with-webhcat-via-native-powershell-or-node-js.aspx2014-04-22T15:00:00Z2014-04-22T15:00:00Z<script type="text/javascript">// <![CDATA[
// Telligent is stripping Style tags, so adding via DOM
var coreStyle = document.createElement("link");
coreStyle.setAttribute("rel", "stylesheet");
coreStyle.setAttribute("type", "text/css");
coreStyle.setAttribute("href", "http://alexgorbatchev.com/pub/sh/current/styles/shCore.css");
document.getElementsByTagName("head")[0].appendChild(coreStyle);
var shTheme = document.createElement("link");
shTheme.setAttribute("rel", "stylesheet");
shTheme.setAttribute("type", "text/css");
shTheme.setAttribute("href", "http://alexgorbatchev.com/pub/sh/current/styles/shThemeMidnight.css");
document.getElementsByTagName("head")[0].appendChild(shTheme);
var smallerfontdiv = document.createElement('style')
smallerfontdiv.innerHTML = "div.smallerfont {font-size: 90%;}\n.comments {display: inline !important;}\ndiv.syntaxhighlighter{overflow-y: hidden !important;overflow-x: auto !important;}";
document.body.appendChild(smallerfontdiv);
// ]]></script>
<script type="text/javascript" src="http://alexgorbatchev.com/pub/sh/current/scripts/shCore.js"></script>
<script type="text/javascript" src="http://alexgorbatchev.com/pub/sh/current/scripts/shBrushJScript.js"></script>
<script type="text/javascript" src="http://alexgorbatchev.com/pub/sh/current/scripts/shBrushPowerShell.js"></script>
<script type="text/javascript" src="http://alexgorbatchev.com/pub/sh/current/scripts/shBrushPlain.js"></script>
<p>One of the great things about HDInsight is that under the covers, it has the same capabilities as other Hadoop installations. This means that you can use regular Hadoop endpoints like <a href="http://www.windowsazure.com/en-us/documentation/articles/hdinsight-monitor-use-ambari-api/">Ambari</a> and WebHCat (formerly known as Templeton) to interact with an HDInsight Cluster.</p>
<p>In this blog post, I&rsquo;ll provide a couple of samples that show how to retrieve Job information from your HDInsight cluster. The first will use native PowerShell (Version 3.0 or higher) and the second will utilize <a href="http://nodejs.org/">Node.js</a> (Tested with version 0.10.26). I have tested the code against HDInsight 3.0 and 2.1 clusters.</p>
<p>While the samples below are using native capabilities in each language, there are also some fully featured command libraries available for PowerShell and for Node.js. Check out <a href="https://github.com/WindowsAzure/azure-sdk-tools">Azure Powershell</a> and the <a href="https://github.com/WindowsAzure/azure-sdk-tools-xplat">Azure Cross-platform CLI</a> to learn more.</p>
<p>First I&rsquo;d like to illustrate how this call will actually make it to your HDInsight cluster. Every cluster is isolated from the Internet behind a secure gateway. This gateway is responsible for forwarding requests to the appropriate endpoints in the private address space of your cluster. This is the reason why HDInsight can make use of REST URI&rsquo;s like: <strong>https://mycluster.azurehdinsight.net/ambari/api/v1/clusters</strong> and <strong>https://mycluster.azurehdinsight.net/templeton/v1/status</strong> without using different port numbers. On many standalone Hadoop clusters, you would have to direct these requests to different port numbers like 563 for Ambari or 50111 for WebHCat.</p>
<p><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/2671.securegatewaydiag.png" alt="" border="0" /></p>
<p>Each of the samples below are functionally equivalent. They follow the basic pattern:</p>
<ol>
<li>Acquire credentials for the cluster&rsquo;s Hadoop Services.</li>
<li>Make a REST call to the WebHCat endpoint <strong>/templeton/v1/jobs</strong></li>
<li>Parse the response</li>
<li>For each Job Id in the response, get detailed job status information from <strong>/templeton/v1/jobs/:jobid</strong></li>
<li>Parse the response and display the <strong>.status</strong> element on the console.</li>
</ol>
<p>The first example shows how this is accomplished in PowerShell:</p>
<h3>PowerShell: listjobswebhcat.ps1 <a href="https://github.com/rickmsft/hdinsightsamples/raw/master/listjobswebhcat.ps1">Download</a></h3>
<div class="smallerfont">
<pre class="brush: ps; auto-links: false;"># Get credential object to use for authenticating to the cluster
if(!$ClusterCredential) { $ClusterCredential = Get-Credential }
$ClusterName = 'clustername' # just the first part, we'll add .azurehdinsight.net below when we build the Uri
# Make the REST call, defaults to GET and parses JSON response to PSObject
$Jobs = Invoke-RestMethod -Uri "https://$ClusterName.azurehdinsight.net/templeton/v1/jobs?user.name=$($ClusterCredential.UserName)&amp;showall=true" -Credential $ClusterCredential
Write-Host "The following job information was retrieved:`n"
$Jobs | ft
# Iterate through the jobs
foreach($JobId in $Jobs.id)
{
#Get details specific to this JobId
$Job = Invoke-RestMethod -Uri "https://$ClusterName.azurehdinsight.net/templeton/v1/jobs/$JobId`?user.name=$($ClusterCredential.UserName)" -Credential $ClusterCredential
Write-Host "The following details were retrieved for JobId $JobId`:`n"
$Job | ft
# Powershell doesn't like the fact that the response includes jobId and jobID elements, so I'm going to modify the one that contains a Hash.
# Invoke-RestMethod would have automatically converted the JSON to a PSObject if thow two tags hadn't been there, so post-convering.
$Job = ConvertFrom-Json ($Job -creplace 'jobID','jobIDObj')
Write-Host "`nThe following is the parsed Status for JobId $JobId`:"
$Job.status | fl
Write-Host "-------------------------------------------------------"
}
</pre>
</div>
<p>Here's an example of the output from running this PowerShell script:</p>
<div class="smallerfont">
<pre class="brush: plain; auto-links: false">The following job information was retrieved:
id detail
-- ------
job_1397520080955_0001
job_1397520080955_0002
-------------------------------------------------------
The following details were retrieved for JobId job_1397520080955_0002:
{"status":{"mapProgress":1.0,"reduceProgress":1.0,"cleanupProgress":0.0,"setupProgress":0.0,"runState":2,"startTime":1397852941726,"queue":"default","priority":"NORMAL","schedulingInfo":"NA","failureInfo":"NA","job
ACLs":{},"jobName":"mapreduce.BaileyBorweinPlouffe_1_100","jobFile":"wasb://clustername@blobaccount.blob.core.windows.net/mapred/history/done/2014/04/18/000000/job_1397520080955_0002_conf.xml","finishTime":13978531674
44,"historyFile":"","trackingUrl":"headnodehost:19888/jobhistory/job/job_1397520080955_0002","numUsedSlots":0,"numReservedSlots":0,"usedMem":0,"reservedMem":0,"neededMem":0,"jobPriority":"NORMAL","jobID":{"id":2,"j
tIdentifier":"1397520080955"},"jobId":"job_1397520080955_0002","username":"userthatranjob","state":"SUCCEEDED","retired":false,"uber":false,"jobComplete":true},"profile":{"user":"userthatranjob","jobFile":"wasb://clustername@blob
account.blob.core.windows.net/mapred/history/done/2014/04/18/000000/job_1397520080955_0002_conf.xml","url":null,"queueName":"default","jobName":"mapreduce.BaileyBorweinPlouffe_1_100","jobID":{"id":2,"jtIdentifier":"1
397520080955"},"jobId":"job_1397520080955_0002"},"id":"job_1397520080955_0002","parentId":null,"percentComplete":null,"exitValue":null,"user":"userthatranjob","callback":null,"completed":null,"userargs":{}}
The following is the parsed Status for JobId job_1397520080955_0002:
mapProgress : 1.0
reduceProgress : 1.0
cleanupProgress : 0.0
setupProgress : 0.0
runState : 2
startTime : 1397852941726
queue : default
priority : NORMAL
schedulingInfo : NA
failureInfo : NA
jobACLs :
jobName : mapreduce.BaileyBorweinPlouffe_1_100
jobFile : wasb://clustername@blobaccount.blob.core.windows.net/mapred/history/done/2014/04/18/000000/job_1397520080955_0002_conf.xml
finishTime : 1397853167444
historyFile :
trackingUrl : headnodehost:19888/jobhistory/job/job_1397520080955_0002
numUsedSlots : 0
numReservedSlots : 0
usedMem : 0
reservedMem : 0
neededMem : 0
jobPriority : NORMAL
jobIDObj : @{id=2; jtIdentifier=1397520080955}
jobId : job_1397520080955_0002
username : userthatranjob
state : SUCCEEDED
retired : False
uber : False
jobComplete : True
-------------------------------------------------------
(continues...)
</pre>
</div>
<p>The next example shows similar steps in Node.js:</p>
<table>
<tbody>
<tr style="vertical-align: top;">
<td style="padding-right: 35px;">
<h3>Node.js: listjobswebhcat.js <a href="https://github.com/rickmsft/hdinsightsamples/raw/master/listjobswebhcat.js">Download</a></h3>
</td>
<td style="padding: 5px 25px 0px 25px; background: lightgray;"><strong><em>Notes:</em></strong><ol>
<li>If you need to download Node.js, get it from <a href="http://nodejs.org">here</a>.</li>
<li>On Windows, launch <em>Node.js command prompt</em> from the Start Screen/Menu</li>
<li>Change directory to the folder where you have saved <em>listjobswebhcat.js</em></li>
<li>Run <em>node listjobswebhcat.js</em></li>
</ol></td>
</tr>
</tbody>
</table>
<div class="smallerfont">
<pre class="brush: js; auto-links: false">var https = require('https');
// Cluster Authentication Setup
var clustername = "clustername";
var username = "clusterusername";
var password = "clusterpassword"; // Reminder: Don`t share this file with your password saved!
// Set up the options to get all known Jobs from WebHCat
var optionJobs = {
host: clustername + ".azurehdinsight.net",
path: "/templeton/v1/jobs?user.name=" + username + "&amp;showall=true",
auth: username + ":" + password, // this is basic auth over ssl
port: 443
};
// Make the call to the WebHCat Endpoint
https.get(optionJobs, function (res) {
console.log("\nHTTP Response Code: " + res.statusCode);
var responseString = ""; // Initialize the response string
res.on('data', function (data) {
responseString += data; // Accumulate any chunked data
});
res.on('end', function () {
// Parse the response, we know it`s going to be JSON, so we`re not checking Content-Type
var Jobs = JSON.parse(responseString);
console.log("The following job information was retrieved:");
console.log(Jobs);
Jobs.forEach(function (Job) {
// Set up the options to get information about a specific Job Id
var optionJob = {
host: clustername + ".azurehdinsight.net",
path: "/templeton/v1/jobs/" + Job.id + "?user.name=" + username + "&amp;showall=true",
auth: username + ":" + password, // this is basic auth over ssl
port: 443
};
https.get(optionJob, function (res) {
console.log("\nHTTP Response Code: " + res.statusCode);
var jobResponseString = ""; // Initialize the response string
res.on('data', function (data) {
jobResponseString += data; // Accumulate any chunked data
});
res.on('end', function () {
var thisJob = JSON.parse(jobResponseString); // Parse the JSON response
console.log("The following is the Status for JobId " + Job.id);
console.log(thisJob.status); // Just Log the status element.
});
});
});
});
});
</pre>
</div>
<p>Here's an example of the output from running this script through Node.js:</p>
<div class="smallerfont">
<pre class="brush: plain; auto-links: false">HTTP Response Code: 200
The following job information was retrieved:
[ { id: 'job_1397520080955_0001', detail: null },
{ id: 'job_1397520080955_0002', detail: null } ]
HTTP Response Code: 200
The following is the Status for JobId job_1397520080955_0002
{ mapProgress: 1,
reduceProgress: 1,
cleanupProgress: 0,
setupProgress: 0,
runState: 2,
startTime: 1397852941726,
queue: 'default',
priority: 'NORMAL',
schedulingInfo: 'NA',
failureInfo: 'NA',
jobACLs: {},
jobName: 'mapreduce.BaileyBorweinPlouffe_1_100',
jobFile: 'wasb://clustername@blobaccount.blob.core.windows.net/mapred/history/don
e/2014/04/18/000000/job_1397520080955_0002_conf.xml',
finishTime: 1397853167444,
historyFile: '',
trackingUrl: 'headnodehost:19888/jobhistory/job/job_1397520080955_0002',
numUsedSlots: 0,
numReservedSlots: 0,
usedMem: 0,
reservedMem: 0,
neededMem: 0,
jobPriority: 'NORMAL',
jobID: { id: 2, jtIdentifier: '1397520080955' },
jobId: 'job_1397520080955_0002',
username: 'userthatranjob',
state: 'SUCCEEDED',
retired: false,
uber: false,
jobComplete: true }
(continues...)
</pre>
</div>
<p>Hopefully this provides a good starting point to show how easy it is to integrate external or on-premises PowerShell or Node.js solutions with HDInsight. Listing Job information is just a simple example, but with the <a href="https://cwiki.apache.org/confluence/display/Hive/WebHCat+Reference">WebHCat API</a>, you can query the Hive/HCatalog metastore, Create/Drop Hive Tables, and even launch <a href="https://cwiki.apache.org/confluence/display/Hive/WebHCat+Reference+MapReduceJar">MapReduce Jobs</a>, <a href="https://cwiki.apache.org/confluence/display/Hive/WebHCat+Reference+Hive">Hive Queries</a> and <a href="https://cwiki.apache.org/confluence/display/Hive/WebHCat+Reference+Pig">Pig Jobs</a>. The ability to execute work and monitor your HDInsight cluster externally can allow you to integrate with other systems and solutions, using whatever language or platform works best for you.</p>
<script type="text/javascript">// <![CDATA[
SyntaxHighlighter.all()
// ]]></script><div style="clear:both;"></div><img src="http://blogs.msdn.com/aggbug.aspx?PostID=10519165" width="1" height="1">Rick_Hhttp://blogs.msdn.com/Rick_5F00_H/ProfileUrlRedirect.ashxCustomizing HDInsight Cluster provisioning http://blogs.msdn.com/b/bigdatasupport/archive/2014/04/15/customizing-hdinsight-cluster-provisioning-via-powershell-and-net-sdk.aspx2014-04-15T22:55:00Z2014-04-15T22:55:00Z<p>In my last <a href="http://blogs.msdn.com/b/bigdatasupport/archive/2014/02/13/how-to-pass-hadoop-configuration-values-for-a-job-via-hdinsight-powershell-and-net-sdk.aspx">blog</a>, I discussed how we can specify Hadoop configurations for a job on an HDInsight cluster. At the end of that blog, I also dicussed the alternative approach where you may want to change certain hadoop configurations from default values and would like to preserve the changes throughout the lifetime of the cluster because, may be,&nbsp;the configurations have worked quite well for your workload&nbsp;during testing and apply to most of your jobs&ndash; you can do this via cluster customization while creating the HDInsight cluster. This approach also fits well with 'elastic hadoop in the cloud' scenario where you would create a customized HDinsight cluster with specific configurations, run your workload and then remove the cluster. While creating my own customized cluster, I realized that it was not very obvious from our existing documentation what different customization options are available or how to use those without digging through the reference documentation. In this blog, I wanted to share a few examples (a Powershell script and a .Net SDK example) with various customization options that can be used during HDInsight cluster provisioning.</p>
<p><span style="font-size: medium;"><strong><em>Can&nbsp;we do it using Azure Portal? </em></strong></span></p>
<p>The short answer is, yes &ndash; but with limitations. As shown in our HDInsight <a href="http://azure.microsoft.com/en-us/documentation/articles/hdinsight-provision-clusters/">documentation</a>, we can create a customized HDInsight cluster via our Azure Portal, Windows Azure Powershell or HDInsight .Net SDK. While I personally like the Azure Portal most for its simplicity and ease of use, not all the customization options are available via the portal, as of today &ndash; for example, customizing Hadoop configuration files&nbsp;or adding additional libraries or JARs during cluster provisioing, as shown in this codeplex <a href="https://hadoopsdk.codeplex.com/wikipage?title=PowerShell%20Cmdlets%20for%20Cluster%20Management&amp;referringTitle=Home">example</a>. Also, the UI restricts us to a certain number of additional storage accounts we can specify on the portal. The Windows Azure Powershell or HDInsight .Net SDK don't have such limitations and with these tools, you can use all the available customization options. Another benefit is, you can reuse the PowerShell script or .Net SDK code and make it part of your workflow.</p>
<p>The chart below shows&nbsp;a summary of a few important customizations that are available via portal, PowerShell and .Net SDK&nbsp;-</p>
<p><a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/5460.ClusterCustomizationOptions.PNG"><img src="http://blogs.msdn.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/5460.ClusterCustomizationOptions.PNG" alt="" border="0" /></a></p>
<p><span style="font-size: medium;"><strong><em>Example using Windows Azure PowerShell: </em></strong></span></p>
<p>Here is a sample PowerShell script&nbsp;with examples of almost all the possible customization options during provisioning of a cluster. You can omit the customizations that you don't need.</p>
<script type="text/javascript" src="https://gist.github.com/AzimUddin/0307dbf7d7e705cdc5e0.js"></script>
<p>&nbsp;</p>
<p><span style="font-size: medium;"><strong><em>Example using HDInsight .Net SDK: </em></strong></span></p>
<p><span style="font-size: small;">Here is an equivalent cluster customization sample with HDInsight .Net SDK. Like before, omit the customizations you don't need.</span></p>
<script type="text/javascript" src="https://gist.github.com/AzimUddin/11025548.js"></script>
<p><em><strong></strong></em>&nbsp;</p>
<p><em><strong><span style="font-size: medium;">Can we customize a cluster after&nbsp;Provisioning?</span></strong></em></p>
<p><span style="font-size: small;">We can, but as explained in Dan's <a href="http://blogs.msdn.com/b/bigdatasupport/archive/2013/11/01/the-hdinsight-support-team-is-open-for-business.aspx">blog</a>, outside of cluster customization during the install time, any manual modification of the Hadoop configuration files or any other file won't be preserved when the Azure VM nodes get updated - hence this is not recommended or supported. But the good news is, you can always customize or configure a Job and here are some of the possible options (not limited to)-</span></p>
<p><span style="font-size: small;">1. You can specify Hadoop configuration values for a job, as shown in this <a href="http://blogs.msdn.com/b/bigdatasupport/archive/2014/02/13/how-to-pass-hadoop-configuration-values-for-a-job-via-hdinsight-powershell-and-net-sdk.aspx">blog</a></span></p>
<p><span style="font-size: small;">2. You can use additional Azure Storage accounts (that are not associated with this HDInsight cluster) for a job, as shown in this <a href="http://social.technet.microsoft.com/wiki/contents/articles/23256.using-an-hdinsight-cluster-with-alternate-storage-accounts-and-metastores.aspx">TechNet article</a></span></p>
<p><span style="font-size: small;">3. You can upload a custom JAR to Window Azure Blob Storage and refer to that JAR from a job via MapReduce -libjars, Hive 'Add Jar' or Pig Register mechanisms.</span></p>
<p><span style="font-size: small;">That's all for today. I hope you find the blog helpful!</span></p>
<p><span style="font-size: small;">@Azim (MSFT)</span></p><div style="clear:both;"></div><img src="http://blogs.msdn.com/aggbug.aspx?PostID=10517653" width="1" height="1">Azim Uddinhttp://blogs.msdn.com/azim91_4000_hotmail.com/ProfileUrlRedirect.ashxUsing Apache Flume with HDInsight http://blogs.msdn.com/b/bigdatasupport/archive/2014/03/18/using-apache-flume-with-hdinsight.aspx2014-03-18T18:30:00Z2014-03-18T18:30:00Z<p><span style="font-family: Consolas;"><span style="font-size: 10pt;"><strong>Gregory Suarez &ndash; 03/18/2014</strong></span> </span></p>
<p>&nbsp;</p>
<p><span style="font-size: 10pt;">(This blog posting assumes some basic knowledge of Apache Flume) </span></p>
<p>&nbsp;</p>
<p><strong>Overview </strong></p>
<p>When asked if <a href="http://flume.apache.org/">Apache Flume</a> can be used with HDInsight, the response is typically no. We do not currently include Flume in our HDInsight service offering or in the HDInsight Server platform ( which is a single node deployment that can be used as a local development environment for HDInsight service) . In addition, a vast majority of Flume consumers will land their streaming data into HDFS &ndash; and HDFS is not the default file system used with HDInsight. Even if it were - we do not expose public facing Name Node or HDFS endpoints so the Flume agent would have a terrible time reaching the cluster! So, for these reasons and a few others , the answer is typically "no. &hellip;it won't work or its not supported<em>"</em></p>
<p>While Flume is not supported by Microsoft, there is no reason why it can't be used to stream your data to Azure Blob storage &ndash; thus making your data available to your HDInsight Cluster. If support is needed specifically for Flume, the forums and discussion groups associated with your Hadoop distribution can be used to answer questions related to Flume.</p>
<p>How can Flume be used with HDInsight ? Considering the default (and recommended) file system used with HDInsight is an Azure Blob storage container , we can use techniques introduced in my earlier <a href="http://blogs.msdn.com/b/bigdatasupport/archive/2014/01/09/mount-azure-blob-storage-as-local-drive.aspx">blog</a> to create a local drive mapping to your Azure blob container from a Windows machine and then configure a Flume agent to use a <strong>file_roll</strong> sink which points to the newly created Windows drive. This will allow your flume agent to essentially land your data into Azure blob storage. The techniques introduced today can be used with existing Flume installations running Windows or Linux based Hadoop distributions - including <a href="http://hortonworks.com/">Hortonworks</a>, <a href="http://www.cloudera.com/content/cloudera/en/home.html">Cloudera</a>, <a href="http://www.mapr.com/">Mapr</a> and others.</p>
<p>Why would someone that has an existing localized Hadoop distribution want to send their streaming data to HDInsight verses streaming it locally to their HDFS cluster? Perhaps the local cluster is reaching its limits and provisioning additional machines is becoming less cost effective. Perhaps the idea of provisioning a cluster "on-demand" to process your data (which is growing everyday- compliments of Flume) is starting to become very appealing. Data can continue to be ingested &ndash; even if you have decomissioned your HDInsight cluster. Perhaps, you are learning Hadoop from an existing sandbox (that includes Flume-NG) and using it to collect events from various server logs in your environment. The data has grown beyond the capabilities of the sandbox and you've been considering HDInsight service. Making a simple flume agent configuration change to an existing sink would land your data in the blob storage container which makes it available to your HDInsight cluster. <strong> </strong></p>
<p><strong>Flume Overview </strong></p>
<p>Flume is all about data ingestion (ingres) into your cluster. In particular, log files that are accumulating on a few machines or even thousands of machines can be collected, aggregated, and streamed to a single entry point within your cluster. Below describes some Flume components and concepts:</p>
<ul>
<li><span style="color: #222222;"><strong>Event: </strong>The basic payload of data transported by Flume. It represent the unit of data that Flume can transport from its point of origination to its final destination. Optional headers are chained together via Interceptors and are typically used to inspect and alter events. </span></li>
<li><span style="color: #222222;"><strong>Client: </strong>An interface implementation that operates at the point of origin of events and delivers them to a Flume agent. Clients typically operate in the process space of the application they are consuming data from. </span></li>
<li><span style="color: #222222;"><strong>Agent:</strong> Core element in Flume's data path. Hosts flume components such as sources, channels and sinks, and has the ability to receive, store and forward events to their next-hop destination. </span></li>
<li><span style="color: #222222;"><strong>Source:</strong> Consumes events delivered to it via a client. When a source receives an event, it hands it over to one or more channels. </span></li>
<li><span style="color: #222222;"><strong>Channel:</strong> A transient store for events. It's the glue between the source and sink. Channels play an important role in ensuring durability of the flows. </span></li>
<li><span style="color: #222222;"><strong>Sink:</strong> Remove events from a channel and transmit them to the next agent in the flow, or to the event's final destination. Sinks that transmit the event to its final destination are also known as terminal sinks. </span></li>
</ul>
<p style="background: white;">Events flow from the client to the source. The source writes events to one or more channels. The channel is the holding area of events in flight and can be configured durable (file backed) or non-durable (memory backed). The events will wait in the channel until the consuming sink can drain it and send the data off to its final destination.</p>
<p style="background: white;">Below depicts a simple Flume agent configured with an HDFS terminal sink:</p>
<p style="background: white;"><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/2061.031914_5F00_0101_5F00_UsingApache1.png" alt="" /></p>
<p style="background: white;"><span style="font-size: 9pt;">For additional details and configuration options available for Flume, please visit the <a href="http://flume.apache.org/FlumeUserGuide.html">Apache Flume website</a>. </span></p>
<p style="background: white;">&nbsp;</p>
<p><strong>Flume &amp; Azure Blob Storage </strong></p>
<p style="background: white;">To allow Flume to send event data to an HDInsight cluster, a <a href="http://flume.apache.org/FlumeUserGuide.html">File Roll Sink</a> will need to be configured within the Flume configuration. The sink directory (the directory where the agent will send events) must point to a Windows drive that is mapped to the Azure Blob storage container. Below is a diagram depicting the flow.</p>
<p style="background: white;">&nbsp;</p>
<p style="background: white;"><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/7367.031914_5F00_0101_5F00_UsingApache2.png" alt="" /></p>
<p style="background: white;">Here's a sample Flume configuration that defines a file_roll sink connected to Azure Blob storage:</p>
<p style="background: white;">&nbsp;</p>
<p style="background: white;"><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/2313.031914_5F00_0101_5F00_UsingApache3.png" alt="" /></p>
<p style="background: white;">&nbsp;</p>
<p style="background: white;">To further illustrate the connectivity concept, below I issue an <strong>&ndash;ls</strong> command from the same Linux machine hosting the above Flume configuration to demonstrate the Azure connectivity. Earlier, I placed a file called Test.txt in the logdata directory.</p>
<p style="background: white;">&nbsp;</p>
<p style="background: white;"><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/5670.031914_5F00_0101_5F00_UsingApache4.png" alt="" /></p>
<p style="background: white;">&nbsp;</p>
<p><strong>Connect Flume file_roll sink to Azure Blob storage </strong></p>
<p style="background: white;">&nbsp;</p>
<p style="background: white;">This section provides high level configuration details to connect a Flume file_roll sink to Azure blob storage. Although all possible scenarios cannot be covered, the information below should be enough to get you started.</p>
<p style="background: white;">&nbsp;</p>
<p style="background: white; margin-left: 36pt;">Install <a href="http://www.cloudberrylab.com/amazon-cloud-storage-online-hard-drive.aspx">CloudBerry Drive</a> on a Windows machine. The Cloudberry drive will provide the central glue connecting the local file system to Azure blob storage endpoint. Note, Cloudberry Drive comes in two different flavors. If you plan on exposing the drive via a network share &ndash; perhaps to allow Flume agent running on Linux to access ) you'll need to install the server flavor of Cloudberry drive. If you install CloudBerry drive on a machine hosting virtual machines, the guests OS's should be able to access the drive via Shared Folders (depending on the VM technology you are using). Conifguration of the software is fairly trivial. I've included some screen shots showing my configuration:</p>
<p style="background: white;">&nbsp;</p>
<p style="background: white; margin-left: 45pt;">Supply Azure Storage account name and key</p>
<p style="background: white; margin-left: 45pt;">&nbsp;</p>
<p style="background: white; margin-left: 72pt;"><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/1768.031914_5F00_0101_5F00_UsingApache5.png" alt="" /></p>
<p style="background: white;">&nbsp;</p>
<p style="background: white;">&nbsp;</p>
<p style="background: white; margin-left: 45pt;">Select the Storage account and specify the container used for HDInsight. In my configuration, I'm using CloudBerry Desktop. Be sure Network Mapped Drive is selected. Note for CloudBerry Server version, and additional option will be presented indicating whether this drive should be exposed via network share</p>
<p style="background: white;">&nbsp;</p>
<p style="background: white; margin-left: 72pt;"><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/5584.031914_5F00_0101_5F00_UsingApache6.png" alt="" /></p>
<p style="background: white;">&nbsp;</p>
<p style="background: white;">&nbsp;</p>
<p style="background: white; margin-left: 72pt;"><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/0702.031914_5F00_0101_5F00_UsingApache7.png" alt="" /></p>
<p style="background: white;">&nbsp;</p>
<p style="background: white;">&nbsp;</p>
<p style="background: white; margin-left: 36pt;">In my configuration, my Flume agent is installed on an edge node within my Hortonworks HDP 2.0.6 Linux cluster. The node is a VM inserted into the cluster via Ambari. I used VMWare Shared Folders to expose the CloudBerry drive to the guest OS running Centos 6.5</p>
<p style="background: white;">&nbsp;</p>
<p style="background: white; margin-left: 72pt;"><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/0825.031914_5F00_0101_5F00_UsingApache8.png" alt="" /></p>
<p style="background: white;">&nbsp;</p>
<p style="background: white; margin-left: 36pt;">I installed VMWare tools and mounted the shared folder to the following directory on the local system -&gt; /blob using the following command:</p>
<p style="background: white; margin-left: 72pt;">&nbsp;</p>
<p style="background: white; margin-left: 72pt;">mount -t vmhgfs .host:/ /blob</p>
<p style="background: white;">&nbsp;</p>
<p style="background: white; margin-left: 36pt;">Next I verified the drive was successfully mounted by issuing an -ls /blob</p>
<p style="background: white; margin-left: 36pt;">Finally, I tested a few Flume sources (exec, Twitter, etc&hellip;) against a directory contained in my Azure blob</p>
<p style="background: white;">&nbsp;</p>
<p style="background: white; margin-left: 36pt;"><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/7563.031914_5F00_0101_5F00_UsingApache9.png" alt="" /></p>
<p style="background: white;">&nbsp;</p>
<p><strong>Having fun with Twitter and Flume </strong></p>
<p style="background: white;">I downloaded the <a href="http://s3.amazonaws.com/hw-sandbox/tutorial13/SentimentFiles.zip">SentimentFiles.zip </a> files associated with Hortonworks <a href="http://hortonworks.com/hadoop-tutorial/how-to-refine-and-visualize-sentiment-data/">Tutorial 13: How to Refine and Visualize Sentiment Data</a> and unzipped the files into a directory on a node within my local HDP 2.0 Linux cluster. I wanted to stream Twitter data directly from the Flume agent directly to my HDInsight cluster. I modified the <strong><em>flumetwitter.conf</em></strong> file and changed the HDFS sink to a file_roll sink which points to the Azure blob container. (see partial configuration changes below)</p>
<p style="background: white;">&nbsp;</p>
<p style="background: white;">flumetwitter.conf</p>
<p style="background: white;"><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/2388.031914_5F00_0101_5F00_UsingApache10.png" alt="" /></p>
<p style="background: white;">Next I launched the Flume agent using the command below and collected data for about 2 weeks</p>
<p style="background: white;">&nbsp;</p>
<p><span style="font-family: Consolas; font-size: 10pt;">root@hdp20-machine4 logdata]# <strong>flume-ng agent -c /root/SentimentFiles/SentimentFiles/flume/conf/ -f /root/SentimentFiles/SentimentFiles/flume/conf/flumetwitter.conf -Dflume.root.logger=DEBUG,console -n TwitterAgent</strong>&nbsp; </span></p>
<p style="background: white;">&nbsp;</p>
<p style="background: white;">Below charts the sink roll intervals over a timespan of about 2 weeks with some additional notes in the image below.</p>
<p style="background: white;">&nbsp;</p>
<p style="background: white;"><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/1830.031914_5F00_0101_5F00_UsingApache11.png" alt="" /></p>
<p style="background: white;">&nbsp;</p>
<p style="background: white;">During this time span, I collected a little over 32 GB of Tweets pertaining to the keywords I defined in flumetwitter.conf file above.</p>
<p style="background: white;">&nbsp;</p>
<p style="background: white;">I wrote a very simple .NET stream based mapper and reducer to count over the source token within each tweet.</p>
<p style="background: white;">&nbsp;</p>
<p style="background: white;">Here's my <strong>mapper</strong> and <strong>reducer</strong> code:</p>
<script type="text/javascript" src="https://gist.github.com/gregorysuarez/9520005.js"></script>
<script type="text/javascript" src="https://gist.github.com/gregorysuarez/9520131.js"></script>
<p>After running the job, I imported the results into PowerView and answered some basic questions - "What are the top sources?" , "What application on Android or Windows mobile devices are the most common?"</p>
<p>Here's the top 20 by source.</p>
<p><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/8512.031914_5F00_0101_5F00_UsingApache12.png" alt="" /></p>
<p>Admintently, better analytics can be performed given the content and subject matter of the data collected but today I wanted a very simple test job to extract some basic information.</p>
<p><strong>Conclusion </strong></p>
<p>Over the last few weeks, I've collected over 130 GB of streaming data ranging from log files, to Azure Datamarket content, to Twitter feeds. This is hardly "big data", but it could be &hellip; over time .</p>
<p>&nbsp;</p>
<p>Next time, Bill Carroll and I will analyze the data collected using Hive and the <a href="http://hadoopsdk.codeplex.com/wikipage?title=Map%2fReduce&amp;referringTitle=Home">.NET SDK for Hadoop- Map/Reduce</a> incubator api's.</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p><div style="clear:both;"></div><img src="http://blogs.msdn.com/aggbug.aspx?PostID=10508851" width="1" height="1">Gregory Suarez - MSFThttp://blogs.msdn.com/cts_2D00_gregorys_4000_live.com/ProfileUrlRedirect.ashxOozie sqoop action hits primary key violationhttp://blogs.msdn.com/b/bigdatasupport/archive/2014/03/17/oozie-sqoop-action-hits-primary-key-violation.aspx2014-03-17T17:01:00Z2014-03-17T17:01:00Z<p>We have seen multiple customers contact us where an oozie job appears to hang. The oozie job involves a sqoop action which is exporting data from a file in HDInsight to a table in a SQL Azure database. For background on Sqoop see <a href="http://blogs.msdn.com/b/bigdatasupport/archive/2014/01/07/getting-started-with-sqoop-in-hdinsight.aspx" target="_blank">Getting Started with Sqoop</a> . We will use this blog to help understand HDInsight's behavior better. The actual problem is that SQL Azure database raises a primary key violation.</p>
<p>Typically a primary key violation is not going to be resolved by another attempt unless the record is removed from the SQL Server table before you re execute the command.</p>
<p>Within SQL Server you will see the 2627 error message when a primary key violation is encountered.</p>
<p style="background: #d9d9d9; margin-left: 36pt;">Msg 2627, Level 14, State 1, Line 1 Violation of PRIMARY KEY constraint 'PK_Table_4'. Cannot insert duplicate key in object 'dbo.Table1'. The duplicate key value is (1).The statement has been terminated.</p>
<p>&nbsp;</p>
<p>In HDInsight tasktracker logs you will see. The same primary key violation error shows up.</p>
<p style="background: #d9d9d9; margin-left: 36pt;">2014-03-15 13:51:00,704 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library<br />2014-03-15 13:51:02,501 INFO org.apache.hadoop.mapred.Task: Using ResourceCalculatorPlugin : <a href="mailto:org.apache.hadoop.util.WindowsResourceCalculatorPlugin@671aeb3">org.apache.hadoop.util.WindowsResourceCalculatorPlugin@671aeb3</a><br />2014-03-15 13:51:04,094 INFO org.apache.hadoop.mapred.MapTask: Processing split: Paths:/user/hdp/Table1/Table1.csv:26+13<br />2014-03-15 13:51:04,547 WARN org.apache.hadoop.io.compress.snappy.LoadSnappy: Snappy native library is available<br />2014-03-15 13:51:04,547 INFO org.apache.hadoop.io.compress.snappy.LoadSnappy: Snappy native library loaded<br />2014-03-15 13:51:04,719 INFO org.apache.sqoop.mapreduce.AutoProgressMapper: Auto-progress thread is finished. keepGoing=false<br />2014-03-15 13:51:05,329 WARN org.apache.sqoop.mapreduce.SQLServerExportDBExecThread: Error executing statement: java.sql.BatchUpdateException: Violation of PRIMARY KEY constraint &amp;apos;PK_Table_4&amp;apos;. Cannot insert duplicate key in object &amp;apos;dbo.Table1&amp;apos;. The duplicate key value is (1).</p>
<p>&nbsp;</p>
<p>I first followed the blog "Getting Started with Sqoop" above and created a table in SQL Azure database with four records. I then created a table1.csv file with the same records and placed it on my WASB storage account in my default container for my HDInsight cluster. Trying to export the Table1.csv records into the Table1 in my SQL Azure database should raise the primary key violation. The actual sqoop command is.</p>
<p style="background: #d9d9d9; margin-left: 36pt;">sqoop export --connect "jdbc:sqlserver://xxxxx.database.windows.net:1433;username=hdp@xxxxx;password=xxxxx;database=wpc-wadb" --table Table1 --export-dir /user/hdp/Table1 --input-fields-terminated-by ","</p>
<p>&nbsp;</p>
<p>The map reduce job took 1 hour and 21 minutes to fail! Although the job eventually failed it appears to end users that the job is hung because it was running so long before it failed! Why did it take so long to fail? By default, on HDInsight, a mapper will attempt 8 times and each attempt has a task timeout of 600 seconds. In the tasktracker log below you can see that total finish time is 1 hour and 21 minutes. Eight attempts were made and each attempt took 600 seconds. This gets us to our 1 hour 21 minutes to fail the job completely. The url is <a href="http://jobtrackerhost:50030/jobtasks.jsp?jobid=job_201403171425_0046&amp;type=map&amp;pagenum=1">http://jobtrackerhost:50030/jobtasks.jsp?jobid=job_201403141723_0012&amp;type=map&amp;pagenum=1</a></p>
<p>&nbsp;</p>
<p style="margin-left: 36pt;"><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/2047.031814_5F00_2022_5F00_Ooziesqoopa1.jpg" alt="" /></p>
<p>&nbsp;</p>
<p>In your mapred-site.xml located at C:\apps\dist\hadoop-1.2.0.1.3.6.0-0862\conf there are two properties that affect this behavior, mapred.map.max.attempts and mapred.task.timeout.</p>
<p>&nbsp;</p>
<p style="background: #d9d9d9; margin-left: 36pt;">&lt;property&gt;<br />&lt;name&gt;mapred.map.max.attempts&lt;/name&gt;<br />&lt;value&gt;8&lt;/value&gt;<br />&lt;/property&gt;</p>
<p style="background: #d9d9d9; margin-left: 36pt;">&lt;property&gt;<br />&lt;name&gt;mapred.task.timeout&lt;/name&gt;<br />&lt;value&gt;600000&lt;/value&gt;<br />&lt;/property&gt;</p>
<p>&nbsp;&nbsp;</p>
<p>Let's change these properties, restart the namenode and jobtracker services and run the sqoop command again and see if the behavior changes. Let's change the mapred.map.max.attempts to 2 and mapred.task.timeout to 120000 (2 minutes). This time the map reduce job failed in 4 minutes and 40 seconds. It tried two map attempts and each attempt timed out at 120 seconds.</p>
<p>&nbsp;</p>
<p style="margin-left: 36pt;"><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/6663.031814_5F00_2022_5F00_Ooziesqoopa2.jpg" alt="" /></p>
<p>&nbsp;</p>
<p>Let's put the mapred.map.max.attempts and mapred.tasks.timeout back to the HDInsight defaults and restart the namenode and jobtracker services. Also let's create an oozie workflow and job.properties file and execute an oozie job with the sqoop action.</p>
<p>&nbsp;</p>
<p><strong>Workflow.xml </strong></p>
<p style="background: #d9d9d9; margin-left: 36pt;">&lt;workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1"&gt;<br />&lt;start to = "OozieSqoopAction"/&gt;&lt;action name="OozieSqoopAction"&gt;<br />&lt;sqoop xmlns="uri:oozie:sqoop-action:0.2"&gt;<br />&lt;job-tracker&gt;${jobTracker}&lt;/job-tracker&gt;&lt;name-node&gt;${nameNode}<br />&lt;/name-node&gt;<br />&lt;configuration&gt;<br />&lt;property&gt;<br />&lt;name&gt;mapred.compress.map.output&lt;/name&gt;<br />&lt;value&gt;true&lt;/value&gt;<br />&lt;/property&gt;<br />&lt;/configuration&gt;<br />&lt;command&gt;export --connect jdbc:sqlserver://xxxxx.database.windows.net:1433;username=hdp@xxxxx;password=xxxxx;database=wpc-wadb --table Table1 --export-dir /user/hdp/Table1 --input-fields-terminated-by ","&lt;/command&gt;<br />&lt;/sqoop&gt;<br />&lt;ok to="end"/&gt;<br />&lt;error to="fail"/&gt;<br />&lt;/action&gt;<br />&lt;kill name="fail"&gt;<br />&lt;message&gt;Job failed, error message[${wf:errorMessage(wf:lastErrorNode())}] <br />&lt;/message&gt;<br />&lt;/kill&gt;<br />&lt;end name="end"/&gt;<br />&lt;/workflow-app&gt;</p>
<p>&nbsp;</p>
<p><strong>Job.properties file </strong></p>
<p style="background: #d9d9d9; margin-left: 36pt;">#oozie properties<br />oozie.wf.application.path=wasb://xxx21@portalvhdszmhjyc3mxxxxx.blob.core.windows.net/user/wcarroll/wf<br />oozie.use.system.libpath=true<br />#Hadoop mapred.job.tracker<br />jobTracker=jobtrackerhost:9010<br />#Hadoop fs.default.name<br />nameNode=wasb://xxx21@portalvhdszmhjyc3mxxxxx.blob.core.windows.net<br />#Hadoop mapred.queue.name<br />queueName=default</p>
<p>&nbsp;</p>
<p>I then RDP into my headnode and opened a Hadoop command prompt and changed directories to C:\apps\dist\oozie-3.3.2.1.3.6.0-0862\oozie-win-distro\bin. I copied my workflow.xml file into wasb, the job.properties file to a local folder and then issued the command below to start the oozie job.</p>
<p style="background: #d9d9d9; margin-left: 36pt;">oozie job -oozie http://namenodehost:11000/oozie -config c:\temp\pk\job.properties &ndash;run</p>
<p>&nbsp;</p>
<p>If we look at the tasktracker logs again we see the job finished in 41 minutes and 1 second. It turns out that the default mapred.map.max.attempts on Hortonworks Data platform is actually 4 and the oozie service overrides the mapred-site.xml for the cluster using this parameter, so it only attempts four times instead of 8 times issuing the sqoop command outside of ozzie.</p>
<p style="margin-left: 36pt;"><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/8345.031814_5F00_2022_5F00_Ooziesqoopa3.png" alt="" /></p>
<p>&nbsp;</p>
<p>This is a little better but we still would like to affect this behavior. Let's modify the workflow.xml file to pass in <span style="color: #ff0000;">mapreduce.map.max.attempts = 2</span> and <span style="color: #800080;">mapreduce.task.timeout = 120000</span> and run the oozie job again. The new workflow.xml with the property changes is below. Copy it back to wasb and re run the oozie job.</p>
<p style="margin-left: 36pt;"><strong>New Workflow.xml </strong></p>
<p style="background: #d9d9d9; margin-left: 36pt;">&lt;workflow-app name="sample-wf" xmlns="uri:oozie:workflow:0.1"&gt;<br />&lt;start to = "OozieSqoopAction"/&gt;<br />&lt;action name="OozieSqoopAction"&gt;<br />&lt;sqoop xmlns="uri:oozie:sqoop-action:0.2"&gt;<br />&lt;job-tracker&gt;${jobTracker}&lt;/job-tracker&gt;<br />&lt;name-node&gt;${nameNode}&lt;/name-node&gt;<br />&lt;configuration&gt;<br />&lt;property&gt;<br />&lt;name&gt;mapred.compress.map.output&lt;/name&gt;<br />&lt;value&gt;true&lt;/value&gt;<br />&lt;/property&gt;&nbsp;&nbsp;&nbsp; <br /><span style="color: #ff0000;">&lt;property&gt; </span><br /><span style="color: #ff0000;">&lt;name&gt;mapred.map.max.attempts&lt;/name&gt; </span><br /><span style="color: #ff0000;">&lt;value&gt;2&lt;/value&gt; </span><br /><span style="color: #ff0000;">&lt;/property&gt;&nbsp;&nbsp;</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <br /><span style="color: #800080;">&lt;property&gt; </span><br /><span style="color: #800080;">&lt;name&gt;mapred.task.timeout&lt;/name&gt; </span><br /><span style="color: #800080;">&lt;value&gt;120000&lt;/value&gt; </span><br /><span style="color: #800080;">&lt;/property&gt;</span> <br />&lt;/configuration&gt;<br />&lt;command&gt;export --connect jdbc:sqlserver://xxxxx.database.windows.net:1433;username=hdp@xxxxx;password=xxxxx;database=wpc-wadb --table Table1 --export-dir /user/hdp/Table1 --input-fields-terminated-by ","<br />&lt;/command&gt;<br />&lt;/sqoop&gt;<br />&lt;ok to="end"/&gt;<br />&lt;error to="fail"/&gt;<br />&lt;/action&gt;<br />&lt;kill name="fail"&gt;<br />&lt;message&gt;Job failed, error message[${wf:errorMessage(wf:lastErrorNode())}] &lt;/message&gt;<br />&lt;/kill&gt;<br />&lt;end name="end"/&gt;<br />&lt;/workflow-app&gt;</p>
<p style="margin-left: 36pt;">&nbsp;</p>
<p>Now we are back to the job finishing in 4 minutes and 33 seconds with 2 attempts timing out after 120 seconds for each attempt. Changing the configuration properties of mapreduce in the workflow allows us to affect the mapreduce parameters of a specific oozie workflow without affect all the other mapreduce jobs on the cluster. In this case it also allows our mapreduce job to fail quicker, so we don't have to wait 1 hour 21 minutes to realize our job failed because of a primary key violation in the SQL Azure database. We also have a method to pass these configuration parameters for the job.</p>
<p style="margin-left: 36pt;"><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/0246.031814_5F00_2022_5F00_Ooziesqoopa4.png" alt="" /></p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>For more information on using oozie with HDInsight see <a href="http://www.windowsazure.com/en-us/documentation/articles/hdinsight-use-oozie/" target="_blank">Use oozie with HDInsight</a> . This describes how to submit an ozzie job remotely with powershell which passes the mapred configuration properties remotely.</p>
<p>For more information on passing configuration parameters for a job on HDInsight see <a href="http://blogs.msdn.com/b/bigdatasupport/archive/2014/02/13/how-to-pass-hadoop-configuration-values-for-a-job-via-hdinsight-powershell-and-net-sdk.aspx" target="_blank">Passing configuration parameters for a job on HDInsight</a> .</p>
<p>Before we start to change cluster properties it is best to understand HDInsight\Hadoop behavior. There are many errors that might be transient and a retry attempt is beneficial. For example a SQL Server deadlock error might benefit from a retry, however if you understand the nature of the error, passing in configuration properties in your oozie workflow is a good option to have. This way you can change the configuration properties of a single mapreduce job issues from oozie without affecting all the mapreduce jobs on the cluster.</p>
<p>&nbsp;</p>
<p>Hope this helps!</p>
<p>Bill</p><div style="clear:both;"></div><img src="http://blogs.msdn.com/aggbug.aspx?PostID=10508481" width="1" height="1">carrollwphttp://blogs.msdn.com/carrollwp_4000_hotmail.com/ProfileUrlRedirect.ashxHDInsight News – New Articles to readhttp://blogs.msdn.com/b/bigdatasupport/archive/2014/03/07/hdinsight-news-new-articles-published-for-you-to-read.aspx2014-03-07T23:05:16Z2014-03-07T23:05:16Z<p>Hi Folks,
</p><p>I'm Jason from the Microsoft Big Data Support team. Thanks for reading our blog, and for trying out HDInsight in your own business.
</p><p>I want to share some new articles Microsoft just published that will be helpful for getting started with HDInsight in your business. To help folks who are not so familiar with the Apache projects, I'll quickly compare to an existing Microsoft SQL Server feature you might have heard of before to help ease the transition.
</p><p>For even more articles, check out the main HDInsight documentation front page at <a href="http://www.windowsazure.com/en-us/documentation/services/hdinsight/" target="_blank">http://www.windowsazure.com/en-us/documentation/services/hdinsight/</a>
</p><p>Happy Hadooping! Jason Howell
</p><p>
</p><div><table style="border-collapse:collapse" border="0"><colgroup><col style="width:1061px"/></colgroup><tbody valign="top"><tr><td style="padding-left: 7px; padding-right: 7px; border-top: solid 0.5pt; border-left: solid 0.5pt; border-bottom: solid 0.5pt; border-right: solid 0.5pt"><p><h2>1. <a href="http://www.windowsazure.com/en-us/documentation/articles/hdinsight-monitor-use-ambari-api/" target="_blank"><span style="color:#0563c1; text-decoration:underline">Monitor HDInsight clusters using Ambari API</span></a>
</h2></p><p><a href="http://www.windowsazure.com/en-us/documentation/articles/hdinsight-monitor-use-ambari-api/" target="_blank"><span style="font-size:12pt">http://www.windowsazure.com/en-us/documentation/articles/hdinsight-monitor-use-ambari-api/</span></a><span style="font-size:12pt">
</span><span style="font-size:14pt">
</span></p><p style="margin-left: 72pt">
</p><ul style="margin-left: 72pt"><li><div><strong>Analogous to System Center:</strong> If you are familiar with Microsoft System Center to deploy applications, and SCOM Management packs to measure those system, you will feel at home with the concepts in Apache Ambari.
</div><p>
</p></li><li><div><strong>What is Ambari?</strong> The Apache Ambari project is aimed at making Hadoop management simpler by developing software for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs. <a href="http://ambari.apache.org/" target="_blank">http://ambari.apache.org/</a>
</div><p><span style="font-size:9pt">
<img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/4186.030714_5F00_2327_5F00_HDInsightNe1.png" alt=""/></span>
</p></li></ul></td></tr><tr><td style="padding-left: 7px; padding-right: 7px; border-top: none; border-left: solid 0.5pt; border-bottom: solid 0.5pt; border-right: solid 0.5pt"><p><h2>2. <a href="http://www.windowsazure.com/en-us/documentation/articles/hdinsight-use-sqoop/" target="_blank"><span style="color:#0563c1; text-decoration:underline">Use Sqoop with HDInsight</span></a>
</h2></p><p><a href="http://www.windowsazure.com/en-us/documentation/articles/hdinsight-use-sqoop/" target="_blank"><span style="font-size:12pt">http://www.windowsazure.com/en-us/documentation/articles/hdinsight-use-sqoop/</span></a><span style="font-size:12pt">
</span></p><p style="margin-left: 72pt">
</p><ul style="margin-left: 72pt"><li><div><strong>Analogous to BCP: </strong>If you are familiar with SQL Server BCP.exe then Sqoop will be an easy tool for you to learn for Hadoop and HDInsight.
</div><p>
</p></li><li><div><strong>What is Sqoop? </strong> Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. <a href="http://sqoop.apache.org/" target="_blank">http://sqoop.apache.org/</a>
<strong>
</strong></div><p><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/2055.030714_5F00_2327_5F00_HDInsightNe2.png" alt=""/><strong>
</strong></p><p>See our <a href="http://blogs.msdn.com/b/bigdatasupport/archive/tags/sqoop/" target="_blank">blog posts</a> on Sqoop as well
</p></li></ul></td></tr><tr><td style="padding-left: 7px; padding-right: 7px; border-top: none; border-left: solid 0.5pt; border-bottom: solid 0.5pt; border-right: solid 0.5pt"><p><h2>3. <a href="http://www.windowsazure.com/en-us/documentation/articles/hdinsight-analyze-twitter-data/" target="_blank"><span style="color:#0563c1; text-decoration:underline">Analyze Twitter data with HDInsight</span></a>
</h2></p><p><a href="http://www.windowsazure.com/en-us/documentation/articles/hdinsight-analyze-twitter-data/" target="_blank"><span style="font-size:12pt">http://www.windowsazure.com/en-us/documentation/articles/hdinsight-analyze-twitter-data/</span></a><span style="font-size:12pt">
</span></p><p>
</p><ul style="margin-left: 72pt"><li>In this tutorial, you will connect to Twitter web service to get some Tweets using the Twitter streaming API, and then you will use Hive to get a list of Twitter users that sent most Tweets that contained a certain word.
</li></ul><p>
</p><p style="margin-left: 72pt"><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/5001.030714_5F00_2327_5F00_HDInsightNe3.png" alt=""/>
<img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/4428.030714_5F00_2327_5F00_HDInsightNe4.png" alt=""/>
</p><p style="margin-left: 72pt">
</p><p style="margin-left: 72pt">By the way, shameless plug! Follow us on twitter as <strong>@MSBigDataSupp</strong>
<a href="https://twitter.com/MSBigDataSupp">https://twitter.com/MSBigDataSupp</a>
</p></td></tr><tr><td style="padding-left: 7px; padding-right: 7px; border-top: none; border-left: solid 0.5pt; border-bottom: solid 0.5pt; border-right: solid 0.5pt"><p><h2>4. <a href="http://www.windowsazure.com/en-us/documentation/articles/hdinsight-analyze-flight-delay-data/" target="_blank"><span style="color:#0563c1; text-decoration:underline">Analyze flight delay data with HDInsight</span></a>
</h2></p><p><a href="http://www.windowsazure.com/en-us/documentation/articles/hdinsight-analyze-flight-delay-data/" target="_blank"><span style="font-size:12pt">http://www.windowsazure.com/en-us/documentation/articles/hdinsight-analyze-flight-delay-data/</span></a><span style="font-size:12pt">
</span></p><p>
</p><ul style="margin-left: 72pt"><li>This tutorial shows you how to use Hive to calculate average delays among airports, and how to use Sqoop to export the results to SQL Database.
</li></ul><p><span style="font-size:14pt">
<img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/1881.030714_5F00_2327_5F00_HDInsightNe5.png" alt=""/>
</span></p></td></tr><tr><td style="padding-left: 7px; padding-right: 7px; border-top: none; border-left: solid 0.5pt; border-bottom: solid 0.5pt; border-right: solid 0.5pt"><p><h2>5. <a href="http://www.windowsazure.com/en-us/documentation/articles/hdinsight-use-oozie/" target="_blank"><span style="color:#0563c1; text-decoration:underline">Use Oozie with HDInsight</span></a>
</h2></p><p><a href="http://www.windowsazure.com/en-us/documentation/articles/hdinsight-use-oozie/" target="_blank"><span style="font-size:12pt">http://www.windowsazure.com/en-us/documentation/articles/hdinsight-use-oozie/</span></a><span style="font-size:12pt">
</span></p><p style="margin-left: 72pt">
</p><ul style="margin-left: 72pt"><li><div><strong>Analogous to SQL Agent: </strong>If you are familiar with SQL Server Agent for job scheduling, and SQL Server Integration Services (SSIS) Control flow tasks, Oozie might be a good tool for you to try on Hadoop and HDInsight.
</div><p>
</p></li><li><div><strong>What is Oozie?</strong> Apache Oozie (TM) is a workflow scheduler system to manage Apache Hadoop jobs. Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions. Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availability. Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts). Oozie is a scalable, reliable and extensible system. <a href="http://oozie.apache.org/" target="_blank">http://oozie.apache.org/</a>
</div><p><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/4035.030714_5F00_2327_5F00_HDInsightNe6.png" alt=""/><span style="font-size:14pt">
</span></p></li></ul></td></tr></tbody></table></div><p>
</p><p style="margin-left: 54pt">
</p><p style="margin-left: 72pt"><span style="font-size:14pt"><strong>
</strong></span> </p><div style="clear:both;"></div><img src="http://blogs.msdn.com/aggbug.aspx?PostID=10506219" width="1" height="1">Jason H (Azure)http://blogs.msdn.com/Jason-Howell/ProfileUrlRedirect.ashxMahout with HDInsighthttp://blogs.msdn.com/b/bigdatasupport/archive/2014/02/19/mahout-with-hdinsight.aspx2014-02-19T12:46:00Z2014-02-19T12:46:00Z<p style="text-align: justify;"><span style="font-size: small;">My name is Sudhir and I work with the Microsoft HDInsight support team. The other day, my colleague, Dan and I were discussing MAHOUT, so I thought about how it can be used with HDInsight.</span></p>
<p style="text-align: justify;"><span style="font-size: small;">[* <span style="text-decoration: underline;"><em><strong>Note</strong></em></span>:- If you are using HDInsight 3.1, it's have mahout package installed in it. So in such case one need not to follow some part of this blog post like uploading mahout jar file. Please have a look <a href="http://azure.microsoft.com/en-us/documentation/articles/hdinsight-mahout/#recommendations">here</a>&nbsp;to find out how to run mahout job in HDInsight 3.1.]</span></p>
<p style="text-align: justify;"><span style="font-size: small;">I investigated more to see how MAHOUT can be used with HDInsight and I feel its good information to share. First, I tried through master node (RDP) and then I tried with PowerShell. Running MAHOUT from the head node won&rsquo;t be a recommended approach because if a cluster gets reimaged all the changes to the configuration will not be available. So I skipped this approach and started looking at PowerShell.&nbsp; &nbsp;&nbsp;</span></p>
<p style="text-align: justify;"><span style="font-size: small;"><strong>Before I start I want to mention that MAHOUT is <span style="text-decoration: underline;">not</span> supported by Microsoft.</strong></span></p>
<p style="text-align: justify;"><span style="font-size: small;"><span style="font-size: small;">In case you want to read more about MAHOUT click </span><a href="https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+Wiki">here</a>.&nbsp; I&rsquo;ll be using RecommenderJob class for this example. More information about the class can be found </span><a href="https://cwiki.apache.org/confluence/display/MAHOUT/Itembased+Collaborative+Filtering">here</a>.</p>
<p style="text-align: justify;">Here are the step by step instruction to use MAHOUT on HDInsight through PowerShell. &nbsp;</p>
<p style="text-align: justify;">Copy following files to a folder on the Local Machine</p>
<p><a href="http://repo2.maven.org/maven2/org/apache/mahout/mahout-core/0.8/mahout-core-0.8-job.jar">http://repo2.maven.org/maven2/org/apache/mahout/mahout-core/0.8/mahout-core-0.8-job.jar</a></p>
<p><a href="https://github.com/rawatsudhir/Samples/blob/master/ItemID.txt">https://github.com/rawatsudhir/Samples/blob/master/ItemID.txt</a>&nbsp;:- contains userid, itemid and value</p>
<p><a href="https://github.com/rawatsudhir/Samples/blob/master/users.txt">https://github.com/rawatsudhir/Samples/blob/master/users.txt</a>&nbsp; :- contains userid</p>
<p><span style="font-size: small;">Next step is to upload above sample files. Open PowerShell window and use below script to upload each file.</span></p>
<p style="padding-left: 30px;"><span style="font-size: small;">$subscriptionName = "&lt;subscription name&gt;"</span></p>
<p style="padding-left: 30px;"><span style="font-size: small;">$storageAccountName = "&lt;storage account&gt;"</span></p>
<p style="padding-left: 30px;"><span style="font-size: small;">$containerName = "&lt;container&gt;"</span></p>
<p style="padding-left: 30px;"><span style="font-size: small;">$fileName ="&lt;Location\FileName&gt;"</span></p>
<p style="padding-left: 30px;"><span style="font-size: small;"># Uploading file under the folder mahout</span></p>
<p style="padding-left: 30px;"><span style="font-size: small;">$blobName = "&lt;mahout/FileName&gt;"&nbsp;&nbsp;</span></p>
<p style="padding-left: 30px;"><span style="font-size: small;"># Get the storage account key</span></p>
<p style="padding-left: 30px;"><span style="font-size: small;">Select-AzureSubscription $subscriptionName</span></p>
<p style="padding-left: 30px;"><span style="font-size: small;">$storageaccountkey = get-azurestoragekey $storageAccountName | %{$_.Primary}</span></p>
<p style="padding-left: 30px;"><span style="font-size: small;">&nbsp;# Create the storage context object</span></p>
<p style="padding-left: 30px;"><span style="font-size: small;">$destContext = New-AzureStorageContext -StorageAccountName $storageAccountName -StorageAccountKey $storageaccountkey</span></p>
<p style="padding-left: 30px;"><span style="font-size: small;">&nbsp;# Copy the file from local workstation to the Blob container&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</span></p>
<p style="padding-left: 30px;"><span style="font-size: small;">Set-AzureStorageBlobContent -File $filename -Container $containerName -Blob $blobName -context $destContext</span></p>
<p><span style="font-size: small;">Copy below command (or get the script from </span><a href="https://github.com/rawatsudhir/Samples/blob/master/MahoutfromPowershell">here</a>).&nbsp;</p>
<p style="padding-left: 30px;"><span style="font-size: small;"># Cluster Name</span></p>
<p style="padding-left: 30px;"><span style="font-size: small;">&nbsp;$clusterName="&lt;ClusterName&gt;"</span></p>
<p style="padding-left: 30px;"><span style="font-size: small;">&nbsp;# Subscription name</span></p>
<p style="padding-left: 30px;"><span style="font-size: small;">&nbsp;$subscriptionName="&lt;SubscriptionName&gt;"</span></p>
<p style="padding-left: 30px;"><span style="font-size: small;">$containerName = "&lt;containerName&gt;"</span></p>
<p style="padding-left: 30px;"><span style="font-size: small;">$storageAccountName = "&lt;StorageAccountName&gt;"</span></p>
<p style="padding-left: 30px;"><span style="font-size: small;">Select-AzureSubscription -SubscriptionName $subscriptionName</span></p>
<p style="padding-left: 30px;"><span style="font-size: small;">&nbsp;#&nbsp;&nbsp; Assuming mahout-core-0.8-job.jar copied to Mahout folder.</span></p>
<p style="padding-left: 30px;"><span style="font-size: small;">&nbsp;$mahoutJob = New-AzureHDInsightMapReduceJobDefinition -JarFile "wasb://$containerName@$storageAccountName.blob.core.windows.net/mahout/mahout-core-0.8-job.jar" -ClassName "org.apache.mahout.cf.taste.hadoop.item.RecommenderJob"</span></p>
<p style="padding-left: 30px;"><span style="font-size: small;">&nbsp;# Adding the similarityclassname argument</span></p>
<p style="padding-left: 30px;"><span style="font-size: small;">&nbsp;$mahoutJob.Arguments.Add("-s")</span></p>
<p style="padding-left: 30px;"><span style="font-size: small;">&nbsp;# Adding the name of similarityclassname. However other similarityclassname can be used. </span></p>
<p style="padding-left: 30px;"><span style="font-size: small;">&nbsp;$mahoutJob.Arguments.Add("SIMILARITY_COOCCURRENCE")</span></p>
<p style="padding-left: 30px;"><span style="font-size: small;">&nbsp;# Adding the input file argument</span></p>
<p style="padding-left: 30px;"><span style="font-size: small;">&nbsp;$mahoutJob.Arguments.Add("-i")</span></p>
<p style="padding-left: 30px;"><span style="font-size: small;">&nbsp;# Adding location of the file. The file is stored on Windows Azure Storage Blob.</span></p>
<p style="text-align: justify; padding-left: 30px;"><span style="font-size: small;">&nbsp;$mahoutJob.Arguments.Add("wasb://$containerName@$storageAccountName.blob.core.windows.net/mahout/itemID.txt")</span></p>
<p style="padding-left: 30px;"><span style="font-size: small;">&nbsp;# Adding usersFile as an argument.</span></p>
<p style="padding-left: 30px;"><span style="font-size: small;">&nbsp;$mahoutJob.Arguments.Add("--usersFile")</span></p>
<p style="padding-left: 30px;"><span style="font-size: small;">&nbsp;# Adding userFile location.</span></p>
<p style="padding-left: 30px;"><span style="font-size: small;">&nbsp;$mahoutJob.Arguments.Add("wasb://$containerName@$storageAccountName.blob.core.windows.net/mahout/users.txt")</span></p>
<p style="padding-left: 30px;"><span style="font-size: small;">&nbsp;# Adding output as an argument.</span></p>
<p style="padding-left: 30px;"><span style="font-size: small;">&nbsp;$mahoutJob.Arguments.Add("--output")</span></p>
<p style="padding-left: 30px;"><span style="font-size: small;">&nbsp;# Adding output location. This will be the location where result will be generated.</span></p>
<p style="padding-left: 30px;"><span style="font-size: small;">&nbsp;$mahoutJob.Arguments.Add("wasb://$containerName@$storageAccountName.blob.core.windows.net/mahout/output")</span></p>
<p style="padding-left: 30px;"><span style="font-size: small;">&nbsp;# Starting job</span></p>
<p style="padding-left: 30px;"><span style="font-size: small;">&nbsp;$MahoutJobProcessing = Start-AzureHDInsightJob -Cluster $clusterName&nbsp; -JobDefinition $mahoutJob</span></p>
<p style="padding-left: 30px;"><span style="font-size: small;">&nbsp;# Waiting Job for completion</span></p>
<p style="padding-left: 30px;"><span style="font-size: small;">Wait-AzureHDInsightJob&nbsp; -Job $MahoutJobProcessing -WaitTimeoutInSeconds 3600</span></p>
<p style="padding-left: 30px;"><span style="font-size: small;">&nbsp;# Getting error if any</span></p>
<p style="padding-left: 30px;">&nbsp;Get-AzureHDInsightJobOutput -Cluster $clusterName&nbsp;-JobId $MahoutJobProcessing.JobId -StandardError</p>
<p>&nbsp;</p>
<p><span style="font-size: small;">Run above scripts. Waiting job for completion</span></p>
<p style="padding-left: 30px;"><a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/7077.1.png"><img src="http://blogs.msdn.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/7077.1.png" alt="" border="0" /></a></p>
<p style="padding-left: 30px;">&nbsp;</p>
<p><span style="font-size: small;">Once job done, the result will be found on target directory.</span></p>
<p style="padding-left: 30px;"><a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/8640.2.png"><img src="http://blogs.msdn.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/8640.2.png" alt="" border="0" /></a></p>
<p><span style="font-size: small;">It basically outputs userIDs with associated recommended itemIDs and their scores.</span></p>
<p style="padding-left: 30px;"><a href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/5621.3.png"><img src="http://blogs.msdn.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78/5621.3.png" alt="" border="0" /></a></p>
<p style="padding-left: 30px;">&nbsp;</p>
<p>&nbsp;<span style="font-size: small;"><strong><span style="text-decoration: underline;">Clean up process</span></strong>:-</span></p>
<p><span style="font-size: small;">&nbsp;Require to do rdp and run hdfs fs -rmr -skipTrash /user/hdp/temp to delete temp folder. Use any familiar tool to delete&nbsp;temp folder. &nbsp;</span></p>
<p>&nbsp;</p>
<p><span style="font-size: small;"><strong><span style="text-decoration: underline;">Tips</span></strong>:-</span></p>
<ul>
<li><span style="font-size: small;">Make sure of/understand algorithm which you are going to use.</span></li>
<li><span style="font-size: small;">Look for the input type by the algorithm.</span></li>
<li><span style="font-size: small;">You may want to prepare your data based on the input required by the algorithm.</span></li>
</ul>
<p>&nbsp;</p>
<p><span style="font-size: small;">Thanks to Bill and Sunil to review this blog post.</span></p>
<p>&nbsp;</p>
<p><span style="font-size: small;">Happy Learning!</span><br /><br /><span style="font-size: small;">Sudhir Rawat</span></p>
<p>&nbsp;</p><div style="clear:both;"></div><img src="http://blogs.msdn.com/aggbug.aspx?PostID=10501444" width="1" height="1">sudhirbloghttp://blogs.msdn.com/sudhirrawat1_4000_live.com/ProfileUrlRedirect.ashxHow to pass Hadoop configuration values for a job on HDInsighthttp://blogs.msdn.com/b/bigdatasupport/archive/2014/02/13/how-to-pass-hadoop-configuration-values-for-a-job-via-hdinsight-powershell-and-net-sdk.aspx2014-02-13T18:42:00Z2014-02-13T18:42:00Z<p>I came across the question a few times recently from several customers&ndash; "how do we pass hadoop configurations at runtime for a mapreduce job or Hive Query via HDInsight PowerShell or .Net SDK?" I thought of sharing the answer here with others who may run into the same question. It is pretty common in Hadoop world to customize Hadoop configuration values that exist in the configuraion files like core-site.xml, mapred-site.xml, hive-site.xml etc., for a specific workload or specific job. Hadoop Configurations, in general, is a broad topic and there are many different ways (site-level, node-level, application level etc) of speciying Hadoop configurations and I don't plan to cover each of these. My focus is&nbsp;on&nbsp;run-time configuraions for a specific job or application. In order to specify Hadoop configuration values for a specific job or application, we typically use 'hadoop &ndash;conf' or 'hadoop &ndash;D' generic options, as shown in this <a href="http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/CommandsManual.html">apache documentation</a>&nbsp;- for a MapReduce JAR, we use 'Hadoop jar -conf' or 'Hadoop jar -D'. In this blog, we will keep our focus on 'Hadoop jar &ndash;D' option and see how we can&nbsp;achieve the same capability on HDInsight, specifically from the HDInsight PowerShell or .Net SDK.</p>
<p>Let's take a look at a few&nbsp;examples.</p>
<p><span style="font-size: medium;"><em><strong>'Hadoop jar &ndash;D' in Apache Hadoop: </strong></em></span></p>
<p>With apache hadoop, if I wanted to run the wordcount Mapreduce example via hadoop command line and wanted to have the output of my mapreduce job as compressed, I could do something like this &ndash;</p>
<p><span style="color: #000000; background-color: #e7e6e6;"><span style="font-family: Segoe UI; font-size: small;">hadoop jar hadoop-examples.jar wordcount -Dmapred.output.compress=true -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec /input /wordcount/output</span></span></p>
<p><span style="font-size: medium;"><em><strong>'Hadoop jar &ndash;D' on Windows Azure HDInsight: </strong></em></span></p>
<p>For Windows Azure HDinsight (or Hortonworks Data Platform on Windows), the syntax is slightly different (using double quotes), the command would be something like this -</p>
<p><span style="color: #000000; background-color: #e7e6e6;"><span style="font-family: Segoe UI; font-size: small;">hadoop jar hadoop-examples.jar wordcount "-Dmapred.output.compress=true" "-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec" /input /wordcount/output</span></span></p>
<p>This is explained nicely in this <a href="http://social.msdn.microsoft.com/Forums/windowsazure/en-US/ae00d95b-53a6-4f92-b9c3-7197401dff5a/hadoop-d-command-line-parameters?forum=hdinsight">HDinsight Forum Thread</a></p>
<p>So far, so good. But the above syntax is valid for Hadoop command line and, for HDInsight cluster, it requires user to RDP to HDInsight cluster head node. On HDInsight, we envision that most users would use HDInsight PowerShell or .Net SDK from a remote client or application server to run Hadoop MapReduce, Hive, Pig or .Net Streaming jobs to make it a part of a rich workflow.</p>
<p><span style="font-size: medium;"><em><strong>Passing Hadoop configuration values for a job via HDInsight PowerShell: </strong></em></span></p>
<p>The HDInsight Job defintion cmdlets <a href="http://msdn.microsoft.com/en-us/library/windowsazure/dn527652.aspx">New-AzureHDInsightMapReduceJobDefinition</a>, <a href="http://msdn.microsoft.com/en-us/library/windowsazure/dn527653.aspx">New-AzureHDInsightHiveJobDefinition</a>, <a href="http://msdn.microsoft.com/en-us/library/windowsazure/dn527632.aspx">Invoke-AzureHDInsightHiveJob</a> and <a href="http://msdn.microsoft.com/en-us/library/windowsazure/dn527638.aspx">New-AzureHDInsightStreamingMapReduceJobDefinition</a>&nbsp;have a parameter called <strong>"-Defines"</strong> that we can use to pass Hadoop configuration values for a specific job at run-time.</p>
<p>Here is a Powershell script with example for mapreduce and Hive jobs &ndash;</p>
<script type="text/javascript" src="https://gist.github.com/AzimUddin/8980629.js"></script>
<p>As you may have noted, the paremeter "-Defines" is a HashTable and you can specify multiple configuration values separated by semi-colon. By the way, HDInsight PowerShell cmdlets are now integrated with Windows Azure PowerShell and can be installed from <a href="http://www.windowsazure.com/en-us/documentation/articles/install-configure-powershell/">here</a></p>
<p><span style="font-size: medium;"><em><strong>Passing Hadoop configuration values for a job via HDInsight .Net SDK: </strong></em></span></p>
<p>Similarly, the HDInsight .Net SDK classes <a href="http://msdn.microsoft.com/en-us/library/windowsazure/microsoft.hadoop.client.mapreducejobcreateparameters.aspx">MapReduceJobCreateParameters</a>, <a href="http://msdn.microsoft.com/en-us/library/windowsazure/microsoft.hadoop.client.hivejobcreateparameters.aspx">HiveJobCreateParameters</a> and <a href="http://msdn.microsoft.com/en-us/library/windowsazure/microsoft.hadoop.client.streamingmapreducejobcreateparameters.aspx">StreamingMapReduceJobCreateParameters</a>&nbsp;have a property called 'Defines' that we can use to pass Hadoop configuration values for a specific job. An example is shown in the code snippet below- I have&nbsp;included just the relevant&nbsp;code &ndash; for full example of using HDInsight .Net SDK to run hadoop jobs, please review our HDInsight documentation <a href="http://www.windowsazure.com/en-us/documentation/articles/hdinsight-submit-hadoop-jobs-programmatically/">here</a></p>
<script type="text/javascript" src="https://gist.github.com/AzimUddin/8981045.js"></script>
<p><em><span style="font-size: medium;"><strong>Passing Hadoop Configuration values via WebHcat REST API: </strong></span></em></p>
<p>HDInsight PowerShell or .Net SDK use WebHcat (aka Templeton) REST API to submit a job remotely and leverages the Templeton <a href="http://people.apache.org/~thejas/templeton_doc_latest/mapreducejar.html">define</a> parameter for passing Hadoop job configuration values &ndash; the 'define' parameter is available as part of the REST API for Mapreduce, Hive and Streaming jobs. If you were wondering why certain job types, such as PIG, didn't have the "-Defines" parameter in the HDInsight PowerShell cmdlet or .Net SDK, the reason is, WebHcat or Templeton (v1) REST API does not have 'define' parameter for that job type.</p>
<p>In general, we recommend that you use the HDInsight PowerShell or .Net SDK to submit remote jobs via WebHcat/Templeton, because the SDK makes it easier for you and handles the underlyning REST API details. But, If you can't use HDInsight Powershell or .Net SDK for some reason and need to use direct REST API, here is an example of passing hadoop configuration values via WebHcat REST API, using Windows PowerShell. You can also use any utility, such as <a href="http://curl.haxx.se/">cURL</a>, that can invoke REST API.</p>
<script type="text/javascript" src="https://gist.github.com/AzimUddin/8981835.js"></script>
<p><em><strong><span style="font-size: medium;">Persistent Hadoop configurations via HDInsight cluster customization: </span></strong></em></p>
<p>I know our focus&nbsp;in this blog&nbsp;has been&nbsp;on run-time Hadoop configurations, but I do want to call out that if there are certain hadoop configurations you wanted to change from the default values for the HDInsight cluster and wanted to preserve the changes throughout the cluster lifetime, you can do this via cluster customization with HDInsight PowerShell or .Net SDK, as shown <a href="http://hadoopsdk.codeplex.com/wikipage?title=PowerShell%20Cmdlets%20for%20Cluster%20Management&amp;referringTitle=Cluster%20Management">here</a>. This approach works well for a short-lived cluster or elastic services where you would create a customized cluster with specific configurations, run your workload and then remove the cluster.</p>
<p>Also, as explained in Dan's <a href="http://blogs.msdn.com/b/bigdatasupport/archive/2013/11/01/the-hdinsight-support-team-is-open-for-business.aspx">blog</a>, outside of cluster customization during the install time, any manual modification of the Hadoop configuration files or any other file won't be preserved when the Azure VM nodes get updated.</p>
<p>That's it for today. I hope you find&nbsp;it&nbsp;helpful!</p>
<p>@Azim (MSFT)</p><div style="clear:both;"></div><img src="http://blogs.msdn.com/aggbug.aspx?PostID=10500015" width="1" height="1">Azim Uddinhttp://blogs.msdn.com/azim91_4000_hotmail.com/ProfileUrlRedirect.ashxStructured vs Semi-structured Datahttp://blogs.msdn.com/b/bigdatasupport/archive/2014/01/21/structured-verse-semi_2D00_structured-data.aspx2014-01-21T21:08:00Z2014-01-21T21:08:00Z<p>My name is Bill Carroll and I am a member of the Microsoft HDInsight support team. The majority of my working career has been spent on SQL Server, a relational database. Little did I think about it all these years, but relational databases are structured data. When we create a table we define the structure with a data type and whether we allow null data or not. When we insert or update a record for the table, checks are done to determine if the data adheres to the predefined structure. Below is a simple example where the check fails and terminates the statement. SQL Server raises the 8152 message.</p>
<p>&nbsp;</p>
<pre class="scroll"><code class="mysql"> create table table1 (c1 int not null, </code><code class="mysql">c2 varchar(10) null , c3 decimal(18,4) null) <br /> go <br /> <br /> insert into table1 (c1, c2, c3) </code><code class="mysql">values (1, 'A very long string ', 14.91918181) <br /> go <br /> <br /> <span style="color: #ff0000;">Msg 8152, Level 16, State 14, Line 1 </span><br /><span style="color: #ff0000;"> String or binary data would be truncated. </span><br /><span style="color: #ff0000;"> The statement has been terminated.</span> <br /></code></pre>
<p>&nbsp;</p>
<p>This last month I worked an issue with a customer on HDInsight that drove home the difference between structured data of the relational database world versus semi-structured data in the big data world. I also found a new respect for the basic WordCount example and the wisdom of those that chose it as a starting point for mapreduce. We had a java mapreduce job that appeared to hang or in reality progress very slowly in the reducer. The reduces eventually completed after 255 hours.</p>
<p>The mapper function parsed a tab separated line into separate java String variables. One variable was an ID, other variables were concatenated together to form a part of a JSON document. The mapper wrote out the ID as the key and the partial JSON document as the value. In the reducer function, for each key, the parts of the JSON document were concatenated to form one JSON document. The document was then written out by the reducer.</p>
<p>The investigation revealed that in the mapper the parsing of the ID had some "invalid" data, but because we are dealing with semi-structured data no nice error message is raised like in a relational database. One of the ID's was actually a space with over 13 million partial JSON elements. What this meant was, in the reducer for the invalid key, a single JSON document was being created with 13 million concatenations to a String. There were several invalid ID's but this was the worst offender. Using the WordCount example, in the reducer, I did counts and sum of chars of the elements for each key and wrote it out without doing the string concatenation. After this I saw the invalid keys and the numbers of the counts, the problem became obvious. The techniques learned in the WordCount example are useful!</p>
<p>In the next part of the investigation I used two tools that are installed with java, jstack and jmap. With jstack I was able to get the stack while the reducer was "hung". I was able to line up the line of code on line=126 as <strong>s+="&lt;partial JSON documents&gt;";.</strong> I also used jmap heap dump to confirm that it was processing the invalid key. Below is part of the stack output from jstack.</p>
<p style="padding-left: 30px;">&nbsp;Thread 1: (state = IN_JAVA)<br />&nbsp; - <strong>java.util.Arrays.copyOfRange</strong>(char[], int, int) @bci=40, line=2694 (Compiled frame; information may be imprecise)<br />&nbsp; - java.lang.String.&lt;init&gt;(char[], int, int) @bci=60, line=203 (Compiled frame)<br />&nbsp; - java.lang.StringBuilder.toString() @bci=13, line=405 (Compiled frame)<br />&nbsp; -<strong> BigDataSupport.MyClass$Reduce.reduce</strong>(org.apache.hadoop.io.Text, java.lang.Iterable, org.apache.hadoop.mapreduce.Reducer$Context) @bci=70,<strong> line=126</strong> (Compiled frame)<br />&nbsp; - BigDataSupport.MyClass$Reduce.reduce(java.lang.Object, java.lang.Iterable, org.apache.hadoop.mapreduce.Reducer$Context) @bci=7, line=103 (Compiled frame)<br />&nbsp; - org.apache.hadoop.mapreduce.Reducer.run(org.apache.hadoop.mapreduce.Reducer$Context)</p>
<p>&nbsp;</p>
<p>More information about jstack can be found <a title="jstack" href="http://docs.oracle.com/javase/7/docs/technotes/tools/share/jstack.html" target="_blank">here</a>&nbsp;and jmap <a title="jmap" href="http://docs.oracle.com/javase/7/docs/technotes/tools/share/jmap.html" target="_blank">here</a>. To review the output of jmap you can use <a title="jvisualvm" href="http://docs.oracle.com/javase/7/docs/technotes/guides/visualvm/index.html" target="_blank">jvisualvm</a>.</p>
<p>Once we corrected the "invalid" key problem, I started to investigate the performance of Java string concatenation and did some simple tests using mapreduce. I was surprised. I took one of the largest keys and placed a stopwatch in the reducer to time it. I wrote out the elapsed time in MS, the length of the final string in char and the number of partial string elements that it was concatenating. I then tested using java String concatenation, StringBuilder's append() method, and StringBuffer's append() method. Java String concatenation is significantly slower. The results are below.</p>
<p style="padding-left: 30px;"><strong>String s = ""; s+= arg1.next().toString();</strong><br />Key = c1976429 reduce(): elapsed time = 61732 length = 1543427 ctr = 42556</p>
<p style="padding-left: 30px;"><br /><strong>StringBuilder s = new StringBuilder(); s.append(arg1.next().toString());</strong><br />Key = c1976429 reduce(): elapsed time = 226 length = 1543427 ctr = 42556</p>
<p style="padding-left: 30px;"><br /><strong>StringBuilder s = new StringBuilder(16000000); s.append(arg1.next().toString());</strong> <br />Key = c1976429 reduce(): elapsed time = 225 length = 1543427 ctr = 42556</p>
<p style="padding-left: 30px;"><br /><strong>StringBuffer s = new StringBuffer(); s.append(arg1.next().toString());</strong><br />Key = c1976429 reduce(): elapsed time = 227 length = 1543427 ctr = 42556</p>
<p>&nbsp;</p>
<p>As I continue to learn about java, mapreduce, and HDInsight\Hadoop some of the simplest lessons are the best. Here are just a few of things I learned.</p>
<ul>
<li>In dealing with semi structured data it is always a good idea to validate the keys and the size of the values you are writing out to the reducer. Modify the WordCount example to explore your data and get to know it.</li>
<li>Java has several utilities like JPS, JStack, JMap, JVisualVM and others that can help in your investigations.</li>
<li>If you are going to concatenate Strings use the append() method of StringBuilder or StringBuffer, it provides better performance. Performance matters when dealing with large amounts of data.</li>
<li>
<div>The world of big data and HDInsight\Hadoop is a world of semi-structured data. This provides us with a greater ability to explore insights in data, but we also need to be aware of its pit falls.</div>
<p>&nbsp;</p>
</li>
</ul>
<p>&nbsp;</p><div style="clear:both;"></div><img src="http://blogs.msdn.com/aggbug.aspx?PostID=10491520" width="1" height="1">carrollwphttp://blogs.msdn.com/carrollwp_4000_hotmail.com/ProfileUrlRedirect.ashxHow to manually compile and create your own jar file to execute on HDInsighthttp://blogs.msdn.com/b/bigdatasupport/archive/2014/01/21/how-to-manually-compile-and-create-your-own-jar-file-to-execute-on-hdinsight.aspx2014-01-21T19:30:00Z2014-01-21T19:30:00Z<p>Hi, my name is Bill Carroll and I am a member of the Microsoft HDInsight support team. At the heart of Hadoop is the MapReduce paradigm. Knowing how to compile your java code and create your own jar file is a useful skill, especially for those coming from the C++ or &nbsp;.Net programming world. So our goal for the post is to manually compile and create our own jar file and execute it on HDInsight</p>
<p>The data set I have chosen is <a href="http://en.wikipedia.org/wiki/Exchange-traded_fund">electronically traded funds</a> data from 2007 to 2014. From Wikipedia, An ETF is an investment fund traded on stock exchanges, much like stocks. An ETF holds assets such as stocks, commodities, or bonds, and trades close to its net asset value over the course of the trading day. Most ETFs track an index, such as a stock index or bond index.&nbsp;The data is common to other investments, so a much larger population could be used. Think of analyzing data from every stock, mutual fund, ETF on exchanges all over the world at 5 second intervals. That's big data!</p>
<p>&nbsp;</p>
<h1>Data</h1>
<p>Let's familiarize ourselves with the data. You can download the ETF-WATCHLIST.CSV file and java code from&nbsp;<a title="ETF-WATCHLIST" href="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-components-postattachments/00-10-49-14-91/ETF_2D00_WATCHLIST.zip">here</a>. The data is an ANSI comma separated file with 11 columns. You can see it has both text and numeric data and is currently sorted by DATE and SYMBOL.</p>
<div>
<table style="border-collapse: collapse;" border="0"><colgroup><col style="width: 125px;" /><col style="width: 96px;" /><col style="width: 402px;" /></colgroup>
<tbody valign="top">
<tr>
<td style="padding-left: 7px; padding-right: 7px; border: solid 0.5pt;">
<p>SYMBOL</p>
</td>
<td style="padding-left: 7px; padding-right: 7px; border-top: solid 0.5pt; border-left: none; border-bottom: solid 0.5pt; border-right: solid 0.5pt;">
<p>TEXT</p>
</td>
<td style="padding-left: 7px; padding-right: 7px; border-top: solid 0.5pt; border-left: none; border-bottom: solid 0.5pt; border-right: solid 0.5pt;">
<p>ETF exchange symbol</p>
</td>
</tr>
<tr>
<td style="padding-left: 7px; padding-right: 7px; border-top: none; border-left: solid 0.5pt; border-bottom: solid 0.5pt; border-right: solid 0.5pt;">
<p>DATE</p>
</td>
<td style="padding-left: 7px; padding-right: 7px; border-top: none; border-left: none; border-bottom: solid 0.5pt; border-right: solid 0.5pt;">
<p>TEXT</p>
</td>
<td style="padding-left: 7px; padding-right: 7px; border-top: none; border-left: none; border-bottom: solid 0.5pt; border-right: solid 0.5pt;">
<p>Date in the file is YYYY-MM-DD format.</p>
</td>
</tr>
<tr>
<td style="padding-left: 7px; padding-right: 7px; border-top: none; border-left: solid 0.5pt; border-bottom: solid 0.5pt; border-right: solid 0.5pt;">
<p>OPEN</p>
</td>
<td style="padding-left: 7px; padding-right: 7px; border-top: none; border-left: none; border-bottom: solid 0.5pt; border-right: solid 0.5pt;">
<p>NUMERIC</p>
</td>
<td style="padding-left: 7px; padding-right: 7px; border-top: none; border-left: none; border-bottom: solid 0.5pt; border-right: solid 0.5pt;">
<p>Opening price</p>
</td>
</tr>
<tr>
<td style="padding-left: 7px; padding-right: 7px; border-top: none; border-left: solid 0.5pt; border-bottom: solid 0.5pt; border-right: solid 0.5pt;">
<p>HIGH</p>
</td>
<td style="padding-left: 7px; padding-right: 7px; border-top: none; border-left: none; border-bottom: solid 0.5pt; border-right: solid 0.5pt;">
<p>NUMERIC</p>
</td>
<td style="padding-left: 7px; padding-right: 7px; border-top: none; border-left: none; border-bottom: solid 0.5pt; border-right: solid 0.5pt;">
<p>High price for the day</p>
</td>
</tr>
<tr>
<td style="padding-left: 7px; padding-right: 7px; border-top: none; border-left: solid 0.5pt; border-bottom: solid 0.5pt; border-right: solid 0.5pt;">
<p>LOW</p>
</td>
<td style="padding-left: 7px; padding-right: 7px; border-top: none; border-left: none; border-bottom: solid 0.5pt; border-right: solid 0.5pt;">
<p>NUMERIC</p>
</td>
<td style="padding-left: 7px; padding-right: 7px; border-top: none; border-left: none; border-bottom: solid 0.5pt; border-right: solid 0.5pt;">
<p>Low price for the day</p>
</td>
</tr>
<tr>
<td style="padding-left: 7px; padding-right: 7px; border-top: none; border-left: solid 0.5pt; border-bottom: solid 0.5pt; border-right: solid 0.5pt;">
<p>CLOSE</p>
</td>
<td style="padding-left: 7px; padding-right: 7px; border-top: none; border-left: none; border-bottom: solid 0.5pt; border-right: solid 0.5pt;">
<p>NUMERIC</p>
</td>
<td style="padding-left: 7px; padding-right: 7px; border-top: none; border-left: none; border-bottom: solid 0.5pt; border-right: solid 0.5pt;">
<p>Closing price for the day</p>
</td>
</tr>
<tr>
<td style="padding-left: 7px; padding-right: 7px; border-top: none; border-left: solid 0.5pt; border-bottom: solid 0.5pt; border-right: solid 0.5pt;">
<p>VOLUME</p>
</td>
<td style="padding-left: 7px; padding-right: 7px; border-top: none; border-left: none; border-bottom: solid 0.5pt; border-right: solid 0.5pt;">
<p>NUMERIC</p>
</td>
<td style="padding-left: 7px; padding-right: 7px; border-top: none; border-left: none; border-bottom: solid 0.5pt; border-right: solid 0.5pt;">
<p>Number of shares traded for the day</p>
</td>
</tr>
<tr>
<td style="padding-left: 7px; padding-right: 7px; border-top: none; border-left: solid 0.5pt; border-bottom: solid 0.5pt; border-right: solid 0.5pt;">
<p>WEEK</p>
</td>
<td style="padding-left: 7px; padding-right: 7px; border-top: none; border-left: none; border-bottom: solid 0.5pt; border-right: solid 0.5pt;">
<p>NUMERIC</p>
</td>
<td style="padding-left: 7px; padding-right: 7px; border-top: none; border-left: none; border-bottom: solid 0.5pt; border-right: solid 0.5pt;">
<p>The week of the year</p>
</td>
</tr>
<tr>
<td style="padding-left: 7px; padding-right: 7px; border-top: none; border-left: solid 0.5pt; border-bottom: solid 0.5pt; border-right: solid 0.5pt;">
<p>WEEKDAY</p>
</td>
<td style="padding-left: 7px; padding-right: 7px; border-top: none; border-left: none; border-bottom: solid 0.5pt; border-right: solid 0.5pt;">
<p>NUMERIC</p>
</td>
<td style="padding-left: 7px; padding-right: 7px; border-top: none; border-left: none; border-bottom: solid 0.5pt; border-right: solid 0.5pt;">
<p>The day of the week (Sunday = 1)</p>
</td>
</tr>
<tr>
<td style="padding-left: 7px; padding-right: 7px; border-top: none; border-left: solid 0.5pt; border-bottom: solid 0.5pt; border-right: solid 0.5pt;">
<p>DESCRIPTION</p>
</td>
<td style="padding-left: 7px; padding-right: 7px; border-top: none; border-left: none; border-bottom: solid 0.5pt; border-right: solid 0.5pt;">
<p>TEXT</p>
</td>
<td style="padding-left: 7px; padding-right: 7px; border-top: none; border-left: none; border-bottom: solid 0.5pt; border-right: solid 0.5pt;">
<p>Description of the ETF</p>
</td>
</tr>
<tr>
<td style="padding-left: 7px; padding-right: 7px; border-top: none; border-left: solid 0.5pt; border-bottom: solid 0.5pt; border-right: solid 0.5pt;">
<p>CATEGORY</p>
</td>
<td style="padding-left: 7px; padding-right: 7px; border-top: none; border-left: none; border-bottom: solid 0.5pt; border-right: solid 0.5pt;">
<p>TEXT</p>
</td>
<td style="padding-left: 7px; padding-right: 7px; border-top: none; border-left: none; border-bottom: solid 0.5pt; border-right: solid 0.5pt;">
<p>Category of the ETF</p>
</td>
</tr>
</tbody>
</table>
</div>
<p>&nbsp;</p>
<p><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/8176.012114_5F00_1929_5F00_Howtomanual1.png" alt="" /></p>
<p>&nbsp;</p>
<h1>Code</h1>
<p>Let's get to today's goal, to manually compile and create our own jar file and execute it on HDInsight. I am going to assume you have reviewed the classic <a href="http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html">Apache WordCount Example</a> already. Below is the java code. The purpose of the code is to write out the ETF symbol + description along with a count of&nbsp;trade days.</p>
<p>&nbsp;</p>
<pre class="scroll"><code class="java"> package BigDataSupport;<br /> <br /> import java.io.IOException;<br /> import java.util.*;<br /> <br /> import org.apache.hadoop.fs.Path;<br /> import org.apache.hadoop.io.Text;<br /> import org.apache.hadoop.mapred.JobConf;<br /> import org.apache.hadoop.mapreduce.Job;<br /> import org.apache.hadoop.mapreduce.Mapper;<br /> import org.apache.hadoop.mapreduce.Reducer;<br /> import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;<br /> import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;<br /> import org.apache.hadoop.util.GenericOptionsParser;<br /> <br /> public class ETFRun1 <br /> {<br /> public static class Map extends Mapper&lt;Object, Text, Text, Text&gt; <br /> {<br /> public void map(Object key, Text value, Context context) throws IOException, InterruptedException <br /> {<br /> String[] stringArray = value.toString().split(",");<br /> String line = value.toString().trim(); <br /> <br /> String symbol = stringArray[0].toString().trim(); <br /> String date = stringArray[1].toString().trim(); <br /> double open = Double.parseDouble(stringArray[2].toString().trim()); <br /> double high = Double.parseDouble(stringArray[3].toString().trim());<br /> double low = Double.parseDouble(stringArray[4].toString().trim());<br /> double close = Double.parseDouble(stringArray[5].toString().trim());<br /> int volume = Integer.parseInt(stringArray[6].toString().trim());<br /> int week = Integer.parseInt(stringArray[7].toString().trim());<br /> int weekday = Integer.parseInt(stringArray[8].toString().trim());<br /> String description = stringArray[9].toString().trim();<br /> String category = stringArray[10].toString().trim();<br /> context.write(new Text(symbol + " " + description), new Text(line));<br /> }<br /> }<br /> <br /> public static class Reduce extends Reducer&lt;Text, Text, Text, Text&gt; <br /> {<br /> public void reduce(Text key, Iterable&lt;Text&gt; values, Context context) throws IOException, InterruptedException <br /> {<br /> Integer ctr = 0; <br /> Iterator&lt;Text&gt; it = values.iterator();<br /> <br /> while(it.hasNext()) <br /> {<br /> ctr++;<br /> it.next(); <br /> }<br /> context.write(key, new Text(ctr.toString()));<br /> }<br /> }<br /> <br /> public static void main(String[] args) throws Exception <br /> {<br /> JobConf conf = new JobConf();<br /> String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();<br /> if (otherArgs.length != 2) <br /> {<br /> System.err.println("Usage: &lt;in&gt; &lt;out&gt;");<br /> System.exit(2);<br /> }<br /> <br /> Job job = new Job(conf, "ETFRun1");<br /> job.setJarByClass(ETFRun1.class);<br /> job.setMapperClass(Map.class);<br /> job.setReducerClass(Reduce.class);<br /> <br /> job.setMapOutputKeyClass(Text.class);<br /> job.setMapOutputValueClass(Text.class);<br /> job.setOutputKeyClass(Text.class);<br /> job.setOutputValueClass(Text.class);<br /> <br /> FileInputFormat.addInputPath(job, new Path(otherArgs[0]));<br /> FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));<br /> System.exit(job.waitForCompletion(true) ? 0 : 1);<br /> }<br /> }<br /> </code></pre>
<p>&nbsp;</p>
<p>There are some things I want to point out in the code above. We are using our own customer package name called BigDataSupport. This will become important when we create our jar file and execute our jar file on HDInsight. We have imported several java and apache Hadoop files. On HDInsight java is installed at c:\apps\dist\java\bin. We will pick up the import java file from this location when we compile. We also need two apache Hadoop jar files to compile successfully. These are located on HDInsight at C:\apps\dist\hadoop-1.2.0.1.3.2.0-05\ hadoop-core-1.2.0.1.3.2.0-05.jar and C:\apps\dist\hadoop-1.2.0.1.3.2.0-05\lib\ commons-cli-1.2.jar. In the c:\apps\dist\java\bin folder are the javac.exe, and jar.exe files. We will use the javac.exe to compile our java code and the jar.exe to create a jar file. Because we intend to compile and build multiple times, let's create a simple build-ETFRun.cmd file that we can reuse.</p>
<p>You are probably aware that the MapReduce paradigm works on key-value pairs. Our map function takes three parameters. A <strong>key</strong> of Object data type, <strong>value</strong> as Text data type, and <strong>context</strong> as a Context data type. For the mapper, the key data will be automatically generated by MapReduce and is the offset into the file. The value data will be an individual line of our ETF-WATCHLIST.CSV file. The context will be an object that we use to write or emit out a key-value pair to the next phase of MapReduce. We use the context.write function to write out the key-value pair. The data type that the context writes out must match what is defined in our Mapper signature. Notice that Mapper is defined as Mapper&lt;Object, TEXT, TEXT, TEXT&gt;. The first two parameters, Object and TEXT, are what comes in and the third and fourth parameters, TEXT and TEXT, are what come out. This means that our context.write function must write out a key as data type TEXT, and a value as data type TEXT. If we want to write out other data types in the context.write function we must change our Mapper function signature to match it.</p>
<p>In the map function we are parsing the data by comma and placing each column into the array for Strings. Then we assign each column to a variable for the 11 columns. We then write out our key-value pair in the context.write statement. Because we have to write out a TEXT data type we create a new TEXT and pass it our symbol + description as the key. For the value we also create a new TEXT and just pass the entire line. For our first line in the ETF-WATCHLIST.CSV, the map function should be writing out:</p>
<ul>
<li><strong>Key</strong> = AGG Aggregate Bond Ishares</li>
<li><strong>Value</strong> = AGG,2007-01-03,100.0000,100.0700,99.7900,99.9100,471100,1,4,Aggregate Bond Ishares,BOND AND FIXED INCOME</li>
</ul>
<p>&nbsp;The Reduce function has a signature of &lt;TEXT, TEXT, TEXT, TEXT&gt;. The first two parameters, TEXT and TEXT, are the key and value from the map function. They must also match what your context.write function statement writes out from the map function. The third and fourth parameters, TEXT and TEXT are the data types that the reduce function will write out. Our reduce function takes three parameters, a <strong>key</strong> of TEXT data type, <strong>values</strong> as an Iterable of TEXT data type, and <strong>context</strong> as a Context data type. In between the map and reduce phases, MapReduce has a "sort\shuffle" phase. This automatically sorts our key and places each value for that key in the Iterable Object. In our reduce function we then can iterate over the values for each key. We should have one execution of our reduce function for each unique key that the map function writes out. The key-value pair coming into the reduce function should look like:</p>
<ul>
<li>&nbsp;<strong>Key</strong> = AGG Aggregate Bond Ishares</li>
<li><strong>Iterable[0] = </strong>AGG,2007-01-03,100.0000,100.0700,99.7900,99.9100,471100,1,4,Aggregate Bond Ishares,BOND AND FIXED INCOME</li>
<li><strong>Iterable[1] = </strong>AGG,2007-01-04,100.0300,100.1900,99.9400,100.1200,1745500,1,5,Aggregate Bond Ishares,BOND AND FIXED INCOME</li>
<li><strong>Iterable[2] </strong>= AGG,2007-01-05,100.0000,100.0900,99.9000,100.0500,318200,1,6,Aggregate Bond Ishares,BOND AND FIXED INCOME</li>
</ul>
<p>In the reduce function, very similar to the WordCount example, we use the iterator to count up the number of trade days for each ETF. We then use the context.write function to write out the key and the counter. The ctr variable is of type Integer and we need to convert it to string in order to create a new TEXT object for the context.write function. The key-value pair coming out of the reduce function should look like:</p>
<ul>
<li><strong>Key</strong> = AGG Aggregate Bond Ishares</li>
<li><strong>Value </strong>= 1765</li>
</ul>
<p>The main function creates an instance of the JobConf Object and sets properties. Note that the setting properties for our map and reduce class names, the data types of our Output key and values. These need to match our Map and Reduce function signatures.</p>
<p>&nbsp;</p>
<h1>Compile and Jar</h1>
<p>Now that we have our ETFRun1.java file created lets compile it and create a jar file. Enable RDP on your HDInsight cluster through the portal and remote desktop to your headnode. Temporarily create a c:\MyFiles folder and copy the ETFRun1.java file into the folder as well our data file, ETF-WATCHLIST.CSV. Next create a build-ETFRun.cmd file with the following commands.</p>
<p><strong>build-ETFRUN.CMD </strong></p>
<pre class="scroll"><code class="java"> cls<br /> <br /> REM ===My Build File===<br /> <br /> cd c:\MyFiles<br /> <br /> del c:\MyFiles\ETFRun1.jar<br /> <br /> rd /s /q c:\MyFiles\MyJavaClasses<br /> <br /> rd /s /q c:\MyFiles\BigDataSupport<br /> <br /> mkdir c:\MyFiles\MyJavaClasses<br /> <br /> mkdir c:\MyFiles\MyJavaClasses\BigDataSupport<br /> <br /> c:\apps\dist\java\bin\javac.exe -classpath "C:\apps\dist\hadoop-1.2.0.1.3.2.0-05\hadoop-core-1.2.0.1.3.2.0-05.jar;C:\apps\dist\hadoop-1.2.0.1.3.2.0-05\lib\commons-cli-1.2.jar" -d c:\MyFiles ETFRun1.java <br /> <br /> copy c:\MyFiles\BigDataSupport\*.class c:\MyFiles\MyJavaClasses\BigDataSupport\<br /> <br /> c:\apps\dist\java\bin\jar.exe -cvf C:\MyFiles\ETFRun1.jar -C MyJavaClasses/ . <br /> <br /> c:\apps\dist\java\bin\jar.exe -tfv C:\MyFiles\ETFRun1.jar<br /> <br /> <br /> rem hadoop jar c:\MyFiles\ETFRun1.jar BigDataSupport.ETFRun1 /example/data/ETF-WATCHLIST.CSV /example/etf <br /></code></pre>
<p style="background: white;">The javac.exe takes a <strong>&ndash;classpath</strong> parameters which contains the path to the two apache Hadoop jar files we need in order to compile. A <strong>&ndash;d</strong> parameter which specifies the folder to place the .class files in. Javac.exe also takes the name of your .java file to compile into .class file. If you have multiple .java files just add them separated by a space. Because our code uses a package name the .class files are actually placed in folders under the c:\MyFiles folder to match the package name. This turns out to be c:\MyFiles\BigDataSupport. Below you can see the .class files generated by javac.exe</p>
<p style="background: white;"><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/0576.012114_5F00_1929_5F00_Howtomanual2.png" alt="" /></p>
<p style="background: white;">Next we use jar.exe to create a jar file that Hadoop can run. We have created a c:\MyFiles\MyJavaClass\BigDataSupport folder and copied our .class files from c:\MyFiles\BigDataSupport into c:\MyFiles\MyJavaClass\BigDataSupport before creating the jar file. If we don't get the paths correct when we execute the jar file it will raise a ClassNotFoundException. Take another quick look at the build-ETFRUN.cmd file.</p>
<p style="background: white;">Note: If you have third party jar files that you want to include in your jar file you can modify the build script to create a MyJavaClass\Lib folder and copy your third party jar into the lib folder before you jar it.</p>
<p style="background: white;">&nbsp;Below is an example of the ClassNotFoundException if you don't get the path correct.</p>
<p style="background: white; padding-left: 30px;">Exception in thread "main" java.lang.ClassNotFoundException: BigDataSupport.ETFRun1<br />at java.net.URLClassLoader$1.run(URLClassLoader.java:366)<br />at java.net.URLClassLoader$1.run(URLClassLoader.java:355)<br />at java.security.AccessController.doPrivileged(Native Method)<br />at java.net.URLClassLoader.findClass(URLClassLoader.java:354)<br />at java.lang.ClassLoader.loadClass(ClassLoader.java:423)<br />at java.lang.ClassLoader.loadClass(ClassLoader.java:356)<br />at java.lang.Class.forName0(Native Method)<br />at java.lang.Class.forName(Class.java:264)<br />at org.apache.hadoop.util.RunJar.main(RunJar.java:153)</p>
<p style="background: white;">&nbsp;</p>
<p style="background: white;">The jar.exe takes &ndash;cvf parameters. Parameters c, creates a new archive, v generates verbose output and f specifies the archive file name. Jar.exe also takes the &ndash;C MyJavaClass /. parameter to tell it where the .class files are to jar up. You can also use the &ndash;tvf to display the contents of a jar file after it is created.</p>
<p style="background: white;">Execute the build-ETFRUN.CMD to compile and create your ETFRun1.jar file. If it compiles and builds successfully your output should look like:</p>
<p style="background: white;"><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/6038.012114_5F00_1929_5F00_Howtomanual3.png" alt="" /></p>
<p style="background: white;">If you have syntax errors in your code they will be displayed in the output. Review them and modify your code and try again. Notice that the BigDataSupport path in the list of .class files in the jar file.</p>
<p style="background: white;">&nbsp;</p>
<p style="background: white;">Open a Hadoop command prompt and copy your data file or the ETF-WATCHLIST.CSV to HDFS. You can also copy the ETFRun1.jar file into HDFS, but for now I will leave it in c:\MyFiles folder.</p>
<p style="background: white;"><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/7522.012114_5F00_1929_5F00_Howtomanual4.png" alt="" /></p>
<p style="background: white;">&nbsp;</p>
<p style="background: white;">To run your jar file, execute hadoop jar c:\MyFiles\ETFRun.jar BigDataSupport.ETFRun1 /example/data/ETF-WATCHLIST.CSV /example/etf. Hadoop jar takes four parameters. The path to your jar file. The path to the class to start executing from within your jar file. Data input and data output paths. You should see output similar to below when the job runs.</p>
<p style="background: white;"><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/3240.012114_5F00_1929_5F00_Howtomanual5.png" alt="" /></p>
<p style="background: white;">&nbsp;</p>
<p style="background: white;">You should see the job output in the /example/etf/part-r-00000 file. You can copy it to your local drive to examine.</p>
<p style="background: white;"><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/0118.012114_5F00_1929_5F00_Howtomanual6.png" alt="" /></p>
<p style="background: white;">&nbsp;</p>
<p style="background: white;">If we open up our part-r-00000 file in wordpad.exe we see that we have the symbol + description (key) and the count of trade days (value). We have 194 unique ETF's. In the output of the job look at the "Reduce input groups". We see the ETF symbol along with its description. Each ETF has a different number of trade days. They range from 688 to 1765 trade days. Some ETF's started trading after January of 2007.</p>
<p style="background: white;"><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/6813.012114_5F00_1929_5F00_Howtomanual7.png" alt="" /></p>
<p style="background: white;">&nbsp;</p>
<h2>Conclusion</h2>
<p>This should give us a good foundation to use our own data, our own java code and be able to successfully compile it, create a jar file and execute it as a MapReduce job on HDInsight.</p>
<p>&nbsp;</p><div style="clear:both;"></div><img src="http://blogs.msdn.com/aggbug.aspx?PostID=10491491" width="1" height="1">carrollwphttp://blogs.msdn.com/carrollwp_4000_hotmail.com/ProfileUrlRedirect.ashxHow to add custom Hive UDFs to HDInsighthttp://blogs.msdn.com/b/bigdatasupport/archive/2014/01/14/how-to-add-custom-hive-udfs-to-hdinsight.aspx2014-01-14T21:23:00Z2014-01-14T21:23:00Z<p style="text-align: justify;">I recently had a need to add a UDF to Hive on HDInsight. I thought that it would be good to share that experience on a blog post. Hive provides a library of <a title="Apache Hive Reference on Built-in Functions" href="https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF" target="_blank">built-in functions</a> to achieve the most common needs. The cool thing is that it also provides the framework to create your own UDF. I had a recent need to extract a string from a column if the string matches a Java RegEx pattern. For those of us coming from a SQL Server background, we can extract a pattern from a string using a combination of the <a title="MSDN Documentation on PATINDEX" href="http://msdn.microsoft.com/en-us/library/fooc0dfb17f-2230-4e36-98da-a9b630bab656.aspx" target="_blank">PATINDEX</a> and <a title="MSDN Documentation on SUBSTRING" href="http://msdn.microsoft.com/en-us/library/ms187748.aspx" target="_blank">SUBSTRING</a> functions. The idea was to provide that same functionality with Hive.</p>
<p>When an HDInsight cluster is provisioned (created), you can pass in customizations to suit specific needs. Behind the scenes an HDInsight cluster is built on a series of virtual machine images. Those images sometimes move to a new host or are reimaged for various reasons &ndash;customizations passed in when you provision the cluster will survive this process and stay available in the HDInsight cluster. Therefore we recommend passing in custom JARs, such as additions to the Hive function library, at the time you provision your HDInsight cluster.</p>
<p style="text-align: justify;">You can find details on setting up the system for using PowerShell with HDInsight <a title="Reference on Provisioning Cluster using PowerShell" href="http://www.windowsazure.com/en-us/documentation/articles/hdinsight-get-started/" target="_blank">here</a>. Let me walk you through the process that I used to deploy the custom UDF on Azure HDInsight.</p>
<h3>Compile your customized code into JAR</h3>
<p>The first thing we need to do here is to encapsulate the logic for the functionality within a class. For this example, I had a piece of Java code that I could re-use and I decided to call my class &ndash; <span style="color: black;">FindPattern</span>. So, let us encapsulate those details here within a source file called FindPattern.java. I created a Java project and called it HiveUDF. I then created a package called HiveUDF and started adding my custom functionality there. So, one of the classes there is the FindPattern class.</p>
<p style="text-align: justify;">In order to implement a Hive UDF, you would need to extend the class "UDF" available in org.apache.hadoop.hive.ql.exec.UDF. I am showing the code snippet <a title="Java code for Hive UDF" href="https://gist.github.com/dharkum/8454979" target="_blank">here </a>&ndash;&nbsp;<br />
<script type="text/javascript" src="https://gist.github.com/dharkum/8454979.js"></script>
</p>
<p style="text-align: left;">The overall layout of the project on Eclipse IDE is shown below &ndash; notice how I have referenced the jar that I downloaded from HDInsight. This will help with resolution of the package org.apache.hadoop.hive.ql.exec.UDF.</p>
<p>&nbsp;<img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/8304.011414_5F00_2123_5F00_Howtoaddcus1.png" alt="" /></p>
<p style="text-align: justify;">Notice how I have the classes separated by the encapsulated functionality within a package: HiveUDF. I have the JAR file - hive-exec-0.11.0.1.3.1.0-06.jar - downloaded from my HDInsight installation. This JAR file is available on the default container of the primary storage account of any existing Azure HDInsight cluster and can be reused for this project as long as the versions of the HDInsight clusters match. You can find a lot more information on provisioning the HDInsight cluster on my previous blog <a title="Blog Post on Hive Architecture" href="http://blogs.msdn.com/b/bigdatasupport/archive/2013/11/11/get-started-with-hive-on-hdinsight.aspx" target="_blank">post</a>.</p>
<p style="text-align: justify;">I have taken the quick and easy approach of compiling this Java code with javac and packaging the class file with jar, as shown below. You can find a bit more detailed write up on this <a title="Reference to Denny's blog post" href="http://dennyglee.com/2013/05/09/compile-and-add-hive-udf-via-add-jar-in-hdinsight-on-azure/" target="_blank">here</a>.</p>
<p><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/2451.011414_5F00_2123_5F00_Howtoaddcus2.png" alt="" /></p>
<p>&nbsp;Now, it is time for us to compile our code. I am using javac for the sake of this example &ndash;</p>
<p style="text-align: justify;">Javac -cp &lt;Path to your JARs separated by ;&gt; -d &lt;Output location&gt; &lt;Source file&gt;</p>
<pre class="scroll"><code class="js">c:\MyDevelopment\HiveCustomization\Classes\target\classes&gt; javac -cp "c:\MyDevelopment\HiveCustomization\Source\hive-exec-0.11.0.1.3.1.0-06.jar" -d "C:\MyDevelopment\HiveCustomization\Classes\target\classes" "c:\MyDevelopment\HiveCustomization\Source\HiveUDF\src\HiveUDF\FindPattern.java" </code></pre>
<p class="scroll"><code class="js"></code>As a result of the above compilation, you will notice the FindPattern.class file generated within target\classes. Let us go ahead and package this into a JAR file, to upload into the HDInsight cluster!</p>
<p style="text-align: justify;">Switch to: C:\MyDevelopment\HiveCustomization\Classes\target\classes and execute the following jar command -</p>
<p style="text-align: justify;"><span style="background-color: #ffffff;"><strong>Jar cvf HiveUDF.jar .</strong></span></p>
<p style="text-align: justify;">As you can see from the screenshot below, a manifest file and the class files are packaged into a jar file. There are two other class files that are part of my HiveUDF project that exist for other UDFs.</p>
<p style="text-align: justify;"><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/4431.011414_5F00_2123_5F00_Howtoaddcus4.png" alt="" /></p>
<p>A Java Project can be used to add Java classes to encapsulate the different Hive UDF functionality and built like this and pre-provisioned into the cluster when the cluster is built. Once the rhythm of development is established, the build process, provisioning and creation of cluster can be automated with ease. A copy of&nbsp;HiveUDF.jar developed using the above steps&nbsp;is also attached to this post.</p>
<p>I pre-provisioned the Azure BLOB storage layout as described on my previous <a title="Blog Post on Hive Architecture" href="http://blogs.msdn.com/b/bigdatasupport/archive/2013/11/11/get-started-with-hive-on-hdinsight.aspx" target="_blank">post</a> and uploaded this custom JAR on the "myhivelibs" storage account as shown below &ndash; use the same location in the $HiveLibStorageAccount<br />and $HiveLibStorageContainer variables in PowerShell. You can use a tool like Visual Studio Server Explorer to explore Windows Azure Storage resources as described <a title="MSDN Reference to VS Explorer for Azure" href="http://msdn.microsoft.com/en-us/library/windowsazure/ff683677.aspx" target="_blank">here</a></p>
<p>Now, it is time to create a customized HDInsight Cluster, adding this JAR to the Hive Configuration! I will demonstrate it with PowerShell &ndash; detailed reference <a title="Reference on Powershell SDK" href="http://hadoopsdk.codeplex.com/wikipage?title=PowerShell%20Cmdlets%20for%20Cluster%20Management" target="_blank">here</a>.</p>
<p style="text-align: justify;">You can populate the parameters with your Subscription information and it will prompt for Credentials twice &ndash; the first time for creating the admin credentials for the HDInsight cluster that is being provisioned and the second time for the credentials for the MetaStore database. The PowerShell script assumes that the PrimaryStorageAccount, PrimaryStorageContainer, HiveLibStorageAccount, HiveLibStorageContainer, and HiveMetaStoreDB, indicated by respective variables on the PowerShell script below, have been pre-provisioned prior to execution of the script. The credentials for the MetaStore database will need to be the same as the one that is used to connect to the SQL Azure server. The authentication to SQL Azure server can also be independently validated through the Azure management portal.</p>
<p style="text-align: justify;">The code snippet that I used to create the customized HDInsight Cluster is <a title="CreateCustomHDICluster.ps1" href="https://gist.github.com/dharkum/8455459" target="_blank">here</a>&nbsp;-
<script type="text/javascript" src="https://gist.github.com/dharkum/8455459.js"></script>
</p>
<p style="text-align: justify;">If the script is successful, it will show the details of the cluster that it created for us &ndash; snippet below &ndash;</p>
<p style="background: #012456;"><span style="color: whitesmoke; font-family: Lucida Console; font-size: 9pt;">Name : MyHDICluster <br />HttpUserName : admin <br />HttpPassword : Mypass123! <br />Version : 2.1.3.0.432823 <br />VersionStatus : Compatible <br />ConnectionUrl : https://MyHDICluster.azurehdinsight.net <br />State : Running <br />CreateDate : 1/14/2014 5:21:51 AM <br />UserName : admin <br />Location : West US <br />ClusterSizeInNodes : 4 <br />DefaultStorageAccount : myprimarystorage.blob.core.windows.net <br />SubscriptionId : Your Subscription ID <br />StorageAccounts : {} </span></p>
<p style="text-align: justify;">What we have done above is created a nice customized HDInsight cluster, which deploys our additional library &ndash; HiveUDF.jar - for us!&nbsp;This specific configuration change will survive any background node re-images and so it is important to keep the content on the HiveLibStorageAccount intact so that any re-images that happen in the background can still fetch the library files from this location. Now that we have the JAR deployed, we don't need to do any ADD JAR and can directly create a temporary function and start using our UDF!</p>
<p style="text-align: justify;">With our pattern matching UDF, we can extract a given pattern from the message. Let us see an example where we would like to extract embedded phone numbers from messages &ndash; just for the sake of demonstration J For the sake of this demonstration, I am creating a Hive table called WebFeeds and loading it with some data as shown below.</p>
<p><span style="text-decoration: underline;">Test Data:</span></p>
<p>1, "Call me at 123-1234567"<br />2, "Talk to you later"<br />3, "You have reached the voicemail for 111-2223344. Please leave a message"<br />4, "Have a good day"</p>
<p style="text-align: justify;">You can create a file holding the above text data using any text editor like Notepad and save that file as MyUDFTestData.txt. Next using a storage explorer like Visual Studio Server Explorer, you can attach to the storage account "myprimarystorage" and upload this data file on the container "install" as "webfeeds/MyUDFTestData.txt"</p>
<p style="text-align: justify;">Then, let us create a file called MyUDFQuery.hql and save the below Hive code in there and upload that to the storage account "myprimarystorage" on the container "install". The code for MyUDFQuery.hql can be found <a title="Hive QL Code for UDF" href="https://gist.github.com/dharkum/8455163" target="_blank">here</a> -</p>
<script type="text/javascript" src="https://gist.github.com/dharkum/8455163.js"></script>
<p>I am just choosing to do a quick test here as you can see above. The Invoke-Hive cmdlet can take a HQL file as a parameter, but it needs to reside on the BLOB storage account. Notice how the HQL file that has been uploaded to a BLOB storage as shown above is referenced in the code snippet <a title="PS Script for testing the UDF" href="https://gist.github.com/dharkum/8455537" target="_blank">below</a> -</p>
<script type="text/javascript" src="https://gist.github.com/dharkum/8455537.js"></script>
<p>If everything has been set correctly and&nbsp;if a phone pattern is embedded on the message, you will see the UserID and the extracted phone number from the message as the output now!</p>
<p>This is a simple example of how you can customize your HDInsight Hadoop cluster. When you add your own Hive UDFs you can simplify and standardize Hive queries. This is simple and easy to do when you provision your HDInsight cluster. Happy customizing!</p>
<p style="text-align: justify;">@Dharshana (MSFT)</p>
<p>Thanks to <a href="http://blogs.msdn.com/cindygross/">Cindy Gross</a>&nbsp;| <a href="https://twitter.com/sqlcindy">@SQLCindy</a>, Rick_H, Azim, Farooq and Jason for reviewing this!</p><div style="clear:both;"></div><img src="http://blogs.msdn.com/aggbug.aspx?PostID=10489724" width="1" height="1">Dharshana_Bharadwajhttp://blogs.msdn.com/dharshana_5F00_bharadwaj_4000_hotmail.com/ProfileUrlRedirect.ashxMount Azure Blob Storage as Local Drivehttp://blogs.msdn.com/b/bigdatasupport/archive/2014/01/09/mount-azure-blob-storage-as-local-drive.aspx2014-01-09T15:47:02Z2014-01-09T15:47:02Z<p><span style="font-family:Arial; font-size:10pt"><strong>Gregory Suarez – 01/09/2014
</strong></span></p><p>
</p><p><span style="font-family:Arial; font-size:10pt">I was recently working with a colleague of mine who submitted a MapReduce job via an HDInsight Powershell script and he needed a quick way to visually inspect the last several lines of the output after it had completed. He was looking for an easy and flexible way to do this considering the results were stored in Azure blob storage.
</span></p><p>
</p><p><span style="font-family:Arial; font-size:10pt">There are a couple of approaches one could take here. First, you could connect to the head node via remote desktop and execute the Hadoop <em>tail</em> command to retrieve the last several rows of the file. The following shows an example that could be used from the rdp session to send the results to the console for visual inspection.
</span></p><p>
</p><p style="margin-left: 36pt"><span style="font-family:Arial; font-size:10pt">hadoop fs -tail /&lt;location to filename&gt;
</span></p><p style="margin-left: 36pt">
</p><p><span style="font-family:Arial; font-size:10pt">Other standard ways all involve <em>copying</em> the results from the Azure container down to the local file system. This could be performed using external utilities such as AZCopy or could be accomplished programmatically via <a href="http://www.windowsazure.com/en-us/manage/services/hdinsight/submit-hadoop-jobs-programmatically/">Azure Powershell script</a> . Once the results are retrieved from the remote system you could use whatever tools are at your disposal to interrogate the file.
</span></p><p>
</p><p><span style="font-family:Arial; font-size:10pt">I recommended a third option which was to mount the blob storage as a local drive to the Windows 8.1 machine that submitted the job.
</span></p><p>
</p><p style="margin-left: 72pt"><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/4645.010914_5F00_1556_5F00_MountAzureB1.png" alt=""/><span style="font-family:Arial; font-size:10pt">
</span></p><p><span style="font-family:Arial; font-size:10pt">Above, the drive letter Z: maps to &lt;<span style="color:black"><em>storageaccountname&gt;.blob.core.windows.net</em><span style="color:#666666">.</span>
</span>When you do this you gain the flexibility of drag and drop for uploading and downloading files to blob storage as well as in place editing and direct random access to the files contained in the remote location. This indirection allows one to execute local tools such as <em>tail</em> and <em>grep</em> against the remote blob location without having to explicitly copy the files to the local file system. When configured like this - you can <em>think</em> of Azure Blob storage as a USB drive that was just plugged into your system. My colleague simply opened a Windows 8.1 command prompt after the job was submitted and issued a tail command on the output file contained in blob storage to retrieve the desired results.
</span></p><p><span style="font-family:Arial; font-size:10pt">There are a few tools that I found which offer Azure Blob Storage drive functionality - but one in particular is <a href="http://gladinet.com/p/download_starter_V4.htm">Gladinet Drive Access</a> . I have had much success with this tool which is why I recommend it.
</span></p><p><span style="font-family:Arial; font-size:10pt">Configuration requires just a few steps.
</span></p><p>
</p><ol><li><div><span style="font-family:Arial; font-size:10pt">Open Gladinet Desktop Management Console and select Mount Virtual Directory
</span></div><p>
</p><p style="margin-left: 36pt"><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/0028.010914_5F00_1556_5F00_MountAzureB2.png" alt=""/><span style="font-family:Arial; font-size:10pt">
</span></p></li><li><div><span style="font-family:Arial; font-size:10pt">Next, from the <strong>Mounting Virtual Directory</strong> dialog box, select '<em>Windows Azure Blob Storage</em>' from the drop down list of options. Next specify a Virtual Directory name (any name) and press next to continue
</span></div><p style="margin-left: 36pt">
</p><p style="margin-left: 36pt"><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/7701.010914_5F00_1556_5F00_MountAzureB3.png" alt=""/><span style="font-family:Arial; font-size:10pt">
</span></p></li><li><div><span style="font-size:10pt"><span style="font-family:Arial">From the Login Information tab, specify the Azure storage name used for your HDInsight container. For example, the URI to access the default file system on my HDInsight cluster is </span><span style="font-family:Wingdings">à</span><span style="font-family:Arial"> wasb://suaro22@<strong>bigsuaro6</strong>.blob.core.windows.net . In this case, I would enter bigsuaro6 as the storage name into the Account Name dialog box below. Primary Access Key - go to your Azure storage accounts, select your desired storage account. Click Manage Access Keys in the bottom toolbar. Copy the Primary Access Key code and paste it into the Primary Access Key textbox.
</span></span></div><p>
</p><p style="margin-left: 36pt"><img src="http://blogs.msdn.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-01-61-78-metablogapi/1440.010914_5F00_1556_5F00_MountAzureB4.png" alt=""/><span style="font-family:Arial; font-size:10pt">
</span></p></li></ol><p><span style="font-family:Arial; font-size:10pt">Once the drive is configured, you can use it like any other drive on your system.
</span></p><p><span style="font-family:Arial; font-size:10pt">While we do not endorse or support this tool, it may come in handy for you to use on daily basis.
</span></p><p>
</p><p><span style="font-size:12pt">
</span> </p><div style="clear:both;"></div><img src="http://blogs.msdn.com/aggbug.aspx?PostID=10488464" width="1" height="1">Gregory Suarez - MSFThttp://blogs.msdn.com/cts_2D00_gregorys_4000_live.com/ProfileUrlRedirect.ashx