Search results matching tags 'Windows Azure' and 'Data Science'http://sqlblog.com/search/SearchResults.aspx?o=DateDescending&tag=Windows+Azure,Data+Science&orTags=0Search results matching tags 'Windows Azure' and 'Data Science'en-USCommunityServer 2.1 SP2 (Build: 61129.1)Data Science and the Cloudhttp://sqlblog.com/blogs/buck_woody/archive/2013/08/26/data-science-and-the-cloud.aspxMon, 26 Aug 2013 15:56:42 GMT21093a07-8b3d-42db-8cbf-3350fcbf5496:50665BuckWoody<p><a href="http://officeimg.vo.msecnd.net/en-us/images/MH900433830.jpg"><img style="border:0px currentColor;margin-top:1px;margin-bottom:1px;float:right;max-width:550px;" src="http://officeimg.vo.msecnd.net/en-us/images/MH900433830.jpg" alt="" width="259" height="193" /></a></p>
<p>More than perhaps any other computing discipline, Data Science lends itself best to Cloud Computing in general, and Windows Azure in specific. That's a big claim, but before I offer some evidence, I need to explain what I mean by "Data Science". I've written before on Data Science (<a href="http://sqlblog.com/b/buckwoody/archive/2012/10/16/is-data-science-science.aspx" target="_blank">http://blogs.msdn.com/b/buckwoody/archive/2012/10/16/is-data-science-science.aspx</a><a href="http://sqlblog.com/blogs/buck_woody/archive/2012/10/16/is-data-science-science.aspx,">,</a> and <a href="https://www.simple-talk.com/cloud/data-science/data-science-laboratory-system---keyvalue-pair-systems/" target="_blank">https://www.simple-talk.com/cloud/data-science/data-science-laboratory-system---keyvalue-pair-systems/</a>&nbsp;), but since it's an evolving field, here's what I've observed as the areas that a Data Scientist focuses on:</p>
<ul>
<li>Research - Standard researching techniques such as domain knowledge, data sources and impact analysis</li>
<li>Statistics - Probability and descriptive statistics-focused</li>
<li>Programming - At least one functional or object-oriented language, often Python, F#, LISP, Haskell or Java and Javascript</li>
<li>Sources of data -&nbsp;Internal organizational data as well as external sources such as weather, economics, spatial, geo-political sources and more</li>
<li>Data movement - Traditional Extract, Transform and Load (ETL), along with ingress or referencing external data sources</li>
<li>Complex Event Processing (CEP) - Analyzing or triggering computing as data moves through a source</li>
<li>Data storage - Storage systems including distributed storage and remote storage</li>
<li>Data processing - Both single-node and distributed processing systems, RDBMS, NoSQL (Hadoop, Key/Value Pair, Document Store, Graph databases, etc)</li>
<li>Machine learning - Data-instructive programming as well as Artificial Intelligence and Natural Language Processing</li>
<li>Decision analysis - Interpreting the processing of data to identify a pattern, make a prediction, and data mining</li>
<li>Business Intelligence - Design of exploratory data, visualizations, business and organization impacts and communication to the stakeholders of the use of data and visualization tools</li>
</ul>
<p>There are of course other aspects of data science, but I believe this list covers the majority of skills I've seen in individuals with the Data Scientist title. And&nbsp;it is normally an individual, or at least a very limited group of people. as you examine the list above, you can see this person requires a fairly extensive technical background, and in the domain knowledge area in specific, there's a pretty large time element. That isn't to say a very bright person couldn't ramp up on these areas, just that having all of that in your portfolio takes time.</p>
<p>Given that these are the skillsets, why is cloud computing well suited to assisting in the data science function?</p>
<p>It's obvious that a researcher needs good Internet skills, beyond simply referencing a Wikipedia article - although that's certainly a good thing to include from time to time. While searching isn't specific to Windows Azure, there are platform components that allow the programming function to call out to the web for data access. Windows Azure includes a platform that allows languages from Python to F#, JavaScript (Including NodeJS), Java and more.</p>
<p>Cloud computing allows the data scientist to access data stored in Windows Azure (Blobs, Tables, Queues, RDBMS's as a service such as SQL Server and MySQL) as well as IaaS systems that can run full RDBMS systems such as SQL Server, Oracle, PostGreSQL and others. In addition, the Windows Azure Marketplace contains "Data as a Service" which has free and fee-based data to include in a single application.</p>
<p>The Windows Azure Service Bus allows architecting a CEP system, and using SQL Server allows the StreamInsight feature, and can communicate from on-premises, Windows Azure IaaS and PaaS, and other data sources.</p>
<p>For data storage and computing, Windows Azure allows everything from traditional RDBMS's as described to any NoSQL in IaaS, on both Windows and Linux operating systems. Statistical packages such as "R" are also supported. The elasticity allows the data scientist to spin up huge clusters, such as Hadoop or other NoSQL offerings, perform some analysis, and then stop the process when complete, saving cost, and bypassing the internal IT systems (which may have its own dangers, to be sure).&nbsp; Windows Azure also offer the High Performance Computing (HPC) computing version of Windows Server on Windows Azure, for large-scale massively parallel data processing, in constant and "burst" modes.</p>
<p>In addition, Windows Azure has many services, such as the HDInsight Service (Hadoop on demand) and other analysis offerings that don't even require the data scientist to stand up and manage a Virtual Machine in IaaS. For visualization, Microsoft has included the ability to use Excel with the HDInight Service, and of course that works with all Microsoft Business Intelligence functions, and there are several other data visualization tools such as Power View . You can enter the tools you have in the Microsoft stack in this tool (<a href="http://www.microsoft.com/en-us/bi/Products/bi-solution-builder.aspx">http://www.microsoft.com/en-us/bi/Products/bi-solution-builder.aspx)</a> for more on the visualization options you have. The data scientist can also build visualizations in web pages, on iPhone, Android or Windows mobile devices, or in full client-code installations.</p>
<p>Because the need for elasticity, multiple operating systems, and changing landscapes for data and processing, data science is well served by cloud computing - and in Windows Azure in particular because of the services and features offered, not only on Microsoft Windows but Open Source.</p>
<p>&nbsp;</p>Using Hadooop (HDInsight) with Microsoft - Two (OK, Three) Options http://sqlblog.com/blogs/buck_woody/archive/2012/12/04/using-hadooop-hdinsight-with-microsoft-two-ok-three-options.aspxTue, 04 Dec 2012 15:28:23 GMT21093a07-8b3d-42db-8cbf-3350fcbf5496:46509BuckWoody<p>Microsoft has many tools for &ldquo;Big Data&rdquo;. In fact, you need many tools &ndash; there&rsquo;s no product called &ldquo;Big Data Solution&rdquo; in a shrink-wrapped box &ndash; if you find one, you probably shouldn&rsquo;t buy it. It&rsquo;s tempting to want a single tool that handles everything in a problem domain, but with large, complex data, that isn&rsquo;t a reality. You&rsquo;ll mix and match several systems, open and closed source, to solve a given problem.</p>
<p>But there are tools that help with handling data at large, complex scales. Normally the best way to do this is to break up the data into parts, and then put the calculation engines for that chunk of data right on the node where the data is stored. These systems are in a family called &ldquo;Distributed File and Compute&rdquo;. Microsoft has a couple of these, including the <a href="http://www.microsoft.com/hpc/en/us/default.aspx">High Performance Computing edition of Windows Server</a>. Recently we partnered with <a href="http://hortonworks.com/">Hortonworks</a> to bring the <a href="http://hadoop.apache.org/">Apache Foundation&rsquo;s release of Hadoop</a> to Windows. And as it turns out, there are actually two (technically three) ways you can use it.</p>
<p style="padding-left:30px;"><span style="color:#993300;"><em>(There&rsquo;s a more detailed set of information here: <a href="http://www.microsoft.com/sqlserver/en/us/solutions-technologies/business-intelligence/big-data.aspx"><span style="color:#993300;">http://www.microsoft.com/sqlserver/en/us/solutions-technologies/business-intelligence/big-data.aspx</span></a>, I&rsquo;ll cover the options at a general level below)&nbsp; </em></span></p>
<h1>First Option: Windows Azure HDInsight Service</h1>
<p>&nbsp;Your first option is that you can simply log on to a Hadoop control node and begin to run Pig or Hive statements against data that you have stored in Windows Azure. There&rsquo;s nothing to set up (although you can configure things where needed), and you can send the commands, get the output of the job(s), and stop using the service when you are done &ndash; and repeat the process later if you wish.</p>
<p>(There are also connectors to run jobs from Microsoft Excel, but that&rsquo;s another post)</p>
<p>&nbsp;<a href="http://sqlblog.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-00-79-79/0572.option_2D00_1.png"><img src="http://sqlblog.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-00-79-79/0572.option_2D00_1.png" alt="" width="367" height="212" border="0" /></a></p>
<p>This option is useful when you have a periodic burst of work for a Hadoop workload, or the data collection has been happening into Windows Azure storage anyway. That might be from a web application, the logs from a web application, <a href="http://en.wikipedia.org/wiki/Telemetry">telemetrics</a> (remote sensor input), and other modes of constant collection. &nbsp;</p>
<p>You can read more about this option here: &nbsp;<a href="http://sqlblog.com/b/windowsazure/archive/2012/10/24/getting-started-with-windows-azure-hdinsight-service.aspx">http://blogs.msdn.com/b/windowsazure/archive/2012/10/24/getting-started-with-windows-azure-hdinsight-service.aspx</a></p>
<h1>Second Option: Microsoft HDInsight Server</h1>
<p>Your second option is to use the Hadoop Distribution for on-premises Windows called Microsoft HDInsight Server. You set up the Name Node(s), Job Tracker(s), and Data Node(s), among other components, and you have control over the entire ecostructure.</p>
<p><a href="http://sqlblog.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-00-79-79/7041.option_2D00_2.png"><img src="http://sqlblog.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-00-79-79/7041.option_2D00_2.png" alt="" width="152" height="179" border="0" /></a>&nbsp;</p>
<p>This option is useful if you want to &nbsp;have complete control over the system, leave it running all the time, or you have a huge quantity of data that you have to bulk-load constantly &ndash; something that isn&rsquo;t going to be practical with a network transfer or disk-mailing scheme.</p>
<p>You can read more about this option here: <a href="http://www.microsoft.com/sqlserver/en/us/solutions-technologies/business-intelligence/big-data.aspx">http://www.microsoft.com/sqlserver/en/us/solutions-technologies/business-intelligence/big-data.aspx</a></p>
<h1>Third Option (unsupported): Installation on Windows Azure Virtual Machines</h1>
<p>&nbsp;Although unsupported, you could simply use a Windows Azure Virtual Machine (we support both Windows and Linux servers) and install Hadoop yourself &ndash; it&rsquo;s open-source, so there&rsquo;s nothing preventing you from doing that.</p>
<p>&nbsp;<a href="http://sqlblog.com/cfs-file.ashx/__key/communityserver-blogs-components-weblogfiles/00-00-00-79-79/0121.option_2D00_3.png"><img src="http://sqlblog.com/resized-image.ashx/__size/550x0/__key/communityserver-blogs-components-weblogfiles/00-00-00-79-79/0121.option_2D00_3.png" alt="" width="326" height="188" border="0" /></a></p>
<p>Aside from being unsupported, there are other issues you&rsquo;ll run into with this approach &ndash; primarily involving performance and the amount of configuration you&rsquo;ll need to do to access the data nodes properly. But for a single-node installation (where all components run on one system) such as learning, demos, training and the like, this isn&rsquo;t a bad option.</p>
<p>Did I mention that&rsquo;s unsupported? :) </p>
<p>You can learn more about Windows Azure Virtual Machines here: <a href="http://www.windowsazure.com/en-us/home/scenarios/virtual-machines/">http://www.windowsazure.com/en-us/home/scenarios/virtual-machines/</a></p>
<p>And more about Hadoop and the installation/configuration (on Linux) here: <a href="http://en.wikipedia.org/wiki/Apache_Hadoop">http://en.wikipedia.org/wiki/Apache_Hadoop</a></p>
<p>And more about the HDInsight installation here: <a href="http://www.microsoft.com/web/gallery/install.aspx?appid=HDINSIGHT-PREVIEW">http://www.microsoft.com/web/gallery/install.aspx?appid=HDINSIGHT-PREVIEW</a></p>
<h1>Choosing the right option</h1>
<p>Since you have two or three routes you can go, the best thing to do is evaluate the need you have, and place the workload where it makes the most sense.&nbsp; My suggestion is to install the HDInsight Server locally on a test system, and play around with it. Read up on the best ways to use Hadoop for a given workload, understand the parts, write a little Pig and Hive, and get your feet wet. Then sign up for a test account on HDInsight Service, and see how that leverages what you know. If you're a true tinkerer, go ahead and try the VM route as well. </p>
<p>Oh - there&rsquo;s another great reference on the Windows Azure HDInsight that just came out, here: <a href="http://sqlblog.com/b/brunoterkaly/archive/2012/11/16/hadoop-on-azure-introduction.aspx">http://blogs.msdn.com/b/brunoterkaly/archive/2012/11/16/hadoop-on-azure-introduction.aspx</a> &nbsp;</p>