Search results matching tags 'Business Intelligence' and 'SQL Server'http://sqlblog.com/search/SearchResults.aspx?o=DateDescending&tag=Business+Intelligence,SQL+Server&orTags=0Search results matching tags 'Business Intelligence' and 'SQL Server'en-USCommunityServer 2.1 SP2 (Build: 61129.1)Data Mining Algorithms – an Introductionhttp://sqlblog.com/blogs/dejan_sarka/archive/2015/02/19/data-mining-algorithms-an-introduction.aspxThu, 19 Feb 2015 18:08:58 GMT21093a07-8b3d-42db-8cbf-3350fcbf5496:57938Dejan Sarka<p>Data mining is the most advanced part of business intelligence. With statistical and other mathematical algorithms, you can automatically discover patterns and rules in your data that are hard to notice with on-line analytical processing and reporting. However, you need to thoroughly understand how the data mining algorithms work in order to interpret the results correctly. In this blog I am introducing the data mining, and in the following blogs I am unveiling the black box of data mining and explaining how the most popular algorithms work.</p> <h3>Data Mining Definition</h3> <p>Data mining is a process of exploration and analysis, by automatic or semiautomatic means, of historical data in order to discover patterns and rules, which can be used later on new data for predictions and forecasting. With data mining, you deduce some hidden knowledge by examining, or training, the data. The unit of examination is called a <i>case</i>, which can be interpreted as one appearance of an entity, or a row, in a table. The knowledge is <i>patterns</i> and <i>rules</i>. In the process, you use attributes of a case, which are called <i>variables</i> in data mining terminology. For better understanding, you can compare data mining to On-Line Analytical Processing (OLAP), which is a model-driven analysis where you build the model in advance. Data mining is a data-driven analysis, where you search for the model. You examine the data with data mining algorithms.</p> <p>There are many alternative names for data mining, such as knowledge discovery in databases (KDD) and predictive analytics. Originally, data mining was not the same as machine learning in that it gives business users insights for actionable decisions; machine learning determines which algorithm performs the best for a specific task. However, nowadays data mining and machine learning are in many cases used as synonyms.</p> <h3>The Two Types of Data Mining</h3> <p>Data mining techniques are divided into two main classes:</p> <ul> <li>The <i>directed</i>, or <i>supervised</i> approach: You use known examples and apply gleaned information to unknown examples to predict selected target variable(s). </li> <li>The <i>undirected</i>, or <i>unsupervised</i> approach: You discover new patterns inside the dataset as a whole. </li> </ul> <p>Some of the most important directed techniques include classification, estimation, and forecasting. Classification means to examine a new case and assign it to a predefined discrete class. Examples are assigning keywords to articles and assigning customers to known segments. Very similar is estimation, where you are trying to estimate a value of a variable of a new case in a continuously defined pool of values. You can, for example, estimate the number of children or the family income. Forecasting is somewhat similar to classification and estimation. The main difference is that you can’t check the forecasted value at the time of the forecast. Of course, you can evaluate it if you just wait long enough. Examples include forecasting which customers will leave in the future, which customers will order additional services, and the sales amount in a specific region at a specific time in the future.</p> <p>The most common undirected techniques are clustering and affinity grouping. An example of clustering is looking through a large number of initially undifferentiated customers and trying to see if they fall into natural groupings. This is a pure example of &quot;undirected data mining&quot; where the user has no preordained agenda and hopes that the data mining tool will reveal some meaningful structure. Affinity grouping is a special kind of clustering that identifies events or transactions that occur simultaneously. A well-known example of affinity grouping is market basket analysis. Market basket analysis attempts to understand what items are sold together at the same time.</p> <h3>Common Business Use Cases</h3> <p>Some of the most common business questions that you can answer with data mining include:</p> <ul> <li>What’s the credit risk of this customer? </li> <li>Are there any groups of my customers? </li> <li>What products do customers tend to buy together? </li> <li>How much of a specific product can I sell in the next time period? </li> <li>What is the potential number of customers shopping in this store? </li> <li>What are the major groups of my web-click customers? </li> <li>Is this a spam email? </li> </ul> <p>However, the actual questions you might want to answer with data mining could be by far broader and depend on your imagination only. For an unconventional example, you might use data mining to try to lower the mortality rate in a hospital.</p> <p>Data mining is already widely used in many different applications. Some of the typical usages, along with the most commonly used algorithms for a specific task, include the following:</p> <ul> <li><i>Cross-selling</i>: Widely used for web sales with the Association Rules and Decision Trees algorithms. </li> <li><i>Fraud detection</i>: An important task for banks and credit card issuers, who want to limit the damage that fraud creates, including that experienced by customers and companies. The Clustering and Decision Trees algorithms are commonly used for fraud detection. </li> <li><i>Churn detection</i>: Service providers, including telecommunications, banking, and insurance companies, perform this to detect which of their subscribers are about to leave them in an attempt to prevent it. Any of the directed methods, including the Naive Bayes, Decision Trees, or Neural Network algorithm, is suitable for this task. </li> <li><i>Customer Relationship Management (CRM) applications</i>: Based on knowledge about customers, which you can extract with segmentation using, for example, the Clustering or Decision Trees algorithm. </li> <li><i>Website optimization</i>: To do this, you should know how your website is used. Microsoft developed a special algorithm, the Sequence Clustering algorithm, for this task. </li> <li><i>Forecasting</i>: Nearly any business would like to have some forecasting, in order to prepare better plans and budgets. The Time Series algorithm is specially designed for this task. </li> </ul> <h3>A Quick Introduction to the Most Popular Algorithms</h3> <p>In order to raise the expectations for the upcoming blogs, I am adding a brief introduction to the most popular data mining algorithms in a condensed way, in a table. <table cellspacing="0" cellpadding="0"> <tr> <td> <p><strong>Algorithm</strong></p> </td> <td> <p><strong>Usage</strong></p> </td> </tr> <tr> <td> <p>Association Rules</p> </td> <td> <p>The algorithm used for market basket analysis, this defines an itemset as a combination of items in a single transaction. It then scans the data and counts the number of times the itemsets appear together in transactions. Market basket analysis is useful to detect cross-selling opportunities.</p> </td> </tr> <tr> <td> <p>Clustering</p> </td> <td> <p>This groups cases from a dataset into clusters containing similar characteristics. You can use the Clustering method to group your customers for your CRM application to find distinguishable groups of your customers. In addition, you can use it for finding anomalies in your data. If a case does not fit well to any cluster, it is kind of an exception. For example, this might be a fraudulent transaction.</p> </td> </tr> <tr> <td> <p>Naïve Bayes</p> </td> <td> <p>This calculates probabilities for each possible state of the input attribute for every single state of predictable variable. Those probabilities predict the target attribute based on the known input attributes of new cases. The Naïve Bayes algorithm is quite simple; it builds the models quickly. Therefore, it is very suitable as a starting point in your predictive analytics project. </p> </td> </tr> <tr> <td> <p>Decision Trees </p> </td> <td> <p>The most popular DM algorithm, it predicts discrete and continuous variables. It uses the discrete input variables to split the tree into nodes in such a way that each node is more pure in terms of target variable, i.e. each split leads to nodes where a single state of a target variable is represented better than other states.</p> </td> </tr> <tr> <td> <p>Regression Trees</p> </td> <td> <p>For continuous predictable variables, you get a piecemeal multiple linear regression formula with a separate formula in each node of a tree. Discrete input variables are used to split the tree into nodes. A tree that predicts continuous variables is a Regression Tree. Use Regression Trees for estimation of a continuous variable; for example, a bank might use this technique to estimate the family income for a loan applicant.</p> </td> </tr> <tr> <td> <p>Linear Regression</p> </td> <td> <p>Predicts continuous variables, using a single multiple linear regression formula. The input variables must be continuous as well. Linear Regression is a simple case of a Regression Tree, a tree with no splits. Use it for the same purpose as Regression Trees.</p> </td> </tr> <tr> <td> <p>Neural Network</p> </td> <td> <p>This algorithm is from artificial intelligence, but you can use it for predictions as well. Neural networks search for nonlinear functional dependencies by performing nonlinear transformations on the data in layers, from the input layer through hidden layers to the output layer. Because of the multiple nonlinear transformations, neural networks are harder to interpret compared to Decision Trees.</p> </td> </tr> <tr> <td> <p>Logistic Regression</p> </td> <td> <p>As Linear Regression is a simple Regression Tree, a Logistic Regression is a Neural Network without any hidden layers.</p> </td> </tr> <tr> <td> <p align="left">Support Vector Machines</p> </td> <td> <p>Support Vector Machines are supervised learning models with associated learning algorithms that analyse data and recognize patterns, used for classification. A support vector machine constructs a hyper plane or set of hyper planes in a high-dimensional space where the input variables define the dimensions. The hyper planes split the data points into discrete groups of the target variable. Support Vector Machines are powerful for some specific classifications, like text and images classifications and hand-written characters recognition.</p> </td> </tr> <tr> <td> <p>Sequence Clustering</p> </td> <td> <p>This searches for clusters based on a model, and not on similarity of cases as Clustering does. The models are defined on sequences of events by using Markov Chains. Typical usage of the Sequence Clustering would be an analysis of your company’s Web site usage, although you can use this algorithm on any sequential data.</p> </td> </tr> <tr> <td> <p>Time Series</p> </td> <td> <p>You can use this algorithm to forecast continuous variables. Time Series many times denotes two different internal algorithms. For short-term forecasting, Auto-Regression Trees (ART) algorithm is used. For long-term prediction, Auto-Regressive Integrated Moving Average (ARIMA) algorithm is used. </p> </td> </tr> </table> </p> <h3>Conclusion</h3> <p>This brief introduction to data mining should give you the idea what you could use it for and an overview which algorithms are appropriate for the business problem you are trying to solve. I guess you also noticed I am not talking about any specific technology here. These most popular data mining algorithms are available in many different products. For example, you can find them in SQL Server Analysis Services, Excel with Data Mining Add-ins, R, Azure ML, and more. Please learn how to use them with your specific product using the documentation of the product, by reading books that deal with your product, or by visiting a course about the product.</p> <p>I hope you got excited enough to read the upcoming blogs and visit some of my presentations on various conferences.</p>PASS SQL Saturday #356 Slovenia Pre-Conference Seminarshttp://sqlblog.com/blogs/dejan_sarka/archive/2014/09/29/pass-sql-saturday-356-slovenia-pre-conference-seminars.aspxMon, 29 Sep 2014 06:47:53 GMT21093a07-8b3d-42db-8cbf-3350fcbf5496:55487Dejan Sarka<p>I am proud and glad I can announce two top pre-conference seminars at the <a href="https://www.sqlsaturday.com/356/eventhome.aspx">PASS SQL Saturday #356 Slovenia</a> conference. The speakers and the seminars titles are:</p> <ul> <li>Stacia Misner - <a href="http://www.kompas-xnet.si/koledar-tecajev/pre-conference-seminarspower-up-your-data-with-excel-and-power-bi/SQL1">Power Up Your Data with Excel and Power BI</a> </li> <li>Kevin Boles - <a href="http://www.kompas-xnet.si/koledar-tecajev/pre-conference-seminars-tune-like-a-guru/SQL">Tune Like A Guru!</a></li> </ul> <p>Both seminars will take place on Friday, December 12th, in the classrooms of our sponsor <a href="http://www.kompas-xnet.si/">Kompas Xnet</a>. The price for a seminar is € 149, with early bird price at € 119. Early bird price is valid until October 31st.</p> <p>I am also using this opportunity to explain how and why we decided for these two seminars. The decision was made by the conference organizers, Matija Lah, Mladen Prajdič, and Dejan Sarka. There was a lot of discussion in different social networks about PASS Summit pre-conference seminars lately. If you have any objections for our seminars, please do not start big discussions in public; please tell them to the three of us directly.</p> <p>First of all, unlike at the PASS Summit seminars, the speakers are not going to earn big money here, and therefore it is not really worth spending much time and energy on our decision. We think that any of the speakers who sent proposals for our SQL Saturday could present a top quality seminar. We would like to enable seminars for every speaker that wants to deliver one. However, in a small country, we will have already hard time to fill up the two seminar we have currently. Our intention is to reimburse at least part of the money the speakers spent on their own for travelling expenses and accommodation. In our opinion, it makes sense to do this for the speakers that spent the most for the travelling. Coming here from USA is expensive, and it also takes three days in both directions. That’s why we decided to organize the seminars for the first two speakers from USA. </p> <p>Of course, this is not the last event. If everything goes well with SQL Saturday #356 and with the seminars, we will definitely try to organize more events in the future, and invite more speaker to deliver a seminar as well.</p> <p>Thank you for understanding!</p>24 Hours of PASS (September 2014): Recordings Now Available!http://sqlblog.com/blogs/sergio_govoni/archive/2014/09/24/24-hours-of-pass-september-2014-recordings-now-available.aspxWed, 24 Sep 2014 17:20:00 GMT21093a07-8b3d-42db-8cbf-3350fcbf5496:55400Sergio Govoni<p>Sessions of the event <a href="http://www.sqlpass.org/24hours/2014/summitpreview/About.aspx" target="_blank" mce_href="http://www.sqlpass.org/24hours/2014/summitpreview/About.aspx">24 Hours of PASS: Summit Preview Edition</a>&nbsp;(which was held on last September 9th) were recorded and now they are available for online streaming!</p><p><a href="http://www.sqlpass.org/24hours/2014/summitpreview/Schedule.aspx" target="_blank" mce_href="http://www.sqlpass.org/24hours/2014/summitpreview/Schedule.aspx"><img width="984" height="175" style="width:495px;height:90px;" src="http://sqlblog.com/files/folders/54790/download.aspx" border="0"></a></p><p>If you have missed one session in particular or the entire event, you can view it or review your preferred sessions; you can find all details <a href="http://www.sqlpass.org/24hours/2014/summitpreview/Schedule.aspx" target="_blank" mce_href="http://www.sqlpass.org/24hours/2014/summitpreview/Schedule.aspx">here</a>.</p><p>What could you aspect from the next PASS Summit? Find it out on recorded sessions of this edition of 24 Hours of PASS.</p>24 Hours of PASS (September 2014): Summit Preview Editionhttp://sqlblog.com/blogs/sergio_govoni/archive/2014/08/12/24-hours-of-pass-september-2014-summit-preview-edition.aspxTue, 12 Aug 2014 21:22:00 GMT21093a07-8b3d-42db-8cbf-3350fcbf5496:54789Sergio Govoni<p>Which sessions you can expect to find at the next <a href="http://www.sqlpass.org/summit/2014/Home.aspx" target="_blank" mce_href="http://www.sqlpass.org/summit/2014/Home.aspx">PASS Summit 2014</a> ? Find it out on September 09, 2014 (12:00 GMT) at the free online event: <a href="http://www.sqlpass.org/24hours/2014/summitpreview/Sessions.aspx" target="_blank" mce_href="http://www.sqlpass.org/24hours/2014/summitpreview/Sessions.aspx">24 Hours of PASS: Summit Preview Edition</a>.<br></p><p><a href="http://www.sqlpass.org/24hours/2014/summitpreview/Sessions.aspx" target="_blank" mce_href="http://www.sqlpass.org/24hours/2014/summitpreview/Sessions.aspx"><img width="600" height="110" style="width:600px;height:110px;" src="http://sqlblog.com/files/folders/54790/download.aspx" border="0"></a></p><p>Register now at this <a href="http://www.sqlpass.org/24hours/2014/summitpreview/Registration.aspx" target="_blank" mce_href="http://www.sqlpass.org/24hours/2014/summitpreview/Registration.aspx">link</a>.</p><p>No matter from what part of the world you will follow the event, the important thing is to know that&nbsp;they will be 24 hours of continuous training on SQL Server and Business Intelligence on your computer!</p>24 Hours of PASS (June 2014): Recordings Now Available!http://sqlblog.com/blogs/sergio_govoni/archive/2014/07/08/24-hours-of-pass-june-2014-recordings-now-available.aspxTue, 08 Jul 2014 17:57:00 GMT21093a07-8b3d-42db-8cbf-3350fcbf5496:54320Sergio Govoni<P>Sessions of the event <A href="http://www.sqlpass.org/24hours/2014/ss2014launch/Sessions.aspx" target=_blank mce_href="http://www.sqlpass.org/24hours/2014/ss2014launch/Sessions.aspx">24 Hours of PASS: SQL Server 2014</A> (which was held on last June at 25th and 26th) were recorded and now they are available for online streaming!</P>
<P><IMG style="WIDTH:622px;HEIGHT:111px;" border=0 src="http://sqlblog.com/files/folders/54122/download.aspx" width=622 height=111>&nbsp;</P>
<P>If you have missed one session (in particular) or the entire live event, you can view it or review your preferred sessions <A href="http://www.sqlpass.org/24hours/2014/ss2014launch/Schedule.aspx" target=_blank mce_href="http://www.sqlpass.org/24hours/2014/ss2014launch/Schedule.aspx">here</A>.</P>
<P>What could you aspect from the next PASS Summit? Find it out on September 9 at 24 Hours of PASS: Summit 2014 Preview Edition!</P>24 hours of PASS is back!http://sqlblog.com/blogs/sergio_govoni/archive/2014/06/05/24-hours-of-pass-is-back.aspxFri, 06 Jun 2014 03:47:00 GMT21093a07-8b3d-42db-8cbf-3350fcbf5496:54121Sergio Govoni<p>The most important free on-line event on SQL Server and Business Intelligence is back!&nbsp;The 24 Hours of PASS is coming back with a great edition fully based on the new features of SQL Server 2014.</p>
<p><a href="http://www.sqlpass.org/24hours/2014/ss2014launch/Sessions.aspx" target="_blank" mce_href="http://www.sqlpass.org/24hours/2014/ss2014launch/Sessions.aspx"><img width="622" height="111" style="width:622px;height:111px;" src="http://sqlblog.com/files/folders/54122/download.aspx" border="0"></a></p>
<p>Register now at this <a href="http://www.sqlpass.org/24hours/2014/ss2014launch/Registration.aspx" target="_blank" mce_href="http://www.sqlpass.org/24hours/2014/ss2014launch/Registration.aspx">link</a>.</p>
<p>No matter from what part of the world you will follow the event, the important thing is to know that it will be 24 hours of continuous training on SQL Server and Business Intelligence.</p>PASS DW/BI Virtual Chapter Upcoming Sessions (December 2013)http://sqlblog.com/blogs/sergio_govoni/archive/2013/12/06/pass-dw-bi-virtual-chapter-upcoming-sessions-december-2013.aspxFri, 06 Dec 2013 23:33:00 GMT21093a07-8b3d-42db-8cbf-3350fcbf5496:52043Sergio Govoni<p><span class="hps">Let me point out the upcoming live events scheduled for December 2013 organized by&nbsp;<a href="http://bi.sqlpass.org/" target="_blank" mce_href="http://bi.sqlpass.org/">PASS Business Intelligence Virtual Chapter</a>.</span></p><p>&nbsp;</p><p><b>Create and Load a Staging Environment from Scratch in an Hour with Biml</b></p><p>Date: Thursday 12 December Noon PST / 3 PM EST / 8 PM GMT<br>Speaker: Scott Currie<br>URL: <a href="https://attendee.gotowebinar.com/register/7424713205660411905" target="_blank" mce_href="https://attendee.gotowebinar.com/register/7424713205660411905">https://attendee.gotowebinar.com/register/7424713205660411905</a> </p><p>Business Intelligence Markup Language (Biml) automates your BI patterns and eliminates the manual repetition that consumes most of your SSIS development time. During this hour long presentation, Scott Currie from Varigence will use the free BIDSHelper add-in for BIDS and SSDT to introduce Biml and use to automatically generate large quantities of custom SSIS packages. The session will be largely demonstration driven, and reusable sample code will be distributed for you to use in your own projects. Using a live-typing approach, Scott will start from scratch and by the end of the session create a full-blown staging environment. This will include the creation of *hundreds* of target table creation scripts, data load packages, data scrubbing rules, logging, and more. The best part is that you can freely reuse the code in your own environment just by changing the connection strings - or make small changes to implement your own data load patterns.</p><p>&nbsp;</p><p><b>Inferred Dimension Members within MDS and SSIS</b></p><p>Date: Monday 16 December 3 PM PST / 6 PM EST / 11 PM GMT<br>Speaker: Reza Rad<br>URL: <a href="https://attendee.gotowebinar.com/register/7123625140094491905" target="_blank" mce_href="https://attendee.gotowebinar.com/register/7123625140094491905">https://attendee.gotowebinar.com/register/7123625140094491905</a> </p><p>Combining Master Data Services with Data Warehouses, will cause some challenges in ETL Scenarios. In this session we will go through a demo of Inferred Dimension Members implementation with SSIS considering the fact that MDS keeps the single version of truth for the dimension record. In this session you will learn how we will write back new record's data into MDS entity as an Inferred member. The staging structure of Master Data Services and Batch Processing will be used for this. Then you will learn what is the best practice to add the inferred record into Data Warehouse dimension. Updating the existing dimension member also would consider the Inferred member and apply SCD types only if this is not an inferred Member.</p><p>&nbsp;</p><p><b>Guerrilla MDS/MDM The Road To Data Governance</b></p><p>Date: Thursday 19 December Noon PST / 3 PM EST / 8 PM GMT<br>Speakers: Ira Whiteside and Victoria Stasiewicz<br>URL: <a href="https://attendee.gotowebinar.com/register/336644427020709122" target="_blank" mce_href="https://attendee.gotowebinar.com/register/336644427020709122">https://attendee.gotowebinar.com/register/336644427020709122</a> </p><p>Ira and Vic's session "Guerrilla MDS" will be a walk-through of a real-world implementation for a master data model (MDM) and metadata mart utilizing SSIS, MDS and POWER BI EXCEL add-ins as well as applying proper data quality techniques. We will walk through in detail the processes necessary for utilizing the complete MDS functionality as follows: creating entities attribute, relating entities the domain based attributes, staging leave table, updated entity content, apply business rules, create subscription view and set up security. Source code for all samples and PowerPoint will be made available.&nbsp;</p>24 Hours of PASS (July 2013): Recordings Now Available!http://sqlblog.com/blogs/sergio_govoni/archive/2013/08/06/24-hours-of-pass-july-2013-recordings-now-available.aspxWed, 07 Aug 2013 03:10:00 GMT21093a07-8b3d-42db-8cbf-3350fcbf5496:50398Sergio Govoni<P>The sessions of the event <A href="http://www.sqlpass.org/24hours/2013/summitpreview/" target=_blank mce_href="http://www.sqlpass.org/24hours/2013/summitpreview/">24 Hours of PASS: Summit Preview</A> (which was held on last July 31) were recorded and now they are available for online streaming!</P>
<P><A href="http://www.sqlpass.org/24hours/2013/summitpreview/" target=_blank mce_href="http://www.sqlpass.org/24hours/2013/summitpreview/"><IMG style="WIDTH:212px;HEIGHT:84px;" border=0 src="http://sqlblog.com/files/folders/50399/download.aspx" width=212 height=84></A></P>
<P>If you have missed one session in particular or the entire live event, you can view and review your preferred sessions; you can find all details <A href="http://www.sqlpass.org/summit/2013/Sessions/SneakPeeks.aspx" target=_blank mce_href="http://www.sqlpass.org/summit/2013/Sessions/SneakPeeks.aspx">here</A>.</P>
<P>This edition of 24 Hours of PASS wants to be a sneak taste of what you&nbsp;can&nbsp;expect from the next <A href="http://www.sqlpass.org/summit/2013/" target=_blank mce_href="http://www.sqlpass.org/summit/2013/">PASS Summit</A> that this year will be in Charlotte (NC) from 15 to 18 October 2013.</P>SSIS Design Patterns, the Bookhttp://sqlblog.com/blogs/andy_leonard/archive/2012/08/06/ssis-design-patterns-the-book.aspxMon, 06 Aug 2012 16:37:43 GMT21093a07-8b3d-42db-8cbf-3350fcbf5496:44587andyleonard<p>For the past two years, I have had the honor and privilege or authoring <a href="http://www.amazon.com/SSIS-Design-Patterns-Matt-Masson/dp/1430237716" target="_blank">SSIS Design Patterns</a> alongside Jessica Moss, Michelle Ufford, Tim Mitchell, and Matt Masson. Publication of the book – like many projects of this scope – has been delayed. The current publication date is 27 Aug 2012 and I have high confidence in this date. </p> <p>I take responsibility for publication delays and apologize to those who pre-ordered the book. The reasons for the delays are not important. I have built a career as a software developer and architect based on the following maxim:</p> <blockquote> <p><em>Deliver quality late, no one remembers. <br />Deliver junk on time, no one forgets.</em></p> </blockquote> <p>The shared goal of everyone working on this project has been to deliver quality. Proofing the manuscripts, I believe we have achieved that goal. </p> <p>:{&gt;</p>The Data Scientisthttp://sqlblog.com/blogs/buck_woody/archive/2011/11/15/the-data-scientist.aspxTue, 15 Nov 2011 15:00:18 GMT21093a07-8b3d-42db-8cbf-3350fcbf5496:39814BuckWoody<p>A new term - well, perhaps not that new - has come up and I’m actually very excited about it. The term is Data Scientist, and since it’s new, it’s fairly undefined. I’ll explain what I <em>think</em> it means, and why I’m excited about it.</p> <p>In general, I’ve found the term deals at its most basic with analyzing data. Of course, we all do that, and the term itself in that definition is redundant. There is no science that I know of that does not work with analyzing lots of data. But the term seems to refer to more than the common practices of looking at data visually, putting it in a spreadsheet or report, or even using simple coding to examine data sets. </p> <p>The term Data Scientist (as far as I can make out this early in it’s use) is someone who has a strong understanding of data sources, relevance (statistical and otherwise) and processing methods as well as front-end displays of large sets of complicated data. Some - but not all - Business Intelligence professionals have these skills. In other cases, senior developers, database architects or others fill these needs, but in my experience, many lack the strong mathematical skills needed to make these choices properly. </p> <p>I’ve divided the knowledge base for someone that would wear this title into three large segments. It remains to be seen if a given Data Scientist would be responsible for knowing all these areas or would specialize. There are pretty high requirements on the math side, specifically in graduate-degree level statistics, but in my experience a company will only have a few of these folks, so they are expected to know quite a bit in each of these areas. </p> <p><strong>Persistence</strong></p> <p>The first area is finding, cleaning and storing the data. In some cases, no cleaning is done prior to storage - it’s just identified and the cleansing is done in a later step. This area is where the professional would be able to tell if a particular data set should be stored in a Relational Database Management System (RDBMS), across a set of key/value pair storage (NoSQL) or in a file system like HDFS (part of the Hadoop landscape) or other methods. Or do you examine the stream of data without storing it in another system at all? </p> <p>This is an important decision - it’s a foundation choice that deals not only with a lot of expense of purchasing systems or even using Cloud Computing (PaaS, SaaS or IaaS) to source it, but also the skillsets and other resources needed to care and feed the system for a long time. The Data Scientist sets something into motion that will probably outlast his or her career at a company or organization.</p> <p>Often these choices are made by senior developers, database administrators or architects in a company. But sometimes each of these has a certain bias towards making a decision one way or another. The Data Scientist would examine these choices in light of the data itself, starting perhaps even before the business requirements are created. The business may not even be aware of all the strategic and tactical data sources that they have access to. </p> <p><strong>Processing</strong></p> <p>Once the decision is made to store the data, the next set of decisions are based around how to process the data. An RDBMS scales well to a certain level, and provides a high degree of ACID compliance as well as offering a well-known set-based language to work with this data. In other cases, scale should be spread among multiple nodes (as in the case of Hadoop landscapes or NoSQL offerings) or even across a Cloud provider like Windows Azure Table Storage. In fact, in many cases - most of the ones I’m dealing with lately - the data should be split among multiple types of processing environments. This is a newer idea. Many data professionals simply pick a methodology (RDBMS with Star Schemas, NoSQL, etc.) and put all data there, regardless of its shape, processing needs and so on. </p> <p>A Data Scientist is familiar not only with the various processing methods, but how they work, so that they can choose the right one for a given need. This is a huge time commitment, hence the need for a dedicated title like this one. </p> <p><strong>Presentation</strong></p> <p>This is where the need for a Data Scientist is most often already being filled, sometimes with more or less success. The latest Business Intelligence systems are quite good at allowing you to create amazing graphics - but it’s the data behind the graphics that are the most important component of truly effective displays. </p> <p>This is where the mathematics requirement of the Data Scientist title is the most unforgiving. In fact, someone without a good foundation in statistics is not a good candidate for creating reports. Even a basic level of statistics can be dangerous. Anyone who works in analyzing data will tell you that there are multiple errors possible when data just seems right - and basic statistics bears out that you’re on the right track - that are only solvable when you understanding why the statistical formula works the way it does. </p> <p>And there are lots of ways of presenting data. Sometimes all you need is a “yes” or “no” answer that can only come after heavy analysis work. In that case, a simple e-mail might be all the reporting you need. In others, complex relationships and multiple components require a deep understanding of the various graphical methods of presenting data. Knowing which kind of chart, color, graphic or shape conveys a particular datum best is essential knowledge for the Data Scientist. </p> <p><strong>Why I’m excited</strong></p> <p>I love this area of study. I like math, stats, and computing technologies, but it goes beyond that. I love what data can do - how it can help an organization. I’ve been fortunate enough in my professional career these past two decades to work with lots of folks who perform this role at companies from aerospace to medical firms, from manufacturing to retail. </p> <p>Interestingly, the size of the company really isn’t germane here. I worked with one very small bio-tech (cryogenics) company that worked deeply with analysis of complex interrelated data. </p> <p>So&#160; watch this space. No, I’m not leaving Azure or distributed computing or Microsoft. In fact, I think I’m perfectly situated to investigate this role further. We have a huge set of tools, from RDBMS to Hadoop to allow me to explore. And I’m happy to share what I learn along the way. </p>