Just a three hour tour…

Menu

Articles

My last blog post showed how to copy data to a blob storage account, which I needed to do to be able to call an Azure Machine Learning [ML] Web Service. When calling a ML Web Service, the data must be in an Azure Blob Storage account. Once a ML model has been trained, and a web services has been created, it’s ready for production. Calling the experiment in Data Factory allows the ML to be run with tens of thousands of rows as part of a scheduled process. Prior to inserting the ML web service in Data Factory, make sure that you test it to ensure there are no errors with the web service, as Data Factory does not expose all of the ML errors which may be encountered by the web service.

Creating Azure Machine Learning Data Factory Pipelines

Two new steps need to be added to the existing Data Factory Pipeline, one to call the ML Web Service and one for the output. The ML pipeline requires two pieces of JSON code, a linked service to make the connection to the web service and a pipeline to invoke the job and specify the inputs and the Outputs. For the Output, the first step requires no JSON as first a blob storage container in Azure needs to be created to store it. The next steps involve writing JSON to create a linked service to connect to it and lastly an Output dataset needs to be defined.

Calling Machine Learning Service

The Linked Service for ML is going to need some information from the Web Service, the URL and the API key. Chances are neither of these have been committed to memory, instead open up Azure ML, go to Web Service and copy them. For the URL, look under the API Help Page grid, there are two options, Request/Response and Batch Execution. Clicking on Batch Execution loads a new page Batch Execution API Document. The URL can be found under Request URI. When copying the URL, you do not need to include any text after the word “jobs”. The rest of the URL, “?api-version=2.0”. Copying the entire URL will cause an error. Going back to the web Services page, The API Key appears on the dashboard section of Azure ML and there is a convenient button for copying it. Using these two pieces of information, it is now possible to create the Data Factory Linked Service to make the connection to the web service, which here I called AzureMLLinkedService
{
"name": "AzureMLLinkedService",
"properties": {
"description": "Connecting ML Experiment”
"hubName": " GingerDataFactoryTest_hub",
"type": "AzureML",
"typeProperties": {
"mlEndpoint": "https://ussouthcentral.services.azureml.net/workspaces/fbe056b6d4c74d7f9d1954367dc3fa61/services/xxa56efd75b745e28cd0512822d17eae/jobs",
"apiKey": "**********"
}
}
}

Lastly another Dataset needs to be created to process the output. The data will be written to a file called Output.csv, which is in a folder called mloutput01/ which is located in the Blob storage container, which is the same one I used previously for the input folder used earlier. This file will be overwritten every single time this is run.
{
"name": "OutputDataSetBlob",
"properties": {
"published": false,
"type": "AzureBlob",
"linkedServiceName": "AzureBlobStorageLinkedService",
"typeProperties": {
"fileName": "output.csv",
"folderPath": "mloutput01/",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": false,
"policy": {}
}
}

If you add this code onto the previous Data Factory code, you can take data from the database and use it to run a Azure ML experiment and run as much data as you want through the experiment.

When starting with Azure Machine Learning, it is sort of hard to wrap one’s brain around what kind of insight that Machine Learning can provide. When doing data analysis, often times we are looking for patterns. Does the volume of data really go up at the end of the month or is just the additional processes that make it seem that way? Does anyone really know if sales really pick up in August or is that just legerdemain from the sales department? Linear Regression can help determine that.

Relationships between Different Items

There are two types of indicators for linear correlation, positive and negative as shown on the following charts. The Y axis represents Grades, and the x axis is changed to show positive and negative correlation of the amount of X on grades. When X is the amount of study hours, there is a positive correlation and the line goes up. When X is changed to watching cat videos, there is a negative correlation. If you can’t draw a line around the points there is no correlation. If I were to create a graph where X indicated the quantity of the bags of Cheese Doodles consumed on grades, it would not be possible to draw a straight line, where the data points cluster around it. Since this is Line-ar regression, if that line doesn’t exist there is no correlation. Knowing there is no correlation is also useful.

Calculating Linear Regression

While the variable relationship is really easy to see without Math, there is an underlying formula that describes Linear Regression, and lest all of the math majors get upset I thought I would include the formula

Yi = a0 + b0Xi + ei

Y – is the value of the Y axis, which in our example is grades

a – Is the point where the line intersects Y, or more clearly stated, where the line is. Now ideally your data should intersect at those points but since the line is sort of a guide, this won’t exactly match.

b – Contains the slope of the line

X – Is the value of the X axis, which depending on the example you are looking at is

E – This contains the error

Machine Learning with Linear Regression

In the blog examples, there are only two values, grades and something else. Machine learning can take all of your input variables and determine which values, if any impact the result. Hopefully this information provides you with a good use case for machine learning. In case you were unaware, Azure ML is availablefor free. All you need to do is sign up for an account at https://studio.azureml.net . There are a few size limitations as far as how much data you can load, but you can load enough to determine if machine learning will work in your environment.

As 2015 draws to a close, I started thinking back about everything that has happened this year. Technically this has been a big year as a many new applications were released. Here are just some of them, with links included to provide more detail.

This short list could be a lot longer as it doesn’t count the number of updates released to Power BI, which occur several times a month, the CTP releases for SQL Server 2016, the new web version of BIML, or PowerShell. It’s really hard to keep up with everything that is changing. It’s a good thing that so many people are willing to help others learn how through speaking and blogs which make learning new things easier.

Community Involvement in 2015

Keeping up with all of these events is difficult, especially given the pace of releases. I spend a lot of time reading various blogs, watching videos and going to hear people speak. I also have been able to talk about topics of particular interest, many Power BI and Machine Learning. This year I spoke a different times at a number of different events including: Speaker Idol, two different user groups, seven webinars, five SQL Saturdays and other Tech Events. I’ve got a number of engagements on the books for next year, including PASS BA Con and SQL Saturday #461 – Austin. 2016 is shaping up to be busy too and hopefully our paths will cross. I list all of my speaking events on my Engagement Page and I hope that you might take a look at it from time to time if you are interested in catching up in person sometime. Next year I am hoping my list of speaking engagements changes somewhat as I plan on trying harder to get accepted to speak at events where I submitted and was turned down in 2015. On a more positive note, views of my blog are up 1000%, and the number of website subscribers has more than doubled. Thank you very much for continuing to read this site and I hope you find my thoughts helpful. I posted once a week this year, which I thought was pretty good until I talked to Ken Fischer b | t who blogs twice a week. I’ll have to try harder next year. If you think of a topic you think would make a good blog post, let me know as I am always interested in feedback.

Keeping Up the Pace in 2016

Next year there will be no slowdown in the things to learn as SQL Server 2016 is going to be released. Although the exact date has not been announced, my sources tell me to look for it around May-June. The next release of SQL Server is going to be huge as it will include new tools Microsoft added to integrate Big Data and open source platforms to SQL Server. PolyBase, JSON and R are all going to be part of with SQL Server. Personally, I find the R integration most Datazen and SSRS are going to be integrated in the next release too which should really increase the implementation of mobile reporting visualizations.

When incorporating streaming data from Azure Streaming Analytics, it is important to select the data to accomplish the goals of the tasks at hand. The sample company Eosh is streaming a lot of sensor data and has a variety of questions which the data is will answer. Eosh is a concrete delivery company which is streaming the data back from their vehicles to the central dispatch system using Microsoft’s Stream Analytics. There’s a much more detailed description of what Eosh is streaming and their data needs in the first post in this series. After reviewing when Tumbling Windows and Sliding Windows, are used, in this post we are going to discuss another option for streaming data, Hopping Windows.

When to Use Hopping Windows

Eosh wants to use Hopping Windows to determine the previous action when water is added to the concrete mix. There is a flow meter sensor in the water tank which detects when the driver flips the switch to add more water. There are a number of different reasons for adding water, one being that the pouring is complete and the driver is washing out the remaining concrete. Another reason could be that the driver is stuck in traffic and the water is added to keep the concrete from setting up within the mixer. Depending on the type of concrete in the mixer, if too much water is added, the concrete will no longer have the required strength and can’t be used to create a load bearing structure. It is very important that concrete used in structural concrete be created according to specification, as concrete mixed incorrectly will crumble over time, something commonly seen in Detroit. If too much water is added the vehicle may be routed to a different location so the concrete can be used for a non-load bearing purpose, like creating sidewalks.

Overlapping Hops

By design, all hops contain an overlapping previous time slice. The picture provides a good visualization for how the data slices are created. Eohs wants to look at the events which happened 5 minutes prior so that the adding water event can be appropriately categorized. The following Streaming query can provide that data

This query will create 10 minute slices of time. Each slice will look at the last 5 minutes previous reported and 5 minutes past that. By slicing the data in this way, the context around adding water can be evaluated to determine what kind of water add event took place. Eosh can then use this data to determine if the concrete can be delivered to the original location or if it needs to be rerouted. This later processing will be accomplished via machine learning, which I will talk about in a later post.

Tumbling Windows are another way of selecting data from an Azure Stream to drive an Azure ML Experiment. Once again of my examples here are going to be based on the concrete company Eohs, which I referenced in a previous post when talking about Streaming Windows. Eohs is streaming data, via Azure Stream Analytics [ASA] and we need evaluate a portion of that data for an Azure Machine Learning experiment. The experiments don’t need all of the data; only a portion of that data is required. Some of the data will be reported on in real time, and other portions of data will be used for analysis at a longer window. The necessary data will be extracted via an Azure Stream Analytics Query using Windowing. In this post, we will be talking about Tumbling Windows.

Eohs: Streaming Sensor Data

Eohs has installed a tracking system which sends GPS positioning and sensor data which is sent back in near real time to the dispatching company. The dispatchers are able to monitor on their screens the location of the truck, speed, heading and some sensor information delivered every 20 seconds which allow them to know if the truck is loading concrete, pouring concrete, adding water, seatbelt information, and if the passenger door is opened. Eohs is interested in using the sensor data received to figure out if they will need to perform maintenance on their concrete mixing drums. The drums need to have maintenance performed on them based on the drum speed, concrete pouring sensor, and the amount of water added when in use.

Using Azure ML to Determine when to Perform Maintenance

Since Eohs is streaming their data with ASA, we monitor the sensor information for the water and the drum speed over time to see if maintenance is required on the concrete drum. The Azure ML experiment will look at the combination of the water, drum speed and time of day to determine if maintenance is required. We will need to evaluate the sensors every 15 minutes.

Tumbling Windows in Azure Stream Analytics

We want to look at the performance of the sensors in 15 minute increments, so to do this we are going to use a tumbling window. Tumbling windows are designed to read data in fixed increments, so our query is going to read them every 30 minutes. Using the Stream Analytics Query Language, this query will provide the data.

This query will return the data every 15 mint. The EvalTime will be the single time value when the query was run. TIMESTAMP BY EntryTime will ensure that the data is evaluated based upon when the data was created instead of the time that the data reached the Azure server as sometimes data packets may be received out of order. Having our data split into multiple streams like this will allow for multiple experiments to be performed on our Azure Data Stream.

For Part 3 of this series we will talk about Hopping Windows and how and when to use that technique on our data. If you are interested in knowing when my next post will be available, please subscribe and you’ll receive an email when the next post is available.

Recently I read a twitter message where the person said they had just run an Azure Machine Learning [ML] experiment and it worked successfully. He wondered what the results mean? I thought it might be helpful to explain how to interpret some of the results. There are a number of different types of algorithms you can use in Azure ML, each has a different way of evaluating the result of the experiment. Azure ML has three major algorithm categories: classification, clustering and regression. Since I believe classification and clustering are more often discussed when reading about ML, first I thought I’d talk about regression.

When should one use Regression Models?

While it is possible to make a reasonable guess on the other algorithm categories mean, regression algorithms are not so intuitive. A regression algorithms are used to predict a specific value output number based upon a series of variables. In order to run this kind of experiment, you will need a key to uniquely identify the record to be able to use this model, and be looking to return a number for a results.

Reading the Results

If you visualize the output of a ML project, the visualization looks like the picture on the left. The number you want to pay attention to is the coefficient of determination. This value tells you how good your predictions are. The closer this number is to 1, the better your variables are at predicting the results. If you look at the values in my experiment, it will round up to the number 1, since the value is 0.642447. That means the variables I am evaluating can mostly be used to determine a value. If you read about when to use a regression models, the common example using them is to predict a home value, which if you have all the information people use for creating real estate comparisons, such as number of rooms, closeness to a busy street, lot size etc.

No Result is a Result

Sometime there is a value in finding nothing. Recently I did some ML analysis for a company, who provided a set of data for analysis to determine how it was influencing the outcome. After running a number of different algorithms, weights, and other changes in the experiment, I determined that their data didn’t impact the final outcome in any significant way. Prior to presenting the findings to them, I thought that they would not have a favorable opinion of the analysis I did, as they were looking for an answer I didn’t find. What did happen is they were surprised and happy that I was able to show them that their thoughts about the data were not true, so they could focus their efforts to different areas. Keep that in mind next time you find nothing, because no result is an answer too.

Microsoft previewed Cortana Analytics in July 13, 2015, and since then, they have published a lot of information on their site about it. Based on what I’ve seen on the internet, there appears to be a lot of confusion as to what Cortana Analytics is. This is completely understandable when you consider the number of different products the name Cortana has represented for Microsoft. My favorite is the image with the picture of a blue girl, which is from the Xbox game Halo 3. A video game was character was the first place Microsoft used the name Cortana in 2007. At the Microsoft BUILD Developer Conference in April 2014, the name Cortana was used for the Microsoft version of the Apple’s Siri phone application. If you are interested in hearing about it, I’ve included a link to the Channel 9 video here where they talk about Cortana. Finally, a year later Microsoft comes out with a product called Cortana Analytics. No wonder people are confused.

Cortana Analytics is not a Product

Cortana Analytics: the bow tying different applications together

To help bring clarity to what Cortana Analytics is and is not, I wanted to start out with what I think is the most confusing point. Cortana Analytics is not a product, but a name given to a bunch of other applications which are designed to work together. In essence, Microsoft tied a bow around a bunch of applications and called the bow, Cortana Analytics. Here’s an example scenario. Start by sending water meter data from the physical meters to the cloud, where you aggregate, analyze, store and end up with a Power BI application on your phone showing you a visualization of some aspect of the data. To make this happen from a technical perspective using Microsoft’s tools, one would need to probably create an Event Hub, run a Streaming Analytics Process, use Data Factory to call a Machine Learning experiment, migrate the data to an Azure Storage account of some kind and then create a Power BI report to be sent to your phone. All of that, is Cortana Analytics. It is not one product, but a big bow tying all of the applications they have designed to work together under one name. Power BI is part of it. On that note, I recently saw Microsoft do a demo with Power BI where they integrated the Cortana-phone like functionality of talking to Power BI and it displayed the information it was asked. I have no idea when this will be released, but it sure was a neat demonstration. In this demo, they mentioned they were adding Cortana funcationality to Power BI, which really didn’t help the confusion level with the name.

Cortana Analytics Web Presentation

I recently recorded a video presentation of Cortana Analytics where I described in greater detail the components which make up Cortana Analytics and how they work together. That video is available here. As I am working more with the components which make up Cortana Analtics, such as Machine Learning and Power BI, I will definitely be devoting more blog post to the topic, so please subscribe to my blog if you are interested in learning more about it.

The PASS organization is a professional organization which sponsors a number of different technical events in the technical community. Recently, I have been honored to be selected to speak at not one but two events hosted by PASS, a professional organization which provides a lot of great resources to improve knowledge of all things SQL Server and related technologies to the world. The PASS Business Intelligence Chapter provides training on all things related to Business Intelligence via the web. I was selected to talk at the last meeting in May. Thank you to all of the people who were able to attend my talk on Top 10 SSIS Tuning Tricks live. If you had to work, no problem all of the talks hosted by the PASS Business Intelligence Virtual Chapter Recordings are available on www.Youtube.com. The recording of my Top 10 SSIS Tuning Tricks session is available here.

24 Hours of PASS

Periodically PASS provides a 24 Hour Training session on SQL Related topics to provide training live to every time zone in the world. As this event is watched by people around the world, it is a real honor to be selected for this event. This time the speakers were selected from people who had not yet spoken at the PASS Summit Convention, as the theme was Growing Our Community. The theme is just another way the PASS organization is working to improve people’s skills. Not only do they provide the opportunity to learn all things data, but also provide professional development through growing the speaking skills by providing many avenues to practice these skills.

Data Analytics with Azure Machine Learning

My abstract on Improving Data Analytics with Azure Machine Learning was selected by the 24 Hours of PASS. As readers of my blog are aware, I have been working on Azure Machine Learning [ML] this year and look forward to discussing how to integrate Azure ML into current environments. Data analytics with ML are yet another way to derive meaning from data being collected and stored. I find the application of data analytic fascinating, and hope to show you why if you are able to attend. There are a number of wonderful talks scheduled at this event, so I encourage you to check out the schedule at attend as many as you can. To be sure I’ll be signing up for a number of sessions as well.

Recently I was afforded the opportunity to speak at several different events, all of which I thoroughly enjoyed. I was able to speak on Azure Machine learning first at the Arizona SQL Server Users Group meeting. I really appreciate all who attended as we had quite a crowd. Since the meeting is held practically on Arizona State University’s Tempe Campus, it was great to see a number of students attending, most likely due to Ram’s continued marketing efforts on meetup.com. After talking to him about it, I was impressed at his success at improving attendance by promoting the event on Meetup, and wonder if many SQL Server User Groups have experienced the same benefits. If you have, please let me know. Thanks Joe for taking a picture of the event too.

Modern Data Warehousing Precon

The second event where I had the opportunity to talk about technology was at the Precon at SQL Saturday in Huntington Beach, where I spoke about Modern Data Warehousing. It was a real honor to be selected for this event, and I really enjoyed interacting with all of the attendees. Special thanks to Alan Faulkner for his assistance. We discussed the changing data environment including cloud based storage, analytics, Hadoop, handling ever increasing amounts of data from different sources, increasing demands of users, the review of technology solutions demonstrate ways to resolve these issues in their environments.

Talking and More Importantly Listening

The following day was SQL Saturday in Huntington Beach #389. Thanks to Andrew, Laurie, Thomas and the rest of the volunteers for making this a great event as I know a little bit about the work that goes into planning and pulling off the event. My sessions on Azure ML, Predicting the future with Machine Learning and Top 10 SSIS Tuning Tricks were both selected and I had great turnout on both sessions. To follow-up with a question I received during my SSIS Session, Balanced Data Distributor was first released as a new SSIS transform for SQL Server 2008 and 2008 R2, so you can use it for versions prior to SQL Server 2012. I’ve posted more information about it here. I also got a chance to meet a real live data scientist, the first time that has happened. Not only did I get a chance to speak but a chance to listen. I really enjoyed the sessions from Steve Hughes on the Building a Modern Data Warehouse and Analytics Solution in Azure, Kevin Kline on , and Julie Koesmarno on Interactive & Actionable Data Visualisation With Power View. As always it’s wonderful to get a chance to visit in person with the people who’s technical expertise I read. In addition to listening to technical jokes which people outside of the SQL community would not find humorous, it’s great to discuss technology with other practitioners. Thanks to Mr. Smith for providing me a question which I didn’t know the answer, which now I feel compelled to go find. I’ll be investigating the scalability of Azure ML and R so that I will be able to have an answer for him next time I see him. I really enjoy the challenge of not only investigating and applying new technology but figuring out how to explain what I’ve learned. I look forward to the opportunity to present again, and when I do I’ll be sure to update this site so hopefully I get a chance to meet the people who read this.Yours Always

As I was honored enough to be selected to give a PreCon on the Internals of the Modern Data Warehouse, I thought that I would take the time to explain why I felt drawn to the topic. There are a lot of places that haven’t given much thought to the changes in technology which have happened over the last few years. The major feature upgrades to SQL Server in 2012 and 2014 have meant that they can use column store indexes which makes things faster and maybe better High Availability. While those things are certainly valuable improvements there is a lot more that you can do to derive value from your data and companies want more than just a well-organized, running data warehouse.

Data is a Valuable Asset

In 2010, Borders Group Inc. was allowed by the Federal Trade Commission to sell their customer information to Barnes and Noble as part of their bankruptcy sale of their assets. In 2015, RadioShack is doing the same thing. Businesses understand that data is valuable and they are interested in using it to drive decision making. Amazon, Netflix and Target are well known for their use of customer information to drive sales, but they are far from the only ones doing this. This is one of the bigger trends identified recently in the business press. The heads of companies are now looking for their data teams to do more with their data so that they too can have the dream information systems they are reading about.

Total Destruction of the Existing DW is Not Required

While a lot of the time, it might be nice to level everything and start over, that is not always an option. The major reason for this is that the data warehouse environment already in place has a lot of value. You want to add to the value already there, not destroy what you have. Also it would take a long time to recreate the environment and no one is patient enough to wait for that. Alternatively you could expand into areas of new technology as your data grows. Perhaps this mean you archive some of your data from your database to a Hadoop cluster instead of backing up the data in some far off location. This would allow you to use Sqoop to bring the data back when you need it, providing ready access to the data. Perhaps you want to provide the users more self-service BI capabilities, moving the data analysis into the hands of the people who are more familiar with the data? You could add the capabilities of Power View in Excel, Power Designer or Tableau to your environment.

Incorporating Social Media Information

The business world operates not only on a batch cycle. More and more companies want to know what is being said about them so they can respond appropriately. With tools like Azure Event Hubs, Data Factory, Streaming Analytics, and Machine Learning this isn’t as hard to do as it might sound. We’ll review these products so that attendees will understand how these tools can provide greater insight not only into their own data, but the data building about them outside of the company firewall.

For More Information

I really hope you can join me in Huntington Beach on April 10 for a full day of exploring these concepts. I always look forward to events like the precon and of course SQL Saturday #389 – Huntington Beach which is the following day.