Category Archives

Another great resource for learning about Power BI is the course on EDX: Analyzing and Visualizing Data with Power BI. Granted, has been around for a while, but I forgot blogging about it; maybe it is a bit easier to find now.

Like this:

All this talk all the time about Big Data and Advanced Analytics is all well and good, in fact it is something I do most of my time. The technology is there and has great potential. The biggest question now is how to use these technologies to their full extent and maximize the benefits of the technologies for your organization. The answer lies in becoming a data driven organization.

A data driven organization is an organization that breathes data, not only in the sense of producing data, but also in the sense of analyzing, consuming and really understanding data, both their own as well as the data others can provide. In order to have a sense to become a data driven organization, you will need to change People, Process and Technology. There is enough talk about the Technology in the market already (and on this blog), so I will come back to that later and not go into much detail now. Let’s look at the other two: People and Process. I view Process as very much related to People: bringing in new skills without the proper Process in place for how to work with them and for the new People to work together will not be very useful.

So, what People do you need? In other words: what roles do you need in a data driven organization? I see four required roles in any organization that wants to be more data-driven. This is not to say that these four roles should be four different people; it is very well possible that someone might take on more than one role. I am however confident that there exist very few people who will able to do all four roles since each requires specific skills, focus and passion.

The four roles are: Wrangler, Scientist, Artist and Communicator. Let’s look at the four roles in more detail.

Wrangler

The role Wrangler, or data wrangler as others call this role is responsible for identifying, qualifying and providing access to data sets. In this sense the data is the wild horse that the wrangler tames. This role is a need for the Scientist role to work with qualified, trustable and managed data sources. In much situations, this looks a lot like the current data management roles already present in organizations. This role lives mostly in IT. Keywords here are databases, connection strings, Hadoop, protocols, file formats, data quality, master data management, data classification.

Scientist

More popularly called the Data Scientist, a lot of people seem to believe that as long as you hire a Data Scientist you are a data driven company. This is much the same as saying that if you have Hadoop you ‘do Big Data’. This is about as smart as saying that if you got your driver’s license you make an excellent Formula 1 driver. It is just not true, sorry. Note also, that the opposite applies; if you are a great Formula 1 driver you could be a very bad driver on open roads. Running Hadoop does say you use Big Data. Hiring a Data Scientist does not mean you are a data driven company.

A Scientist is someone who applies maths, a lot of maths, to convert data into information. He or she applies statistical models and things like deep learning, data mining and machine learning to make this happen. Scientists are the rock stars of this data-focused world since they are the once actually making the magic happen. However, they cannot do it alone. They need good quality and trustable data, which is what the Wrangler supplies. Also, these Scientists happen to be ill-understood by the rest of the organization. This do this experiment: have your (Data) Scientist stick around the water cooler for 15 minutes every day and let him / her talk to people (I know, for some this is hard already). Then, check how quickly the person the Scientist is talking to disconnects. My experience is that someone who is not a fellow Scientist or Communicator will not make it for 15 minutes. Just try it, you will see what I mean.

The Artist role converts information the Scientist brewed up to insight that the consumers can understand and use. This role focusses on esthetics and the best way of data visualization to bring the message across in the best possible way. While the Wrangler is a very IT focused role and the Scientist is very mathematical, the Artist often comes from the creative arts world. The Artist just loves making things understandable and loves making the world a better place by creating beautiful things, such as great looking reports and dashboards. They often employ storytelling and other powerful visual methods such as infographics to convey their message to the consumers.

The last role in data driven organizations is a chameleon; If you look at the types of person in the Wrangler, Scientist and Artist role it is clear to see that these are very different people, with different backgrounds and different passions. Just as much as some of them find it hard to talk to the rest of the organization they can find it difficult to talk among their own and work together. In order to make sure there is no communications breakdown, many organizations invest in a Communicator; someone who has enough understanding of the passion of the people in the other roles to be able to level with them, understand their needs and explain the needs of others to them. Sub types of the Communicator is the Wrangler-Scientist communicator and the Scientist-Artist communicator.

This concludes the roles I see in a data driven organization; of course these roles with need the be supported with the right Processes and Technology. Having a Technology platform instead of disparate tools will help you to achieve this and make the best out of the investments you are making in these roles.

Like this:

Many customers asked me questions on Azure Machine Learning (Microsoft’s fully managed machine learning and data mining solution) and more specifically on it’s pricing. In this post I will try to explain how the pricing works and what components you need to be aware of.

Azure Machine Learning is offered in two tiers: Free and Standard. The Free tier is obviously, well, free. It is however as you could expect limited compared to Standard. Differences are mostly in performance (multiple nodes for execution in standard vs. just one node in free) or storage (10 gb in free, unlimited in standard). There is no SLA for the free version, you cannot set up a production Web API to automate experiments in free and the staging web API is throttled.

For the standard tier, the following items need to be taken into consideration:

Seat; Azure ML has a monthly fee per seat, which translates to a user (mostly your data scientist) using the Azure ML web interface to develop and tune experiments. This price is per month per subscription/seat.

Studio usage; This is an hourly price for running experiments. You will pay this according to the number of hours your experiments run and thus claim computing resources.

API Usage; Azure ML allows you to bring an experiment online through the use of RESTful web services. This means you can automate score and training and have applications, websites, etc. use the experiment without human interference. With this you could do an automated credit scoring, recommendation or churn prediction directly from your app or website. In order to make this work you will need to create a web service in Azure ML (also called API). Azure ML charges per hour for compute used in an API that is production, so that is the fee you will need to pay per hour the web service / API is ‘online’ and usable. Also, you will need to pay per 1000 transactions. Transactions in this case are interactions with the API, such as one recommendation, one churn or one credit score.

Hope this clarifies a bit. Please refer to the official page linked above for more details and for the pricing details.

Like this:

With the big news of the Power BI and Cortana integration I could not wait until next week to publish this short video of me demo-ing this cool technology! In the video I ask Cortana a couple of questions on stats from my a part of my blog that I record using Google Analytics. How cool is that? This shows the unique ability of Microsoft to integrate a BI technology such as Power BI with Windows to make it very easy for users to get the information they need when they need it where they need it. Do you speak BI? Great stuff don’t you think?

Like this:

The pricing table on the Power BI website does a good job at explaining when a free account is acceptable and when a pro account is required. However, it does not explain all nor is really clear (in my opinion). So, after some digging I came up with this: an step-wise wizard that helps you determine if you can use a free account for Power BI or if you need pro (below); simply answer a series of Yes/No questions and you will know if you can use free or really need pro. Please note that this is no official communication and by no means I am responsible for any errors. Use this at your own risk.

Like this:

Azure Data Factory provides a great number of data processing activities out of the box (for example running Hive or Pig scripts on Hadoop / HDInsight).

In many case though, you just need to run an activity that you already have built or know how to build in .NET. So, how would you go about that? Would you need to convert all those items to Hive scripts?

Actually, no. Enter Custom .NET activities. Using this you can run a .NET library on Azure Batch or HDInsight (whatever you like) and make it part of your Data Factory pipeline. Regardless of whether you use Batch or HDInsight you can just run your .NET code on it. I prefer using Batch since it provides more auto-scaling options, is cheaper and makes more sense to me in general; I mean, why run .NET code on a HDInsight service that runs Hive and Pig? It feels weird. However, if you already have HDInsight running and prefer to minimize the number of components to manage, choosing HDInsight might make more sense than using Batch.

Switching from Batch to HDInsight means to changing the LinkedServiceName for the activity to point to your HDInsight or HDInsight on demand cluster.

Tables are passed to the .NET activity using a connection string, so essentially if you have both input and output tables defined as blob storage items, your custom assembly will get a connection string to the blob storage items, read the input files, do its processing and write the output files before passing on the control to ADF.

Using this framework the sky is the limit: anything you can run in .NET can now be part of your ADF processing pipeline…pretty cool!

Like this:

A little while ago an R package for AzureML was released, which enables R users to interface with Azure Machine Learning (Azure ML). Specifically, it enables you to easily use one of the coolest features of Azure ML: publishing and consuming algorithms / experiments as web services.

Like this:

The First Look series focusses on new products, recent announcements, previews or things I have not had the time to provide a first look at and serves as introduction to the subject. First look posts are fairly short and high level.

Cortana Analytics Suite is Microsoft’s connecting and integrating suite of products for Big Data and Advanced Analytics. It combines a number of technologies Microsoft had before into one suite and adds new, ready to use capabilities for business solutions such as churn analysis.

Like this:

The First Look series focusses on new products, recent announcements, previews or things I have not had the time to provide a first look at and serves as introduction to the subject. First look posts are fairly short and high level.

Azure Data Catalog is a service that is now in public preview that provides a one-stop access layer to data sources; it abstracts away specifics of accessing data that are dependent on where and how data is stored, such as server names, protocols, ports, etc. It includes easy to use search and publishing tools, so both business and IT can collaborate together on providing a general, easy to use data access layer to all employees.

For more info on Azure Data Catalog see: http://azure.microsoft.com/en-us/services/data-catalog/