Archive for: July 4th, 2016

In this post, we focus on sourcing R and Python’s external dependencies, such as R libraries and Python modules, which are not already installed on Azure ML and require code compilation. Commonly the compiled code comes from a variety of other languages such as C, C++ and Fortran. One could also use this approach to wrap their compiled code with R or Python wrappers and run it on Azure ML.

To illustrate the process, we will build two MurmurHash modules from C++ for R and Python using the following two implementations on GitHub, and link them to Azure ML from a zipped folder

Link via David Smith. I knew it was possible to call compiled C code from Python and R, but didn’t expect to be able to do it within Azure ML, so that’s good to know.

So the takeaway that many DB developers would have you believe is ‘Hadoop Good’, ‘RDBMS Bad’.

But wait. RDBMS EDW hasn’t gone away and won’t. That’s where we keep our single version of the truth, the business data that record legal transactions with customers, suppliers, and employees.We also get strong SLAs, strong fault tolerance, and highly curated data based on strong ETL, provenance, and governance.Those are all things that are missing in our Data Lake.

Anybody who sells you on one technology to solve all problems is shilling snake oil. Bill’s answer is an Adjunct Data Warehouse, which sits separate from the Enterprise Data Warehouse. You go to the EDW when you risk getting fired or going to jail if the data’s wrong; you go to the ADW when you need data not in your EDW, or when you need larger-scale analytics in which it’s okay to be 1% off.

K-Means takes in an unlabeled data set and a whole real number, k. K is the number of centroids, or clusters you wish to find. If you do not know how many clusters there should be, it is possible to do some pre-processing to find that more automatically, however that is out of the scope of this article. Once you have a data set and defined the size of k, K-Means begins its iterative process. It starts by selecting centroids by moving them to the average of the data associated with them. It then reshuffles all of the data into new groups based on the proximity to each centroid.

Row Level security is about applying security on a data row level. For example sales manager of united states, should only see data for United States and not for the Europe. Sales Manager of Europe won’t be able to see sales of Australia or United States. And someone from board of directors can see everything. Row Level Security is a feature that is still in preview mode, and it was available in Power BI service, here I mentioned how to use it in the service. However big limitation that I mentioned in that post was that with every update of the report or data set from Power BI Desktop, or in other words with every publish from Power BI Desktop, the whole row level security will be wiped out. The reason was that Row Level Security wasn’t part of Power BI model. Now in the new version of Power BI Desktop, the security configuration is part of the model, and will be deployed with the model.

This is a great security feature, so I’m happy to see the Power BI team taking it the next step forward and integrating RLS directly into Power BI desktop.

One of the most interesting use cases of Polybase is the ability to store historical data from relational databases into a Hadoop File System. The storage costs could be reduced while keeping the data accessible and still can be joined with the regular relational tables. So let`s do the first steps for our new archiving solution.

Archival is a very good use case for external table insertion, and if you don’t have a Hadoop cluster, you could insert into Azure blob storage.

As you can see below, TV and Video total sales were up $178m vs prior year, yet there was also a decline in sales of $79m caused by lower sell prices. And Cameras and camcorders actually had an increase in sales due to sell price, and that drove the total result higher than it otherwise would have been. Of course there is normally an inverse relationship between price and volume (the lower the price, the more you sell). The trick is to maximise sales (or more correctly margin $).

Think well about this third step: you can save quite a few dollars annually by just keeping a local cache of travel distances, and only querying when the distance is unknown. My ETL process for this part consists of three global steps:

Add unknown departure/destination pairs to the ‘to be queried’ table (PK of this table is start & end point address in less-structured format, ensuring uniqueness of commutes)

Use the local cache of travel distance (as complete as it gets at this moment) as the primary lookup for travel distance

The Google Maps API allows free tier users to make about 1000 requests per day. If you don’t need to pull more than that many data points back (or can queue them to run over the necessary time frame), there’s no marginal cost to calls. Otherwise, it ends up being a few dollars per thousand calls, so that shouldn’t break your company’s budget.

Register only data sources that users interact with. Usually the first priority is to register data sources that the users see-for instance, the reporting database or DW that you want users to go to rather than the original source data. Depending on how you want to use the data catalog, you might also want to register the original source. In that case you probably want to hide it from business users so it’s not confusing. Which leads me to the next tip…

Use security capabilities to hide unnecessary sources. The Standard (paid) version will allow you to have some sources registered but only discoverable by certain users & hidden from other users (i.e., asset level authorization). This is great for sensitive data like HR. It’s also useful for situations when, say, IT wants to document certain data sources that business users don’t access directly.