Data Scientist, Machine learning, R, SAS, Python – Amsterdam (NL)

Post navigation

Introduction

Not long ago I was reading on t-Distributed Stochastic Neighbor Embedding (t-sne), a very interesting dimension reduction technique, and on Mel frequency cepstrum a sound processing technique. Details of both techniques can be found here and here. Can we combine the two in a data analysis exercise? Yes, and with not too much R code you can already quickly create some visuals to get ‘musical’ insights.

Spotify Data

Where can you get some sample audio files? Spotify! There is a Spotify API which allows you to get information on playlists, artists, tracks, etc. Moreover, for many songs (not all though) Spotify provides downloadable preview mp3’s of 30 seconds. The link to the preview mp3 can be retrieved from the API. I am going to use some of these mp3’s for analysis.

In the web interface of Spotify you can look for interesting playlists. In the search field type in for example ‘Bach‘ (my favorite classical composer). In the search results go to the playlists tab, you’ll find many ‘Bach’ playlists from different users, including the ‘user’ Spotify itself. Now, given the user_id (spotify) and the specific playlist_id (37i9dQZF1DWZnzwzLBft6A for the Bach playlist from Spotify) we can extract all the songs using the API:

GET https://api.spotify.com/v1/users/{user_id}/playlists/{playlist_id}

You will get the 50 Bach songs from the playlist, most of them have a preview mp3. Let’s also get the songs from a Heavy Metal play list, and a Michael Jackson play list. In total I have 146 songs with preview mp3’s in three ‘categories’:

Bach,

Heavy Metal,

Michael Jackson.

Transforming audio mp3’s to features

The mp3 files need to be transformed to data that I can use for machine learning, I am going to use the Python librosa package to do this. It is easy to call it from R using the reticulate package.

For sound processing, features extraction on the raw audio signal is often applied first. A commonly used feature extraction method is Mel-Frequency Cepstral Coefficients (MFCC). We can calculate the MFCC for a song with librosa.

Each mp3 is now a matrix of MFC Coefficients as shown in the figure above. We have less data points than the original 661.500 data points but still quit a lot. In our example the MFCC are a 96 by 1292 matrix, so 124.032 values. We apply a the t-sne dimension reduction on the MFCC values.

Calculating t-sne

A simple and easy approach, each matrix is just flattened. So a song becomes a vector of length 124.032. The data set on which we apply t-sne consist of 146 records with 124.032 columns, which we will reduce to 3 columns with the Rtsne package:

tsne_out = Rtsne(AllSongsMFCCMatrix, dims=3)

The output object contains the 3 columns, I have joined it back with the data of the artists and song names so that I can create an interactive 3D scatter plot with R plotly. Below is a screen shot, the interactive one can be found here.

Conclusion

It is obvious that Bach music, heavy metal and Michael Jackson are different, you don’t need machine learning to hear that. So as expected, it turns out that a straight forward dimension reduction on these songs with MFCC and t-sne clearly shows the differences in a 3D space. Some Michael Jackson songs are very close to heavy metal 🙂 The complete R code can be found here.

Introduction

On the 8th, 9th and 10th of December I participated at the IKEA hackaton. In one word it was just FANTASTIC! Well organized, good food, and participants from literally all over the world, even the heavy snow fall on Sunday did not stop us from getting there!

I formed a team with Jos van Dongen and his son Thomas van Dongen and we created the “I-Love-IKEA” app to help customers find IKEA products. And of course using R.

The “I-Love-IKEA” Shiny R app

The idea is simple. Suppose you are in the unfortunate situation that you are not in an IKEA store, and you see a chair, or nice piece of furniture, or something completely else…. Now does IKEA have something similar? Just take a picture, upload it using the I-Love-IKEA R Shinyapp and get the best matching IKEA products back.

Implementation in R

How was this app created? The following steps outline steps that we took during the creation of the web app for the hackaton.

First, we have scraped 9000 IKEA product images using Rvest, then each image is ‘scored’ using a pre-trained VGG16 network, where the top layers are removed.

That means that for each IKEA image we have a 7*7*512 tensor, we flattened this tensor to a 25088 dimensional vector. Putting all these vectors in a matrix we have a 9000 by 25088 matrix.

If you have a new image, we use the same pre-trained VGG16 network to generate a 25088 dimensional vector for the new image. Now we can calculate all the 9000 distances (for example cosine similarity) between this new image and the 9000 IKEA images. We select, say, the top 7 matches.

A few examples

A Shiny web app

To make this useful for an average consumer, we have put it all in an R Shiny app using the library ‘minUI‘ so that the web site is mobile friendly. A few screen shots:

The web app starts with an ‘IKEA-style’ instruction, then it allows you to take a picture with your phone, or use one that you already have on your phone. It uploads the image and searches for the best matching IKEA products.

The R code is available from my GitHub, and a live running Shiny app can be found here.

Conclusions

Obviously, there are still many adjustments you can make to the app to improve the matching. For example pre process the images before they are sent through the VGG network. But there was no more time.

Unfortunately, we did not win a price during the hackaton, the jury did however find our idea very interesting. More importantly, we had a lot of fun. In Dutch “Het waren 3 fantastische dagen!”.

Introduction

Market Basket Analysis or association rules mining can be a very useful technique to gain insights in transactional data sets, and it can be useful for product recommendation. The classical example is data in a supermarket. For each customer we know what the individual products (items) are that he has bought. With association rules mining we can identify items that are frequently bought together. Other use cases for MBA could be web click data, log files, and even questionnaires.

In R there is a package arules to calculate association rules, it makes use of the so-called Apriori algorithm. For data sets that are not too big, calculating rules with arules in R (on a laptop) is not a problem. But when you have very huge data sets, you need to do something else, you can:

use more computing power (or cluster of computing nodes).

use another algorithm, for example FP Growth, which is more scalable. See this blog for some details on Apriori vs. FP Growth.

Or do both of the above points by using FPGrowth in Spark MLlib on a cluster. And the nice thing is: you can stay in your familiar R Studio environment!

Spark MLlib and sparklyr

Example Data set

We use the example groceries transactions data in the arules package. It is not a big data set and you would definitely not need more than a laptop, but it is much more realistic than the example given in the Spark MLlib documentation :-).

Preparing the data

I am a fan of sparklyr 🙂 It offers a good R interface to Spark and MLlib. You can use dplyr syntax to prepare data on Spark, it exposes many of the MLlib machine learning algorithms in a uniform way. Moreover, it is nicely integrated into the RStudio environment offering the user views on Spark data and a way to manage the Spark connection.

First connect to spark and read in the groceries transactional data, and upload the data to Spark. I am just using a local spark install on my Ubuntu laptop.

For demonstration purposes, data is copied in this example from the local R session to Spark. For large data sets this is not feasible anymore, in that case data can come from hive tables (on the cluster).

The figure above shows the products purchased by the first four customers in Spark in an RStudio grid. Although transactional systems will often output the data in this structure, it is not what the FPGrowth model in MLlib expects. It expects the data aggregated by id (customer) and the products inside an array. So there is one more preparation step.

The figure above shows the aggregated data, customer 12, has a list of 9 items that he has purchased.

Running the FPGrowth algorithm

We can now run the FPGrowth algorithm, but there is one more thing. Sparklyr does not expose the FPGrowth algorithm (yet), there is no R interface to the FPGrowth algorithm. Luckily, sparklyr allows the user to invoke the underlying Scala methods in Spark. We can define an new object with invoke_new

And by looking at the Scala documentation of FPGrowth we see that there are more methods that you can use. We need to use the function invoke, to specify which column contains the list of items, to specify the minimum confidence and to specify the minimum support.

By invoking fit, the FPGrowth algorithm is fitted and an FPGrowthModel object is returned where we can invoke associationRules to get the calculated rules in a spark data frame

rules = FPGmodel %>% invoke("associationRules")

The rules in the spark data frame consists of an antecedent column (the left hand side of the rule), a consequent column (the right hand side of the rule) and a column with the confidence of the rule. Note that the antecedent and consequent are lists of items! If needed we can split these lists and collect them to R for plotting for further analysis.

The invoke statements and rules extractions statements can of course be wrapped inside functions to make it more reusable. So given the aggregated transactions in a spark table trx_agg, you can get something like:

Introduction

Recently, Dataiku 4.1.0 was released, it now offers much more support for R users. But wait a minute, Data-what? I guess some of you do not know Dataiku, so what is Dataiku in the first place? It is a collaborative data science platform created to design and run data products at scale. The main themes of the product are:

Collaboration & Orchestration: A data science project often involves a team of people with different skills and interests. To name a few, we have data engineers, data scientists, business analysts, business stakeholders, hardcore coders, R users and Python users. Dataiku provides a platform to accommodate the needs of these different roles to work together on data science projects.

Productivity: Whether you like hard core coding or are more GUI oriented, the platform offers an environment for both. A flow interface can handle most of the steps needed in a data science project, and this can be enriched by Python or R recipes. Moreover, a managed notebook environment is integrated in Dataiku to do whatever you want with code.

Deployment of data science products: As a data scientist you can produce many interesting stuff, i.e. graphs, data transformations, analysis, predictive models. The Dataiku platform facilitates the deployment of these deliverables, so that others (in your organization) can consume them. There are dashboards, web-apps, model API’s, productionized model API’s and data pipelines.

There is a free version which contains already a lot of features to be very useful, and there is an paid version, with “enterprise features“. See for the Dataiku website for more info.

Improved R Support in 4.1.0

Among many new features, and the one that interests me the most as an R user, is the improved support for R. In previous versions of Dataiku there was already some support for R, this version has the following improvements. There is now support for:

R Code environments

In Dataiku you can now create so-called code environments for R (and Python). A code environment is a standalone and self-contained environment to run R code. Each environment can have its own set of packages (and specific versions of packages). Dataiku provides a handy GUI to manage different code environments. The figure below shows an example code environment with specific packages.

In Dataiku whenever you make use of R –> in R recipes, Shiny, R Markdown or creating R API’s you can select a specific R code environment to use.

R Markdown reports & Shiny applications

If you are working in RStudio you most likely already know R Markdown documents and Shiny applications. In this version, you can also create them in Dataiku. Now, why would you do that and not just create them in RStudio? Well, the reports and shiny apps become part of the Dataiku environment and so:

They are managed in the environment. You will have a good overview of all reports and apps and see who has created/edited them.

You can make use of all data that is already available in the Dataiku environment.

Moreover, the resulting reports and Shiny apps can be embedded inside Dataiku dashboards.

The figure above shows a R markdown report in Dataiku, the interface provides a nice way to edit the report, alter settings and publish the report. Below is an example dashboard in Dataiku with a markdown and Shiny report.

Creating R API’s

Once you created an analytical model in R, you want to deploy it and make use of its predictions. With Dataiku you can now easily expose R prediction models as an API. In fact, you can expose any R function as an API. The Dataiku GUI provides an environment where you can easily set up and test an R API’s. Moreover, The Dataiku API Node, which can be installed on a (separate) production server imports the R models that you have created in the GUI and can take care of load balancing, high availability and scaling of real-time scoring.

The following three figures give you an overview of how easy it is to work with the R API functionality.

First, define an API endpoint and R (prediction) function.

Then, define the R function, it can make use of data in Dataiku, R objects created earlier or any R library you need.

Finally, once you are satisfied with the R API you can make a package of the API, that package can then be imported on a production server with Dataiku API node installed. From which you can then serve API requests.

Conclusion

The new Dataiku 4.1.0 version has a lot to offer for anyone involved in a data science project. The system already has a wide range support for Python, but now with the improved support for R, the system is even more useful to a very large group of data scientists.

Introduction

I often see advertisement for The Bold and The Beautiful, I have never watched a single episode of the series. Still, even as a data scientist you might be wondering how these beautiful ladies and gentlemen from the show are related to each other. I do not have the time to watch all these episodes to find out, so I am going to use word embeddings on recaps instead…

Calculating word embeddings

First, we need some data, from the first few google hits I got to the site soap central. Recaps can be found from the show that date back to 1997. Then, I used a little bit of rvest code to scrape the daily recaps into an R data set.

Word embedding is a technique to transform a word onto a vector of numbers, there are several approaches to do this. I have used the so-called Global Vector word embedding. See here for details, it makes use of word co-occurrences that are determined from a (large) collection of documents and there is fast implementation in the R text2vec package.

Once words are transformed to vectors, you can calculate distances (similarities) between the words, for a specific word you can calculate the top 10 closest words for example. More over linguistic regularities can be determined, for example:

amsterdam - netherlands + germany

would result in a vector that would be close to the vector for berlin.

Results for The B&B recaps

It takes about an hour on my laptop to determine the word vectors (length 250) from 3645 B&B recaps (15 seasons). After removing some common stop words, I have 10.293 unique words, text2vec puts the embeddings in a matrix (10.293 by 250).

The following figure shows some other B&B characters and their closest matches.

If you want to see the top n characters for other B&B characters use my little shiny app. The R code for scraping B&B recaps, calculating glove word-embeddings and a small shiny app can be found on my Git Hub.

Conclusion

This is a Mickey Mouse use case, but it might be handy if you are in the train and hear people next to you talking about the B&B, you can join their conversation. Especially if you have had a look at my B&B shiny app……

Introduction

Two things that recently came to my attention were AutoML(Automatic Machine Learning) by h2o.ai and thefashion MNIST by Zalando Research. So as a test, I ran AutoML on the fashion mnist data set.

H2o AutoML

As you all know a large part of the work in predictive modeling is in preparing the data. But once you have done that, ideally you don’t want to spend too much work in trying many different machine learning models. That’s were AutoML from h2o.ai comes in. With one function call you automate the process of training a large, diverse, selection of candidate models.

AutoML trains and cross-validates a Random Forest, an Extremely-Randomized Forest, GLM’s, Gradient Boosting Machines (GBMs) and Neural Nets. And then as “bonus” it trains a Stacked Ensemble using all of the models. The function to use in the h2o R interface is: h2o.automl. (There is also a python interface)

So the first 784 columns in the data set are used as inputs and column 785 is the column with labels. There are more input arguments that you can use. For example, maximum running time or maximum number of models to use, a stopping metric.

It can take some time to run all these models, so I have spun up a so-called high CPU droplet on Digital Ocean: 32 dedicated cores ($0.92 /h).

h2o utilizing all 32 cores to create models

The output in R is an object containing the models and a ‘leaderboard‘ ranking the different models. I have the following accuracies on the fashion mnist test set.

Gradient Boosting (0.90)

Deep learning (0.89)

Random forests (0.89)

Extremely randomized forests (0.88)

GLM (0.86)

There is no ensemble model, because it’s not supported yet for multi label classifiers. The deeplearning in h2o are fully connected hidden layers, for this specific Zalando images data set, you’re better of pursuing more fancy convolutional neural networks. As a comparison I just ran a simple 2 layer CNN with keras, resulting in an test accuracy of 0.92. It outperforms all the models here!

Conclusion

If you have prepared your modeling data set, the first thing you can always do now is to run h2o.automl.

Introduction

My first car was a 13 year Mitsubishi Colt, I paid 3000 Dutch Guilders for it. I can still remember a friend that would not like me to park this car in front of his house because of possible oil leakage.

Can you get an idea of which cars will likely to leak oil? Well, with open car data from the Dutch RDW you can. RDW is the Netherlands Vehicle Authority in the mobility chain.

RDW Data

There are many data sets that you can download. I have used the following:

Observed Defects. This set contains 22 mln. records on observed defects at car level (license plate number). Cars in The Netherlands have to be checked yearly, and the findings of each check are submitted to RDW.

Basic car details. This set contains 9 mln. records, they are all the cars in the Netherlands, license plate number, brand, make, weight and type of car.

Defects code. This little table provides a description of all the possible defect codes. So I know that code ‘RA02’ in the observed defects data set represents ‘oil leakage’.

Simple Analysis in R

I have imported the data in R and with some simple dplyr statements I have determined per car make and age (in years) the number of cars with an observed oil leakage defect. Then I have determined how many cars there are per make and age, then dividing those two numbers will result in a so called oil leak percentage.

For example, in the Netherlands there are 2043 Opel Astra’s that are four years old, there are three observed with an oil leak, so we have an oil leak percentage of 0.15%.

The graph below shows the oil leak percentages for different car brands and ages. Obviously, the older the car the higher the leak percentage. But look at BMW: waaauwww those old BMW’s are leaking like oil crazy… 🙂 The few lines of R code can be found here.

Conclusion

There is a lot in the open car data from RDW, you can look at much more aspects / defects of cars. Regarding my old car that i had, according to this data Mitsubishi’s have a low oil leak percentage, even older ones.