~ Broaden your Horizon

Distilled News

Did you know using XGBoost algorithm is one of the popular winning recipe of data science competitions ? So, what makes it more powerful than a traditional Random Forest or Neural Network ? In broad terms, it’s the efficiency, accuracy and feasibility of this algorithm. (I’ve discussed this part in detail below). In the last few years, predictive modeling has become much faster and accurate. I remember spending long hours on feature engineering for improving model by few decimals. A lot of that difficult work, can now be done by using better algorithms. Technically, “XGBoost” is a short form for Extreme Gradient Boosting. It gained popularity in data science after the famous Kaggle competition called Otto Classification challenge. The latest implementation on “xgboost” on R was launched in August 2015. We will refer to this version (0.4-2) in this post. In this article, I’ve explained a simple approach to use xgboost in R. So, next time when you build a model, do consider this algorithm. I’m sure it would be a moment of shock and then happiness!

Natural Language Processing (NLP) is a vast area of Computer Science that is concerned with the interaction between Computers and Human Language[1]. Within NLP many tasks are – or can be reformulated as – classification tasks. In classification tasks we are trying to produce a classification function which can give the correlation between a certain ‘feature’ D and a class C. This Classifier first has to be trained with a training dataset, and then it can be used to actually classify documents. Training means that we have to determine its model parameters. If the set of training examples is chosen correctly, the Classifier should predict the class probabilities of the actual documents with a similar accuracy (as the training examples).

It’s that time of year that I review everything that I’ve written over the past year and share my favorite blogs. As many of you know, I travel frequently and because I’ve continuously seen every airline movie, I have plenty of time to write. And according to the commentary, every now and then I have a good one. So here are my Top 10 Blogs from 2015!

In multiple SF Text meetups and Text By the Bay presentations, modern NLP reveals itself as a means to extract actionable knowledge from the world’s Internet data and user data in a specific context, derive human intent, and follow up on it. This used to be called AI. Machine Learning (ML) is much more popular now under the new brand management of Deep Learning, when it really went mainstream. Carlos Guestrin, the CEO and founder of Dato (GraphLab), keynoted this Data Summit 2015 with a strong prediction: all apps will be ML-enabled. ML will be a commodity, and each ML user or provider will need to be able to package it as (micro)services and then aggregate and consume them. Consumer applications will need to know about the real world and context. Google has its knowledge graph, LinkedIn has its Economic Graph. Knowledge bases and providers such as Factual can be used by startups, but the key problem is not yet fully realized.

What does this have to do with parallel computation? Briefly, the code generates 5,000 standard normal random variates, repeats this 5,000 times and stores them in a 5,000 x 5,000 matrix (`x’). Then it computes x x’. The second part is key, because it involves a matrix multiplication.

There are many benefits to teaching undergraduate statistics with R-especially in the RStudio environment-but it must be admitted that the learning curve is fairly steep, especially when it comes to tinkering with plots to get them to look just the way one wants. If there were ever a situation when I would prefer that the students have access to a graphical user interface, production of plots would be it. You can, of course, write Shiny apps like this one, where the user controls features of the graphs through various input-widgets. But then the user must visit the remote site, and if he or she wishes to build a graph from a data frame not supplied by the app, then the app has to deal with thorny issues surrounding the uploading and processing of .csv files, and in the end the user still has to copy and paste the relevant graph-making code back to wherever it was needed.

I’ll be running an R course soon and I am looking for fun (public) datasets to use in data manipulation and visualization. I would like to use a single dataset that has some easy variables for the first days, but also some more challenging ones for the final days. And I want that when I put exercises, the students* are curious about finding out the answer.