Issue #25

May 15 2014

Editor Picks

An article this week proclaimed, much to the data science community’s chagrin, that “most of a data scientist’s time is spent creating predictive models.” Forget about cleaning data, doing historical analyses that go into basic reports, etc. Apparently, the core job is predictive modeling. I fear for the company who hires any data scientist that believes that... [they risk not asking] one of the most important predictive modeling questions of all: Do I really need to build this model? Can I do something simpler?...

This week’s Spotlight is on Dr. Dan Ciresan, a senior researcher at IDSIA in Switzerland and a pioneer in using CUDA for Deep Neural Networks (DNNs). His methods have won international competitions on topics such as classifying traffic signs and recognizing handwritten Chinese characters. The following is an excerpt from our interview...

Data Science Articles & Videos

Building a Business around Machine Learning APIsI got a variety of reactions on Twitter following my GigaOM piece on how Data Scientists work at automating themselves. One of them I want to discuss today is about building businesses on top / around Prediction APIs such as Google's or BigML's (a.k.a. machine learning APIs)...

Large-Scale Machine Learning with Apache SparkSpark is a new cluster computing engine that is rapidly gaining popularity — with over 150 contributors in the past year, it is one of the most active open source projects in big data, surpassing even Hadoop MapReduce. In this talk, we’ll introduce Spark and show how to use it to build fast, end-to-end machine learning workflows....

Data Science Anti-Pattern: The SQLoppelgangerData scientists, attention! The time has come to call out one of the egregious anti-patterns of data science.I call it… the SQLoppelganger. Definition: A SQLoppelganger is a database query (or other analytics code) that reproduces business logic that already exists somewhere else...

Predicting Stock Swings with PsychSignal, Quandl and BigML
People like to tweet about stocks, so much so that ticker symbols get their own special dollar sign like $AAPL or $FB. What if you could mine this data for insight into public sentiment about these stocks? Even better, what if you could use this data to predict activity in the stock market?...

Robust Regression and Outlier Detection via Gaussian Processes
In the last post, I showed after removal of the outliers, one can do a linear regression on the remaining data which is called robust linear regression. However, instead of detecting the outliers then fit the regression model, we can do better. Choose a model that is robust to outliers and flexible enough to capture all main signal by excluding the outliers...

Garbage In, Garbage Out: How Anomalies Can Wreck Your Data
Flawed census data is used every year to build scientific models, do in-depth analysis, and even make large-scale policy decisions. If the data backing up a model is wildly inaccurate, then our model is useless. That is: “garbage in, garbage out.” This incident is an example of a wider issue in data analysis: anomalous data, or data that contains errors. Let’s look at a couple more examples, and how data visualization can catch these errors....

Consistency of Random Forests
Random forests are a learning algorithm proposed by Breiman (2001) which combines several randomized decision trees and aggregates their predictions by averaging. In the present paper, we take a step forward in forest exploration by proving a consistency result for Breiman's (2001) original algorithm in the context of additive regression models. Our analysis also sheds an interesting light on how random forests can nicely adapt to sparsity in high-dimensional settings...

Jobs

We’re looking for a Sr. Software Engineer for our Personalization and Machine Learning team, with experience using machine learning algorithms and techniques for use and content modeling, content recommendation, as well as experience gathering and analyzing data from disparate sources...

Training & Resources

The LION CommunityThe LIONcommunity page contains mixed materials about machine learning and optimization made available by our lab and by a growing community of active researchers and users, in particular slides related to the LIONbook, usage cases in selected application areas, tutorial movies, etc...

SQL Server Analysis Services Neural Network Data Mining Algorithm
In data mining and machine learning circles, the neural network is one of the most difficult algorithms to explain. Fortunately, SQL Server Analysis Services allows for a simple implementation of the algorithm for data analytics. Check out this tip to learn more...

What is the data science industry? The Data Analytics handbook was created to inform students and young professionals and answer this question. Hear from over 30 data scientists, data analysts, CEOs, and academics from Facebook, LinkedIn, Yelp, Cloudera, and many more!...

Books

"This book is a treasure trove of intuitive, practical, and brilliant mathematical techniques. Every person with an interest in mathematics, science, or engineering will enjoy this highly stimulating and fun book."