Issue #39

Aug 21 2014

Editor Picks

Data CarpentryThe New York Times has an article titled For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights. Mostly I really like it. What I’m less thrilled about is calling this “janitor work”. For one thing, it’s not particularly respectful of custodians, whose work I really appreciate. But it also mischaracterizes what this type of work is about. I’d like to propose a different analogy that I think fits a lot better: data carpentry...

Supervised Machine Learning: A Review of Classification Techniques This paper describes various supervised machine learning classification techniques. Of course, a single article cannot be a complete review of all supervised machine learning classification algorithms (also known induction classification algorithms), yet we hope that the references cited will cover the major theoretical issues, guiding the researcher in interesting research directions and suggesting possible bias combinations that have yet to be explored. ...

Data Science Articles & Videos

Machine Learning, Heart Rate Variability & Highway Congestion
Using a machine learning algorithm I had determined my baseline for relaxation and stress. When I began I had used 11 features to train the algorithm which means I used 11 sources of data that the algorithm used to try and predict a relaxed or stress sate. The WEKA software has the ability to show which of the data sources is the most useful in determining the outcome and as it did so it allowed me to narrow the features to three...

Which cities get the most sleep?
People in Melbourne sleep the most, people in Tokyo sleep the least, and Americans just need more sleep overall. Those are some of the findings from a vast new dataset released to The Wall Street Journal by Jawbone, the makers of the UP, a digitized wristband that tracks how its users move and sleep...

Jobs

Booking.com is looking for a E-Commerce Product Owner with substantial experience in a data science related discipline, preferably recommender engines or other personalization methods. We are looking for someone to extend our already highly impactful teams with industry experience and a different perspective... you will be responsible for creating products and features improving our customers' experience using our data, with a strong focus on driving conversion and customer loyalty....

Training & Resources

Theano Tutorial
Theano is a software package which allows you to write symbolic code and compile it onto different architectures (in particular, CPU and GPU). It's especially good for machine learning techniques which are CPU-intensive and benefit from parallelization (e.g. large neural networks). This tutorial will cover the basic principles of Theano, including some common mental blocks which come up...

Baby steps in Python – Exploratory analysis in Python (using Pandas)Pandas are one of the most useful data analysis library in Python (I know these names sounds weird, but hang on!). They have been instrumental in increasing the use of Python in data science community. In this tutorial, we will use Pandas to read a data set from a Kaggle competition, perform exploratory analysis and build our first basic categorization algorithm for solving this problem...

How-to: Use IPython Notebook with Apache Spark
The developers of Apache Spark have given thoughtful consideration to Python as a language of choice for data analysis. They have developed the PySpark API for working with RDDs in Python, and further support using the powerful IPythonshell instead of the builtin Python REPL...

Books

"A terrific book if you are interested in understanding how these algorithms work. The author is superb at explaining the core ideas in clear, understandable terms. You don't need to be a computer geek to follow this book. All you need is a desire to understand. I wish I had had more teachers like this guy when I was in school. I am truly impressed with his ability to explain..."