Issue #16

March 13 2014

Editor Picks

Yesterday a journalist asked me to comment on Vincent Granville's post about the $30/hr data scientist for hire on Elance. What started as a quick reply in an email, spiraled a bit, so I figured I'd post the entire reply here to get your thoughts in the comments...

In Speeding up isotonic regression in scikit-learn, we dropped down into Cython to improve the performance of a regression algorithm. I thought it would be interesting to compare the performance of this (optimized) code in Python against the naive Julia implementation...

Predicting ad click-through rates (CTR) is a massive-scale learning problem that is central to the multi-billion dollar online advertising industry. We present a selec ion of case studies and topics drawn from rec nt experiments in the setting of a deployed CTR prediction system...

Data Science Articles & Videos

Driving the Future of Smart Cities - How to Beat the TrafficPivotal’s Data Science Team has developed several innovative methods to analyze traffic flow information harvested from real-time and in-car data sources including GPS. These methods by themselves are highly useful for predicting future traffic conditions and dissecting traffic data. We will describe how we created these algorithms and show different interesting results...

"How do I become a Data Scientist?"I got an email recently asking something along these lines: "I'm a smart ex-engineer who likes stats. I want to be a data scientist. How difficult will it be for me to find a job doing data science work at a startup?". I sent back an email which looked more or less like the following post...

Apache Spark: 3 Real-World Use CasesThe Hadoop processing engine Spark has risen to become one of the hottest big data technologies in a short amount of time. And while Spark has been a Top-Level Project at the Apache Software Foundation for barely a week, the technology has already proven itself in the production systems of early adopters, including Conviva, ClearStory, and Yahoo...

San Francisco Neighborhood Recommender
At the end of my time at Zipfian Academy (an intensive data-science program), we were given two weeks to design and implement a data-science project... I decided to build a neighborhood recommender for San Francisco...

Redlining for the 21st Century
Using personal information gathered about you on the Internet to provide better choice is very different from using the same information to control your behavior. The former is a service to the consumer. The latter is exploitation. I call the use of big data to exploit consumers “personal redlining.”...

How to Nab a Job using LinkedIns "Who's Viewed Your Profile"
If you look at the right side of your LinkedIn profile, you'll see an intriguing text box: Who's Viewed Your Profile, the networking equivalent of catching someone checking you out on the subway. And if you know how to use the feature right, it can land you business or a job...

Team Chemistry is the New Holy Grail of Performance Analytics“Makes teams better” is fast-becoming both an essential ingredient to getting hired and a mission-critical skill-set worth measuring. After spending the weekend at the annual MIT Sloan Sports Analytics Conference, “quantifying chemistry”—identifying those talents, attributes and combinatorial skills that make a team play so much better than a group of talented individuals—has clearly become the new Holy Grail of sports analytics...

Parallelism in One Line
Python has a terrible rep when it comes to its parallel processing capabilities. Ignoring the standard arguments about its threads and the GIL (which are mostly valid), the real problem I see with parallelism in Python isn't a technical one, but a pedagogical one...

Top Ten Reasons To "Kaggle"
Kaggle has been a tremendous learning experience to expand my depth and breadth of knowledge. Here's why...

Jobs

Big data at Nike comes with big opportunities for innovation. Nike is turning to big data and Hadoop to help better understand customers, improve marketing efforts and fine-tune their data-driven strategies. As a Data Engineering Lead you will be working with our highly motivated application engineers and data scientist, working in a small agile group to solve sophisticated and high impact problems...

Training & Resources

Some Useful Machine Learning Libraries. With the advent of many different and intricate Machine Learning algorithms, it is very hard to come up with your code to any problem. Therefore, the use of a library and its choice is imperative before you start the project...

This is a collection of 120 real data science interview questions, covering a wide range of questions you might face when interviewing for a data science position. It was made by individuals interviewing for data science positions, with contributions also from data scientists...

Data Science 101: Deep Learning Methods and Applications
Microsoft Research is a hotbed of data science and machine learning research. A recent publication is available for download (PDF): “Deep Learning: Methods and Applications”... The 134 page book is aimed to provide an overview of general deep learning methodology and its applications to a variety of signal and information processing tasks...

Whenever I have a classification task with lots of data and lots of features, I love throwing Vowpal Wabbit or VW at the problem. Unfortunately, I find the array of commandline options in vw very intimidating. The github wiki is really good, but the information you need to be productive is scattered all over the place. This is my attempt to put everything you need in one place...

P.S. Did you enjoy the newsletter? Do you have friends/colleagues who might like it too? If so, please forward it along - we would love to have them onboard :)

Sign up to receive the Data Science Weekly Newsletter every Thursday

Easy to unsubscribe. No spam — we keep your email safe and do not share it.