Issue #38

Aug 14 2014

Editor Picks

So You Wanna Try Deep Learning?I’m keeping this post quick and dirty, but at least it’s out there. The gist of this post is that I put out a one file gist that does all the basics, so that you can play around with it yourself...

Scholar OctopusFun hack: I took 7200 papers from 34 CV/ML conferences, and layed them out with t-SNE based on bigram tfidf. Explore...

Is HBase’s slow and steady approach winning the NoSQL race?In the world of NoSQL databases, the products that have dominated the conversation are MongoDB and DataStax Enterprise, a leading distribution of Apache Cassandra. But a couple of headlines this week bring into focus a perhaps less-splashy, though rather tenacious player: Apache HBase, which is included with most major Hadoop distributions...

Data Science Articles & Videos

Building a Production Machine Learning Infrastructure
Josh Wills, Director of Data Science at Cloudera has a gift for making fairly complicated technology explanations very digestible to the novice and intermediary techie. What I most love about this video is how Josh explains -very clearly – the issue of translating analytics Machine Learning on a large set of data records (see: individuals) and making it work in a production environment on one individual (think eCommerce)...

Using scikit-learn Pipelines and FeatureUnionsSince I posted a postmortem of my entry to Kaggle's See Click Fix competition, I've meant to keep sharing things that I learn as I improve my machine learning skills. One that I've been meaning to share is scikit-learn's pipeline module. The following is a moderately detailed explanation and a few examples of how I use pipelining when I work on competitions...

An Empirical Analysis of Stop-and-Frisk in New York CityBetween 2006 and 2012, the New York City Police Department made roughly four million stops as part of the city’s controversial stop-and-frisk program. We empirically study two aspects of the program by analyzing a large public dataset released by the police department that records all documented stops in the city...

Interfaces, Efficiency and Big DataThe recording of John Chambers' keynote presentation from the useR! 2014 conference, Interfaces, Efficiency and Big Data, is now available for viewing thanks to Data Science LA...

The Top 5 Questions A Data Scientist Should Ask During a Job InterviewThe data science job market is hot and an incredible number of companies, large and small are advertising a desperate need for talent. Before jumping on the first 6-figure offer you get, it would be wise to ask the penetrating questions below to make sure that the seemingly golden opportunity in front of you isn’t actually pyrite...

The Question to Ask Before Hiring a Data Scientist
When hiring data scientists, there’s nothing more frustrating than making the wrong hire. Data scientists are in notoriously high demand, hard to attract, and command large salaries — compounding the cost of a mistake...

Visualizing product relationships in a Market Basket analysis
I came up with this technique to visualize and explain market basket analysis in very simple visualization. This was the core thought behind this technique: Algorithms used in Text mining can be leveraged to create relationship plots in a Market basket analysis...

Jobs

zulily is seeking an intellectually curious, collaborative data expert to work as an acquisition-focused data scientist and statistician. As a zulily Data Scientist, you will use statistical analysis and machine learning to better understand how users engage with zulily, and you will use that information to build models that inform our retention and acquisition practices, recommender systems, and optimize content. You should have a strong background in statistics and probability, machine learning, and working with large datasets. Additionally, you should have knowledge of and experience in online marketing practices and metrics...

Training & Resources

Data Science at the Command Line - Webcast
We data scientists love to create exciting data visualizations and insightful statistical models. However, before we get to that point, usually much effort goes into obtaining, scrubbing, and exploring the required data. The command line, although invented decades ago, is an amazing environment for performing such data science tasks...

Books

"Andrew Gelman is a top researcher in Bayesian statistics as well as an excellent writer. He has written an excellent text on Bayesian data analysis that uses the Markov Chain Monte Carlo methods for dealing with hierarchical linear models. This book starts out on an introductory level covering a wide variety of statistical modeling problems including logistic regression and multilevel logistic regression, generalized linear models and causal inference..."