Issue #29

June 12 2014

Editor Picks

There's no denying that 'data scientist' is a hot job title to have right now, and for good reason. It's a tremendously fun and challenging field to be in, and despite all of the often undeserved hoopla that surrounds it, data scientists are doing some pretty amazing things. So it's no surprise that many people are clamoring to find out how to become data scientists. As I run a blog that attempts to teach some basic data science using sports analytics, I often get email asking how one gets started in data science and/or how quickly one can learn the prerequisites for being a data scientist. Instead of replying to these all the time, I thought I'd write my thoughts up here...

Using a technique called regression discontinuity, an Uber plog post claimed they'd reduced DUI arrests by about 10% on average. The post bugged me though, because there is not a lot of detail on the methods, and regression discontinuity is the sort of research design that is very much dependent on specification. In this post I replicate the study and walk through what regression discontinuity is and why it can be a very effective research design. Ultimately I think it’s plausible that Uber did in fact reduce DUIs in Seattle, but the story is a bit more complex than the blog post lets on...

Ever wonder why statisticians flock to specialized languages? It’s not hard to figure out. They’re not just flocking; in part, they’re fleeing. The C standard math library — and by extension, nearly every standard math library out there — lacks even the most basic functionality for doing statistical analysis...

What Big Data Visualization Analytics can learn from RadiologyAs I research on part III of the “What Healthcare can learn from Wall Street” series, which is probably going to turn in to a Part III, Part IV, and Part V, I was thinking about visualization tools in big data and how to use them to analyze large data sets rapidly (relatively) by a human (or a deep unsupervised learning type algorithm) – and it came to me that us radiologists have been doing this for years...

Sentiment Classification Using scikit-learn (Ryan Rosario talk)Facebook produces millions of pieces of text content every day. In this talk we discuss a system based on scikit-learn and the Python scientific computing ecosystem that describes and models positive and negative sentiment of user generated content on Facebook...

Open-sourcing Haxl, a library for HaskellToday we’re open-sourcing Haxl, a Haskell library that simplifies access to remote data, such as databases or web-based services. Haxl is a layer that sits between the application code and one or more “data sources”—APIs for fetching remote data...

Frequentism and Bayesianism II: When Results Differ
While it is easy to show that the two approaches are often equivalent for simple problems, it is also true that they can diverge greatly for more complicated problems. I've found that in practice, this divergence makes itself most clear in two different situations...

Finding Entity Names in Google's Knowledge Graph
I wrote about this patent because it describes how data janitors might use anchor text pointed to a page about an entity to help find other names for that entity. I wrote about it because it does a great job of showing how this knowledge graph kind of fact extraction differs from the web crawling and indexing that we often talk about when we talk about the indexing of things found on the web...

Self-Teaching Neural Network
What is this? My experiments creating a self teaching neural network (nn) using genetic algorithms in JavaScript. What the hell is that? A neural network can be thought of as computational model of a biological brain, made up of network of neurons that receive input and create outputs...

The Colors of Chemistry
This notebook documents my exploration of color theory and its applications to photochemistry. It also shows off the functionality of several Julia packages: Color.jl for color theory and colorimetry, SIUnits.jl for unitful computations, and Gadfly.jl for graph plotting...

Jobs

When you join Dow Jones, you become part of the most dynamic, creative and savvy news and information companies in the world. The Data Scientist is an integral member of the Dow Jones Data Science and Engineering team. The role will support the data science and engineering strategy at Dow Jones...

Training & Resources

Data Science Stack ExchangeData Science Stack Exchange is a question and answer site for Data science professionals, Machine Learning specialists, and those interested in learning more about the field. It's 100% free, no registration required...

Learning Data Science isn’t easy – even just working out what you need to learn about is tricky. This is what made Swami Chandrasekaran come up with his Curriculum via Metromap, (well worth a look!). Inspired by that, I’ve created a Data Science Clock...

Books

"Building a simple but powerful recommendation system is much easier than you think. This report explains innovations that make machine learning practical for business production settings — and demonstrates how even a small-scale development team can design an effective large-scale recommender. The style of the report makes this subject approachable for all levels of expertise."