Navigating Themes in Restaurant Reviews with Word Mover’s DistanceWhat does the sentence “The Sicilian gelato was extremely rich” have in common with “The Italian ice-cream was very velvety”? To a human, the two sentences (incidentally taken from two different reviews of the same restaurant) have a similar key theme: This restaurant has a gelato dish diners are raving about. But consider this from the perspective of a machine: Apart from the words “the” and “was” (which are ubiquitous across reviews and considered stop-words), there are no words in common. How can we teach a machine how to learn that these two sentences have similar themes? ...

A Message from this week's Sponsor

Data Science Articles & Videos

Building the Next New York Times Recommendation EngineThe New York Times publishes over 300 articles, blog posts and interactive stories a day. Refining the path our readers take through this content — personalizing the placement of articles on our apps and website — can help readers find information relevant to them, such as the right news at the right times, personalized supplements to major events and stories in their preferred multimedia format. In this post, I’ll discuss our recent work revamping The New York Times’s article recommendation algorithm...

Survival analysis in R – step by step guideI recently was looking for methods to apply to time-to-event data and started exploring Survival Analysis Models. In this post, I’m exploring basic KM estimator which is a nonparametric estimator of the survival function using a real dataset (on time to death for 80 males who were diagnosed with different types of tongue cancer, from packageKMsurv) and a simulated dataset (using packagesurvsim)...

Machine Learning Used To Predict Fine Wine Price MovesCuriosity about the limits of machine learning led former trader, UCL academic and startup founder, Dr Tristan Fletcher, to apply complex AI techniques to the — on the surface — rather chaotic arena of fine wine pricing, comparing them with trading techniques used for more typical asset classes...

Frequentism and Bayesianism V: Model SelectionHere I am going to dive into an important topic that I've not yet covered: model selection. We will take a look at this from both a frequentist and Bayesian standpoint, and along the way gain some more insight into the fundamental philosophical divide between frequentist and Bayesian methods, and the practical consequences of this divide...

Learning Seattle’s Work Habits from Bicycle Counts (with R!)This is an R version of Learning Seattle’s Work Habits from Bicycle Counts [featured in last week's newsletter!]. It more or less mimics the original Python code to offer an equivalent output. If all goes well, you might just run this .Rmd file and a nice HTML output will be generated...

Teaching Machines to Understand UsA reincarnation of one of the oldest ideas in artificial intelligence could finally make it possible to truly converse with our computers. And Facebook has a chance to make it happen first...

Baidu explains how it’s mastering Mandarin with deep learning
Baidu senior research engineer Awni Hannun presented on a new model that the Chinese search giant has developed for handling voice queries in Mandarin. The model, which is accurate 94 percent of the time in tests, is based on a powerful deep learning system called Deep Speech that Baidu first unveiled in December 2014...

Google details how it cut Google Voice transcription error rates by 50%
Google today explained how its researchers have improved the speech recognition systems underlying the transcription for voicemails in Google Voice. Last month Google disclosed that the recognition error rate in Google Voice had gone down by 50 percent, and now Google is talking about how it achieved that success...

Jobs

At WSI, weather means business. We are the world's leading provider of weather-driven business solutions that enable enterprises to make better decisions using the most accurate, precise and resolute weather data available. We serve some of the world's biggest brands in the aviation, energy, insurance, and media markets, plus multiple federal and state government agencies. Based on growth and expansion in analytics we are searching for a leader to build WSI’s capabilities in the data sciences, working closely with leaders across the company to build a variety of models, recommenders, and algorithms used by WSI customers to make critical weather related business decisions...

Training & Resources

Understanding Statistical Power and Significance Testing
Much has been said about significance testing – most of it negative. Methodologists constantly point out that researchers misinterpret p-values. Some say that it is at best a meaningless exercise and at worst an impediment to scientific discoveries. Consequently, I believe it is extremely important that students and researchers correctly interpret statistical tests. This visualization is meant as an aid for students when they are learning about statistical hypothesis testing...

Comparison of machine learning libraries used for classificationThis project aims at a minimal benchmark for scalability, speed and accuracy of commonly used implementations of a few machine learning algorithms. The target of this study is binary classification with numeric and categorical inputs (of limited cardinality i.e. not very sparse) and no missing data...The algorithms studied are a) linear (logistic regression, linear SVM), b) random forest, c) boosting, and d) deep neural network...in various commonly used open source implementations like 1) R packages, 2) Python scikit-learn, 3) Vowpal Wabbit, 4) H2O, 5) xgboost, and 6) Spark MLlib...

Books

"Effective Python is a time-efficient way to learn – or remind yourself – what the best practices are and why we use them. It’s a concise book of practical techniques to write maintainable, performant and robust code using practices widely accepted in the community..."