Issue #91

August 20 2015

Editor Picks

The State of Artificial Intelligence in Six VisualsWe cover many emerging markets in the startup ecosystem. Previously, we published posts that summarized Financial Technology, Internet of Things, Bitcoin, and MarTech in six visuals. This week, we do the same with Artificial Intelligence (AI). At this time, we are tracking 855 AI companies across 13 categories, with a combined funding amount of $8.75billion...

Eigenstyle: Principal Component Analysis and FashionAny set of images can be broken down with Principal Component Analysis. This has been done pretty successfully with faces. Here we’ll take a look at style. Our dataset is 807 pictures of dresses from Amazon...

A Message from this week's Sponsor

Data Science Articles & Videos

Deep learning for assisting the process of music composition (part 1)
This is part 1 of my explorations of using deep learning for assisting the process of music composition. In this part, I look at some almost-winning output of a model trained by deep learning methods on over 23,000 folk tunes, and make improvements to produce a session-ready piece...

Mining Administrative Data to Spur Urban RevitalizationAfter decades of urban investment dominated by sprawl and outward growth, municipal governments in the United States are responsible for the upkeep of urban neighborhoods that have not received sufficient resources or maintenance in many years. One of city governments' biggest challenges is to revitalize decaying neighborhoods given only limited resources. In this paper, we apply data science techniques to administrative data to help the City of Memphis, Tennessee improve distressed neighborhoods...

The reusable holdout: Preserving validity in adaptive data analysisWe present a new methodology for navigating the challenges of adaptivity. A central application of our general approach is the reusable holdout mechanism that allows the analyst to safely validate the results of many adaptively chosen analyses without the need to collect costly fresh data each time...

Deep Convolutional Networks on Graph-Structured DataIn this paper we consider the general question of how to construct deep architectures with small learning complexity on general non-Euclidean domains, which are typically unknown and need to be estimated from the data. In particular, we develop an extension of Spectral Networks which incorporates a Graph Estimation procedure, that we test on large-scale classification problems, matching or improving over Dropout Networks with far less parameters to estimate...

Adventures in Data Mining - What we talk about when we talk about space heatersI thought maybe I could save on the gas bill by getting space heaters for just a couple rooms...I faced the problem with a vision for what might be useful to a consumer: a web app containing interactive scatterplot visualizations of the product features...Overall, the problem can then be broken down into two sub-tasks: 1) Identify the relevant product “features” or aspects that people need to know about (label the axes), and 2) Determine consumer attitudes to each feature or aspect of a product as implied by the reviews (score each product along the axes)...

Data Science on Firesquads: Classifying Emails with Naive BayesThis year, for our Firesquad rotation, we on the Data Science squad wanted to help automate the classification of support emails. The short-term goal was to reduce the time Coach Relations needs to spend when answering emails. Longer term, this tool could allow us to automatically detect patterns and raise alarms when specific support requests are occurring at an abnormal rate...

The Effects of Hyperparameters on SGD Training of Neural NetworksThe performance of neural network classifiers is determined by a number of hyperparameters, including learning rate, batch size, and depth. A number of attempts have been made to explore these parameters in the literature, and at times, to develop methods for optimizing them. However, exploration of parameter spaces has often been limited. In this note, I report the results of large scale experiments exploring these different parameters and their interactions...

With Discovery, 3 Scientists Chip Away At An Unsolvable Math Problem
Armed with an algorithm, McLoud-Mann, along with her husband, Casey Mann, and David Von Derau — all of the University of Washington, Bothell — had been trying to help unravel one of math's long-standing unanswered questions.
How many shapes are able to "tile the plane" — meaning the shapes can fit together perfectly to cover any flat surface without overlapping or leaving any gaps...

Jobs

Remind helps teachers, students, and parents engage in safe, simple communication. With more than 25 million users, we are one of the fastest-growing companies in edtech. We are hiring data scientists who are energized about solving the communication challenges that teachers face every day. Apply here

Denoising Dirty Documents: Part 1So this blog is the first in a series of blogs about how to put together a reasonable solution to Kaggle’s Denoising Dirty Documents competition...

Out-of-Core Dataframes in Python: Dask and OpenStreetMapIn this post, I'll take a look at how dask can be useful when looking at a large dataset: the full extracted points of interest from OpenStreetMap. We will use Dask to manipulate and explore the data, and also see the use of matplotlib's Basemap toolkit to visualize the results on a map...

Books

Comprehensive history of statistics from its beginnings around 1700 to its emergence as a distinct and mature discipline around 1900...

"This book is THE definitive work on the early development of statistics. Obviously written by a man in love with his subject. Bernoulli, de Moivre, Bayes, Laplace, Gauss, Quetelet, Lexis, Galton, Edgeworth and Pearson all but come alive. I particularly enjoyed the reproductions of first sources included that you would otherwise have to travel to Paris to see..."