Posts Categorized: Data Science

From Synonyms to Object Properties It’s well known that word embeddings are excellent for finding similarities between words — specifically, synonyms. We achieve this using supervised machine learning techniques by showing a neural net a dataset of hundreds of millions of pieces of text. The algorithm looks at the context and frequency in which particular… Read more

I recently stumbled across an old Data Science Stack Exchange answer of mine on the topic of the “Best Python library for neural networks”, and it struck me how much the Python deep learning ecosystem has evolved over the course of the past 2.5 years. The library I recommended in July 2014, pylearn2, is no… Read more

Deploying machine learning models has always been a struggle. Most of the software industry has adopted the use of container engines like Docker for deploying code to production, but since accessing hardware resources like GPUs from Docker was difficult and required hacky, driver specific workarounds, the machine learning community has shied away from this option.… Read more

Maybe you’re training a machine learning model on a really big dataset. Perhaps you’ve got a big database dump and you want to extract some information. Or maybe you’re crawling web scrapes or mining text files. Modern computers are really quite powerful for processing streams of data. You shouldn’t have to resort to a Hadoop… Read more

In the frothy sea of Big Data buzz, there’s a tidbit: “More data beats a better model.” But if you’re not Google, and you’re not building distributed language models…well, haven’t you ever wondered how much improvement a model should yield when scaling up to a bigger dataset? Here we look at a specific example to… Read more

Earlier this week, Google released TensorFlow, an open source library for numerical computation. Given the general frothiness around machine learning, we thought folks might appreciate a simple, straightshootin’ take from indico’s Machine Learning team. Unlike a random person on the Internet, we deal with this stuff daily, and can hopefully shed some light on how… Read more

Last month Alec Radford and I had the great pleasure of attending the SIGGRAPH 2015 conference in Los Angeles. If you don’t know about SIGGRAPH, here’s a quick snippet from their website: “Since its beginning in 1974 as a small group of specialists in a previously unknown discipline, ACM SIGGRAPH has evolved to become an international… Read more

Introduction to Neural Image Captioning Image Captioning is a damn hard problem — one of those frontier-AI problems that defy what we think computers can really do. This summer, I had an opportunity to work on this problem for the Advanced Development team during my internship at indico. The work I did was fascinating but not revolutionary … Read more

Data visualization A big part of working with data is getting intuition on what those data show. Staring at raw data points, especially when there are many of them, is almost never the correct way to tackle a problem. Low dimensional data are easy to visually inspect. You can simply pick pairs of dimensions and… Read more

Get the ipython/Jupyter notebook on Github: indico-plotlines A few months ago, a great video of Kurt Vonnegut circulated the web. He describes an idea for plotting the simple shapes of stories as good vs. ill fortune across the length of the story. He says: “There’s no reason why these simple shapes of stories can’t be fed… Read more