Sometimes Data Scientists find themselves with a map-reduce cloud architecture and computation that needs to be done on a large scale, but the data isn’t actually cloud scale. One great way to get the cloud working for you is a … Read More

I had the privilege of presenting my work on “Calculating Feature Importance in Data Streams with Concept Drift using Online Random Forest” at IEEE Big 2014 in Washington, DC this last week. The conference was an interesting mix of presentations … Read More

In my last post, I described how we used Elias, an exploratory analysis tool for large-scale information extractions, to look at which (person,location) pairs are mentioned the most together, and then extended the analysis to distinguish how those entities are comentioned. Today, … Read More

In this and my next post, I’ll be showing a a few quick analyses we performed using a new tool we developed, called Elias. In today’s post, we’ll see how topic modeling can be used to characterize how entities are co-mentioned, not … Read More

CCRi was delighted to host the second meeting of the Cville Data Science group earlier this month. A full house packed our conference room, and a good time was had by all. The lineup for the talks included three CCRi … Read More

Most machine learning algorithms and statistical inference techniques operate on the entire dataset. Think of ordinary least squares regression or estimating generalized linear models. The minimization step of these algorithms is either performed in place in the case of OLS … Read More

A technique that I have particularly useful in Lisp-like languages like Mathematica and Clojure is destructuring. Destructuring is a mechanism for extracting parts of an expression. The Lisp “code as data” paradigm lends itself to destructuring techniques. I recently leveraged … Read More

I recently pushed a very alpha Solr plugin to GitHub that does unsupervised clustering on unstructured text documents. The plugin is written in Clojure and utilizes the Incanter and associated Parallel Colt libraries. Solr/Lucene builds an inverted index of term to … Read More

A standard query on geospatial data is the nearest neighbor query, i.e. Select the five closest police stations from a given point. The brute force approach to this problem is joining the two tables spatially and sorting by distance limiting … Read More

There is a ton of information in the TIGER Census files at the U.S. Gov Census site. Unfortunately, it is not easily mapped to geolocations. I had to get the tract level shapefiles and then transform the variables in the … Read More