Blog

So, we've probably mentioned that we're using ElasticSearch to power Gander. What we haven't talked about so much is why, and how we're using it under the hood. When we were building out the back end for Gander, the very first thing we wanted to know is what search platform would meet our needs best. It needed to be powerful and it needed to be fast. If it had a great API to work against and was easy to set up, well, that's just gravy on top. ElasticSearch had these qualities in spades. The other major platform we looked at was Solr, which, like ElasticSearch, is built on top of Lucene. The two are actually somewhat similiar, with Solr being older and designed more for standard search applications. ElasticSearch is newer, and its API, setup and underlying model reflect this. For the kind of application we're building, ElasticSearch is the better fit.

Elasticsearch works by creating indexes over whatever one is trying to search. These indexes track a subset of the data we want to search, and are organized in such a way so that these searches will be fast. To actually keep track of our data, we store it all in CouchDB. However, we still needed to get our data to Elasticsearch so that it could create indexes over that data. Luckily, there's a pretty smooth way to do it. CouchDb has a concept called rivers where any changes made to the underlying data show up as a change on the river. The changes go downstream along the river, one might say. Elasticsearch has a river plugin for reading these changes. With the right configuration, any changes to the data we want to search will automatically be propagated to Elasticsearch and reindexed, so our search results will always be up to date.

Gander is primarily Ruby on Rails, and it is there that we create new search requests and display our search results. It needs to be able to interact with Elasticsearch, too. Elasticsearch has a pretty great client-side API, but we didn't want to reinvent the wheel, or tire, so to speak. We're using the Tire gem instead. It lets us make requests and display their results with a minimum amount of fuss. Basically, when a user enters a new query that they want to search, we can parse that query and pass it to Tire. Tire will then use the Elasticsearch API to make the search request on our behalf. It then will return us the search results in a nice format, which we then display to the user on the page. Easy peasy, lemon squeezy.

And that's pretty much it. Elasticsearch is an easy search platform to work with - besides some distributable aspects, this is a basic overview of our entire search architecture. Questions? Comments? Want to know about some specifics? Contact us!

My original plan for this afternoon was to attend Jeremy Howard and Mike Bowles' session on predictive modeling, but, after a morning of focused web crawls, I decided to go listen to Simon Rogers (@smfrogers) and Michael Brunton-Spall (@bruntonspall) talk about data journalism instead. To cop a Britishism, it was brilliant. Rogers is the pioneering journalist behind The Guardian's uber-popular Datablog, and Brunton-Spall is one of the developers tasked with transforming reams of raw data into journalist-searchable information.

If you haven't ever read the Datablog, you should: it's a model for transparent, accessible business, giving readers a variety of ways to consume news, the numbers behind the news, and the methodology for obtaining these numbers. Datablog does a lot of the UK government's work for them, and a decent amount of our government's as well, turning paper and web documents into public google spreadsheets, interactive charts and visualizations, and editorial stories. As Rogers noted, while data used to be the domain of long-form journalism, our new crawling, parsing, and processing skills make it highly suitable for short-form news as well. It's pretty easy to imagine it becoming a real-time news source (I'm sure Automated Insights would agree).

This session used a bunch of Datablog posts and datasets to illustrate the parts of data journalism, which boil down to:

1) collect sent data, recurring events, breaking news, and theories to be explored

2) figure out what to compare or show change, what the data means, what other data sets to use with it

3) shove the chosen into spreadsheets

4) clean up the data: check for data in wrong format, merged cells, unnecessary columns of data, data measured in different units. 80% of their time is spent here

5) perform calculations on the data. recalculate if needed, sanity check the results

6) map the data in one or more formats (graphics, free viz tools, google fusion table, story, and/or just publish)

To help The Guardian's journalists identify the needles in the data haystack, the developers came up with a guideline they call "The Philosophy of Interesting Information." What qualifies as interesting?

metadata, as revealed in Wikileaks cables. US soldiers are much better at entering tags than diplomats

the habitual--it betrays the people who published in the info

distress

anomalies

visualizations

Journalists parse datasets for these qualities using Ajax Solr, which puts a more user-friendly interface atop Solr. It includes search, interactive graphs, and tag clouds, and looks quite nice, but is not available to the public.

Occasionally, the Datablog has turned to its readers for help in parsing massive amounts of pdfs. What they've found is that a) you need to recognize and reward contributors for their help or else they'll get bored midway through and b) for the crowd sourced data to be effective, you need people to comb through it. Long story short: cool concept, great for tips, pretty bad for data.

Since much of the Datablog datasets have a geographic component, the journalists often use Google's Fusion Tables to visualize them. There are two types of Fusion Tables that really work: ones with borders and ones with dots. In the last part of the session, Rogers showed us how to create a dot one that displayed where all the session attendees were from, along with age and eye color. If you have a google account, it's incredibly simple.

1) create/upload a spreadsheet/csv

2) create table based on that spreadsheet

3) visualize as map (geocode)

4) set window info (custom or automatic)

One thing to note is that Fusion Tables don't yet work with real-time databases, though the Google API team is working on it.