Monthly Archives September 2008

The Freebase user group meetup is a great place to meet other data nerds and learn what’s happening in the data community. The next one, on 17th Sept (this Wednesday), will feature several interesting talks, as well as one by me (Flip Kromer).

I’ll super-briefly show off the alpha version of infochimps.org, and then describe how some stuff we learned in philosophy class, and exploring data about data about data, have each helped solve some thorny engineering problems.

Looking forward to any feedback on the new design and beers afterwards if anyone’s in the SF area.

We’re starting beta testing of infochimps.org v1.0 — see the following post. In order to start really populating infochimps.org with dataset payloads, the Infinite Monkeywrench is about to get some major love. The following syntax is still evolving, but we’re already using it to do some really fun stuff: here’s a preview.

One of the data set’s we’re proud to be liberating is the National Climate Data Center’s global weather data. To use that data, you need the file describing each of the NCDC weather stations. (I’ll just describe the stations metadata file — the extraction cartoon for the main dataset is basically the same but like 10 feet wide.)

The weather station metadata is found at at ftp://ftp.ncdc.noaa.gov/pub/data/gsod/ish-history.txt, it’s a flat file, it has a header of 17 lines, it contains fields describing each stations latitude, longitude, call sign and all that, and has lines that look like

Almost all of that is setup and teardown. Once the infochimps schema has field names, the only part you really have to figure out is the cartoon,

s6 .s5 .s30 s2.s2.s2.s4 ..ci5 .ci6 .ci5

If you’ve used perl’s unpack(), you’ll get the syntax — this says ‘take the USAF call sign from the initial 6-character string; ignore one junk character; … take one character as the latitude sign, and an integer of up to 5 digits as the scaled latitude, ….’

Rather load it into a database? Leave the last line out, and stage right into your DB. (Any of MySQL 4.x+, Potsgres 8.2+, SQLite3+ work.)

Surely a hand-tuned scripts will do this more thoroughly (and more quickly), but you can write this in a few minutes, set it loose on the gigabytes of data, and do all the rest from the comfort of your DB, your hadoop cluster, or a script that starts with populated datastructures given by a YAML file.

—

Another example. The US Nations Institute for Science and Technology (NIST) publishes an authoritative guide to conversion factors for units of measurement. It is, unhelpfully, only available as an HTML table or a PDF file.

Now with some tweaking, you could do even more (and you’ll find you need to hand-correct a couple rows), but note:

Once one person’s done it nobody else has to.

This snippet gets you most of the way to a semantic dataset in your choice of universal formats.

In fact, there’s so little actual code left over we can eventually just take schema + url + cartoon as entered on the website, crawl the relevant pages, and provide each such dataset as CSV, XML, YAML, JSON, zip’d sqlite3 file … you get the idea — and we can do that without having to run code from strangers on our server.

Most importantly, for an end user this isn’t like trusting some random dude’s CSV file uploaded to a site named after a chimpanzee. The transformation from NIST’s data to something useful is so simple you can verify it by inspection. Of course, you can run the scripts yourself to check; or you can trace the Monkeywrench code itself; and once we have digital fingerprinting set up on infochimps.org anyone willing to stake their reputation on the veracity of a file can sign it — but it’s pretty easy to accept something this terse but expressive as valid. Our goal is to give transparent provenance of infochimps.org data to any desired degree.