A millions songs and MapReduce

Earlier this year, Echo Nest and LabROSA at Columbia University released the Million Song Dataset, a freely available collection of audio and metadata for a million contemporary popular music tracks. The purpose of the dataset, among other things, was to help encourage research on music algorithms. But as Paul Lamere, director of Echo Nest’s Developer Platform, makes clear, getting started with the dataset can be daunting.

In a post on his Music Machinery blog, Lamere explains how to use Amazon’s Elastic MapReduce to process the data. In fact, Echo Nest has loaded the entire Million Song Dataset onto a single S3 bucket, available at http://tbmmsd.s3.amazonaws.com/. The bucket contains approximately 300 files, each with data on about 3,000 tracks. Lamere also points to a small subset of the data — just 20 tracks — available in a file on GitHub, and he also created track.py to parse track data and return a dictionary containing all of it.

Strata Conference New York 2011, being held Sept. 22-23, covers the latest and best tools and technologies for data science — from gathering, cleaning, analyzing, and storing data to communicating data intelligence effectively.

GPS steps in where memory fails

After decades of cycling without incident, The New York Times science writer John Markoff experienced what every cyclist dreads: a major crash, one that resulted in a broken nose, a deep gash on his knee, and road rash aplenty. He was knocked unconscious by the crash, unable to remember what had happened to cause it. In a recent piece in the NYT, he chronicled the steps he took to reconstruct the accident.

He did so by turning to the GPS data tracked by the Garmin 305 on his bicycle. Typically, devices like this are utilized to track the distance and location of rides as well as a cyclist’s pedaling and heart rates. But as Markoff investigated his own crash, he found that the data stored in these types of devices can be use to ascertain what happens in cycling accidents.

In investigating his own memory-less crash, Markoff was able to piece together data about his trip:

My Garmin was unharmed, and when I uploaded the data I could see that in the roughly eight seconds before I crashed, my speed went from 30 to 10 miles per hour — and then 0 — while my heart rate stayed a constant 126. By entering the GPS data into Google Maps, I could see just where I crashed. I realized I did have several disconnected memories. One was of my hands being thrown off the handlebars violently, but I had no sense of where I was when it happened. With a friend, Bill Duvall, who many years ago also raced for the local bike club Pedali Alpini, I went back to the spot. La Honda Road cuts a steep and curving path through the redwoods. Just above where the GPS data said I crashed, we could see a long, thin, deep pothole. (It was even visible in Google’s street view.) If my tire hit that, it could easily have taken me down. I also had a fleeting recollection of my mangled dark glasses, and on the side of the road, I stooped and picked up one of the lenses, which was deeply scratched. From the swift deceleration, I deduced that when my hands were thrown from the handlebars, I must have managed to reach my brakes again in time to slow down before I fell. My right hand was pinned under the brake lever when I hit the ground, causing the nasty road rash.

It’s one thing for a rider to reconstruct his own accident, but Markoff says insurance companies are also starting to pay attention to this sort of data. As one lawyer notes in the Times article, “Frankly, it’s probably going to be a booming new industry for experts.”

Crowdsourcing and crisis mapping from WWI

The explosion of mobile, mapping, and web technologies has facilitated the rise of crowdsourcing during crisis situations, giving citizens and NGOs — among others — the ability to contribute to and coordinate emergency responses. But as Patrick Meier, director of crisis mapping and partnerships at Ushahidi has found, there are examples of crisis mapping that pre-date our Internet age.

Meier highlights maps he discovered from World War I at the National Air and Space Museum, pointing to the government’s request for citizens to help with the mapping process:

In the event of a hostile aircraft being seen in country districts, the nearest Naval, Military or Police Authorities should, if possible, be advised immediately by Telephone of the time of appearance, the direction of flight, and whether the aircraft is an Airship or an Aeroplane.

And he asks a number of very interesting questions: How often were these maps updated? What sources were used? And “would public opinion at the time have differed had live crowdsourced crisis maps existed?”

Get the O’Reilly Data Newsletter

Stay informed. Receive weekly insight from industry insiders.

Featured Video

Is Privacy Becoming a Luxury Good? Julia Angwin discusses how much she has spent trying to protect her privacy, and raises the question of whether we want to live in a society where only the rich can buy their way out of ubiquitous surveillance.