Archive

Data sources

Posted by Cheng Soon Ong on September 19, 2008

For people who are interested in algorithms development, we are often faced with the "have a hammer, looking for a nail" problem. Once we have confirmed that the standard machine learning datasets (for example at UCI ) do not offer a useful application area where does one go? Below, I look at four websites which list data and also software associated with data. The information is not collected with machine learning in mind, and so a user would probably need to write preprocessing scripts to convert stuff into something useful.

A common theme is that just providing blobs of data isn't enough, one has to provide data as well as interfaces or processing tools for it. The other common theme is that these are just listings of data, and not an archival copy.

This is a site for large data sets and the people who love them:
the scrapers and crawlers who collect them,
the academics and geeks who process them,
the designers and artists who visualize them.
It's a place where they can exchange tips and tricks,
develop and share tools together, and begin to integrate their particular projects.

theinfo.org classifies the activities that people want to do to data into three different ones: get, process, view. In the get section, they provide a list of links to sources of data, which includes things from US congressional district boundaries to stock ticker data which requires a (free) registration. Unfortunately, the list of datasets is a static list, and does not provide useful slicing capabilities. In the view section, there is a nice list of different visualizations of datasets, for example a visualization of trends in twitter or worldmapper which morphs the area of a country to correspond to the size of a certain variable of interest, such as the number of internet users.

However, the really nice thing about this site is that for each section, it lists tools of the trade and tips and tricks which are bits of software which are related to collecting, processing and visualizing data. These are the kinds of things which simplifies our data analysis tasks. There doesn't seem to be a tool for each of the data sources listed yet, which means that a machine learner may still need to write his scraping tool to get data.

There are many sources to find out something about everything.
Until now, there’s been no good place for you to find out everything about something.

This site is still in beta, and currently only provides a list of datasets. They promise to allow uploading of your datasets in the full version. What's nice about the design is that you can slice the list of datasets according to a list of predefined fields or tags. So, in a sense, the design is very much like mloss.org, depending on community involvement to keep the repository fresh and up to date. Most of the data seems to be in tabular format (csv, xls), but they support yaml, which means that in principle more complex structures can exist.

Datamob highlights the connection between public data sources
and the interfaces people are building for them

They list hot new datasets and hot new interfaces, which are the latest listings. They have a short list of machine learning data which includes the venerable UCI and also Netflix. There is a simple submit form which allows one to add a link to the source of data or interface. They don't aim to be comprehensive but instead but rather the best place to see how public data is being put to use online. However, it is a pity that the two lists seem to be independent. It would be nice to see which datasets uses which interfaces.

Looking at one of the visualizations (under interfaces) of the 2008 presidential donations, it pointed out something interesting: often when visualizing data, there are not enough pixels on a screen to represent what you want.

Those familiar with freshmeat, CPAN or PyPI
can think of CKAN as providing an analogous service for open knowledge.

They package data in a predefined format, which allows them to design an API. In particular, they encourage open data, that is material that people are free to use, reuse and redistribute without restriction. The predefined package allows them to attach much more meta-data to each submission, and in the long run would allow more automated processing. For example, they allow the download of the meta-data of citeseer, which is dublin core compliant with additional metadata fields, including citation relationships (References and IsReferencedBy), author affiliations, and author addresses.

The REST API essentially defines how client software can upload and download data, and allows querying of what resources are available.

Comments

lauren (on September 21, 2008, 04:56:46)

Hey Cheng, thanks for the mention.

Actually, Datamob does link interfaces to the datasets they draw on and vice versa.

This is much more interesting when you look at the wide variety of interfaces that draw on that same campaign finance data set -- six listed on Datamob -- or when you check out our listing for a site like EveryBlock, which draws from a variety of data sources.