Google Refine Blog

Wednesday, November 10, 2010

Our acquisition of Metaweb back in July also brought along Freebase Gridworks, an open source software project for cleaning and enhancing entire data sets. Today we’re announcing that the project has been renamed to Google Refine and version 2.0 is now available.

Google Refine is a power tool for working with messy data sets, including cleaning up inconsistencies, transforming them from one format into another, and extending them with new data from external web services or other databases. Version 2.0 introduces a new extensions architecture, a reconciliation framework for linking records to other databases (like Freebase), and a ton of new transformation commands and expressions.

Freebase Gridworks 1.0 has already been well received by the data journalism and open government data communities (you can read how the Chicago Tribune, ProPublica and data.gov.uk have used it) and we are very excited by what they and others will be able to do with this new release. To learn more about what you can do with Google Refine 2.0, watch the following screencasts:

Tuesday, June 29, 2010

Gridworks 1.1 provides a feature for reconciling cells in a data set to topics on Freebase (e.g., matching "Tom Hanks" to the topic identified by "/en/tom_hanks" and viewable here). There are a few advantages to performing reconciliation. First, reconciliation can match both "Tom Hanks" and "Thomas Jeffrey Hanks" to the same topic, making the data more internally consistent and allowing trends to emerge more clearly. Second, by connecting the data to Freebase topics, we can now pull data from Freebase to augment the original data set, say, adding nationality and birth places to each person mentioned in the data set. Finally, as there are implicitly relationships between cells in the data set (a "directed" relationship between a "Director" cell and a "Movie" cell), by connecting cells to Freebase topics, we can now load those relationships into Freebase, enriching Freebase for other people to benefit from.

But of course, all those benefits are applicable to other databases, too. If you have your own private database, or if you work primarily with your university's data, or the Library of Congress' data, or any other data source, then it makes sense to reconcile your data set with that source. We are re-working Gridworks to support reconciliation against arbitrary data source (discussion thread). Here is a brief update on the current development.

In the source trunk, the Reconcile dialog box has been changed to support registering of standard reconciliation services adhering to this developing API specs:

Some sample APIs can be experimented with "in the raw" here. Each reconciliation service can have its own semantics. For example, to the Netflix reconciliation service below, "types" does not mean Freebase types but film genres.

The service can also specify how to formulate URLs from identifiers. Here, the Netflix identifier is "60020675":

The service can also customize various auto suggest widgets in the UI. For example, here, the Netflix service automatically suggests only film topics (rather than cities) for "Chicago".

If you're interested in this development, check out Gridworks' trunk and try out the feature. The Netflix reconciliation service mentioned is implemented as an Acre app (open source). Feel free to develop your own reconciliation service, plug it into Gridworks, and tell us what works and what doesn't (mailing list).