Wednesday, November 27, 2013

Detail on distribution of artworks in the Tate collection by birthdate of artists, visualized by Florian Krautli.

What does "big data" look like for museums? Collecting institutions have enormous stacks of data about the artifacts and artworks in their stores. Several museums around the world have worked hard to make their data accessible by providing free access to datasets, applying Creative Commons licenses to digital content, or creating APIs (application programming interfaces) that allow programmers to build their own software on the museum's data.

Last month, the Tate joined the party when they opened up their collection database to the world on GitHub, a website where programmers collaborate on projects. The Tate is providing metadata about artworks and artists in its collection--over 70,000 artworks in all. The data is in a computer language called .JSON that is commonly used for data sharing and processing. Even if you don't speak database, it's worth seeing how the Tate is presenting their collection to programmers on GitHub.

What can you do with these .JSON files? Anyone can pull down the data and use it for their own purposes, subject to some simple goodwill guidelines. Here are two examples of visualizations created by GitHub users:

These visualizations are fun. They are beautiful. They raise interesting questions about the Tate's collection and the imperfections of collections data.

But the discussions they raise are limited. Florian's blog post centers on the question of why there are so many pieces by William Turner in the Tate's collection. A commenter pointed out that there must be an error in the data, as it is highly unlikely that Turner produced more than 40,000 works in his lifetime. Jim's post suggests some fun but somewhat silly conclusions about the height/width ratio of artworks.

Reading these posts and the related conversations, I was struck by two conflicting feelings:

It's awesome that data-sharing is causing people to have a conversation about what artists are represented in a museum collection, what kind of artwork the Tate has, what surprising things can be visualized and learned from the collections data, and how the data can be improved.

The data is sufficiently flawed and idiosyncratic to yield conclusions of questionable value. Knowing the dimensions of the frame a painting is in is much less compelling than many, many other things that could be known and explored about works of art. I'm imagining visualizations focusing on the gender or race of artists in the collection, frequency of loans (and to whom), frequency of display, common words used in label text... the list goes on.

To me, the fact that #1 is exciting and promising makes addressing #2 worth it. Opening up data is just the first (big) step to make it usable and useful. These experiments prompt questions, identify gaps in the data, and promote new forms of collection, dissemination, and analysis. The data you have is not always the data you want, but you often don't know that until you start monkeying with it. Future iterations of data sharing and use will help institutions and citizen-participants take the next steps to make it meaningful.

Detail on distribution of artworks in the Tate collection by birthdate of artists, visualized by Florian Krautli.

What does "big data" look like for museums? Collecting institutions have enormous stacks of data about the artifacts and artworks in their stores. Several museums around the world have worked hard to make their data accessible by providing free access to datasets, applying Creative Commons licenses to digital content, or creating APIs (application programming interfaces) that allow programmers to build their own software on the museum's data.

Last month, the Tate joined the party when they opened up their collection database to the world on GitHub, a website where programmers collaborate on projects. The Tate is providing metadata about artworks and artists in its collection--over 70,000 artworks in all. The data is in a computer language called .JSON that is commonly used for data sharing and processing. Even if you don't speak database, it's worth seeing how the Tate is presenting their collection to programmers on GitHub.

What can you do with these .JSON files? Anyone can pull down the data and use it for their own purposes, subject to some simple goodwill guidelines. Here are two examples of visualizations created by GitHub users:

These visualizations are fun. They are beautiful. They raise interesting questions about the Tate's collection and the imperfections of collections data.

But the discussions they raise are limited. Florian's blog post centers on the question of why there are so many pieces by William Turner in the Tate's collection. A commenter pointed out that there must be an error in the data, as it is highly unlikely that Turner produced more than 40,000 works in his lifetime. Jim's post suggests some fun but somewhat silly conclusions about the height/width ratio of artworks.

Reading these posts and the related conversations, I was struck by two conflicting feelings:

It's awesome that data-sharing is causing people to have a conversation about what artists are represented in a museum collection, what kind of artwork the Tate has, what surprising things can be visualized and learned from the collections data, and how the data can be improved.

The data is sufficiently flawed and idiosyncratic to yield conclusions of questionable value. Knowing the dimensions of the frame a painting is in is much less compelling than many, many other things that could be known and explored about works of art. I'm imagining visualizations focusing on the gender or race of artists in the collection, frequency of loans (and to whom), frequency of display, common words used in label text... the list goes on.

To me, the fact that #1 is exciting and promising makes addressing #2 worth it. Opening up data is just the first (big) step to make it usable and useful. These experiments prompt questions, identify gaps in the data, and promote new forms of collection, dissemination, and analysis. The data you have is not always the data you want, but you often don't know that until you start monkeying with it. Future iterations of data sharing and use will help institutions and citizen-participants take the next steps to make it meaningful.