Compatible Data: Challenges and Solutions

“In the humanities, many sources of data are linkable and open. But we’re not operating at a maximum level for linked, open data. We need to move toward making our data and our research more linkable.” These were the opening remarks by Micki McGee, chair of Fordham University’s Digital Humanities Initiative, at the recent Compatible Data Initiative Meeting (September 23-25, 2011), hosted by Fordham University, The New York Public Library-Labs, and in collaboration with the Yaddo Corporation. Daniel Pitti, Richard Edwards (who organized the event with Micki), Edward Whitley, Susan Brown, Alan Liu, Craig Dietrich, Katy Börner, and others came together to share their research and the problems they’ve encountered with making data linkable, interoperable, useful, and easy to use.

I attended the meeting mainly to listen. The Wordle above was created from my extensive notes on the entire day's proceedings. I love that "data," "people," and "visualization" were the top three words because they are the among the most important words to the digital humanities. Also, "different" and "relationships" are prominent and quite meaningful. Relationships and difference are part of what makes our work compelling.

At the meeting, many questions were raised about compatible data and few conclusive answers were reached. Some of the questions included:

What kinds of standards of evidence are required for suggesting a relationship of influence among people, ideas, movements, etc.?

When multiple relationships exist among individuals (such as historical figures), how do we privilege some individuals over others?

How do we show changes over time (such as relationships or developing ideas) and source back to primary documents?

How do we represent or record literary, artistic, and intellectual influences in datasets?

“Digital scholarship involves archiving previous scholarly information collections, using social media to assist in the creation of new scholarship, and synthesizing and creating structure by modifying archives as new work emerges. Synthesizing and creating structure around scholarship is a skill that will become more and more important as the Internet provides us with information (and scholarship) overload.”

Michael gets to the heart of an important compatibility: synthesis. It’s easier said than done, at this point.

AnaMaria Seglie, in her post “Building a Cake, Baking an Archive,” identifies an important verb: to interact. “While we have made several efforts to spread the word, we are still working on ways to interact with growing digital, scholarly communities. So, here is where I come to my question(s) for you, HASTAC scholars, how do you find your scholarly information? And with this question, I don’t simply mean, the articles, books, and data you collect. What channels do you follow for retrieving information? How do you learn about search tools? For teachers, where do you go to find new materials for your classroom? In this increasingly vast archive we call the web, how does a small, developing archive make itself know?”

Getting archives to communicate with each other is an important part of data compatibility. At the meeting, Jon Ippolito spoke about his work on developing a metaserver that connects data across different databases. He explains, “When you add a record to the database and connect it to the metaserver, if the record does not exist, it will add it the metaservers. Anytime someone adds something to the metaserver, it checks to see if an entry was already created. The intention is not just to say there’s another record, but to offer pointers to multiple resources, different artifacts registered in museums and libraries across the world, that might be valuable.”

Is there anyone who is compiling a data, making a database, or working across multiple databases? What challenges have you encountered? How have you overcome them?

0

7 comments

Great post Elizabeth! What were some of the repositories mentioned in the meeting you attended? How much of this work involves creating centralized searching resources across repositories vs. creating larger federated archives and repositories that centrally control collections? Are the access systems geared towards academic researchers or a wider population?

I guess I've got more questions than answers here, but in any case you've got me thinking...

These sorts of questions are on my mind a lot as I'm working on a semantic web-based project that seeks to combine data from a bunch of different sources. Have you heard of LODLAM (http://lod-lam.net/summit/)? It's a summit on "Linked Open Data in Library Archives and Museums" where the participants tackle these questions of data interoperability via linked open data. I'm excited to see so many in the (digital) humanities starting to recognize the importance of formatting their data this way.

I'm a little confused by the idea of a metaserver, though. One of the promises of LOD is that we don't need a central clearinghouse for information. So long as data is semantically marked and the ontology is available, we can link and share data seamlessly. As someone said at the NEH Digital Humanities Project Directors Meeting a few weeks ago, "your data is my data and my data is your data".

The questions you quote from AnnMarie are ones I come up against far too often. Just finding the right tools for my own projects and for my teaching can take years of research, even though everything is already available on the Internet somewhere. There are just so many different resources and tools in so many disparate locations that we probably do need something like a metaserver, not necessarily for data, but for indices of tools, projects, etc. It would certainly make my life easier.

Thanks for sharing those notes. It sounds like they're trying to navigate the relationship between structure of ontologies and the flexibility of folksonomies. I wonder how (and if) they intend to share the data they generate. It also sounds like it will require a lot of human intervention to flesh out those messy relationships. I'm all for human-machine collaboration when it comes to data harvesting; it often seems like the only way to capture the kinds of information we're interested in.

Adam, Several repositories were mentioned, among them the Social Networks Archival Contexts Project (SNAC), which Daniel Pitti discussed at length. This project addresses the "ongoing challenge of transforming description of and improving access to primary humanities resources through the use of advanced technologies. The project will test the feasibility of using existing archival descriptions in new ways, in order to enhance access and understanding of cultural resources in archives, libraries, and museums.

Other speakers, such as Edward Whitley, talked about his project, "The Vault at Pfaff's," (hosted by Lehigh U) a database about the relationships among people who frequented the Vault at Pfaff's, a nineteenth-century beer cellar in NYC that attracted a range of characters, including Walt Whitman.The goal of that project is to develop tools for data mining and visualization to reveal the workings of literary communities and to allow for serendiptious discovery of new knowledge. The project seeks to link visualizations of data directly to the source documents and databases, and it wants to make that scholarship transparent.

The overall goal of the meeting was to figure out ways to make databases and data linkable. What do we need, how do we ask for it, how do we fund it? How do we make account for digital obsolescence? The sciences are much further along, but some of their methods don't work for the humanities. As far as data is concerned, much of our research involves contradiction, dualities, emotion, identities and so on. There is definitely a wish to link to the Library of Congress and other large repositories of information. One of the problems is that sometimes records are wrong, or people disagree on how a record should be described. And yes, most everyone wants to make the data systems available to all, not just to academics.

Michael, You seem to know so much more about this subject than me! As I understand it, the goal of the Metaserver is to be a space that allows researchers, curators, librarians and anyone else to link their data. Here's an excerpt from my notes on Jon Ippolito's talk about it:

"Metaserver leaves data where they are and connects them with pointers, using title and theme.

When you add a record to the database and connect it to the metaserver, if the record does not exist, it will add it the metaservers. Anytime someone adds something to the metaserver, it checks to see if an entry was already created. The intention is not just to say there’s another record, but to offer pointers to multiple resources, different artifacts registered in museums and libraries across the world, that might be valuable.

It’s sort of like having an ISBN number for different artifacts. The amount of data required is so minimal, it fits into a lot of different standards. It can have a common field that another database has. Across disciplines, you find the kind of data stored in these data sets are so different it’s hard have a one to one cross walk."

I think I understand better now, Elizabeth. It sounds like they're trying to create a centralized repository of different data sources, but not copy the data itself, if I understand correctly. That would be very valuable. We need more indices of projects and data stores, preferably one or two centralized ones for the humanities that are, themselves, linked data and well publicized. LOD-LAM is similar, but more interested (afaik) in getting several different groups to use interoperable data, not to provide a centralized index.

When they discussed SNAC, was there any mention of RoSE (Research-oriented Social Environment). That's Alan Liu's project on linking authors and primary texts into a network. They have stated plans to collaborate with SNAC. I'm looking forward to seeing what both SNAC and RoSE produce for public consumption; they sound like they'd be great partners with Bibliopedia, my project that's focused on secondary literature rather than primary texts as those two are.

"What’s unique about RoSE is it is not an entirely controlled vocabulary system.

Person to person relationships—influence—rich descriptors. We’re thinking of our system as evolutionary system of thought; pseudosystem of thought. The majority of metadating systems are harvested from sources that follow standards. That provides a thin matrix about knowledge. We’re designing the thin matrix to be a platform for more thickly described relationships and annotations on top of that. Then you could layer on very thick knowledge / understanding on top of that."

Richard Edwards Tweeted: "RoSE is unique in its pseudo-controlled vocabulary to create more thickly described relationships, layering new data. "

Messy relationships. That's part of what makes history and literature interesting, but they're difficult to capture in a standardized database. It's one of the things Liu's project takes under consideration.