Categories

Subscribe

Archive for October, 2010

Michelle Hudson and I have visited more teaching & research faculty across campus learning about their uses, needs, and wants when it comes to data. As one person put it, we are preparing for the “Data Tsunami”, or as another person put it — the “Data Explosion”. We have learned a few more things:

Brokering – At least a couple of the people we visited thought libraries ought to play a central role in the brokering of data sets. In their view, libraries would be repositories of data as well as manage the licensing of the data both inside and outside the University community. “Libraries can make it easy for us to find data sets.” The Institute for Quantitative Social Science at Harvard University may be a good model. This clearing house function needs to include services educating people on how to use the data, “cool” interfaces for finding and using the data, and links to bibliographic materials like books and journal articles. “We would also like somebody to help us keep track who uses our data and where it is cited in the literature.”

Curation – Some people have “archived” original data sets in the form of paper-based surveys. These things are stored in file cabinets of basements. Others have elaborate computer systems complete with redundant backups, rsync functionality, and data refreshment protocols. One person alluded to HubZero as a possible tool for these tasks.

Data origination – Most of the people we have talked to generate their own data either through surveys or scientific equipment. Fewer people, so far, have gotten their data from other people or companies. When it has come from companies, the data has been encrypted before it gets here as well as anonymized.

Data types – The formats of the data is falling into a couple of categories: 1) binary data such as images, video, & simulation output, and 2) plain text data in the form of spreadsheets (mostly) or sometimes relational databases. “We know that the use of relational databases is the ‘best’ way to organize this information, but none of us want to take the time to learn SQL.”

Licensing – At least a couple of the people we visited license their data to others. After working with General Counsel, contracts between parties are signed and the data is exchanged. We have yet to see any money changing hands. The licenses are used to protect the University from liability when the data gets used in ways not allowed by the license. A couple of people would like to the University (or specifically the library) to handle this sort of paperwork.

Metadata – There is a wide spectrum of metadata application against the data sets. Some people have no metadata at all. Others maintain multi-volume books filled with “protocols” describing their data and how it is to be collected. One person said, “We spend a lot of our time correcting metadata tabulating what camera was used, when, and by whom… Our reputation rests on the quality of our data (and metadata). We’ve formatted our metadata as CSV files as well as XML files. In order for our data to be characterized as ‘good’ we need an error rate of 1000/1”.

Sharing – We are learning that the sharing of data is a complicated decision-making process. Many things come into play including but not necessarily limited to: the culture of the subject discipline, patents, the competitive nature of the researcher, intellectual property rights, funding agency requirements, embargoes, and the inclusion of human subjects. Some people are more than willing to share than others. So far, no one will share their until the first paper has been written. The want (need) “publication rights”.

Size – Everybody believes they have “large” data sets, but the definition of large needs to be qualified. On one hand large may be equated with sizable files. Videos are a good example. On the other hand large may mean many records. Big longitudinal studies complete with many fields per subject are a good example.

We are also learning that no one person or group seems to have a complete handle on the issues surrounding data. Michelle and I certainly don’t. Everybody knows a lot but not everything. Consquently, we are thinking of hosting “Data Day” — a time and place when many of the people who deal with data for teaching and research get together, share war stories, and learn from each others’ experience. In the end we may understand how to be more efficient and prepared with the “tsumami” is actually upon us.

Yesterday David Dickinson came to visit the libraries to share and discuss some of his work regarding optical character recognition of New Testament manuscripts.

David Dickinson is a South Bend resident and Renaissance Man with a multifaceted educational background and vocational history. Along the way he became keenly interested in religion as well as computer programming. On and off for the past five years or so, and working in conjunction with the Center for the Study of New Testament Manuscripts, he has been exploring the possibilities of optical character recognition against New Testament manuscripts. Input very large digitized images of a really, really old original New Testament manuscripts. Programmatically examine each man-made mark in the image. Use artificial intelligence computing techniques to determine (or guess) which “letter” the mark represents. Save the resulting transcription to a file. And finally, provide a means for the Biblical scholar to simultaneously compare the image with the resulting transcription and a “canonical” version of a displayed chapter/verse.

David’s goal is not so much to replace the work being done by scholars but rather to save their time. Using statistical techniques, he knows computer programs can work tirelessly to transcribe texts. These transcriptions are then expected to be reviewed by people. The results are then expected to be shared widely thus enabling other scholars to benefit.

David’s presentation was attended by approximately twenty people representing the Libraries, the Center for Social Research, and the Center for Research Computing. After the formal presentation a number of us discussed how David’s technology may or may not be applicable to the learning, teaching, and scholarship being done here at the University. For example, there are a number of Biblical scholars on campus, but many of them seem to focus on the Old Testament as opposed to the New Testament. The technology was deemed interesting but some people thought it could not replace man-made transcriptions. Others wondered about the degree the technology could be applied against manuscripts other the New Testament. In the end there were more questions than answers.

Next steps? Most of us thought David’s ideas were not dead-ends. Consequently, it was agreed that next steps will include presenting the technology to local scholars in an effort to learn whether or not it is applicable to their needs and the University’s.

At the most recent ECDL conference in Glasgow (Scotland) there was a panel discussion on data curation called Developing services to support research data management and sharing. Below are some of the things I learned:

My take-away from Sara Jone‘s (DDC) remarks was, “There are no incentives for sharing research data”, and when given the opportunity for sharing data owners react by saying things like, “I’m giving my baby away… I don’t know the best practices… What are my roles and responsibilities?”

Veerle Van den Eynden (United Kingdom Data Archive) outlined how she puts together infrastructure, policy, and support (such as workshops) to create successful data archives. “infrastructure + support + policy = data sharing” She enumerated time, attitudes and privacy/confidentiality as the bigger challenges.

Robin Rice (EDINA) outlined services similar to Van den Eynden’s but was particularly interested in social science data and its re-use. There is a much longer tradition of sharing social science data and it is definitely not intended to be a dark archive. He enumerated a similar but different set of barriers to sharing: ownership, freedom of errors, fear of scooping, poor documentation, and lack of rewards.

Rob Grim (Tilburg University) was the final panelist. He said, “We want to link publications with data sets as in Economists Online, and we want to provide a number of additional services against the data.” He described data sharing incentive, “I will only give you my data if you provide me with sets of services against it such as who is using it as well as where it is being cited.” Grim described the social issues surrounding data sharing as the most important. He compared & contrasted sharing with preservation, and re-use with archiving. “Not only is it important to have the data but it is also important to have the tools that created the data.”