Code4Lib10 tuesday notes

The morning started with a nice 12 mile run based on a few different routes. A closed road near the Biltmore house had me walking (terrified) across a very short railroad trestle before I noticed a nice pedestrian bridge (way to go Asheville!). Code4Lib is a very connected conference which leads to in-depth online note taking and an active IRC channel.

Code4Lib 2010 kicked off with a keynote by Cathy Marshall. She mentioned a few interesting stats (a webpage weighs 80 micrograms – Brewster Kale), 4.5 Billion personal photos in flickr, and continued to talk about how people preserve, use, and manage their digital artifacts. BTW – her talk was titled – People, their digital stuff, and time: Opportunities, challenges, and life-long challenges.

For me a compelling question she asked was – “digital lets us keep everything – should we?” She had a good point – you never get rewarded for deletion, but sometimes you get rewarded for saving.”(ala Gordon Bell).

She talked about a study she did on massed data where she typed the metadata (e.g. place, artifact, context) that was assigned to digital objects. The interesting aspect of typing metadata assigned to digital objects is in thinking about metadata assignment within the context of teh idea that people don’t approach digital object description/archiving in a deliberate and systematic way. Her final message was that “new opportunities lie in the aggregation of individual archives and efforts.”

They are working on building this at http://cloud4lib.sourceforge.net. They have setup a collaborative workspace which they are running out of OSU libraries using a research account on Amazon EC2. They are having a breakout session at code4lib today to talk about how to structure this idea.

Ross Singer – Linked data

Ross talked about how he took MARC data & built a linked data service. Too complicated to talk about in depth here but you can find his presentation on the code4lib site. The really neat thing he did was take a vufind instance & embedded some RDFa into the template to make the linked data discoverable.

Data librarian, UCBerkeley. Like many things – data analysis in libraries is getting more complicated – libraries have an opportunity to help people learn how to analyze data & how we can use data analysis in our own work. His definition of the cloud “a replacement for the desktop, when it makes us work smarter, extendable.” He asserted the idea that decision makers are not statisticians but rather need processed charts & data. Rapache is an apache module with the R interpreter compiled into it, lets you embed an R script in web pages, provides interface to get/post (baseball example). Some ideas he had for use – interactive we/e-journal use visualizations, real-time survey results, data visualization for instruction, network analysis. . .

Public datasets in the cloud – Rosalyn Metz, Michael B. Klein

Their cloud definition ( on demand self-service, broad network, resource pooling, rapid elasticity, IAAS, PAAS, SAAS). She demoed launching an EC2 instance & mounting datasets hosted on EC2 for analysis. She mentioned socrata, Google fusion tables in which you can create tables and visualizations of your data. Michael talked about data access & issue and wondered where/how libraries might get data & asked what the role of IRB might be in data analysis.

Extensible catalog – Jennifer Bowen

The afternoon started off with an overview of the eXtensible Catalog. They have developed a suite of tools including user interface, metadata, and connectivity (NCIP, OAI) to serve as a holistic replacement for discovery & management services. XC includes frbrized metadata, “why you got that record” in the index display, works as a drupal module, includes some nice complex staff metadata management tools so that staff have flexibility in defining how the metadata is displayed. The toolkit (Bowen indicated that there were lots) allow you to automate data loading and processing! It appears that right now XC is still in a semi-release phase but all software is available.

Conference fatigue has set in – The rest of the sessions are discussed in abstract here

Ok, I take it back. Jeff Sherwood talked about using Levenshtein string distance as a method for matching records using a mathematical algorithm – very neat. They also usd the Jaro-Winler algorithm for string comparison. code is at http://pypi.python.org/pypi/editdis/0.1, but.ly/ZGSmF, secondstring for Java, MARCXimiL – a MARC deduplication package, http://snurl.com/uggtn.