2012-08-27 Edit: this class has been moved to Spring 2012. CS 895: Web-Based Information Retrieval. The instructor for this class will be Dr. Nelson and the class will be similar to the class taught in Fall 2011, featuring a review of IR models, ranking, evaluation, DM/ML, etc.

This is the first time we've had four WS-DL classes in a single semester. The CS 418 and 495 classes will count toward the Web Programming Minor, and the upper level graduate classes will count toward the 24 hours of course work required for the PhD.

WS-DL's Contributions to Digital Preservation 2012

Mat Kelly (@machawk1) presented a demo of a Google Chrome extension he developed called WARCreate, as a further extension to the initial poster/demo presented at WS-DL's trip to JCDL this past June. WARCreate allows a user to create a Web ARChive (WARC) file from any viewable webpage. Mat's main focus was on preserving content behind authentication, namely, social media content that is currently not be preserved by institutions like Internet Archive.

Hany SalahEldeen (@hanysalaheldeen) presented his poster "The Revolution Will Not Be Archived" showing that nearly 11% of a sample of shared social media content was lost within a year of the 2012 Egyptian Revolution. By sampling other culturally important events, Hany was able to confirm similar rates of loss. Further information about Hany's methods and findings are detailed in a blog post he did about the study.

Presenters

Much like last year, Martha Anderson (@MarthaBunton), the director of program management for theNational Digital Information Infrastructure and Preservation Program(NDIIPP), stated that this was the largest ever meeting for NDIIP at about 260 registrants. "Currently, NDSA has 128 members", she said, "with 3 more in the process of being admitted." This chimes in harmony to her welcoming remark at least year's meetup of, "We are growing!". She emphasize that NDSA, being an initiative of NDIIPP, promotes leadership, generosity and commitment. Other goals of the organization she mentioned were stewardship, collaboration, inclusiveness and exchange. Martha then passed the mic to emcee to Bill LeFurgy (@blefurgy).

Anil Dash (@anildash), co-founder of ThinkUp, an open source app that allows creators to keep track of activity in social media, performed his presentation entitled, "Make a Copy". He brought a unique perspective of an entrepreneur and an activist and described himself as being "a geek interested in the social impact of tech on culture, government and others. "Good tools impact culture in a positive way", he said, "Simply recording the conversation in social media would change the nature of the dialog taking place." On the screen he put an image with the writing, "The wholesale destruction of your wedding photos" then asked the audience if they would be offended if there was a "Secret Ivy League" conspiracy that was to commit the fact. Emphasizing the point, he says that this is exactly what is happening with Facebook, blatantly hints at the website's origins as being Harvard-only and cites the terms of service about the service's right to remove any content at will.

"Often", he alludes, " a service's Terms of Service Trump the constitution." But we are locked into these services. Oftentimes, the social and career cost that would come about for a user to opt out of these services is unfeasible."There's a war raging against the open web", Anil stated, admitting the hyperbole. "The majority of time spent on the web is spent within an application experience, not web pages...They are gaslighting the web", he said after explaining the metaphor and further expounded that though his app has authenticated, because he is not a large service like the Washington Post, his app is given a "warning" message when users try to utilize it.Moving on, Anil said, "Our tools for archiving our culture suck." "Silent movies", referencing that the archival of the most popular video format comes in the form of animated GIFs, which usually is preserved without all frames intact. Proprietary formats win if they have more users", he said, emphasizing that popularity is the force that drives the formats utilized by the masses. "We are losing metadata like crazy. You can't find an Instagram photo on the web. We have a billion photos that have no web presence", he went on. Continuing, "In the best case, if some young Facebook engineer came along and integrated the ability to tag metadata on Instagram photos, we would get the same level of metadata as flickr has today, so we will have lost 5 years of sharing - essentially a billion photos lost."Anil stated that it will be illegal to copy your own data off of your own device once everyone has moved to smartphones. "I can't copy my own data without breaking DRM rules", he said. "Device obsolescence is getting faster. Even with open formats, data will be lost."Anil changed the subject slightly with, "The way the web works is by making copies ... If we can encourage things to be on the web, we start to win." Anil mentioned Timehop, a service that will e-mail what you were doing according to your subscribed services (e.g. Twitter, foursquare) a year ago. "How do we mine our personal archives and gain meaning/insight about it?", he said, affirming that Timehop provides some indirect degree of temporally-displaced preservation. "People revere name, 'Library of Congress'", he said to a crowd made up with quite a few members of LoC. "There has been careful stewardship of the institution for a long time, people respect a loc.gov e-mail address." Anil stated that there was an instance where Facebook enforced a 24-hour limit on its social graph system, which the government utilized to show those that had interacted with the White House. When the White House's policies required the graphs to be preserved for longer than Facebook allowed, the Facebook TOS's were in the process of reforming these restrictions for all. Conveniently, the White House was a good use case as to why this content should not have such limitations. "PR trumps Terms of Service", Anil said, saying that if Facebook were to stand firm with their policy, they would have a lot to lose. He finished by sharing that he is currently unable to buy Michael Jackson "making of" Thriller from Sony, yet it can easily be found on YouTube. "Uploaders want to be our allies", he stated while referencing the backlack by various websites and Internet users on the recently rejected SOPA and PIPA legislation. As a final note on the topic, Anil said, in reference to the masses' unbeknownst actions toward preservation: "They want to be advocates for you."David Weinberger(@dweinberger) of the Berkman Center at Harvard University followed Anil with his presentation, "Big Data, Really Big Data" starting off with the original notion that the only way to manage data initially was by eliminating it. "The original strategy", David stated, "of wisdom, knowledge, information, data" can now be inverted."

David went on to talk about the merits of knowledge by stating:

Knowledge was a matter of filtering out to get to the nuggets.

Knowledge drives out difference

Knowledge is a series of stopping points

He went on to talk about finding stopping points when acquiring knowledge by reducing it to, "Find an expert, get knowledge, move on", "We don't have to redo the experience. We can build on what is known", and "Books are stopping points."

Regarding books and specifically what should be the writers' task, "When writing, put everything in it that the reader reads." Referencing the Ancient Greeks and the accessibility of knowledge, David said, "The web of knowledge only has value if there is disagreement." David then showed a quote from Senator Patrick Moynihan stating, "Everyone is entitled to his own opinion. Not his own facts."

"We want to believe that though we disagree, if we sat down long enough, we could come together over the facts.... From the web, we learn that we don't agree", David made the crowd of a long hidden truth.

"We have come up with ways that we can benefit from disgreement" David said and proceeded to show a platypus and the opposition to its existence because it was ill-fit for the time's Linnaean nomenclature.

Regarding knowledge, David said, "Among software developers, we have the best rapid learning development humans have ever had in the Internet." Here, he went on to show examples of programming related website like stackoverflow.com (related: Proposed Digital Preservation StackExchange).

David showed, as learned from these sort of websites, that we can learn from developers in this environment who exhibited humility and generosity in assisting others with nothing tangible to gain. David showed the already greater than 3 billion questions that had been asked on StackOverflow.com and emphasized that "iteration is an incredible tool at scale."

"The act of education itself should be public. There is social benefit in that", he said. In reference to the initial effort like these, David concluded with the need for messiness:

"Messiness is how you scale meaning. If you want these objects to be rich, saturated with meaning, then you have to allow messiness. Ultimately, it is disagreement that scales knowledge."

After David, the room was dismissed for a short break to be greeted by Michael Carroll upon returning.

Michael started off by alluding electronic resource depletion to environmental depletion, stating that the analogy is a good correlative fit. "The links to the data are part of what makes it meaningful. Some resources are allocated to metadata collection but should be re-shifted to preservation efforts. ", Michael said."Intellectual copyrights are intangible things that we attach to things. The rights can be looked at as a thing as well, even though they're intangible", Michael continued.

There are risks in the environment about what you can and can't do. The legal environment of copyright is not stable, it's dynamic."Many copyright holders let their copyrights expire. Michael, describing the holders' reasoning, "It's a natural response to have no long-term copyright intention." Michael then explained the Orphan Works issue wherein a work's author is not known. Because of this, the resource cannot be reused by others, including libraries, for fear of being sued if the copyright owner ever takes claim of the work. "Orphan works gets in the way of digital publication projects", Michael said.Michael then provided his opinion on fair use for preservation:

"Making copies for purpose of preserving them is a social beneficial use that is unlikely to have any market harm to the copyright owner. I think it is a presumption that making a copy itself, storing it, is fair use unless there is an active market for that kind of activity. The storing is not something that you should be deterred by."

Fair use has an active role to play in the digital preservation field. "Grab it, store it, figure out how to work with the copyright", Michael said. He then shifted focus with, "Because copyright attaches automatically, Creative Commons wanted a way to opt out of it and opt into a more sharing environment." He then went on to list the 6 variations of Creative Commons licensing. Michael further emphasized that a CC license is not limited to digital works and that any work can say that it's under the license by simply providing the generic CC URL.He closed by rhetorically asking the question of, "What can Preservation community do to assist in preservation and access to preserved websites?" He recommended that the first step is to mark the digital public domain (things that are already there) so people can be aware of what's in copyright and what is not. Further, people can encourage the use of open licenses at the time of publication or consider using a "springing" open licenses that grants CC right based on some contingency. Doing so will remedy the Orphan Works. Lastly, he said that even if the work is under full copyright, at least mark it as such. Questionable copyright is the source of many issues he described.

Lightning Talks

A session of short-length various topic lightning talks followed.

Christie Moffatt of the National Library of Medicine is developing a “Health and Medicine Blogs” collection at the U.S. National Library of Medicine. She is archiving blogs of medical professionals and patients and thus far has "been able to capture select blogs fairly well".She proposed the question, "What does it mean to completely capture a blog and how far should you go to preserve content?"

Test crawls she performed were helpful in finding problems relating to crawl frequency, quality, etc. She then asked, "How much of the blogs' content needs to be about medicine to be a medical blog (e.g. doctor's that quilt)." but left this question open-ended.

Terry Plum of Simmons GSLIS presented, "Teaching Digital Preservation in a Digital Curriculum Laboratory". He worked with efforts to setup labs to pull in students and faculty to perform preservation procedures. The labs consisted of a wide range of preservation software including dspace, fedora, espace with his intention to bring out any open source apps that relate to archives.

He turned students loose in the apps to see what they can do with broad guidelines. He also provided preservation exercises consisting of step-by-step instructions to execute preservation procedures like ingesting and encapsulation with all exercises having a focus on preservation.

Kelcy Shepherd of Amherst College presented "Our Collective Task: Digital Preservation at the Five Colleges", which described the logistics of the efforts of five colleges at best utilizing their resources for preservation. She found that is was more cost efficient to collaborate five schools, as less personnel was required.

"Now", she said, "is the right time for collaboration. Each college is taking responsibility or creating a need for digital stewardship". Initial assessment was done to evaluate each school's rediness with each school having developed its own digital preservation policy. Some key questions she concluded with were "Should we be doing this within the five colleges, forging a larger collaborated effort, or join a larger effort in the region?"

Jefferson Bailey (@jefferson_bail) of Library of Congress presented "Personal Digital Archiving at the Library of Congress" with the intention of putting emphasis on the value of personal digital archiving. Martha Ballard, from 1785-1815, record events of her life. This analog recording was passed down through her family until it was discovered in 1980s. A followup book called "Midwife's Tale" based on these records.

Jefferson noted that though this collection was able to sit on a shelf for over 200 years until it was discovered, other objects are not so lucky. LoC wants to be supportive toward community efforts of digital archiving. He also noted recent activities supported by LoC including a Personal Archiving table at National Book Festival (the giant floppy disk, he said, is always a big hit), presence at ALA Preservation week and representation at many conference. LoC also now has a downloadable Personal Digital Archiving Day kit."The mission and goal of program", Jefferson stated, "is to work collaboratively with the community to deliver basic, practical tips and guidance to assist individuals seeking to preserve personal digital objects."Carol Minton Morris (@mintonmorris) of DuraSpace was up next with "Painting Crowdsourced Microfinance Platforms and Projects Into the Big Digital" to investigate NDSA initiatives to leverage platforms like KickStarter (Yancy Strickler spoke about this in the 2011 Meetup) for funding.She looked to determine if there was public interest in crowd sourced microfinancing and how effective it would be in overcoming funding issues.

"We've seen an NDSA effect in curating projects related to NDSA goals", she continued, "and Kickstarter has matched backers. However, Kickstarter Projects are mostly about the arts. Tech and publishing projects are the outliers. We might want to think about changing that."

Carol mentioned financing means outside of KickStarter as well like IndieGogo. Unlike KickStarter, IndieGogo allows all funds raised to be kept by the party doing the raising. However, also unlike KickStarter, IndieGogo's fees are higher. She encouraged everyone to think about microfinancing. "If you can split off parts and pieces into small chewable components", she said, "you may be able to find a way to fund things in small bites."

"The way we needed to think of this was in long-term scenarios." she said. "We just funded a grant for the Hibani state archives that proved to be a success story. After funding, the archives began to poll for additional external funding." She continued, "I am here today to get a response from all of you that are constantly looking at programs to make sure that the program fits the needs of scenarios and addresses the emerging challenge in appropriate ways. "Reach a little beyond but not too far. I think we (her colleagues) need to collaborate collectively in our own program."

Joel Wurl of the National Endowment for the Humanities spoke next stating, "Best practices are always plural and, perhaps, always will be. Solutions in one domain may not ever transfer to another. ... I suggest that there will always be a need to be fulfilled and tasks to accomplishment, and thus a need for funders."He followed with other quotables like, "Digital Preservation and Fixity do not go together neatly" and "Is it preservation or is it rendering? YES!" with little comprehensible segue.

"There are serious gaps to be filled in workforce expertise", he continued, "These gaps can be seen at all professionals levels and sizes of institutions." He reiterated early talks about a need for a paradigm shift from project to program. and that the need to really think about this is institutionalized activity."

Describing current projects, he said, "Digital Preservation is also something we have highlighted in our education and training program. We have had very successful projects in recent years", giving the example of University of Michigan having a "virtual lab" in the works. He finished by saying that they can and are willing to work with smaller institutions and are using the tactic of offering money for staff attendance at workshop so to engage these institutions.

A second Individual award went to Dr. Anthony Cocciolo (@acocciolo) for his approaches to teaching digital preservation practices.

Each award winner spoke for a few moments about their respective contributions.

After lunch, the crowd was split into breakout sessions. Documented here are the sessions the WS-DL group chose to attend. From Concurrent Session 1, we attended the Tools Demo #1 regarding "Access". Demoed here were Viewshare and Neatline. Other sessions that we did not attend were:

"Bridging the Gap through Digital Stewardship Training and Education" with presenters from Library of Congress, George Coulbourne, Kris Nelson and Jefferson Bailey.

From the first concurrent session, Trevor Owens of Library of Congress presented a demo of his open source tool Viewshare. Built upon open source tools itself, Viewshare is an easy way to visualize data. After a user signs up (and gets a confirmation of access in a couple days), they will have an account that allows them to upload data to utilize the tool or access data existing in the system to manipulate to ensure that their future uploaded data will be interpreted correct.

ViewShare utilizes Dublin Core data and possesses GeoNames integration to allow a user to aggregate various descriptor of a place to be return latitude/longitude coordinate. Trevor repeatedly emphasize that "We're not editing data, we are only augmenting data", ensuring the crowd that their data was safe. As Viewshare allows the manipulated data to be exported, this was important to point out to give those watching confidence in using the tool. Viewshare also contains user-cusomizable privacy options that will prevent embedding of the data and results and require the user to instead access a specific webpage to see the content, as controlled by the creator.

The second tool presentation of the session was from David McClure of University of Virginia's Scholar's Lab. He presented his creation, Neatline, which is a set of plugins for Omeka that allow for external integration and visualization of data. In manipulating the data, he stated that "everything is a record", including the annotations that a user can add to an image. The use case he demonstrated was of adding annotations to an pre-annotated Hotchkiss Map to make it interactive and more useable.

"Once you have this archival collection", he said, "it give you a way to expand on those materials and gives you an extra level of functionality and interactivity." Through converting these annotations to be more interactive, he is also able to provide extra metadata to the marks to provide relevance of the movement represented by the map's annotations.Further work was done in taking an informal map created by a confederate mapmaker in a letter to his daughter. In the letter he roughly sketched out a map on the bottom half of the third page of the letter. By using Neatline, he was able to overlay this on an authentic map of the area he was describing and display the other transcribed parts of the letter to give contextual relevance. Neatline is capable to utilize geo-referenced from historical map from a GeoServer as well as a few other services (including Stamen map tiles). He also emphasize that Neatline puts a lot of focus in giving the user lots of control over the appearance of the map with customizable annotations. Further, images can be used as the waypoints and the options exist to allow annotations to phase on and off if a temporal aspect is added to aggregate annotations onto the map. Annnotations can also establish hierarchical relationships where the child settings of an annotation inherit the settings of the parent annotation.

After a short break, the second set of concurrent sessions began. As before, the WS-DL team was only able to experience one of these sessions. Those missed were as follows.

David Minor (@dhminor) of Chronopolis at UCSD Library, Jeremy York of HathiTrust and Amy Kirchhoff of Portico presented on the topic of "Experiences with TRAC: Chronopolis, HathiTrust and Portico".

We were able to attend tools demo 4, as it pertained most to some of our research work. The first presentation of the session title "Web Archiving" was Lisa Gregory (@gregorylisa) of State Library of North Carolina who presented, "CINCH: Capture, INgest and CHecksum tool". The tool automates the transfer of online content to a repository using ingest technologies. Lisa asked, "How can we get those archive out of our repository and deposit them?"

The first steps in Cinch, a web-based tool, is for a user to login and upload a file consisting of a list of URLs that you would like the system to download. The URLs can be uploaded in CSV or a simple text format. To generate this list, users can utilize Archive-It, which outputs a PDF that Cinch is capable in taking as input.

Once Cinch locates the files at the URLs, it does a "duplicate check" to see if the system has downloaded it before or if you requested the file before. If the file is a duplicate or some other error occurs, Cinch writes the errors to a "problem files" folder and provides an audit trail that allows the user to correct the issue.Assuming success, once the file is on the destination file system, Cinch does a virus check, verifies and reset the last modified date, extracts the metadata, calculates the checksum (currently SHA-1 but it will support MD5 in the future) and again checks for a duplicate of the current download to see if there were any resultant checksum issues.The final steps of the process are for the tool to package everything up in a zip file. Cinch currently allows files no greater than 0.5GB though if the result exceeds this, it will be split into multiple zips. Once the files in the user-specified list have been completely received and processed, Cinch e-mail the user with a URL to access the content.

The second presentation of the demo session was by Kris Carpenter Negulescu, of Internet Archive for the tool Heritrix, a web crawler created with preservation in mind. "Instead of just grabbing the text," she said, "we wanted to create a body of software that will grab only the data as defined by the user. The complexity of software is how it can be configured to collect a subset of a resource. For example, one video or one image."Heritrix obeys the robots.txt file of a website and only downloads permitted content in its crawl. "Unless your are an institution that has a good reason to violate the robots file", Kris saidm "you ought to obey it."Heritrix also contains "revisit" records that allow the tool to bypass recording the payload of a resource that has not change and instead just records metadata on the subsequent crawl.Speaking of Heritrix's interoperability, Kris said, "It's inherently designed to integrate with other capturing tools and services." Heritrix has many bells and whistles as well as many options that are intended to be applied to very large scale crawling. For small crawls, however, it is overkill."When you distribute a crawl, you can be very efficient with you collection." Kris reiterated. Heritrix's outputsWARC files though she emphasizes that this format is intended for storage and not access.

After another short break, the third set of concurrent sessions started. As before, we were only able to attend a limited number of sessions. Those that we could not were:

Funders Speed Dating, where conference attendees were able get face-to-face with some funders, receive advise and pitch ideas.

Panel Discussion on Assessing and Mitigating Bit-Level Preservation Risks, an NDSA Infrastructure Working Group with presentations by John Spencer of BMS/Chace LLC, Priscilla Caplan of Florida Virtual Campus, Andrea Goethals (@andreagoethals) of Harvard University and Micah Altman (@drmaltman) of MIT Libraries.

A sixth tools demo titled "Ingest Tools" with a presentations of CTS by Brendan Mannix (@b_mannix), Rosie Storey (@rosie_storey) and Kate Zwaard (@kzwa) of Library of Congress and Open Source Digitization Tools byKate Murray, Courtney Egan and Jeff Reed of National Archives and Records Administration

The session we did attend was Tools Demo #5 titled "Web Archiving" with a presentation of Archive-It by Lori Donovan of Internet Archive and WS-DL's own Mat Kelly presenting WARCreate.

Mat went on first to present his Google Chrome extension that allows a user to create a WARC file from any browseable webpage. The process required a user to only browse to the webpage they want preserved and click a single button. "To overcome some of the limitations in Javascript and browser-instigated personal web archiving", Mat said, "I have also created a personal web archiving suite that not only makes up for browser extensions' capability but also provides additional tool to allow the user to leverage the web browser for personal web archiving."

He proceeded to show the conference's #digpres12 hash tag as well as no further work that must be done to make this preserved content replayable on a local Wayback instance. The suite provides an easy way for a user to utilize the WARCs generated from the tool through the included Wayback instance, configured for personal web archiving.

Lori Donovan of Internet Archive followed Mat with the presentation of Archive-IT, a hosted service that allows access and storage as well as collection specification of the resources at user-specified URLs. Archive-It allows users to specify crawl frequency among a plethora of other options for a completely customizable crawl to be executed.Lori proceeded to show example of collections including the K12 Collectionas well as the Japan Earthquake collection. Archive-It allows the collecting of content of news as it unfolds and making sure that it's preserved. Archive-It also provides the facilities to initiate "Closure Crawls" to capture specific content that may have been missed with the initial crawl. There are also efforts by Archive-It to "capture content before it goes away", much like Archive Team.Archive-It provides a simple way to get a list of all of the collections as well as a UI to show which hosts the content was captured from and an Ajax-driven updated status of crawl, showing error, redirects, etc.

Lori then proceeded to go through the various sort of reports that a user can retrieve regarding their crawls including a seed source report, which shows for each seed how many URLs and documents were captured. Multimedia like captured videos can have additional metadata attributed to it in Archive-It and allows a user to search the archive's metadata at the collection, seed and document level.

After finishing, an audience member asked what happens if a user is unable to continue to pay for the subscription to which Lori replied that the archive was retained and there were procedures in-place to allow the content to be hosted directly at Internet Archive. After this last demo, Martha Anderson provided closing remarks to wrap up the conference.

Memento and Source Code Repositories — Harihar Shankar (LANL)

Memento allows temporal access to web resources using datetime. Version control services such as GitHub also allow temporal access, but using a version number instead of datetime. Harihar Shankar of the Los Alamos National Laboratory (LANL) Research Library presented Memento and a Memento/GitHub proxy prototyped at LANL. The proxy enables access to GitHub projects through datetime. For many use cases, datetime is much simpler that Git’s 25-hex-character commit id.

A Research Agenda for “Obsolete Data or Resources” — Michael Nelson (ODU)

Old Dominion University’s Michael Nelson presented WAC’s research agenda for obsolete data and resources. His presentation covered the public’s misconceptions about web archiving, where the web archiving community can improve, the origin of the current notion of time on the web, the gaps bridged by Memento, and some of the progress made to date. Many details and examples are available in the slides.

Google knows how to index the Web and allow casual users to discover resources in mere seconds. Add time to the mix and current indexing and search solutions break down. Eric Hetzner described the challenges and approaches of temporal currently being address at California Digital Library (CDL). CDL has 49 public archives, 19 partners, and nearly 1 billion URLs across archives of 3684 web sites. Nearly 50TB of archives must be stored, indexed, archived, and searched. CDL’s current solutions, which use NutchWAX, do not easily allow for deduplication, metadata indexing, and other optimizations. These and other architectural limitations motivated CDL to begin building anew.

Laura Wynholds studies scientists and what they do with their data. She has been working with scientists from the Center for Embedded Network Sensing (CENS) and Sloan Digital Sky Survey. At both, she has found a variety of data lifecycles and standards. She has found that data and its associated documentation is shared in many ways, form formal institutional stewardship and repositories to informal means such as email, FedEx, and web sites. Large, well-used data sets tend to have very good preservation arrangements. Medium and small data sets do not. However, many medium and small data sets are shared on the web and could be subject to web archiving. The web archive status of two data sets (The VLA FIRST Survey and COMPLETE) was assessed. Neither was well-represented in public web archives. Those data that were archived were not in formats required by scientists (e.g. low-resolution images). So, web archiving can preserve scientific data, but changes in selection criteria are required for web archiving to be truely effective.

Cathy Marshall presented her current findings on the public’s views on ownership and reuse of visual media. In the web archiving community we feel the need to preserve the historical Web just as libraries have traditionally preserved copies of books, newspapers, and magazines. Cathy’s research addresses the social issues with which we in the web archiving community must contend. Many photographs, blogs, and tweets and publically accessable on the web, which makes archiving them technically simple. However, people learn that their pictures and posts are being archived, are frequently surprised and upset by the fact. This is especially true if the archiving organization is a government entity such as the Library of Congress. Much of Cathy’s presentation is covered in detail in her JCDL’12 paper “On the institutional archiving of social media”.

An other important consideration for web archivists is copyright. The “Legal Opportunities for Web Archiving” panel discussion focused on approaches to ensure web archiving is and remains free of legal burden and litigation. In the United States, copyright is derived from Article I, Section 8 of the Consitution and USC Title 17, Chapter 1. There are legal opportunities for web archiving in § 107 (Fair Use), § 108 (Libraries and Archives), § 109 (“First Sale” Doctrine), § 110 (Non-profit performances). The panel discussed the structure of copyright and the issues and problems with copyright in the web archiving context. More information is available on Members of the Berkeley Digital Library Copyright Project web site.

Web archives have been collecting information for nearly two decades, but making this information easily accessable to non-Computer Scientists continues be a challenge. Andreas Paepcke is working with social scientists to build tools that allow high-level interaction with archives. The ArcSpread tool (Narrated demo) uses the Stanford WebBase as its data source. A spreadsheet metaphor provides a working environment familiar to most computer users.

Marc Spaniol is a member of the Longitudinal Analytics of Web Archive Data (LAWA) project where he studies temporal aspects of Web evolution. A detailed description is presented in "Tracking entities in web archives: the LAWA project". Web Archives are a gold mine of information, but we lack effective mining tools. Currently, entity tracking is labor-intensive and tedius process. The relevant URIs must be known and web archive searching is notoriously difficult. Additionally, following web archive links creates time diffusion and web archive crawls suffer from temporal incoherence. Text-Entity-Time Analytics focuses on tracking entities (people, places, etc.) over time. The AIDA framework is an online tool for entity detection and disambiguation. Measuring temporal incoherence requires is key to understanding the sources of incoherence. Spaniol has developed the SHARC framework that allows incoherence measurement and demonstrated that simple changes to crawling strategies will improve temporal coherence.

The Internet Archive (IA) currently holds over 176,000,000,000 resources that require nearly 3 petabytes of storage stored as Web Archive (WARC), CDX, and Web Archive Transformation (WAT) files. IA processes this mass of resources using Hadoop and Pig. The problem definition, big data description, and architectural overview presented by Binns were excellent. The slides contain many more details and are well worth a look even without Aaron’s live explanation.

When it comes to big data, few would argue that Facebook has more data to crunch than nearly anyone else. Sameet Argawal manages Facebooks 100PB (yes, petabyte!) Hadoop cluster—the largest Hadoop cluster in the world. Facebook’s needs have driven it contribute to Hadoop and to lead the development Hive, a peta-scale data warehouse based on Hadoop. This data warehouse has been the source for several interesting studies, including the recently-publicized reduction of six degrees of separation to four (actually 4.74). While a 100PB Hadoop cluster many seem like a problem solved, many issues still need research and resolution. How to keep a 100PB cluster running. How to fairly allocate resources to multiple tenants. How coordinate mutiple clusters. How to coordinate multiple, geographically dispersed clusters. Currently, log data from www.facebook.com is delivered overnight. How can this latency be reduced or eliminated. Facebook’s data is naturally a graph. Is the set of tables the best way to represent the data? Is converting graph data into a set of map-reduce jobs the right approach.