Friday, May 18, 2012

The International Internet Preservation Consortium (IIPC) held its annual general assembly meeting for 2012 from Apr 30 to May 4, 2012, in the Library of Congress in Washington D.C. I concluded this report based on Tweets using #IIPC12 from the meeting and my personal notes. In this report, I tried to assess how much the tweets about an event could give you a complete view about it. More details about the new approach will be in the blog comment.

The first day, April 30, 2012 was open to the public. It was entitled "The Broad Value of Web Archives: Demonstrated Use", @hhockx::
#iipc12 opened just now by Martha Anderson. Laura Campbell welcomes the participants. @netpreserve: IIPC starts with 11 members, now has 42. @gregorylisa: Apropos quote at #IIPC12 "The great use of life is to spend it for something that will outlast it." William James.

@netpreserve: Gildas Illien from BnF sets the stage for researcher use case panel. @cleymour: Gildas Illien: from 3 to 6 university libraries in IIPC in 2012, signal of a stronger link with researchers. @cleymour:Kalev Leetaru from university of Illinois opens his speech with a demonstration of what big data is. Leeataru talk entitled: "A decade and a half of archiving the web for data mining: Lessons learned and how users use web archives". @MarthaBunton: 8 billion words a day added to written record on Twitter daily.
@MarthaBunton:
Most web archives today are black boxes--need context of captures. @saraaubry: crawl scope and policies are helpful to evaluate the original websites and its archives. @HRDocumentation:What's worth keeping on the web? Need to document why we choose to capture certain content. @cleymour:big issue: no way to access the archive in its entirety, because archives designed primarily for a small usage.

@netpreserve: Next up : Ian Soberoff from NIST. His talk entitled: "How web archives are used in the Text REtrieval Conference (TREC)". @lljohnston: Ian Soboroff talking about research into search and the TREC program. Search is not a fully solved problem. @marthaindc:Everything you do has search built into it--email, mobile phone, word processors. Ian Soberoff search is indispensable. Ian discussed the idea of "Is it a good search query response?".

@netpreserve:Bruce Hoffman from Georgetown University now talking about use of web archives for terrorism research. Between 1998 and 2006, new communication media have been in effect. @alexisan75: Hoffman: terrorists can frame their own message on Web in ways not possible before Internet. @marthaindc: if you don't have a website, you don't exist as a terrorist. @agrotke: Most terrorist websites today password protected. Major barrier for archivists.

@netpreserve: Next is Monica Omodei from National Library of Australia speaking on trends in Pandora archive. @kboughida: Bad news no legal deposit legislation (Australia) missed web collecting content lack of permission. After analyzing the access log, they found some patterns of use such as: @hhockx:People tend to use web archives to look for websites which have disappeared from the live web, former Prime Minster website that has been changed after the election, redirection to a new name or used as an archive from the live site.

The following session was the business use of the Web archives. @netpreserve: Rod Wittenberg from Reed Technology on web archives use in legal industry. @hhockx: Archiving websites as evidence to fight piracy and infringement. @cleymour: Rod Wittenberg: examples of cases where the judge decided if an archived website was authenticated or not. Reed technology is preserving the fake sites with the highest quality to be used in court. @kboughida: Rod Wittenberg: re Rule 901. AUTHENTICATING OR IDENTIFYING EVIDENCE for courts Lawyers create pdf from web pages. @kboughida: Rod Wittenberg: sha-1 is used for every web page as digital signature. Accepted as legal.

@netpreserve: About to start day 2 of #IIPC12 - the general assembly meeting of IIPC members. @netpreserve: You may have noticed we went from orange to green with our avatar, we are rolling out our new logo today! @kboughida:@MarthaBunton: explaining the new logo: Angle brackets for tech work. Blue color of trust + green

@kboughida:Daniel Chudnov, George Washington University talking now about GW libraries and web archiving. By the end of 2012, their goal is to deep select, collect captures a modest collection of web archiving materials. Collection areas are history, international relation, matching research in GW uni. Dan confirmed the fact that @kboughida: developers are people too (audience laugh).

Rick Fitzgerald from the Library of Congress gave an update about "HIVE for LC Web Archives: Web Archives and Automatic Subject Indexing". @hvdsomp: Ongoing experiment at Library of Congress looks into automatic classification of web archive with LCSH.

Leïla Medjkoune from Internet Memory gave an update about LAWA (Longitudinal Analytics of Web Archive Data). The analysis on Web data is essential for many R&D services, the challenges includes: scalability, selection of essential resources, adding multilingual support and the time dimension.

Masaki Shibata from National Diet Library gave an update about "Web Archiving in 2012 at National Diet Library". Masaki presented "Web Archiving 3.11 Japanese earthquake & Tsunami". The crawling frequency started on daily basis and by the time the frequency became weekly, then monthly. The volume of data is about 4TB/month. He announced that a new system for deduplication will be used by the end of 2012.

Helen Hockx-Yu from British Library gave the British Library update. @netpreserve: Helen highlighting recent activities at British Library: new access tool, QA module improvements to web curator tool.

@kboughida: Barbara Sierman from KB National Library of the Netherlands talking about SCAPE project SCAlable Preservation Environments. Five IIPC members were involved in SCAPE to provide infrastructure and tools for scalable preservation actions, a framework for automated, QA preservation workflows and integration of these components with policy-based automated preservation planning and watch.

Working Group Meetings were held on the third day, @netpreserve: Working group meetings starting today: access, harvesting, and preservation.

Thursday May 3 2012 - Workshops & Cross Working Group meetings

The fourth day was for Workshops and Cross Working Group meetings. The Web Archiving "Lifecycles" Workshop started with an introduction from Kris Carpenter from Internet Archive. Kris discussed the main challenges in the Web archiving life cycle. Then, we had an open discussion about these challenges. Being hosted in parallel was the NetarchiveSuite workshop, it discussed the curatorial and technical aspects of the integration and daily use of NetarchiveSuite in an automated Web harvesting workflow with a focus on crawling preparation which includes scheduling, packaging, configuring and data structuring.

There were two other afternoon workshops. Legal Roundtable was a discussion between the Web archivists and lawyers in order to discuss and compare the impact of international and national legislations and policies on Web archiving activities. In parallel, it was "Harvesting and Preserving the Future Web". The workshop was divided into three panel discussions: Capture, Replay and Scale Panel. David Rosenthal wrote about this workshop in his blog.