IIPC 2015 Recap

I had a fantastic time at the International Internet Preservation Consortium’s Annual General Meeting this year, held on the beautiful campus of Stanford University (with a day trip down to the Internet Archive in San Francisco). It’s hard to write these sorts of recaps: I had such an amazing time, my head filled with great ideas, that it’s difficult to give everything the justice that they deserve. Many of the presentation slide decks are available on the schedule, and videos will be forthcoming.

My main takeaways: we’re continuing to see the development of sophisticated access tools to these repositories, coupled with increasingly exciting and sophisticated researcher use of them. There’s a recognition that context matters when understanding archived webpages, a phrase that came up a few times throughout the event. Crucially, there was a lot of energy in the room: there’s a real enthusiasm towards making these as accessible as possible and facilitating their use. I wasn’t exaggerating when I noted to one of the organizers that I wish every conference was like this: leaving me on my flight home with lots of fantastic ideas, hope for the future, and excitement about what can be done. As the recent “Conference Manifesto” in the New York Times noted, that’s not the experience at all conferences!

Read one for a short day-by-day breakdown, with apologies for presentations I couldn’t include or didn’t give full justice to:

Day One

The first day was kicked off by a keynote from Vint Cerf and Mahadev Satyanarayanan who set the tone well. I’d been a bit hesitant about Cerf’s talk given the reaction it engendered from digital preservationists and web archivists, but his general “why this matters” points about what the 22nd-century Doris Kearns Goodwin would find in our archives was a good one. Satyanarayanan, or Satya, then advanced his own solution of streaming virtual machines. It was a cool idea (like seriously), but more interesting to me was the marrying of technical solutions to the big-picture questions raised by Cerf. Niels Brügger and Ditte Laursen then discussed national web domains, explaining both how one could (and why they should) study a nation’s web domain – in their case, the .dk domain.

A landmark presentation came from Andy Jackson – I’d recommend downloading his slides here (PPTX). Through large-scale data mining, the UK Web Archive team has been able to plot how many sites have disappeared as well as the pace at which they change. I tweeted some of the slides:

Following these pieces, a series of presentations did a great job of figuring out how people both interact with technology and their thoughts on archiving them. Cathy Marshall in particular weighed the various ideas around whether Facebook should or shouldn’t be archived, and I found the interviews she did fascinating. People were hesitant: they foresaw ugly future uses of how people might use a Facebook archive, they didn’t see it as an archive but rather a compilation of social connections, and as this slide showed, they just didn’t see the value:

Then Meghan Dougherty and Susan Aasman gave very good complementary presentations, looking at personal digital archiving and “everyday digitally lived life.” Aasman’s had a great rumination on different dynamics of memory saving, the impact of storage and memory on this, and the chances we’re taking with our digital heritage (check out her slides from the schedule page). Dougherty’s wide-ranging talk had quite a bit on how people actually use their technology: webcams tracing how people interact with Facebook, for example. It raised several questions about how we might not be preserving the right things – we’re preserving content (text, etc.) but not always a good sense of how people actually use these things.

And then rounding out the first day was a great panel on the UK Web Archive and research projects carried out with them. Jane Winters provided a great summary of what their researchers wanted – ethics, tools, corpora, and no black boxes – and also the challenges we historians bring to the table (i.e. our deep disengagement with quantitative methods). Helen Hockx-Yu gave a good overview of the UK Web Archive holdings, size of data tranches, and how people can interact with them using the prototype Shine Interface. Finally, Josh Cowls rounded off the day with a look at research that had been done using UK Domain Data, news of a forthcoming edited collection, and a spotlight onto four cool projects: on the growth of the UK Web; the evolution of the BBC’s online presence; that of UK universities, and finally; a sense of UK Web cultures.

What a day!

Day Two and Beyond

This is turning into a play-by-play so I’ll be a bit briefer.

Day Two was the “Open Workshops Day” and largely had two parallel tracks: one focused roughly on content, the other on technical solutions. Herbert Van de Sompel kicked off a discussion of the Memento Time Travel Portal, which lets you reconstruct various websites using different archived versions – i.e. you can “reconstruct” a page that has components from different crawls, becoming more complete. Most importantly, it has a graph up along the top that lets you see where various things come from. Daniel Gomes, of the Portuguese Web Archive, then gave a great presentation on different web archive information retrieval undertakings they’ve been doing. One in particular used machine learning to enhance search, which I found really intriguing and exciting. Such amazing stuff!

The rest of the day I was off learning about content: processes to identify national parts of the Internet (Eld Zierau); how to use social media and other processes to create a web archive around disasters (both natural and man-made) from Mohammed Farag;, how Archive-It is deploying amazing research datasets (which we’ve discussed earlier, but it was nice to hear it from Jefferson Bailey himself), and; fascinating digital archeology projects spanning from restoring the SLAC’s first website (Ahmed AlSum) to the University of Bologna’s page (Federico Nanni). I also gave a presentation on my own work – you can view my slides here.

Finally, Michael Nelson gave a sobering presentation on the degree of temporal violation within many web archives – where for example an image was captured a month later than the rest of the text on the page, and all mashed together in the Internet Archive. We might have a separate post about this later.

On Day Three, we went down to the beautiful Internet Archive and participated in discussions around the IIPC itself: new ideas for membership models, revenue streams, researcher participation (please let me keep coming! ;)), and fascinating talks on how to profile contents of a web archive, as well as invited talks by Brewster Kahle and Abby Smith Rumsey. Kahle spoke about the various ambitious undertakings the Internet Archive has and is carrying out, as well as a plea for co-operative collection strategies, distributing presentations, building libraries together. Ramsey’s talk was a neat reflection around the future of memory in a digital age. I took lots of notes but don’t want to fill out this talk much more.

Finally, on Day Four, we had a great session on both WAT files and full-text search. It was a bit technical, but left me full of neat ideas for more sophisticated and formal ways of implementing full-text search on my own collections, and some other tricks with WAT files, and feeling very confident and happy about the field. I tweeted a lot.

Overall

Overall, the IIPC remains one of my favourite conferences: a supportive community, a great mix of content and technological analyses, and a real sense of wonder and ambition about the way forward. This year quite a bit of emphasis was on the issues of contextualizing findings from web archives: both within their culture, but also within their place in the web. WAT files, as I argued, make a lot of this possible so I found those sessions perhaps brought both aspects together well.

A bit of a long post, but it was a long conference. I can’t wait for IIPC 2016 in Iceland next summer!