andrew.n.jacksonJekyll2018-11-14T03:23:57+00:00http://anjackson.net/Andy Jacksonhttp://anjackson.net/anj@anjackson.nethttp://anjackson.net/2018/11/13/continuous-incremental-heritrix2018-11-13T00:00:00+00:002018-11-13T00:00:00+00:00Andy Jacksonhttp://anjackson.netanj@anjackson.net<h2 id="abstract">Abstract</h2>
<p>Under Legal Deposit, our crawl capacity needs grew from a few hundred time-limited snapshot crawls to the continuous crawling of hundreds of sites every day, plus annual domain crawling. We have struggled to make this transition, as our Heritrix set-up was cumbersome to work with when running large numbers of separate crawl jobs, and the way it managed the crawl process and crawl state made it difficult to gain insight into what was going on and harder still to augment the process with automated quality checks. To attempt to address this, we have combined three main tactics; we have moved to containerised deployment, reduced the amount of crawl state exclusively managed by Heritrix, and switched to a continuous crawl model where hundreds of sites can be crawled independently in a single crawl. These changes have significantly improved the quality and robustness of our crawl processes, while requiring minimal changes to Heritrix itself. We will present some results from this improved crawl engine, and explore some of the lessons learned along the way.</p>
<h2 id="introduction">Introduction</h2>
<p>Since we shifted to crawling under Legal Deposit in 2013, the size and complexity of our crawling has massively increased. Instead of crawling a handful of sites for limited periods, we have hundreds of sites to crawl every day, and a domain crawl to perform at least once a year. But the technical team isn’t any bigger than it was before, and the QA team isn’t any bigger than before. So, while we’re sticking with Heritrix as our core crawl engine, we’ve been experimenting with a range of modifications to it (and our supporting systems) over the last year, with the following goals in mind:</p>
<ul>
<li>High level of automation</li>
<li>High level of transparency</li>
<li>Capture JavaScript-heavy web pages</li>
<li>Stable domain crawls</li>
<li>Continuous crawling</li>
</ul>
<p>Our highest priority is automation of any and all automatable tasks, and where necessary, changing Heritrix to make common tasks easier to automate. Crucially, this includes automating recovery from common errors and outages, such as network outages, server maintenance, etc.</p>
<p>However, when we can’t resolve issues automatically, we need to be able to tell what’s going on. The means we need to be able to monitor progress easily, and be able to inspect what the crawler has been doing so we can debug problems. With this in mind, we’ve done quite a lot of work modifying Heritrix to be less of a ‘black box’ by moving or cloning data usually managed by Heritrix into external systems.</p>
<p>As we’ve discussed at previous conferences, we need to be able to cope with JavaScript-heavy websites by integrating browser-based crawling into the crawl process. We’ve been doing that for a while, but it’s only recently that we’ve really been able to see the benefits.</p>
<p>Even when all that is working fine, we’ve had repeated problems ensuring long, large crawls are stable, so we’ve also spent time looking at that over the last year. If you’d like to hear about the twelve-year-old Heritrix bug I managed to reproduce and resolve during all this, or indeed about any of these other areas, please come and talk to me later on.</p>
<p>However, in this presentation I’d like to focus on the work we’ve done to enable us to use Heritrix to perform continuous crawls.</p>
<h2 id="continuous-crawling">Continuous Crawling</h2>
<p>One of the most significant changes under Legal Deposit is the need for continuous crawling, and the case of news sites is particularly crucial. Under selective archiving, we might choose one or two news articles a day to archive. Now, we are expected to get the latest news from all UK news sites at least once a day, and ideally more often than that.</p>
<p><a href="http://sheepfilms.co.uk/2006/12/29/float-and-crawl/"><img src="/building-web-archives/images/floatcrawl.gif" alt="float crawl" /></a></p>
<p>But in terms of automation, we found having hundreds of different Heritrix crawl jobs has simply not been practical. The way Heritrix works means it’s difficult to manage the available resources effectively if there are lots of small jobs, and the automation of the stopping and starting of crawl jobs becomes very challenging when there are so many.</p>
<h2 id="batch-crawling">Batch Crawling</h2>
<p>Therefore (and like many other institutions), we’ve handled all these little jobs are handled using a smaller set of batch crawls. Specifically, we’ve run six crawl streams with different frequencies, and each crawl gets stopped and restarted on that time-scale. For example, the daily crawl launches once a day, at midday, and runs for one day. This means we get the news sites once a day, but also means we never get more than one day’s worth of crawl time on that site. If we want to get a deeper crawl, we need to duplicate the crawl activity in a separate stream (and risk putting too much pressure on the publishers web site), or rely on the domain crawl to pick up everything else (which, as indicated earlier, has not proven reliable).</p>
<p><a href="/building-web-archives/images/batches.odg"><img src="/building-web-archives/images/batches.png" alt="Batch crawl schedule" /></a></p>
<p>This has worked okay, but we wanted to improve things further. For example, being restricted to the batch time frames means we can’t harvest sites when our curators would like us to do so. Similarly, the artificially high load created by launching all the daily crawl activity at once makes it difficult to ensure we manage to successfully pass our crawl seeds through the browser-based rendering engine we use.</p>
<p>What we’d rather do is run one big crawl job, where different sites (or different parts of the same site) can be re-crawled at different intervals. But this is not how Heritrix was designed to work, and we met a number of barriers when trying to work this way:</p>
<ul>
<li>Stability &amp; restartability</li>
<li>Quotas &amp; statistics</li>
<li>Unique URI filtering</li>
</ul>
<p>Firstly, we need Heritrix to be stable when run for a long time, but also reliably resumable if we do need to stop and restart it for some reason. This may sound obvious, but we’ve met a number of challenges while trying to get this part right!</p>
<p>Secondly, we still want to apply quotas to crawls and monitor statistics, so we need some way to manage or reset this data when we start re-crawling.</p>
<p>Finally, and most significantly in terms of changes to Heritrix, we need to be able to change how it filters out unique URIs as the crawl proceeds.</p>
<h2 id="already-seen">Already Seen?</h2>
<p>Every crawler has some kind of record that remembers which URLs it’s already seen. Without that, the crawler would constantly crawl and re-crawl common URLs over and over again. We’d have a million copies of the BBC News homepage and barely any of the articles!</p>
<p><a href="/building-web-archives/images/uri-filter.jpg"><img src="/building-web-archives/images/uri-filter.jpg" alt="Unique URI Filtering" /></a></p>
<p>A standard Heritrix crawl uses a special and highly efficient data structure called a Bloom filter to do this. This works great for batch crawling: you put in the URLs you’ve dealt with, and if you see the same URI again it will tell you so. But it’s not a database, and so it can’t do anything else. You can’t ask it to ‘forget’ a URL. You can’t ask it <em>when</em> you saw that URL, or what happened last time.</p>
<p>For that kind of detail, you need a real database of some kind. But this database will end up with billions of entries, and we really don’t want to take on any more technologies because managing large databases is hard. If only we had some kind of standard system we already support that indexes what we’ve captured.</p>
<p>A capture index.</p>
<p>Wait…</p>
<h2 id="that-seems-familiar">That seems familiar…</h2>
<p><a href="/building-web-archives/images/wabac.jpg"><img src="/building-web-archives/images/wabac.jpg" alt="WABAC MACHINE" /></a></p>
<p>A capture index is precisely what any web archive playback system needs. We used to do this using big CDX files, but recently we’ve started using a dedicated database in the form of OutbackCDX.</p>
<h2 id="outbackcdx">OutbackCDX</h2>
<p><a href="https://github.com/nla/outbackcdx">OutbackCDX</a> is a dedicated CDX service built for web archives. It stores just what you need for playback, and is easy to integrate with playback tools. It’s fast, efficient, and because it’s a real database it can be updated and queried in real time, rather than relying on batch updates. In short, it’s awesome, and as there’s likely a few NLA staff members here I’d like to take this opportunity to thank the National Library of Australia for making OutbackCDX openly available.</p>
<p><a href="/building-web-archives/images/thats-not-a-cdx.jpg"><img src="/building-web-archives/images/thats-not-a-cdx.jpg" alt="That's a CDX index" /></a></p>
<p>So, we’ve found that as well as being great for playback, it’s also handy for mid-crawl data, and we already know how to use it and how to manage large indexes. So what does a crawl look like with OutbackCDX in the loop?</p>
<h2 id="relaunching-a-crawl">(Re)launching a Crawl</h2>
<p><a href="/building-web-archives/images/launch.jpg"><img src="/building-web-archives/images/launch.jpg" alt="Crawl architecture" /></a></p>
<p>Well, to start the story, we need a new way of launching a crawl. We no longer have individual jobs, so instead we have a single, long-running crawl job that we’ve coupled to a message queue - in our case we’re using Apache Kafka. So, when the time comes, we drop a ‘launch’ message on this queue, which the crawl job picks up. It takes the seed URL, and sets up the crawl configuration for that URL using Heritrix’s ‘sheets’ configuration system. This defines the re-crawl frequency (i.e. whether that site should be crawled daily/weekly/etc.) and other parameters like resetting crawl quotas, whether to obey <code>robots.txt</code> and so on.</p>
<p>The seed URL and all subsequent URLs are passed through the usual scoping rules, plus an additional step that calls out to OutbackCDX to see when we last dealt with that URL. This <code>RecentlySeenURIFilter</code> works out how long ago the last crawl was, and if the URL has already been captured within the re-crawl interval. If we have already downloaded the item recently enough, the URL is discarded. However, if a re-crawl is due, it gets enqueued into the Heritrix frontier.</p>
<p>Later, after attempting to download the URL, the crawler registers the outcome in OutbackCDX. The resulting WARCs and log files accumulate as normal, and like any batch crawl material from multiple hosts are stored in a single unified stream of archived resources and metadata.</p>
<h2 id="advantages">Advantages</h2>
<p>The approach to continuous crawling has had a number of advantages:</p>
<ul>
<li>The One Big Job</li>
<li>Re-crawl or Refresh?</li>
<li>Efficient De-duplication</li>
<li>Real-time Playback</li>
<li>Access Crawler Screenshots</li>
</ul>
<p>The main advantage is simply that we crawl the sites the curators want, on the time-scales they have requested. Because this is all done in one large job, this is relatively easy to manage and monitor.</p>
<p>A second advantage comes from a new opportunities this arrangement enables. Because we can set the re-crawl frequency for individual pages and for sections of sites, we can start to explore the possibilities for distinguishing between doing a full re-crawl of a site, and just refreshing parts of a site so we pick up new URLs. We can now easily force the crawler to re-crawl the BBC News homepage in order to pick up new links every few hours, but allow more time for the crawler to pick up the deeper links.</p>
<p>Another plus is that OutbackCDX also stores the checksums of the content downloaded previously, and so can be used to de-duplicate the WARC records and avoid storing multiple copies of resources that haven’t changed. We were already doing this, but using Heritrix’s <code>PersistLog</code> database, which worked well for short crawls but has caused major problems for larger and long-running crawls. Using OutbackCDX for this is much more efficient in terms of disk space requirements, and allows a single deduplication database to be shared between multiple crawlers.</p>
<p>And of course, OutbackCDX is a plain old capture index, so we can also hook up a playback service and use it to inspect the results of the crawl where needed. This works in real-time, with content becoming available immediately after capture.</p>
<p>Finally, OutbackCDX also gives us a handy place to store references to the WARC records that capture what happened when we used the embedded web browsers to capture our seed pages.</p>
<h2 id="crawler-screenshots">Crawler Screenshots</h2>
<p><a href="/building-web-archives/images/screenshots-dashboard.jpg"><img src="/building-web-archives/images/screenshots-dashboard.jpg" alt="Screenshots dashboard" /></a></p>
<p>For example, by monitoring activity via the Kafka queues and combining that with information in OutbackCDX, we’ve finally been able to put together a simple dashboard that shows what the browser saw when it rendered the original web sites during the crawl. It’s been great to finally see what the crawler is doing, and makes it easy to spot the kind of problems that lead to blank pages.</p>
<p>Better still, because we’re also running Python Wayback which support HTTPS in proxy mode, we can easily point the rendering service at the archived copy, and re-render what we’ve managed to download in order to compare what we saw with what we got.</p>
<p>For example, here’s a page from the main UK Government publications site, with a strange little difference in the ordering of the entries between the original and the archived version.</p>
<p><a href="/building-web-archives/images/gov-comparison.png"><img src="/building-web-archives/images/gov-comparison.png" alt="GOV.UK Comparison" /></a></p>
<p>And here’s a BBC News page, showing that we capture full-length screenshots. Here the archived version is extremely similar apart from a couple of minor dynamic changes. It’s not all roses though. If we look at the same page using the usual re-written mode rather than our embedded browser, we start to see some gaps arising from the difference in how the two browsers render the page. So, there’s more work to be done, but nevertheless the quality is much better than it used to be, and we’ve got a way to evaluate the quality of the crawled version, and a better understanding of where the problems are coming from.</p>
<p><a href="/building-web-archives/images/bbc-comparison.png"><img src="/building-web-archives/images/bbc-comparison.png" alt="BBC Comparison" /></a></p>
<h2 id="conclusion">Conclusion</h2>
<p>Back at <a href="http://www.ncdd.nl/digital-preservation-how-are-we-doing-as-a-community-ipres2012-7/">iPres in 2012</a>, Steve Knight gave a deliberately challenging <a href="https://digitalpreservationchallenges.files.wordpress.com/2012/09/mckinney.pdf">presentation</a> about the kind of cultural change needed to move from hobbyist to artisan and then to industrial-scale digital preservation. I don’t think I really grasped the consequences at the time, but the last five years has shown me that the technical chnages are only one aspect of the broader changes we’ve had to make to how the whole team works, in response to the the scale of the Legal Deposit challenge.</p>
<p>What we’ve seen this year is by making a few careful changes (and fixes) to Heritrix, we’ve been able to radically alter the way we operate our crawls. This new arrangement is more robust, reliable and transparent, requires less manual intervention, and provides the kind of foundation we need in order to explore how we might automate more of our quality assurance work in the future.</p>
<p>However, we’ve also tried to keep these changes fairly modular so that they can be re-used without necessarily changing how you run Heritrix. For example, if you want to use OutbackCDX just for de-duplication, you can. So, if any of this sounds at all interesting, please get in touch so we can learn what kind of components we might be able to share between our institutions.</p>
<p>Thank you.</p>
<p><a href="http://anjackson.net/2018/11/13/continuous-incremental-heritrix/">Continuous, incremental, scalable, higher-quality web crawls with Heritrix</a> was originally published by Andy Jackson at <a href="http://anjackson.net">andrew.n.jackson</a> on November 13, 2018.</p>http://anjackson.net/2018/03/15/story-of-a-bad-deed2018-03-15T00:00:00+00:002018-03-15T00:00:00+00:00Andy Jacksonhttp://anjackson.netanj@anjackson.net<p>I love a digital preservation mystery, and this one started with question from <a href="https://digipres.club/@joe/">@joe</a> on <a href="https://digipres.club/">digipres.club</a>:</p>
<iframe src="https://digipres.club/@joe/99650486509645352/embed" class="mastodon-embed" style="max-width: 100%; border: 0" width="400"></iframe>
<script src="https://digipres.club/embed.js" async="async"></script>
<p>A mystery file, starting with <code>0x0baddeed</code>, eh? Fascinating. Those hex digits didn’t happen be accident. Using four-digit hex patterns to signal format is an extremely common design pattern, but no authority hands them out – each format designer mints them independently. There must be a story here…
<!--break--></p>
<p>The first step is to find other examples to work with. For exactly this reason, I deliberately built a special feature into our search indexes: the ability to seach for files based on the first four bytes. Gratifyingly, <a href="https://digipres.club/@nkrabben">someone else</a> beat me to it:</p>
<iframe src="https://digipres.club/@nkrabben/99650687654066239/embed" class="mastodon-embed" style="max-width: 100%; border: 0" width="400"></iframe>
<script src="https://digipres.club/embed.js" async="async"></script>
<p>Poking around in <a href="https://gist.github.com/anjackson/1cb69ae72eadea65a50e348b71d93d2d">the underlying data</a> it was clear that the <a href="https://www.webarchive.org.uk/shine/search?query=content_ffb:%220baddeed%22">179 files that matched this query</a> all appeared to be PowerPoint files based on the file extension, but neither <a href="https://www.nationalarchives.gov.uk/information-management/manage-information/policy-process/digital-continuity/file-profiling-tool-droid/">DROID</a> nor <a href="https://tika.apache.org/">Apache Tika</a> could say any more.</p>
<p>A <a href="http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=new">search of PRONOM</a> showed two separate records for <a href="http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&amp;id=885">Microsoft PowerPoint for Macintosh 4.0</a> and <a href="http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&amp;id=133">Microsoft Powerpoint Presentation 4.0</a>, but no earlier versions. In this case, <a href="https://en.wikipedia.org/wiki/Microsoft_PowerPoint">Wikipedia faired better</a>, <a href="https://en.wikipedia.org/wiki/Microsoft_PowerPoint#cite_note-early-file-compatibility-252">linking to</a> this <a href="https://web.archive.org/web/20130510103008/http://www.bitbetter.com/powerfaq.htm#versions">nice overview of the format compatability between PowerPoint versions</a>.</p>
<p>While the <a href="http://justsolve.archiveteam.org/wiki/PPT">File Format Wiki</a> did not have much detail for the early versions of PowerPoint, it did <a href="http://justsolve.archiveteam.org/wiki/PPT#Sample_files">link</a> to <a href="https://web.archive.org/web/20020313074855/http://ftp.sunet.se/pub/Internet-documents/isoc/charts/presentations/">a source of sample files</a><sup id="fnref:2"><a href="#fn:2" class="footnote">1</a></sup>. This proved to be very fortunate indeed…</p>
<p>I downloaded some of the old sample files from there, and compared them against the <code>0x0baddeed</code> files. Here’s the start of one of the sample file:</p>
<pre><code>$ hexdump -C nii.ppt | head
00000000 ed de ad 0b 03 00 00 00 45 17 00 00 3f 01 31 17 |........E...?.1.|
00000010 6f 20 0f 00 50 00 3e 01 28 17 00 00 28 00 00 00 |o ..P.&gt;.(...(...|
00000020 79 00 00 00 5b 00 00 00 01 00 04 00 00 00 00 00 |y...[...........|
00000030 c0 16 00 00 00 00 00 00 00 00 00 00 10 00 00 00 |................|
00000040 00 00 00 00 00 00 00 00 00 00 80 00 00 80 00 00 |................|
00000050 00 80 80 00 80 00 00 00 80 00 80 00 80 80 00 00 |................|
00000060 c0 c0 c0 00 80 80 80 00 00 00 ff 00 00 ff 00 00 |................|
00000070 00 ff ff 00 ff 00 00 00 ff 00 ff 00 ff ff 00 00 |................|
00000080 ff ff ff 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
</code></pre>
<p>and here’s the start of one of the <code>0x0baddeed</code> files…</p>
<pre><code>$ hexdump -C BidStrat.ppt | head
00000000 0b ad de ed 00 00 00 03 00 00 00 1e 00 7b 00 0a |.............{..|
00000010 00 00 be cd 00 50 00 7a 00 00 00 00 00 00 80 00 |.....P.z........|
00000020 00 18 00 00 03 f6 80 00 00 00 00 00 04 0e 80 00 |................|
00000030 03 c0 00 00 04 0e 80 00 01 a4 00 00 07 ce 80 00 |................|
00000040 0d 16 00 00 09 72 80 00 00 26 00 00 16 88 80 00 |.....r...&amp;......|
00000050 00 00 00 00 16 ae 80 00 00 40 00 00 16 ae 80 00 |.........@......|
00000060 00 00 00 00 16 ee 80 00 00 60 00 00 16 ee 80 00 |.........`......|
00000070 00 26 00 00 17 4e 80 00 00 20 00 00 17 74 80 00 |.&amp;...N... ...t..|
00000080 00 18 00 00 17 94 80 00 00 00 00 00 17 ac 80 00 |................|
00000090 03 40 00 00 17 ac 80 00 01 54 00 00 1a ec 80 00 |.@.......T......|
</code></pre>
<p>Do you see it? Look closer…</p>
<pre><code>$ hexdump -C BidStrat.ppt | head -1
00000000 0b ad de ed 00 00 00 03 00 00 00 1e 00 7b 00 0a |.............{..|
$ hexdump -C nii.ppt | head -1
00000000 ed de ad 0b 03 00 00 00 45 17 00 00 3f 01 31 17 |........E...?.1.|
</code></pre>
<p>Both the first and second four bytes match, but are reversed! Welcome to the confusing world of <a href="https://en.wikipedia.org/wiki/Endianness">endianness</a> (see also <a href="https://developer.apple.com/library/content/documentation/CoreFoundation/Conceptual/CFMemoryMgmt/Concepts/ByteOrdering.html">Apple’s docs on byte ordering</a>)<sup id="fnref:1"><a href="#fn:1" class="footnote">2</a></sup>.</p>
<p>Most computers use a byte-ordering called ‘little-endian’, but the older Mac used an alternative ordering called ‘big-endian’. This is just two different conventions for storing data, and I can’t look at <code>0x0baddeed</code> and know which ordering it is. However, the discovery of <code>ppt</code> files starting with either <code>0x0baddeed</code> or <code>0xeddead0b</code> is consistent with the same type of data being stored in different endian-orders.</p>
<p>Indeed, <a href="https://www.webarchive.org.uk/shine/search?query=content_ffb:%22eddead0b%22">searching for the reversed pattern finds 430 files</a>, and better still the <a href="http://mark0.net/onlinetrid.aspx">online version of TRiD</a> determines these to be early PowerPoint files.</p>
<p><img src="http://anjackson.net/digipres-lessons-learned/images/trid-result-for-nii-ppt.png" alt="TRiD Result" /></p>
<p>In fact, <a href="http://file-extension.net/seeker/file_extension_ppt">it looks like TRiD is also able to distinguish between versions 2.0 and 3.0</a>, but only for the more common byte-ordering.</p>
<p>Sadly, there is no trivial ‘fix’ for this. You can’t just go through the whole file and flip the bytes, because only some chunks are stored like that. If you use the <code>strings</code> command to extract the text, it’s in the expected order, not half-flipped, because it’s stored as a byte stream not <a href="https://en.wikipedia.org/wiki/Word_(computer_architecture)">32-bit ‘words’</a>. The only way to open these files will be to use PowerPoint 2.0 or 3.0 in an emulator, and although either should be able to open both, they are effectively distinct formats. I’m not able to test this, but maybe someone else can?</p>
<p>But why <code>0x0baddeed</code>? Well, I started to speculate that this was a statement from a disgruntled developer. PowerPoint 2.0 was the first version of PowerPoint that ran on both Macs and PCs, and a joke that only reveals itself on one platform would be just the kind of thing I’d expect from a community that relishes Rick-rolls. But then after I thought I had this idea, I realised it feels more like a memory. Can any one else remember a story like this? Please let me know!</p>
<hr />
<div class="footnotes">
<ol>
<li id="fn:2">
<p>Thanks to <a href="https://twitter.com/NKrabben">Nick Krabbenhöft</a> for <a href="https://twitter.com/NKrabben/status/974386913464537099">pointing out</a> that I’d mis-remembered where I’d got these samples from. This updated blog post should now be accurate! <a href="#fnref:2" class="reversefootnote">&#8617;</a></p>
</li>
<li id="fn:1">
<p>I remain frustrated that I still find endianness confusing. In the past, I managed to cobble together a port of a <a href="https://www.marutan.net/rpcemu/">little-endian platform emulation</a> that ran on my big-endian Mac, and despite the fact I got it working I never felt like I really understood it properly! <a href="#fnref:1" class="reversefootnote">&#8617;</a></p>
</li>
</ol>
</div>
<p><a href="http://anjackson.net/2018/03/15/story-of-a-bad-deed/">Story of a Bad Deed</a> was originally published by Andy Jackson at <a href="http://anjackson.net">andrew.n.jackson</a> on March 15, 2018.</p>http://anjackson.net/2017/11/30/sustaining-the-software-that-preserves-access-to-web-archives2017-11-30T00:00:00+00:002017-11-30T00:00:00+00:00Andy Jacksonhttp://anjackson.netanj@anjackson.net<p>Today is the inaugural <a href="http://www.dpconline.org/events/international-digital-preservation-day">International Digital Preservation Day</a>, and as a small contribution to that excellent global effort I thought I’d write about the current state of the open source tools that enable access to web archives.</p>
<p>Most web archive access happens thanks to the <a href="http://web.archive.org/">Internet Archive’s Wayback Machine</a>. The underlying software that delivers that service has gone through at least three iterations (as far as I know). The first was written in Perl and was never made public, but is referred to in papers and bits of documentation. The second was written in Java, and was made <a href="https://github.com/internetarchive/wayback">open source</a>. The third implementation <a href="https://web.archive.org/web/20160617073306/https://archive.org/about/jobs.php#wayback">appears to be written in Python</a> and offers some <a href="https://blog.archive.org/2017/10/05/wayback-machine-playback-now-with-timestamps/">exciting new features</a>, but is <em>not</em> open source<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>. As far as I can tell, the Internet Archive is currently using both the Java and Python versions of Wayback (for the Archive-It service and the global Wayback Machine respectively), but the direction of travel is away from the Java version.</p>
<p>This matters because like most of the <a href="http://mementoweb.org/depot/">web archives in the world</a>, we <a href="https://www.webarchive.org.uk/wayback/archive/*/http://www.webarchive.org.uk">built our own access system</a> upon the open source version of the Wayback software. Now that the lead developers are moving away, what should we do?</p>
<p>Long before I heard about the newest version of Wayback, I was frustrated by what I perceived to be a lot of wasted effort by the other users of the Java Wayback software. Each of us seemed to be working alone, patching bugs and struggling to upgrade and update the Wayback software, when we really needed to pool our resources. This was why, working with others in the <a href="http://netpreserve.org/">International Internet Preservation Consortium</a>, I helped set up the <a href="https://github.com/iipc/openwayback">OpenWayback</a> fork of the Internet Archive’s open source repository.</p>
<p>My original idea was to stay as close to the Internet Archive version as possible. We could pool bug-fixes and testing efforts, and try to coordinate with the Internet Archive to get these small fixes back into the core while staying in sync. We could make our own releases, and use those to help manage our own deployment processes. The hope was to build a larger community of practice around this shared code-base.</p>
<p>This didn’t really go to plan.</p>
<p>We did manage to set up a <a href="https://groups.google.com/forum/#!forum/openwayback-dev">developer community</a> and <a href="https://groups.google.com/d/msg/openwayback-dev/wWa4BJTH6hk/Ny95ATyEBAAJ">bug-fixes are being rolled into managed releases</a>. We did grow the community (at least a little), and currently Lauren Ko is doing an excellent job leading the work. But given how important this tool is, there are still too few of us actively engaged in the work. To my shame and frustration, this includes the UK Web Archive – we have not had a lot of time or resources to put back into OpenWayback recently.</p>
<p>Furthermore, rather than staying close to the Internet Archive version, we’ve ended up with a separate fork. Bug-fixes and improvements made in the Internet Archive version can no longer be easily merged into the OpenWayback repository because the two code-bases have diverged and are now too far apart.</p>
<p>Looking back, I underestimated how hard it would be to build this community of practice. I didn’t realise how fortunate I am to be able to work in the open, and to be in a position where it is safe and comfortable for me to do so. I didn’t understand how much effort it would take to agree and maintain our shared goals. Perhaps I should have pushed back harder when the OpenWayback code started to diverge from the original version. I don’t know. But then, given that the Internet Archive is leaving the Java version behind, perhaps that no longer matters.</p>
<p>No matter. We are where we are, and those of us responsible for preserving access to web archives need to decide how to respond to this situation. For now, the UK Web Archive continues to use the Java OpenWayback software, but we are also evaluating <a href="https://github.com/ikreymer/pywb">pywb</a>. This Python tool-kit for accessing web archives is part of the <a href="https://webrecorder.io/">Webrecorder</a> project, and appears to provide a modern and powerful alternative implementation that is being run as a true open source project<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>. It’s already being used by the <a href="https://www.fccn.pt/en/new-release-of-arquivo-pt-with-improved-replay-quality/">Portuguese Web Archive</a>, <a href="https://perma.cc/">perma.cc</a>, the <a href="http://blog.nationalarchives.gov.uk/blog/uk-government-web-archive-now-even-better/">UK National Archives</a>, the <a href="http://webarchive.parliament.uk/">UK Parliamentary Archive</a>, and <a href="https://github.com/ikreymer/pywb/wiki/Public-Projects-using-pywb">a number of others</a>, so it’s certainly a highly credible alternative.</p>
<p>I still believe that the web archives of the world need to pool our scant resources, and avoid vendor lock-in, by sharing the development and maintenance of open source access tools. The organisations and individuals that <em>do</em> consider shared open tools to be a <em>strategic objective</em> need to find each other and find ways to collaborate. I’m hopeful the <a href="http://netpreserve.org/about-us/working-groups/training-working-group/">new focus on training within IIPC</a> will generate learning resources to help get new people up to speed, but perhaps we can also find ways to support those who recognised the value of working in the open but are not able or permitted to do so.</p>
<p>Finally, we must try to find ways of funding the development and maintenance of these tools. All digital resources need software to make them accessible, and therefore maintaining software is a now critical need for every memory institution that wants to preserve access to the digital or born-digital items in their collections.</p>
<p>It’s not easy, so let’s share the load.</p>
<hr />
<div class="footnotes">
<ol>
<li id="fn:1">
<p>Perhaps it will be one day, but for now it appears to be proprietary. <a href="#fnref:1" class="reversefootnote">&#8617;</a></p>
</li>
<li id="fn:2">
<p>Note that although the main author used to work there, pywb is entirely separate from the Internet Archive work. <a href="#fnref:2" class="reversefootnote">&#8617;</a></p>
</li>
</ol>
</div>
<p><a href="http://anjackson.net/2017/11/30/sustaining-the-software-that-preserves-access-to-web-archives/">Sustaining the Software that Preserves Access to Web Archives</a> was originally published by Andy Jackson at <a href="http://anjackson.net">andrew.n.jackson</a> on November 30, 2017.</p>http://anjackson.net/2017/11/16/driving-crawls-via-annotations2017-11-16T00:00:00+00:002017-11-16T00:00:00+00:00Andy Jacksonhttp://anjackson.netanj@anjackson.net<p><em>Originally published <a href="http://blogs.bl.uk/webarchive/2017/11/driving-crawls-with-web-annotations.html">on the UK Web Archive blog</a> on the 10th of November 2017.</em></p>
<p>The heart of the idea was simple. Rather than <a href="/2017/10/19/tools-for-legal-deposit">our traditional linear harvesting process</a>, we would think in terms of annotating the live web, and imagine how we might use those annotations to drive the web-archiving process. From this perspective, each Target in the Web Curator Tool is really very similar to a bookmark on an social bookmarking service (like <a href="https://pinboard.in/">Pinboard</a>, <a href="https://www.diigo.com/">Diigo</a> or <a href="https://delicious.com/">Delicious</a><sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>), except that as well as describing the web site, the annotations also drive the archiving of that site<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>.</p>
<p>In this unified model, some annotations may simply highlight a specific site or URL at some point in time, using descriptive metadata to help ensure important resources are made available to our users. Others might more explicitly drive the crawling process, by describing how often the site should be re-crawled, whether robots.txt should be obeyed, and so on. Crucially, where a particular website cannot be ruled as in-scope for UK legal deposit automatically, the annotations can be used to record any additional evidence that permits us to crawl the site. Any permissions we have sought in order to make an archived web site available under open access can also be recorded in much the same way.</p>
<p>Once we have crawled the URLs and sites of interest, we can then apply the same annotation model to the captured material. In particular, we can combine one or more targets with a selection of annotated snapshots to form a collection. These ‘instance annotations’ could be quite detailed, similar to those supported by web annotation services like <a href="https://hypothes.is/">Hypothes.is</a>, and indeed this may provide a way for web archives to support and interoperate with services like that.<sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup></p>
<p>Thinking in terms of annotations also makes it easier to peel processes apart from their results. For example, metadata that indicates whether we have passed those instances through a QA process can be recorded as annotations on our archived web, but the actual QA process itself can be done entirely outside of the tool that records the annotations.</p>
<p>To test out this approach, we built a prototype Annotation &amp; Curation Tool (ACT) based on <a href="https://www.drupal.org/">Drupal</a>. Drupal makes it easy to create web UIs for <a href="https://www.drupal.org/node/21947">custom content types</a>, and we were able to create a simple, usable interface very quickly. This allowed curators to register URLs and specify the additional metadata we needed, including the crawl permissions, schedules and frequencies. But how do we use this to drive the crawl?</p>
<p>Our solution was to configure Drupal so that it provided a ‘crawl feed’ in a machine readable format. This was initially a simple list of data objects (one per Target), containing all the information we held about that Target, and where the list could be filtered by crawl frequency (daily, weekly, monthly, and so on). However, as the number of entries in the system grew, having the <em>entire</em> set of data associated with each Target eventually became unmanageable. This led to a simplified description that just contains the information we need to run a crawl, which looks something like this:</p>
<pre><code>[
{
"id": 1,
"title": "gov.uk Publications",
"seeds": [
"https://www.gov.uk/government/publications"
],
"schedules": [
{
"frequency": "MONTHLY",
"startDate": 1438246800000,
"endDate": null
}
],
"scope": "root",
"depth": "DEEP",
"ignoreRobotsTxt": false,
"documentUrlScheme": null,
"loginPageUrl": null,
"secretId": null,
"logoutUrl": null,
"watched": false
},
...
</code></pre>
<p>This simple data export became the first of our <a href="https://kris-sigur.blogspot.co.uk/2015/06/even-though-it-didnt-feature-heavily-on.html">web archiving APIs</a> – a set of <a href="https://en.wikipedia.org/wiki/Application_programming_interface">application programming interfaces</a> we use to try to <a href="https://programmingisterrible.com/post/162346490883/how-do-you-cut-a-monolith-in-half">split large services</a> into <a href="http://blog.dshr.org/2015/06/brief-talk-at-columbia.html">modular components</a><sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup>.</p>
<p>Of course, the output of the crawl engines also needs to meet some kind of standard so that the downstream indexing, ingesting and access tools know what to do. This works much like the API concept described above, but is even simpler, as we just rely on standard file formats in a fixed directory layout. Any crawler can be used as long as it outputs standard WARCs and logs, and puts them into the following directory layout:</p>
<pre><code>/output/logs/{job-identifer}/{launch-timestamp}/*.log
/output/warcs/{job-identifer}/{launch-timestamp}/*.warc.gz
</code></pre>
<p>Where the <code>{job-identifer}</code> is used to specify which crawl job (and hence which crawl configuration) is being used, and the <code>{launch-timestamp}</code> is used to separate distinct jobs launched using the same overall configuration, reflecting repeated re-crawling of the same sites over time.</p>
<p>In other words, if we have two different crawler engines that can be driven by the same crawl feed data and output the same format results, we can switch between them easily. Similarly, we can make any kind of changes to our Annotation &amp; Curation Tool, or even replace it entirely, and as long as it generates the same crawl feed data, the crawler engine doesn’t have to care. Finally, as we’ve also standardised the crawler output, the tools we use to post-process our crawl data can also be independent of the specific crawl engine in use.</p>
<p>This separation of components has been crucial to our recent progress. By de-coupling the different processes within the crawl lifecycle, each of the individual parts is able to be move at it’s own pace. Each can be modified, tested and rolled-out without affecting the others, if we so choose. True, making large changes that affect multiple components does require more careful management of the development process, but this is a small price to pay for the ease by which we can roll out improvements and bugfixes to individual components.</p>
<p>A prime example of this is how our Heritrix crawl engine itself has evolved over time, and that will be the subject of the next blog post.</p>
<hr />
<div class="footnotes">
<ol>
<li id="fn:1">
<p>Although, noting that <a href="https://blog.pinboard.in/2017/06/pinboard_acquires_delicious/">Delicious is now owned by Pinboard</a>, I would like to make it clear that we are not attempting to compete with Pinboard. <a href="#fnref:1" class="reversefootnote">&#8617;</a></p>
</li>
<li id="fn:2">
<p>Note that this is also <a href="https://pinboard.in/upgrade/">a feature of some bookmarking sites</a>. But we are <em>not</em> attempting to compete with Pinboard. <a href="#fnref:2" class="reversefootnote">&#8617;</a></p>
</li>
<li id="fn:3">
<p>I’m not yet sure how this might work, but some combination of the <a href="http://www.openannotation.org/">Open Annotation Specification</a> and <a href="http://ti3etravel.mementoweb.org/about/">Memento</a> might be a good starting point. <a href="#fnref:3" class="reversefootnote">&#8617;</a></p>
</li>
<li id="fn:4">
<p>For more information, see the <em>Architecture</em> section of <a href="http://blog.dshr.org/2016/03/talk-on-evolving-lockss-technology-at.html">this follow-up blog post</a> <a href="#fnref:4" class="reversefootnote">&#8617;</a></p>
</li>
</ol>
</div>
<p><a href="http://anjackson.net/2017/11/16/driving-crawls-via-annotations/">Driving Crawls With Web Annotations</a> was originally published by Andy Jackson at <a href="http://anjackson.net">andrew.n.jackson</a> on November 16, 2017.</p>http://anjackson.net/2017/10/19/tools-for-legal-deposit2017-10-19T00:00:00+01:002017-10-19T00:00:00+01:00Andy Jacksonhttp://anjackson.netanj@anjackson.net<p><em>Before I revisit the ideas explored in <a href="/2016/04/11/building-tools-to-archive-the-modern-web/">the first post in the blog series</a> I need to go back to the start of this story…</em></p>
<p>Between 2003 and 2013 – before the <a href="http://www.bl.uk/aboutus/legaldeposit/introduction/">Non-Print Legal Deposit</a> regulations came into force – the <a href="http://www.webarchive.org.uk/">UK Web Archive</a> could only archive websites by explicit permission. During this time, the <a href="http://dia-nz.github.io/webcurator/">Web Curator Tool</a> (WCT) was used to manage almost the entire life-cycle of the material in the archive. Initial processing of nominations was done via a separate Selection &amp; Permission Tool (SPT), and the final playback was via a separate instance of Wayback, but WCT drove the rest of the process.</p>
<p>Of course, selective archiving is valuable in it’s own right, but this was also seen as a way of building up the experience and expertise required to implement full domain crawling under Legal Deposit. However, WCT was not deemed to be a good match for a domain crawl. The <a href="https://webarchive.jira.com/wiki/display/Heritrix/Heritrix#Heritrix-Heritrix1.14.4%28May2010%29">old version of Heritrix</a> embedded inside WCT was not considered very scalable, was not expected to be supported for much longer, and was difficult to re-use or replace because of the way it was baked inside WCT.<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></p>
<p>The chosen solution was to use <a href="https://github.com/internetarchive/heritrix3">Heritrix 3</a> to perform the domain crawl separately from the selective harvesting process. While this was rather different to Heritrix 1, requiring incompatible methods of set-up and configuration, it scaled fairly effectively, allowing us to perform a full domain crawl on a single server<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>.</p>
<p>This was the proposed arrangement when I joined the UK Web Archive team, and this was retained through the onset of the Non-Print Legal Deposit regulations. The domain crawls and the WCT crawls continued side by side, but were treated as separate collections. It would be possible to move between them by following links in Wayback, but no more.</p>
<p>This is not necessarily a bad idea, but it seemed to be a terrible shame, largely because it made it very difficult to effectively re-use material that had been collected as part of the domain crawl. For example, what if we found we’d missed an important website that should have been in one of our high-profile collections, but because we didn’t know about it had only been captured under the domain crawl? Well, we’d want to go and add those old instances to that collection, of course.</p>
<p>Similarly, what if we wanted to merge material collected using a range of different web archiving tools or services into our main collections? For example, for some difficult sites we may have to drive the archiving process manually. We need to be able to properly integrate that content into our systems and present them as part of a coherent whole.</p>
<p>But WCT makes these kind of things really hard.</p>
<p>If you look at the overall architecture, the Web Curator Tool enforces what is essentially (despite the odd loop or dead-end) a linear workflow (figure taken from <a href="http://webcurator.sourceforge.net/docs/1.6.1/Web%20Curator%20Tool%20Quick%20Start%20Guide%20%28WCT%201.6%29.pdf">here</a>). First you sort out the permissions, then you define your Target and it’s metadata, then you crawl it (and maybe re-crawl it for QA), then you store it, then you make it available. In that order.</p>
<p><a href="/building-web-archives/images/WCT-workflow.svg"><img src="/building-web-archives/images/WCT-workflow.png" alt="WCT Overall Workflow" /></a></p>
<p>But what if we’ve already crawled it? Or collected it some other way? What if we want to add metadata to existing Targets? What if we want to store something but not make it available. What if we want to make domain crawl material available even if we haven’t QA’d it?</p>
<p>Looking at WCT, the components we needed were there, but tightly integrated in one monolithic application and baked into the expected workflow. I could not see how to take it apart and rebuild it in a way that would make sense and enable us to do what we needed. Furthermore, we had already built up a rather complex arrangement of additional components around WCT (this includes applications like SPT but also a rather messy nest of database triggers, cronjobs and scripts). It therefore made some sense to revisit our architecture as a whole.</p>
<p>So, I made the decision to make a fresh start. Instead of the WCT and SPT, we would develop a new, more modular archiving architecture built around the concept of annotations…</p>
<hr />
<div class="footnotes">
<ol>
<li id="fn:1">
<p>Although we have moved away from WCT it is still <a href="https://github.com/DIA-NZ/webcurator">under active development</a> thanks to the National Library of New Zealand, including <a href="https://github.com/DIA-NZ/webcurator/tree/h3impl">Heritrix3 integration</a>! <a href="#fnref:1" class="reversefootnote">&#8617;</a></p>
</li>
<li id="fn:2">
<p>Not without some stability and robustness problems. I’ll return to this point in a later post. <a href="#fnref:2" class="reversefootnote">&#8617;</a></p>
</li>
</ol>
</div>
<p><a href="http://anjackson.net/2017/10/19/tools-for-legal-deposit/">Tools for Legal Deposit</a> was originally published by Andy Jackson at <a href="http://anjackson.net">andrew.n.jackson</a> on October 19, 2017.</p>http://anjackson.net/2017/06/30/waw-digging-documents-out-of-the-archived-web2017-06-30T00:00:00+01:002017-06-30T00:00:00+01:00Andy Jacksonhttp://anjackson.netanj@anjackson.net<h2 id="abstract">Abstract</h2>
<p>As an increasing number of government and other publications move towards online-only publication, we are force to move our traditional Legal Deposit processes based on cataloging printed media. As we are already tasked with archiving UK web publications, the question is not so much ‘how to we collect these documents?’ rather ‘how to we find the documents we’ve already collected?’. This presentation will explore the issues we’ve uncovered as we’ve sought to integrate our web archives with our traditional document cataloging processes, especially around official publications and e-journals. Our current Document Harvester will be described, and it’s avantages and limitations explored. Our current methods for exploiting machine-generated metadata will be discussed, and an outline of our future plans for this type of work will be presented.</p>
<!--break-->
<p><em>These are the slides for the presentation I gave as part of <a href="https://archivedweb.blogs.sas.ac.uk/">Web Archiving Week 2017</a>, on <a href="http://netpreserve.org/wac2017/thursday-15-june/">Thursday 15th of June</a>.</em></p>
<p><img src="/blog/images/2017-06-WAW-digging-out-documents/Slide02.png" alt="Journey of a (print) collection item" /></p>
<p><img src="/blog/images/2017-06-WAW-digging-out-documents/Slide03.png" alt="Original Digital Processing Workflow" /></p>
<p><img src="/blog/images/2017-06-WAW-digging-out-documents/Slide04.png" alt="Document Harvester Workflow" /></p>
<p><img src="/blog/images/2017-06-WAW-digging-out-documents/Slide05.png" alt="What is a Publication?" /></p>
<p><img src="/blog/images/2017-06-WAW-digging-out-documents/Slide06.png" alt="Example gov.uk publication" /></p>
<p><img src="/blog/images/2017-06-WAW-digging-out-documents/Slide07.png" alt="MARC &amp; Cataloguing Standards" /></p>
<p><img src="/blog/images/2017-06-WAW-digging-out-documents/Slide08.png" alt="Metadata Extraction" /></p>
<p><img src="/blog/images/2017-06-WAW-digging-out-documents/Slide09.png" alt="Example gov.uk API Data" /></p>
<p><img src="/blog/images/2017-06-WAW-digging-out-documents/Slide10.png" alt="Resolving References" /></p>
<p><img src="/blog/images/2017-06-WAW-digging-out-documents/Slide11.png" alt="Layers of Transformation" /></p>
<p><img src="/blog/images/2017-06-WAW-digging-out-documents/Slide12.png" alt="Oh No! I Made Another Chain" /></p>
<p><img src="/blog/images/2017-06-WAW-digging-out-documents/Slide13.png" alt="Future Experimentation" /></p>
<p><a href="http://anjackson.net/2017/06/30/waw-digging-documents-out-of-the-archived-web/">Digging Documents Out Of The Archived Web</a> was originally published by Andy Jackson at <a href="http://anjackson.net">andrew.n.jackson</a> on June 30, 2017.</p>http://anjackson.net/2017/06/29/waw-your-lying-archives2017-06-29T00:00:00+01:002017-06-29T00:00:00+01:00Andy Jacksonhttp://anjackson.netanj@anjackson.net<p><em>This is the script for the introduction I gave as part of a ‘Digital Conversations at the BL’ panel event: <a href="https://www.bl.uk/events/web-archives-truth-lies-and-politics-in-the-21st-century#">Web Archives: truth, lies and politics in the 21st century</a> on <a href="http://netpreserve.org/wac2017/wednesday-14-june/">Wednesday 14th of June</a>, as part of <a href="https://archivedweb.blogs.sas.ac.uk/">Web Archiving Week 2017</a>.</em></p>
<p>My role is to help build a web archive for the United Kingdom that can be used to separate fact from fiction. But do to that, you need to be able to trust that the archived content we present to you is what it purports to be.</p>
<p>Which raises the question: Can a web archive lie?</p>
<p>Well it can certainly be confusing. For example, one seemingly simple question that we are sometimes asked is: How big is the UK web? Unfortunately, this is actually quite a difficult question. First, unlike print, many web pages are generated by algorithms, which means the web is technically infinite. Even putting that aside, to answer this question precisely we’d need to capture every version of everything on the web. We just can’t do that. Even if we could download every version of everything we know about, there’s also the problem of all the sites that we failed to even try to capture.</p>
<p>A web archive can also be misleading – most obviously through omission. Sometimes, this might be because of unindented biases introduced by the process by which we select sites for inclusion and higher-quality crawling. Having an open nominations process for the archive can help, but the diversity of those involved with web archives is pretty low. We also know that we lose a lot of content due to the complexity of current web sites and the limitations of our current crawling technologies.</p>
<p>A web archive can also mislead in other ways. When presenting web archives, we use the date we captured the resource as our time axis. This matters because users usually expect that documents appear arranged by their date of publication. We generally don’t know the publication date, and because of the way the web crawling process works, the time line can get mixed up because the crawler will tend to discover documents based on their popularity, in terms of how many other sites link to them. With fast-moving events like news and current affairs, this can become very misleading and is something I expect we’ll have to address more directly in the future.</p>
<p>One way to do this is to start to bring in more structured data from multiple sources, like Twitter or other APIs. These systems usually do provide authoritative publication dates and timestamps. The trick is going to be working our how to blend these different data sources together to improve the way we present our time line.</p>
<p>But can a web archive outright lie? For example, can it say something was on the web at a particular time when in truth it was not even written yet?</p>
<p>Well, yes, it certainly could. Digital material is very malleable - if someone modifies it, that often leaves no traces upon the item itself. As digital resources become increasingly important as historical records, we must expect more and more attempts to hack our archives. Obviously, we take steps to prevent ourselves from being hacked, but how on earth do we prove to you that we’ve done a good job. If not getting hacked is our only answer, it just becomes a massive single point of failure. One breach, and the authenticity of everything digital item we hold is brought into question.</p>
<p>When building large, distributed computer systems, you have to engineer for failure. When we build large, long-lived information systems, we have to take the same approach. We have to work out how to ensure the historical record is trustworthy, even if our institution is hacked.</p>
<p>A hacker may be able to hack one organisation, but simultaneously and consistently hacking multiple independent organisations is much, much harder. As our historical record becomes born-digital, the libraries and archives of the world are going to have to find ways to support each other, and build a chain of care that is so wide and so entangled, it simply can’t be hacked without a trace.</p>
<p><a href="http://anjackson.net/2017/06/29/waw-your-lying-archives/">Can a Web Archive Lie?</a> was originally published by Andy Jackson at <a href="http://anjackson.net">andrew.n.jackson</a> on June 29, 2017.</p>http://anjackson.net/2017/06/28/waw-the-shelves-and-the-mine2017-06-28T00:00:00+01:002017-06-28T00:00:00+01:00Andy Jacksonhttp://anjackson.netanj@anjackson.net<h2 id="abstract">Abstract</h2>
<p>The British Library has a long tradition of preserving the heritage of the United Kingdom, and processes for handling and cataloguing print-based media are deeply ingrained in the organisations structure and thinking. However, as an increasing number of government and other publications move towards online-only publication, we are force to revisit these processes and explore what needs to be changed in order to avoid the web archive becoming an massive, isolated silo, poorly integrated with other collection material. We have started this journey by looking at how we collect official documents, like government publications and e-journals. As we are already tasked with archiving UK web publications, the question is not so much ‘how to we collect these documents?’ rather ‘how to we find the documents we’ve already collected?’. Our current methods for combining curatorial expertise with machine-generated metadata will be discussed, leading to an outline of the lessons we have learned. Finally, we will explore how the ability to compare the library’s print catalogue data with the web archive enables us to study the steps institutions and organisations have taken as they have moved online.</p>
<!--break-->
<p><em>This is the script for the presentation I gave as part of <a href="https://archivedweb.blogs.sas.ac.uk/">Web Archiving Week 2017</a>, on <a href="http://netpreserve.org/wac2017/wednesday-14-june/">Wednesday 14th of June</a>.</em></p>
<h2 id="introduction">Introduction</h2>
<p>This year, I’ll have worked for the British Library for ten years. But it’s only in this last year that I’ve finally had to get to grips with that quintessential piece of library technology… The Catalogue.</p>
<p>We needed to bring the catalogue and the web archive closer together, linking the traditional cataloguing process to the content from the web. I won’t get into the more technical details here. If you’re interested in that, I’ll be giving a longer presentation on that subject tomorrow morning where I’ll talk about how the implementation worked and how I’m coping with the psychological scarring caused by being exposed to MARC metadata.</p>
<p>Instead, today I want to compare the overall architecture of the two systems, because the web archive and the catalogue manage their data in fundamentally different ways.</p>
<h2 id="the-catalogue">The Catalogue</h2>
<p><img src="/blog/images/2017-06-WAW-archive-v-catalogue/Slide02.png" alt="The Journey Of A Collection Item" /></p>
<p>These are some still frames from a British Library video that shows the journey of a print collection item, from being posted to us, through acquisition, cataloguing, finishing and to the shelf. Then, when a reader requests it, from the shelf and out to the reading room. These processes represent a very large proportion of the day-to-day work of the British Library, and much of the library is built around supporting the efficient ingest and delivery of printed material. I don’t just mean the work of the teams involved, but the very existence of those teams, the roles they hold and the management hierarchies that knit them together. Even the physical structure of our buildings, and the way they are connected, are all shaped by this chain of operations.</p>
<p>At the heart of this process sits the catalogue. At every step, the catalogue is updated to reflect this workflow, with the record of each item being created and then updated along the way. The British Library catalogue is not just a collection of bibliographic metadata, it’s also a process management system.</p>
<p><img src="/blog/images/2017-06-WAW-archive-v-catalogue/Slide03.png" alt="Models with chains of events" /></p>
<p>This is a very natural and sensible way to manage the chain of events that must occur in order to deal with print material, and we’ve been doing it like this for a very long time. Consequently, this approach has become deeply embedded in our thinking, and reappears elsewhere. Sometimes the events are in a line, and sometimes in a circle, but always implying a step-by-step approach. Ever forwards.</p>
<p>We found another example last year, when we were looking at the way the library has reacted to the shifting of traditional print publications to online forms, particularly government publications.</p>
<p><img src="/blog/images/2017-06-WAW-archive-v-catalogue/Slide04.png" alt="Document Processing Workflow" /></p>
<p>We learned that the curation and cataloging teams that handle printed material had started manually downloading these publications from the web, processing them and submitting them into our digital library system, and recording this all in the catalogue. But these very same publications were being downloaded and stored through our regular web archiving activities. This duplication of content reflected a significant duplication of effort across teams, and so we set out to resolve it by building a document harvester on top of the web archive.</p>
<h2 id="the-web-archive">The Web Archive</h2>
<p><img src="/blog/images/2017-06-WAW-archive-v-catalogue/Slide05.png" alt="Web Curator Tool Workflow" /></p>
<p>Before legal deposit, the web archive also worked as a chain of events. Every crawl target was defined by a period of time. Usually, we’d crawl a site a few times while we built up a collection, but at some point the crawl target was deemed complete and we’d move on to a new collection and a new set of sites to capture. The Web Curator Tool was built around this process.</p>
<p>But when Legal Deposit came along, this had to change. The most obvious change under Legal Deposit is the sheer scale of the operation, going from thousands of sites a year to millions. But it’s also about the way the workflow has changed.</p>
<p><img src="/blog/images/2017-06-WAW-archive-v-catalogue/Slide06.png" alt="Decoupled Collection &amp; Curation Workflows" /></p>
<p>Under Legal Deposit, we have to try to get it all, all the time! One very important example is that we now collect hundreds of news sites every day, because we want to have a good snapshot of web news no matter what, and this is an ongoing effort. We’re not going to stop one day and declare we’ve had enough news (although lately that has become more tempting!).</p>
<p>But of course, we still want to make our web archive more accessible by building collections around specific events or areas of interest. So the curation process now looks rather different. It <em>might</em> mean adding a new set of web sites to be crawled, but it’s more likely to mean pulling in snapshots of sites or specific pages we’ve already crawled. The process of cataloguing and curating the web has become largely de-coupled from the process of collecting the web.</p>
<p><img src="/blog/images/2017-06-WAW-archive-v-catalogue/Slide07.png" alt="Document Harvester Workflow" /></p>
<p>Harvesting documents works in much the same way - rather than launching specific crawls for this purpose, we can pick out documents from the crawls and expose them for cataloguing.</p>
<p>But now we have two systems that hold metadata we need to bring into play when we make our data discoverable and accessible. Somehow we have to knit all these different sources of information together. To make things even more complicated, we have to cope with the fact that some of these processes need to evolve quite rapidly over time. We are still learning how to index our born-digital material, and we need a way to experiment with different tactics without creating a mess of metadata that we can’t unpick.</p>
<h2 id="bringing-them-together">Bringing them Together</h2>
<p>For print material, our cataloguing standards do change slowly over time, but we accept that you can’t just go back and re-catalogue everything. Because of the manual work involved, updates to the catalogue for print material are rare, and there’s no sense optimising your workflow around rare events. This is why the library focusses on managing items passing through a step-by-step workflow.</p>
<p>But digital is different. Digital means you <em>can</em> go back and re-process everything, and this gives you new ways to bring data together. It’s like suddenly having massive army of dexterous androids to walk the shelves and re-catalogue all the books.</p>
<p>Maybe like this…</p>
<p><img src="/blog/images/2017-06-WAW-archive-v-catalogue/Slide08.png" alt="Helpful Robot Army" /></p>
<p>Or…</p>
<p><img src="/blog/images/2017-06-WAW-archive-v-catalogue/Slide09.png" alt="Disagreeable Robot Army" /></p>
<p>Maybe not.</p>
<p><img src="/blog/images/2017-06-WAW-archive-v-catalogue/Slide10.png" alt="Friendly Robot Army" /></p>
<p>(it’s surprisingly difficult to find an image of a large group of robots that isn’t creepy)</p>
<p>How does this help us bring disparate data sources together? Well, a good analogy can be found in the recent geo-referencing work on the British Library maps collections.</p>
<p><img src="/blog/images/2017-06-WAW-archive-v-catalogue/Slide11.png" alt="Geo-referencer Pins" /></p>
<p>This picture shows a user helping us understand where an old digitised map lines up with a modern one. They do this by picking out specific features that appear on both, and this allow the old map to be projected on top of the new one.</p>
<p><img src="/blog/images/2017-06-WAW-archive-v-catalogue/Slide12.png" alt="Geo-referencer Layers" /></p>
<p>If you imagine the modern map represents the web archive, then we can overlay different sources of data if we can find common points of reference. This might be the names of publications or journals or authors, identifiers like ISBNs or DOIs, or other entities like dates or possible even place names, just like the maps. If we can find these entities and forge links with the catalogue, we can pull in more concepts from more sources and start to align the layers. As we do this, it opens up the possibility of studying the transition from print to online publication directly.</p>
<h2 id="thinking-in-layers">Thinking in Layers</h2>
<p>Now, instead of thinking in terms of chains of events, we’re thinking in terms of layers of information.</p>
<p><img src="/blog/images/2017-06-WAW-archive-v-catalogue/Slide13.png" alt="Thinking in layers" /></p>
<p>Starting at the bottom, because can’t curate or catalogue everything we use automated metadata and full-text extraction on the bulk of our the archived content. We then bring in web archive annotations that describe sites, and the collections and subject areas those sites belong to. From the main catalogue we can also start to bring in publications and their identifiers, and then all of these sources can be merged together (with the manually created metadata taking precedence). From there, we can now populate our full-text search system, or re-generate our datasets and reports.</p>
<p>If the outcome leads to the catalogue being updated, or if we want to add a new source of data, we can update the sources and transformations, and re-generate the whole thing all over again. Because this might mean re-processing a large amount of data, it’s probably not something we can do that often, but by taking this layered approach we can still experiment using subsets or samples of data and then re-build the whole thing when we’re confident it will work.</p>
<p>We can also use this approach to make the automated metadata extraction smarter - using analysis of the manual metadata to improve the extraction process. This feedback loop, with curators and cataloguers in the driving seat, will help us teach the computers to do a better job of automated extraction, so the thousands of data points we can manually curate can help make billions of resources more useful.</p>
<p><img src="/blog/images/2017-06-WAW-archive-v-catalogue/Slide14.png" alt="Summary" /></p>
<p><a href="http://anjackson.net/2017/06/28/waw-the-shelves-and-the-mine/">The Web Archive and the Catalogue</a> was originally published by Andy Jackson at <a href="http://anjackson.net">andrew.n.jackson</a> on June 28, 2017.</p>http://anjackson.net/2017/06/09/discovery-and-access-plans2017-06-09T00:00:00+01:002017-06-09T00:00:00+01:00Andy Jacksonhttp://anjackson.netanj@anjackson.net<p><em>Originally published <a href="http://blogs.bl.uk/webarchive/2017/06/revitalising-the-uk-web-archive.html">on the UK Web Archive blog</a> on the 8th of June 2017.</em></p>
<p>It’s been over a year since we made <a href="http://britishlibrary.typepad.co.uk/webarchive/2016/02/updating-our-historical-search-service.html">our historical search system available</a>, and it’s proven itself to be stable and useful. Since then, we’ve been largely focussed on changes to our crawl system, but we’ve also been planning how to take what we learned in the <a href="http://buddah.projects.history.ac.uk/">Big UK Domain Data for the Arts and Humanities</a> project and use it to re-develop <a href="https://www.webarchive.org.uk/ukwa/">the UK Web Archive</a>.</p>
<p>Our current website has not changed much since 2013, and doesn’t describe <a href="http://www.bl.uk/aboutus/legaldeposit/introduction/">who we are</a> and <a href="http://britishlibrary.typepad.co.uk/webarchive/2015/06/ten-years-of-archiving-the-web.html">what we do</a> now that the <a href="http://www.bl.uk/aboutus/legaldeposit/websites/">UK Legal Deposit regulations</a> are in place. It only describes the sites we have crawled by permission, and does not reflect the tens of thousands of sites and URLs that we have curated and categorised under Legal Deposit, nor the billions of web pages in the full collection. To try to address these issues, we’re currently developing a new website that will open-up and refresh our archives.</p>
<p>One of the biggest challenges is the search index. The 3.5 billion resources we’ve indexed for <a href="https://www.webarchive.org.uk/shine">SHINE</a> represents less than a third of our holdings, so now we need to scale our system up to cope with over ten billion documents, and a growth rate of 2-3 billion resource per year. We will continue working with the <a href="https://github.com/ukwa/webarchive-discovery">open source indexer we have developed</a>, while updating our data processing platform (<a href="http://hadoop.apache.org/">Apache Hadoop</a>) and dedicating more hardware to the <a href="http://blogs.bl.uk/webarchive/2014/11/powering-the-uk-web-archive-search-with-solr.html">SolrCloud that holds our search indexes</a>. If this all works as planned, we will be able to offer a complete search service that covers our entire archive, from 1995 to yesterday.</p>
<p>The first release of the new website is not expected to include all of the functionality offered by the <a href="https://github.com/ukwa/shine">SHINE prototype</a>, just the core functionality we need to make our content and collections more available to a general audience. Quite how we bring together these two distinct views of the same underlying search index is an open question at this point it time. Later in the year, we will make the new website available as a public beta, and we’ll be looking for feedback from all our users, to help us decide how things should evolve from here.</p>
<p>As well as scaling up search, we’ve also been working to scale up our access service. While it doesn’t look all that different, our <a href="https://www.webarchive.org.uk/wayback/archive/">website playback service</a> has been overhauled to cope with the scale of our full collection. This allows us to make our full holdings knowable, even if they aren’t openly accessible, so you get a more informative error message (and <a href="https://en.wikipedia.org/wiki/HTTP_451">HTTP status code</a>) if you attempt to access content that we can only make available on site at the present time. For example, if you look at <a href="https://www.webarchive.org.uk/wayback/archive/*/http://www.google.co.uk">our archive of google.co.uk</a>, you can see that we have captured the Google U.K. homepage during our crawls but can’t make it openly available due to the legal framework we operate within.</p>
<p>The upgrades to our infrastructure will also allow us update the tools we use to analyse our holdings. In particular, we will be attenting the <a href="http://archivesunleashed.com/au4-0-british-invasion/">Archives Unleashed 4.0 Datathon</a> and looking at at the <a href="https://lintool.github.io/warcbase-docs/">Warcbase</a> and <a href="https://github.com/helgeho/ArchiveSpark">ArchiveSpark</a> projects, as they provide a powerful set of open source tools and would enable us to collaborate directly with our research community. A stable data-analysis framework will also provide a platform for automated QA and report generation and make it much easier to update our <a href="https://data.bl.uk/UKWA/">datasets</a>.</p>
<p>Taken together, we believe these developments will revolutionise the way readers and researchers can use the UK Web Archive. It’s going to be an interesting year.</p>
<p><a href="http://anjackson.net/2017/06/09/discovery-and-access-plans/">Revitalising the UK Web Archive</a> was originally published by Andy Jackson at <a href="http://anjackson.net">andrew.n.jackson</a> on June 09, 2017.</p>http://anjackson.net/2017/04/30/more-than-just-a-copy2017-04-30T00:00:00+01:002017-04-30T00:00:00+01:00Andy Jacksonhttp://anjackson.netanj@anjackson.net<p>Following my <a href="/2017/04/19/access-starts-with-loading/">previous post</a>, a <a href="https://twitter.com/atomotic/status/854982711076950017">tweet from Raffaele Messuti</a> lead me to this quote:</p>
<blockquote>
<p>“Computers, by their nature, copy. Typing this line, the computer has copied the text multiple times in a variety of memory registers. I touch a button to type a letter, this releases a voltage that is then translated into digital value, which is then copied into a memory buffer and sent to another part of the computer, copied again into RAM and sent to the graphics card where it is copied again, and so on. The entire operation of a computer is built around copying data: copying is one of the most essential characteristics of computer science. One of the ontological facts of digital storage is that there is no difference between a computer program, a video, mp3-song, or an e-book. They are all composed of voltage represented by ones and zeros. Therefore they are all subject to the same electronic fact: they exist to be copied and can only ever exist as copies.”
<small>From <a href="http://networkcultures.org/wp-content/uploads/2014/06/NN07_complete.pdf">Radical Tactics of the Offline Library</a> via <a href="https://via.hypothes.is/http://networkcultures.org/wp-content/uploads/2014/06/NN07_complete.pdf#annotations:N5kgkCWnEeeWyJNvazzkJg">an annotation</a> by <a href="https://twitter.com/atomotic/status/854982711076950017">@atomotic</a>.</small></p>
</blockquote>
<p>Copying is indeed fundamental to how computers function, and we need to understand that to understand some of the the limits and affordances of digital resources.</p>
<p>However, this isn’t quite what I was trying to say. The thing you interact with is <em>more</em> that just a copy.</p>
<p>Admittedly, sometimes the distance between the two is small. In the case of the ZX Spectrum loading screen they are almost identical: the data loaded from the tape just has to be unpacked<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup> and streamed into the right place in memory in order to produce the image.</p>
<p><img src="http://anjackson.net/digipres-lessons-learned/images/access-layers-spectrum.png" alt="Schematic of how a ZX Spectrum loading scheme works." /></p>
<p>But in general, the bitstream format (and it’s data model) are quite different to the run-time representation. Broadly speaking it’s more common to have some internal data model that is manipulated when you interact with the software, which is then used to compose a ‘view’ of that model that can be delivered to the user (via the host operating system).</p>
<p><img src="http://anjackson.net/digipres-lessons-learned/images/access-layers-modern.png" alt="Schematic of how a modern software application usually works." /></p>
<p>This kind of software is often much richer than most of the formats it supports. For example, when I restored the image by saving it from the GIMP image editor, I didn’t <em>have</em> to save it as a JPEG. I could re-save it in a wide range of formats, translating the in-memory representation of the image into any of the bitstream formats GIMP supports. Like most editor software, it has a ‘native’ format that closely matches the capabilities of the software, and then a wider range of ‘export’ formats that are useful but may lose some functionality.</p>
<p>All of this only makes sense if the thing we’re interacting with is much more than just a copy of the original file.</p>
<p>Which makes it all the more surprising that we don’t have a good name for the actual thing we actually interact with.</p>
<hr />
<div class="footnotes">
<ol>
<li id="fn:1">
<p>The ZX Spectrum <a href="http://www.shadowmagic.org.uk/cssfaq/reference/48kreference.htm#TapeDataStructure">tape data structure</a> describes some interesting details of the packaging format, including the checksum used to ensure the data has loaded correctly. <a href="#fnref:1" class="reversefootnote">&#8617;</a></p>
</li>
</ol>
</div>
<p><a href="http://anjackson.net/2017/04/30/more-than-just-a-copy/">More than just a copy</a> was originally published by Andy Jackson at <a href="http://anjackson.net">andrew.n.jackson</a> on April 30, 2017.</p>http://anjackson.net/2017/04/19/access-starts-with-loading2017-04-19T00:00:00+01:002017-04-19T00:00:00+01:00Andy Jacksonhttp://anjackson.netanj@anjackson.net<p>So what was going on in <a href="/2017/04/14/unsafe-removal-results/">our little experiment in data destruction?</a> Well, to understand what happens when we open up digital files, I want to take you back to my childhood, back when ‘Loading…’ really <em>meant</em> something…
<!--break--></p>
<p>I’d like you to watch the following video. Please enjoy the sweet ‘music’ of the bytes of the bitstream as they stream off the tape and into the memory of the machine.</p>
<p>And no skipping to the end! Sit through the whole <em>damn</em> thing, just like I had to, all those years ago!</p>
<div style="text-align: center;">
<iframe width="560" height="315" src="https://www.youtube.com/embed/V0EfycbDhiw?rel=0" frameborder="0" allowfullscreen=""></iframe>
</div>
<p>I particularly like the bit from <a href="https://youtu.be/V0EfycbDhiw?t=24s">about 0:24s in</a>, as the loading screen loads…</p>
<p><img src="http://anjackson.net/digipres-lessons-learned/images/jetpac-screen-loading-montage.png" alt="JETPAC: loading the loading screen" /></p>
<p>First, we can see a monochrome image being loaded, section-by-section, with individual pixels flowing in row-after-row. The ones and zeros you can see are the same one as the ones you can hear, but they are being copied from the tape, unpacked by the <a href="https://en.wikipedia.org/wiki/Central_processing_unit">CPU</a>, and being stored in a special part of the <a href="https://en.wikipedia.org/wiki/ZX_Spectrum">machine’s</a> memory, called the <a href="http://whatnotandgobbleaduke.blogspot.co.uk/2011/07/zx-spectrum-screen-memory-layout.html">screen memory</a>.</p>
<p>This screen memory is special because another bit of hardware (called the <a href="http://www.worldofspectrum.org/faq/reference/48kreference.htm#Contention">ULA</a>) can see what’s there, and uses it to compose the signal that gets sent to the television screen. As well as forming the binary pixels, it also uses the last chunk of memory to define what colours should be used, and combines these two sets of information to make the final image. You can see this as the final part of the screen-loading process happens, and the monochrome image suddenly fills with colour. You can even <em>hear</em> the difference between the pixel data and the colour information.</p>
<p>After that, the tape moves on and we have to wait even longer while the actual game loads.<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></p>
<p>The point I want to emphasize is that this is just a slow-motion version of what still happens today. The notion of ‘screen memory’ has become more complex and layered, and it all happens <em>much</em> faster, but you’re still interacting with the computer’s memory, not the persistent bitstream.</p>
<p>Because working with memory is faster and simpler than working directly with storage devices, the kind of software that creates and edits files is much easier to write if you can load the whole file into memory to work on it there. The GIMP works like this, and that’s why I was able to re-save my test image out of it.</p>
<p>However, Apple Preview works differently. Based on my results, it seems likely that Preview retains a reference to the original file, which it uses to generate an intermediate in-memory image for display purposes (e.g. a scale-down version). The cached intermediate image can still be shown, even if future operations may fail because the software can no longer find the original file.</p>
<p>These results only make sense because the thing you are interacting with via the computer screen is <em>not</em> the original bitstream, but a version of that data that has been loaded into the computer’s memory. The relationship between these two representations depends on the software involved, can be quite complicated, and the two forms can be quite different.<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup> My suspicion is that we need a better understanding of this relationship in order to better understand what it is we are actually trying to preserve.</p>
<hr />
<div class="footnotes">
<ol>
<li id="fn:1">
<p>What’s that? You skipped to the end!? Shame on you. <a href="#fnref:1" class="reversefootnote">&#8617;</a></p>
</li>
<li id="fn:2">
<p>As we’ve seen, this is true even for a very common and well standardised bitstream format like JPEG. <a href="#fnref:2" class="reversefootnote">&#8617;</a></p>
</li>
</ol>
</div>
<p><a href="http://anjackson.net/2017/04/19/access-starts-with-loading/">Access starts with 'Loading...'</a> was originally published by Andy Jackson at <a href="http://anjackson.net">andrew.n.jackson</a> on April 19, 2017.</p>http://anjackson.net/2017/04/14/unsafe-removal-results2017-04-14T00:00:00+01:002017-04-14T00:00:00+01:00Andy Jacksonhttp://anjackson.netanj@anjackson.net<p>Following my <a href="/2017/04/10/unsafe-device-removal/">proposed experiment in data destruction</a>, a few kind readers tried it out and let me know what happened<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>. I’ve summarised the results below, to try and see if there’s any common pattern.</p>
<!--break-->
<table>
<thead>
<tr>
<th>Software</th>
<th>Format</th>
<th>Was recovery possible?</th>
</tr>
</thead>
<tbody>
<tr>
<td>Apple Preview</td>
<td>JPEG</td>
<td>No (rendered image still shown and could be captured via screenshot)<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup></td>
</tr>
<tr>
<td>GIMP</td>
<td>JPEG</td>
<td>Yes (with minor alterations to the data, likely <a href="https://photo.stackexchange.com/a/83892/62442">within allowed limits for JPEG</a>)<sup id="fnref:2:1"><a href="#fn:2" class="footnote">2</a></sup></td>
</tr>
<tr>
<td>Imagemagick display</td>
<td>JPEG</td>
<td>Yes (result not binary-identical)<sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup></td>
</tr>
<tr>
<td>Ubuntu Image Viewer</td>
<td>JPEG</td>
<td>No<sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup></td>
</tr>
<tr>
<td>Ubuntu Document Viewer</td>
<td>PDF</td>
<td>Yes<sup id="fnref:4:1"><a href="#fn:4" class="footnote">4</a></sup></td>
</tr>
<tr>
<td>PDF reader</td>
<td>PDF</td>
<td>PDF from a browser, stay in a PDF reader after the browser closes but can’t be saved<sup id="fnref:5"><a href="#fn:5" class="footnote">5</a></sup></td>
</tr>
<tr>
<td>Word (Windows 95)</td>
<td>DOC (on a floppy!)</td>
<td>No (but re-inserting the floppy worked!)<sup id="fnref:6"><a href="#fn:6" class="footnote">6</a></sup></td>
</tr>
</tbody>
</table>
<p>As far as I can tell from this data, there isn’t much of a pattern here. Broadly, the observed behaviour seems to depend on the software rather than the format, and ‘viewer’ style applications appear less likely to allow re-saving than ‘editor’ apps (but the behaviour of the Ubuntu Document Viewer shows this is not a robust finding). All we can be sure of at this point is this: “It’s complicated”.</p>
<p>To find out what’s going on, we’ll need to look more closely at what happens when we open a file…</p>
<hr />
<div class="footnotes">
<ol>
<li id="fn:1">
<p>Thanks also to <a href="https://twitter.com/nkrabben">Nick Krabbenhöft</a> for <a href="/2017/04/10/unsafe-device-removal/#comment-3249002689">pointing out</a> that I could have been a bit more careful about my original experiment, and that would have helped work out where the JPEG differences came from in the case of re-saving the image from GIMP. That said, I expect such minor differences are down to small variations in the implementation of the JPEG decompression scheme, <a href="https://photo.stackexchange.com/a/83892/62442">as permitted by the standard</a>. i.e. my final image is likely the no <em>more</em> different that the <em>same original image</em> might be when rendered by a <em>different software application</em>. <a href="#fnref:1" class="reversefootnote">&#8617;</a></p>
</li>
<li id="fn:2">
<p>See <a href="/2017/04/10/unsafe-device-removal/">the original post</a> <a href="#fnref:2" class="reversefootnote">&#8617;</a> <a href="#fnref:2:1" class="reversefootnote">&#8617;<sup>2</sup></a></p>
</li>
<li id="fn:3">
<p>Result from <a href="http://anjackson.net/2017/04/10/unsafe-device-removal/#comment-3249487142">@atomotic</a> <a href="#fnref:3" class="reversefootnote">&#8617;</a></p>
</li>
<li id="fn:4">
<p>Result from <a href="https://twitter.com/archivalistic/status/851907815673286656">@archivalistic</a> <a href="#fnref:4" class="reversefootnote">&#8617;</a> <a href="#fnref:4:1" class="reversefootnote">&#8617;<sup>2</sup></a></p>
</li>
<li id="fn:5">
<p>From <a href="https://twitter.com/andrewjbtw/status/851530416590790656">@andrewjbtw</a> <a href="#fnref:5" class="reversefootnote">&#8617;</a></p>
</li>
<li id="fn:6">
<p>Also from <a href="https://twitter.com/andrewjbtw/status/851531680632365056">@andrewjbtw</a> <a href="#fnref:6" class="reversefootnote">&#8617;</a></p>
</li>
</ol>
</div>
<p><a href="http://anjackson.net/2017/04/14/unsafe-removal-results/">Unsafe Device Removal: The Results</a> was originally published by Andy Jackson at <a href="http://anjackson.net">andrew.n.jackson</a> on April 14, 2017.</p>http://anjackson.net/2017/04/10/unsafe-device-removal2017-04-10T00:00:00+01:002017-04-10T00:00:00+01:00Andy Jacksonhttp://anjackson.netanj@anjackson.net<p>Let’s start with an experiment…</p>
<!--break-->
<h2 id="materials">Materials</h2>
<p>For this experiment, you will need:</p>
<ol>
<li>An USB flash drive of little importance. One of those old sub-GB ones you got from that conference will do.</li>
<li>A copy of a digital file of great importance. Any format will do, as long as it’s in a format you can open.</li>
</ol>
<p>I’m going to use this drive:</p>
<p><img src="http://anjackson.net/digipres-lessons-learned/images/save-as/save-as-test-drive.jpg" alt="Test Drive" /></p>
<p>…and this JPEG:</p>
<p><img src="/digipres-lessons-learned/images/save-as/best-test-image.jpg" alt="My father and my son, alike." /></p>
<h2 id="method">Method</h2>
<ol>
<li>Copy the test file to the USB flash drive. <em>Do not use your only copy of the precious file!</em></li>
<li>Open up the test file from the USB drive, as you usually would (i.e. using the usual app for that format).</li>
<li>Pull out the USB flash drive. <em>Do Not Eject It Properly!</em> <strong>Just yank it right out!</strong><sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>
<ul>
<li><strong>Optional:</strong> <a href="https://www.youtube.com/watch?v=y2eNhPC8wCQ">Throw the USB drive into a blender and destroy it utterly</a>.</li>
</ul>
</li>
<li>Observe what happens.</li>
</ol>
<h2 id="results">Results</h2>
<p>In my experiment, the first thing that happened was…</p>
<p><img src="http://anjackson.net/digipres-lessons-learned/images/save-as/save-as-oops.jpg" alt="Disk Not Ejected Properly" /></p>
<p>…but beside this admonishment, the image was still there…</p>
<p><img src="http://anjackson.net/digipres-lessons-learned/images/save-as/save-as-still-there.jpg" alt="But Still There" /></p>
<p>The bitstream was gone (optionally blended into oblivion – the Digital Object destroyed). But the image was still on the screen. I bet yours is still there too.</p>
<p>But right now, it’s at risk. All it takes is loss of power to this machine, and the file will blink out of existence.<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup></p>
<p>Can you press ‘Save as…’, and get a new bitstream back? It depends on the software.</p>
<p>When I <a href="https://www.flickr.com/photos/anjacks0n/sets/72157655724233440">tried this with Apple Preview</a>, I couldn’t save the image, even though I could see it.</p>
<p><img src="http://anjackson.net/digipres-lessons-learned/images/save-as/save-as-preview-says-no.png" alt="Apple Preview Says No" /></p>
<p>The only way to save it seemed to be as a desktop screenshot, which I would then need to crop to get back an acceptable image.</p>
<p>But re-running the same experiment with image editing software (specifically the <a href="http://www.gimp.org/">GIMP</a>), I could press ‘Save as…’ and a new bitstream was written. Not <em>exactly</em> the same as the original, but good enough.<sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup></p>
<h2 id="over-to-you">Over to you</h2>
<p>I’d be fascinated to know what happens on other platforms and with other software, so please get in touch if you’ve tried this. I’d also be curious to know how the choice of format affects the outcome. If anyone has any results to share, I’ll collect them together in a follow-up post.</p>
<hr />
<div class="footnotes">
<ol>
<li id="fn:1">
<p>Go on, admit it, you’ve always wanted to try this and see what happens. Well, now you get to do it. For Science. <a href="#fnref:1" class="reversefootnote">&#8617;</a></p>
</li>
<li id="fn:2">
<p>And entropy will win. And we don’t want <em>that</em>. <a href="#fnref:2" class="reversefootnote">&#8617;</a></p>
</li>
<li id="fn:3">
<p>The two images were highly similar, with a <a href="https://en.wikipedia.org/wiki/Peak_signal-to-noise_ratio">PSNR</a> of just over 56dB and with a distribution of differences that looks like <a href="http://anjackson.net/digipres-lessons-learned/images/save-as/difference.png">this</a>. It is not clear if the variation is due to small differences in JPEG compression parameters, or if all the parameters are the same but the implementations have small difference in execution (e.g. rounding errors). <a href="#fnref:3" class="reversefootnote">&#8617;</a></p>
</li>
</ol>
</div>
<p><a href="http://anjackson.net/2017/04/10/unsafe-device-removal/">Unsafe Device Removal</a> was originally published by Andy Jackson at <a href="http://anjackson.net">andrew.n.jackson</a> on April 10, 2017.</p>http://anjackson.net/2017/04/04/digipres-lessons-learned2017-04-04T00:00:00+01:002017-04-04T00:00:00+01:00Andy Jacksonhttp://anjackson.netanj@anjackson.net<p>I find working in digital preservation fascinating.</p>
<p>It’s not where I expected to end up. I started off interested in computing and science, and happened to find out about what was then a fairly young <a href="http://web.archive.org/web/19970730073841/http://www.york.ac.uk/depts/phys/ugrad/courses/cphy_ss.htm">MPhys degree in Computation Physics offered by the University of York</a><sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>. I then did a Ph.D. in Computational Physics at Edinburgh University, working in statistical physics. After that, I spent my time oscillating between being a post-graduate researcher who used large-scale computational methods, and being a computational specialist who helped other scientists make use of those kinds of techniques.</p>
<p>I’d decided to move away from research and get a ‘normal’ industry programmer job, so when we moved to Leeds I applied for a few different positions. One of them turned out to be for the <a href="http://www.planets-project.eu/">PLANETS Project</a>, based at the British Library. I liked the place and the people, and the work sounded interesting, allowing me to expand my previous experience (not just in computation, but also the information theory that underlies statistical physics) to a new field. And <em>Industry</em> was spared my woolly ways.</p>
<p>I spent a happy few years working on the PLANETS Project and helping kick-off the follow-on <a href="http://scape-project.eu/">SCAPE Project</a>. I saw ‘Significant Properties’ <a href="http://www.dpconline.org/events/past-events/significant-properties">peak</a>, only to be <a href="http://www.planets-project.eu/docs/papers/Dappert_Significant_Characteristics_ECDL2009.pdf">fatally wounded shortly afterwards</a>, learned that <a href="http://blog.dshr.org/">DSHR</a> is usually right, found that it’s <a href="https://blogs.bodleian.ox.ac.uk/archivesandmanuscripts/2009/02/24/shoot-those-files/">fun</a> to <a href="http://web.archive.org/web/20090323121026/http://www.hki.uni-koeln.de/material/shotGun/">mash bits</a>, and that keeping the bits <a href="http://blog.dshr.org/2017/03/threats-to-stored-data.html">safe</a> ain’t as easy as I thought (even with <a href="https://www.youtube.com/watch?v=pbBa6Oam7-w">these folks</a> around).</p>
<p>But I also grew frustrated with working on digital preservation in the abstract. The research work was fun and challenging, but the gap between that and helping these old institutions navigate the digital turn seemed too vast. This same schism seemed to cause tension within the digital preservation community, for example during <a href="https://ipres-conference.org/">iPres</a> conferences, where I’d hear completely opposite ideas as to what the conference was really <em>for</em>. One group, more academic in composition, was drawn to the ‘big picture’ of very long time-scales, worst-cases, and grand solutions. Another group featured a higher concentration of individuals caring for digital collections that needed better preservation <em>now</em>, and needed to share research results and best practices about what they should be doing.</p>
<p>I felt like I’d spent too much time on the ‘grand solutions’ side of things, and so when the role of technical lead for the <a href="https://www.webarchive.org.uk/">UK Web Archive</a> came up I, leapt at the chance. Being <em>responsible</em> for the actual digital preservation of a large, complex collection of national importance would surely help focus the mind? Indeed it has.</p>
<p>My career trajectory might be unusual, but because the field of digital preservation is <em>all</em> interface, there’s a wide range of people from a lot of different backgrounds working in this nexus between the cultural and the technical. This range is powerful, but also ripe for miscommunication, and I worry we fail to learn from our own history. Sometimes it feels like the same mistakes are being made over and over again, and I can’t tell if we are failing to pool our knowledge, or if each of us needs to fall many times before we can learn to walk, and then to run(.exe)<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>.</p>
<p>I feel like I’ve learned a lot in my decade of digital preservation, and I want to know if what I’ve learned might help others. I’d also like to know whether or not my own theories and opinions are worth the bytes and/or neurons they are encoded in. To this end, I’m going to try to use this blog series as a way of communicating the things I <em>think</em> I’ve learned over the years, in the hope that others find it useful. Or at least interestingly wrong.</p>
<hr />
<div class="footnotes">
<ol>
<li id="fn:1">
<p>Sometime between 1997 and <a href="http://web.archive.org/web/20050405163214/http://www.york.ac.uk/depts/phys/ugrad/courses/">2004</a> it had become ‘Physics with Computer Simulation’, but then by <a href="http://web.archive.org/web/20060508230240/http://www.york.ac.uk/depts/phys/ugrad/courses/">2006</a> it had been merged with the theory course, becoming ‘Theoretical &amp; Computational Physics’. This didn’t last long, and <a href="http://web.archive.org/web/20060908134758/http://www.york.ac.uk/depts/phys/ugrad/courses/">later that year</a> had been fully subsumed into the ‘Theoretical Physics’ course. These days, all physicist are at least somewhat computational, and almost all theorists use computational methods of some sort, even if they don’t all use simulation. <a href="#fnref:1" class="reversefootnote">&#8617;</a></p>
</li>
<li id="fn:2">
<p>Sorrynotsorry <a href="#fnref:2" class="reversefootnote">&#8617;</a></p>
</li>
</ol>
</div>
<p><a href="http://anjackson.net/2017/04/04/digipres-lessons-learned/">Digital Preservation: Lessons Learned?</a> was originally published by Andy Jackson at <a href="http://anjackson.net">andrew.n.jackson</a> on April 04, 2017.</p>http://anjackson.net/2016/06/08/frontiers-in-format-identification2016-06-08T00:00:00+01:002016-06-08T00:00:00+01:00Andy Jacksonhttp://anjackson.netanj@anjackson.net<p>I came to work on digital preservation through <a href="http://www.planets-project.eu/">the PLANETS project</a>, and later <a href="http://scape-project.eu/">the SCAPE project</a> (for the first year) before moving over to web archiving. These were inspiring projects which achieved a great deal, but we were left with lessons to be learned.</p>
<!--break-->
<p>In particular, I remember a meeting between the representatives of the various content-holding institutions – the libraries and archives that were intended to benefit the most from the results of those projects. All of us agreed that, while we appreciated the importance of planning and implementing preservation actions, we were concerned that the fundamental evidence base wasn’t really strong enough to make this planning reliable. Quite simply, we were (and still are) worried that we just don’t understand our collections well enough to make sensible decisions about their future care. In retrospect, I think we should have directed more resources at the problem of characterising and analysing the content we have – at understanding the question rather than rushing to the answer.</p>
<p>I was reminded of this when Jenny Mitcham tweeted about the problems you hit when trying to identify research data formats:</p>
<blockquote>
<p>Any ideas how we solve DAT file format problem? <a href="https://digital-archiving.blogspot.co.uk/2016/05/research-data-what-does-it-really-look.html">https://digital-archiving.blogspot.co.uk/2016/05/research-data-what-does-it-really-look.html</a>
<small><a href="https://twitter.com/jenny_mitcham/status/740513668807462912">@Jenny_Mitcham</a></small></p>
</blockquote>
<p>and so I thought it might be worth sharing some of the ideas that were borne out of those big projects, but we had little or no time to pursue.</p>
<h2 id="sharing-format-profiles">Sharing format profiles</h2>
<p>I think there is much we could do to improve how we use data about formats to drive the improvement of format identification and analysis tools. For example I was very happy to see that shortly after Jenny’s tweet, the Bentley Historical Library followed up with a similar format profile:</p>
<blockquote>
<p>New post looking at our born-digital file formats (in response to @Jenny_Mitcham’s on research data file formats): <a href="http://archival-integration.blogspot.co.uk/2016/06/born-digital-data-what-does-it-really.html">http://archival-integration.blogspot.co.uk/2016/06/born-digital-data-what-does-it-really.html</a>
<small><a href="https://twitter.com/umbhlcuration/status/740643410479026176">@UMBHLCuration</a></small></p>
</blockquote>
<p>Comparing these two sets of results, you immediately get a flavour for the shared problems and some of the more surprising challenges that lie ahead. How do we tease apart different .dat files? Why did some of the simpler formats to identify, like PDF and PNG, appear to fail format identification?</p>
<p>I believe routinely and systematically sharing this kind of information would be extremely useful, encouraging a more data-driven approach to tool development, and helping us to explore the relative strengths and weaknesses of various tools. As <a href="https://twitter.com/nkrabben/status/740649460804521988">Nick Krabbenhöft indicated</a>, there are some tools being developed in this area, and it would be good to see this kind of summary information being shared<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>.</p>
<p><strong>UPDATE:</strong> See also <a href="https://gist.github.com/bitsgalore/96423db2e3cb48e3c5d0cbc1bd41a85e">this excellent comment</a> from <a href="https://twitter.com/bitsgalore">Johan van der Knijff</a> that includes links to format profile results from the KB e-Depot.</p>
<h2 id="aggregating-format-registries">Aggregating format registries</h2>
<p>There’s also gains to be made by making the most of what we already have. To this end, some time ago I build a website that aggregates the contents of five different format registries: <a href="http://www.digipres.org/formats/">http://www.digipres.org/formats/</a></p>
<p>This merges the various sources into a single, coherent format hierarchy <a href="http://www.digipres.org/formats/mime-types/#application/xml">(e.g. XML)</a>, and allows you see what five different registries say about a given file extension <a href="http://www.digipres.org/formats/extensions/#*.flac">(e.g. flac)</a>. It also allows you to <a href="http://www.digipres.org/formats/overlaps/">compare the contents of the registries</a>, leading to the surprising realisation that the degree of overlap between them is really rather small. Currently, only 77 file extensions are know to all five tools, but perhaps publicising the gaps between the registries will help encourage those gaps to be closed.</p>
<p>Obviously, it would be really interesting to combine this kind of approach with the idea of sharing format profiles from real collections, and to hook it up to sources of information about tools (like <a href="http://www.digipres.org/tools/">COPTR/POWRR</a>).</p>
<h2 id="exploring-semi-automated-format-identification">Exploring semi-automated format identification</h2>
<p>In the web archive, we’ve done some work on exploring formats by combining search facets based on the first few bytes of a file with facets built from the file extension. This has been a very useful way to start exploring files base on common internal features, because most software uses a fixed file header format at it’s own format identification technique. This makes it possible to automatically generate an initial format signature (this was also the basis of the <a href="https://github.com/blekinge/percipio">Percipio tool</a>, which in turn is similar to how <a href="http://mark0.net/soft-trid-e.html">TrID</a> works).</p>
<p>We’ve tried to build on some of these ideas in our web archive analysis systems, but we’ve not had enough time to explore the results. I suspect we could learn a lot about the more obscure file formats very quickly if we were able to cluster them based on common internal structures and patterns.</p>
<h2 id="developing-identification-techniques-for-text-formats">Developing identification techniques for text formats</h2>
<p>One of the long-standing problems in format identification is how to cope with text-based formats. This causes problems for web archives when there are wrong or missing file extensions or MIME types, because formats like CSS and JavaScript are hard to spot. Similarly, CSV, TSV and other text data formats are unfortunately reliant on file extension for identification, as are computer source code formats (C, Java, etc.).</p>
<p>My starting point here would be to extend the n-gram frequency methods used in natural language detection (see <a href="http://cloudmark.github.io/Language-Detection/">here</a> for an introduction). However, instead of relying the frequencies of individual character combinations, the idea would be to build language profiles based on classes of text entities, like punctuation, quoted strings, keywords, and so on. For example, CSV files mostly use alphanumeric strings, commas and newlines, but rarely use curly braces or tabs. If this initial analysis is inconclusive, we can clarify the situation by attempting to actually parse the first few lines of the file. I’m reasonably confident these kind of tactics would vastly improve our current format identification capabilities.</p>
<h2 id="documenting-obsolescence">Documenting obsolescence</h2>
<p>Finally, another area which would really benefit from a stronger base of actual evidence, based on real experience, would be format obsolescence itself. I’m hoping to share more stories of difficult formats on this blog in the near future, but I’d also like to see us all collect more examples of how formats are born and how formats die, so we can better understand the challenges ahead of us.</p>
<h2 id="building-the-community">Building the community</h2>
<p>There’s a lot to be done, and I don’t have nearly as much time as I’d like to work on these issues. We need to collaborate effectively, between content-holding organisations and with researchers and tool developers. I’d love to hear from anyone who wants to work on these problems, or about any related challenges that I’ve missed out here.</p>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>We have <a href="https://github.com/peshkira/c3po">C3PO</a> as a standalone tool, and <a href="https://github.com/timothyryanwalsh/brunnhilde">Brunnhilde</a> and <a href="http://openpreservation.org/blog/2016/05/24/while-were-on-the-subject-a-few-more-points-of-interest-about-the-siegfrieddroid-analysis-tool/">droid-sqlite-analysis</a> designed to analyse the output from Siegfried and/or DROID. For web archives, similar functionality has been built into the UKWA <a href="https://github.com/ukwa/webarchive-discovery">webarchive-discovery</a> stack, and we <a href="http://data.webarchive.org.uk/opendata/ukwa.ds.2/fmt/">publish format profiles as open data</a>. <a href="#fnref:1" class="reversefootnote">&#8617;</a></p>
</li>
</ol>
</div>
<p><a href="http://anjackson.net/2016/06/08/frontiers-in-format-identification/">Frontiers in Format Identification</a> was originally published by Andy Jackson at <a href="http://anjackson.net">andrew.n.jackson</a> on June 08, 2016.</p>http://anjackson.net/2016/04/11/building-tools-to-archive-the-modern-web2016-04-11T00:00:00+01:002016-04-11T00:00:00+01:00Andy Jacksonhttp://anjackson.netanj@anjackson.net<p>Four years ago, during the 2012 IIPC General Assembly, we came together to discuss the recent and upcoming challenges to web archiving in the <a href="http://netpreserve.org/sites/default/files/resources/OverviewFutureWebWorkshop.pdf">Future of the Web Workshop</a> (see also <a href="http://blog.dshr.org/2012/05/harvesting-and-preserving-future-web.html">this related coverage on David Rosenthal’s blog</a>). That workshop made it clear that our tools are failing to satisfy many of these challenges:</p>
<ul>
<li>Database driven features</li>
<li>Complex/variable URI formats</li>
<li>Dynamically generated URIs</li>
<li>Rich, streamed media</li>
<li>Incremental display mechanisms</li>
<li>Form-filling</li>
<li>Multi-sourced, embedded content</li>
<li>Dynamic login, user-sensitive embeds</li>
<li>User agent adaptation</li>
<li>Exclusions (robots.txt, user-agent, …)</li>
<li>Exclusion by design (i.e. site architecture intended to inhibit crawling and indexing)</li>
<li>Server-side scripts, RPCs</li>
<li>HTML5 web sockets</li>
<li>Mobile sites</li>
<li>DRM protected content, now part of the HTML standard</li>
<li>Paywalls</li>
</ul>
<p>I wish I could stand here and tell you how much great progress we’ve made in the last four years, ticking entries off this list, but I can’t. Although we’ve made some progress, our crawl development resources have been consumed by more basic issues. We knew moving to domain crawling under Legal Deposit would bring big changes in scale, but I’d underestimated how much the <em>dynamics</em> of the crawl workflow would need to change.</p>
<p>News websites are a great example.</p>
<p><a href="/building-web-archives/images/iipc-ga-2016/bbc-news-visits-open-ukwa.png"><img src="/building-web-archives/images/iipc-ga-2016/bbc-news-visits-open-ukwa.png" alt="BBC News under selective archiving" /></a></p>
<p>Under selective archiving, we would generally only archive specific articles, relating to themes or events. For example, since 2008, we had captured just 13 snapshots of BBC News home page and just a few hundred individual news articles.</p>
<p><a href="/building-web-archives/images/iipc-ga-2016/bbc-news-visits-ld-ukwa.png"><img src="/building-web-archives/images/iipc-ga-2016/bbc-news-visits-ld-ukwa.png" alt="BBC News under Legal Deposit" /></a></p>
<p>Under Legal Deposit, we expect to archive <em>every single news article</em>. This means we need to visit each news site at least once a day, and ideally more frequently than that, while also going back and re-crawling everything at least once every few months in case the content or presentation is changed.</p>
<p><a href="/building-web-archives/images/iipc-ga-2016/daily-pulse.png"><img src="/building-web-archives/images/iipc-ga-2016/daily-pulse.png" alt="Daily crawl pulse" /></a></p>
<p>Our first large-scale ‘frequent crawler’ worked by re-starting the whole crawl job every time, but this stop-start approach meant the sites we visited very frequently would never be crawled very deeply - it’s obviously impossible to crawl all of BBC News in a day. The problem was the need to capture both depth and frequent changes.</p>
<p>One approach would have been to create many different crawl jobs for different combinations of sites, depths and frequencies, but this ran into our second problem - the poor manageability of our tools and workflows. Managing multiple overlapping jobs in Heritrix has proven to be awkward, needing a lot of manual intervention to move different jobs around in order to balance out resource usage over time.<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup> As well as being difficult to fully automate, multiple jobs are also more difficult to monitor effectively.</p>
<p>We’ve learned that <em>everything</em> that can automated <em>should</em> be, and <em>must</em> also be <em>seen</em> to be working. This is not just monitoring uptime or disk space, but runs all the way up to automating quality assurance wherever possible.</p>
<p>However, while exploring how to improve our automated QA, we met the third problem. Our tools were failing to archive the web, making automated QA irrelevant. Not only are our tools difficult to configure, manage and monitor, but they have also fallen too far behind current web development practices. Many of those ‘future web challenges’ we talked about in 2012, are here, now.</p>
<p>Here’s the BBC News homepage on the 20th of March last year:</p>
<p><a href="/building-web-archives/images/iipc-ga-2016/bbc-just-before.png"><img src="/building-web-archives/images/iipc-ga-2016/bbc-just-before.png" alt="BBC News on 20th March 2015" /></a></p>
<!--
This was fine: https://www.webarchive.org.uk/act/wayback/20150320152832/http://www.bbc.co.uk/news
-->
<p>And here it is four days later:</p>
<p><a href="/building-web-archives/images/iipc-ga-2016/bbc-just-after.png"><img src="/building-web-archives/images/iipc-ga-2016/bbc-just-after-clipped.png" alt="BBC News on 24th March 2015 ...mmmmm jaggies..." /></a></p>
<!--
This was not fine: https://www.webarchive.org.uk/act/wayback/20150324131847/http://www.bbc.co.uk/news
-->
<p>We knew the growth in mobile browsing would change the web, and in this case the problems are due to the BBC’s adoption of “responsive design”. Like many other sites, they now use JavaScript to adapt to different displays, although they have aggressively optimised for mobile first. The site delivers a basic, functional page (with only a few low resolution images), but then uses JavaScript to enhance the display by loading in different stylesheets and more, higher resolution images. Our Heritrix only got the version intended for very small screens.</p>
<p><a href="/building-web-archives/images/iipc-ga-2016/the-guardian.png"><img src="/building-web-archives/images/iipc-ga-2016/the-guardian-clipped.png" alt="The Guardian on 31th March 2015" /></a></p>
<p>A more representative example of these issues comes from The Guardian. Their responsive design was more standard and used the <code>srcset</code> attribute on images to provide different resolution versions. However, even though the srcset image attribute <a href="https://github.com/whatwg/html/commit/969543cd259a0cc41a0a5cbe97e0010c6999eb09?diff=split">has been around since 2012</a> and <a href="http://caniuse.com/srcset">implemented widely since 2014</a>, it’s still not supported by Heritrix out-of-the-box and has only just become supported by OpenWayback (<a href="https://github.com/iipc/openwayback/issues/310">in version 2.3.1</a>).</p>
<p><a href="/building-web-archives/images/iipc-ga-2016/srcset-standard-example.png"><img src="/building-web-archives/images/iipc-ga-2016/srcset-standard-example.png" alt="Example of srcset from the HTML 5 standard" /></a></p>
<p><a href="/building-web-archives/images/iipc-ga-2016/can-i-use-srcset.png"><img src="/building-web-archives/images/iipc-ga-2016/can-i-use-srcset.png" alt="Can I Use srcset?" /></a></p>
<p>In summary, we needed a new crawl process that:</p>
<ul>
<li>Crawls more often as well as crawling sites deeply.</li>
<li>Is easier to develop, monitor and managed.</li>
<li>Uses a browser to render at least the main seed URLs in order to capture more dependencies.</li>
</ul>
<!--
* Can spot documents from selected sites that should be catalogued, and generates initial metadata for them.
-->
<p>To do this, we’ve changed our crawling process away from being based on crawl jobs, and instead we run one continuous crawl. In order to launch crawls for new material, we inject seeds into the crawl as part of an overall workflow. The workflow attempts to break the overall crawl process down into smaller, separate processes that perform individual crawl tasks, and then we join them together using a standard message queue system.</p>
<p><a href="/building-web-archives/images/iipc-ga-2016/crawl-agents-hires.png"><img src="/building-web-archives/images/iipc-ga-2016/crawl-agents.png" alt="Crawl System Overview" /></a></p>
<p>This architectural pattern, with relatively simple processes joined together with queues acting as buffers, is fairly common in mainstream distributed computing systems. As long as clear APIs are defined that describe how each microservice talks to the next, the individual components and be restarted, upgraded or replaced without bringing down the whole system. This chain of messages also provides a handy way of monitoring the overall progress of the crawl. This approach also encourages a more modular design, where each component has a clear role and function, so that it’s easier to understand what’s happening and bring new developers up to speed.</p>
<p><a href="/building-web-archives/images/iipc-ga-2016/act-bbc-example.png"><img src="/building-web-archives/images/iipc-ga-2016/act-bbc-example.png" alt="W3ACT BBC News Crawl Schedule" /></a></p>
<p>Our workflow starts with our annotation and curation tool. This is used by our staff to configure which sites should be visited on which frequencies, but unlike its predecessor, the Web Curator Tool, none of the more complex crawl processes are tightly bound inside it. All it does is provide a crawl feed, which looks like this:</p>
<pre><code>[
{
"id": 1,
"title": "gov.uk Publications",
"seeds": [
"https://www.gov.uk/government/publications"
],
"schedules": [
{
"frequency": "MONTHLY",
"startDate": 1438246800000,
"endDate": null
}
],
"scope": "root",
"depth": "DEEP",
"ignoreRobotsTxt": false,
"documentUrlScheme": null,
"loginPageUrl": null,
"secretId": null,
"logoutUrl": null,
"watched": false
},
...
</code></pre>
<p><a href="/building-web-archives/images/iipc-ga-2016/crawl-agents-1.png"><img src="/building-web-archives/images/iipc-ga-2016/crawl-agents-1.png" alt="Crawl System Crawl Stage 1" /></a></p>
<p>A separate launcher process runs every hour, downloads this feed, and checks if any crawls should be launched in the next hour. If so, it posts the seeds of the crawl onto a message queue called ‘uris-to-render’.<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup></p>
<pre><code>{
"clientId": "FC-3-uris-to-crawl",
"isSeed": true,
"metadata": {
"heritableData": {
"heritable": [
"source",
"heritable"
],
"source": ""
},
"pathFromSeed": ""
},
"url": "https://www.gov.uk/government/publications"
}
</code></pre>
<p><a href="/building-web-archives/images/iipc-ga-2016/crawl-agents-2.png"><img src="/building-web-archives/images/iipc-ga-2016/crawl-agents-2.png" alt="Crawl System Crawl Stage 2" /></a></p>
<p>In this new workflow, the first actual crawl activity is now a browser-based rending step. Before we do anything else, we run the seeds through an embedded web browser and attempt to capture a good rendering of the original site. We keep the rendered pages, and extract the embedded and navigational links that we find. These URIs are then passed on to a ‘uris-to-crawl’ queue.</p>
<pre><code>{
"headers": {},
"isSeed": true,
"method": "GET",
"parentUrl": "https://www.gov.uk/government/publications",
"parentUrlMetadata": {
"heritableData": {
"heritable": [
"source",
"heritable"
],
"source": ""
},
"pathFromSeed": ""
},
"url": "https://www.gov.uk/government/publications"
}
</code></pre>
<p><a href="/building-web-archives/images/iipc-ga-2016/crawl-agents-3.png"><img src="/building-web-archives/images/iipc-ga-2016/crawl-agents-3.png" alt="Crawl System Crawl Stage 3" /></a></p>
<p>Our long-running Heritrix crawl is configured to pull messages off this queue and push the URLs into the crawl frontier. At this point, we have broken our rule of having small, simple processes, and we’re using Heritrix in quite a similar way to our earlier crawls. The way it runs and the way it queues and prioritises URLs has been modified to suite this continuous approach, but it’s still a rather uncomfortably monolithic chunk that manages a lot of crawl state in a rather difficult to monitor way.</p>
<p>To complete the workflow, we also configured Heritrix to push a message onto a ‘uris-to-index’ queue after each resource is crawled.</p>
<p><a href="/building-web-archives/images/iipc-ga-2016/crawl-agents-4.png"><img src="/building-web-archives/images/iipc-ga-2016/crawl-agents-4.png" alt="Crawl System Crawl Stage 4" /></a></p>
<pre><code>{
"annotations": "ip:173.236.225.186,duplicate:digest",
"content_digest": "sha1:44KA4PQA5TYRAXDIVJIAFD72RN55OQHJ",
"content_length": 324,
"extra_info": {},
"hop_path": "IE",
"host": "acid.matkelly.com",
"jobName": "frequent",
"mimetype": "text/html",
"seed": "WTID:12321444",
"size": 511,
"start_time_plus_duration": "20160127211938966+230",
"status_code": 404,
"thread": 189,
"timestamp": "2016-01-27T21:19:39.200Z",
"url": "http://acid.matkelly.com/img.png",
"via": "http://acid.matkelly.com/",
"warc_filename": "BL-20160127211918391-00001-35~ce37d8d00c1f~8443.warc.gz",
"warc_offset": 36748
}
</code></pre>
<p>This is very similar to the Heritrix crawl log, but in the form of a stream of crawl event messages, which are then submitted to a dedicated CDX server. This standalone component developed by the National Library of Australia provides a clear API for both adding as well as querying CDX data, and can cope with the submission of many hundreds of CDX records per second.</p>
<p>Our QA OpenWayback is configured to use this real-time CDX server, and to be able to look at the WARC files that are currently being written as well as the older ones. This means that crawled resources become available in Wayback almost instantly.</p>
<p><a href="/building-web-archives/images/iipc-ga-2016/crawl-agents-5.png"><img src="/building-web-archives/images/iipc-ga-2016/crawl-agents-5.png" alt="Crawl System Crawl Stage 5" /></a></p>
<p>Under our previous crawl procedure, we were only able to update the CDX index overnight, so this live feedback is a big improvement for us. It also allows the part of the workflow that extracts documents from the crawl to check that they are available before passing them back to W3ACT for cataloguing.</p>
<p>Putting it all together, we have a more robust crawl system where, depending on the complexity of the page, the whole process of rendering, archiving and making a seed available can complete in seconds.</p>
<p><a href="/building-web-archives/images/iipc-ga-2016/success-kid-archived-in-20-seconds.jpg"><img src="/building-web-archives/images/iipc-ga-2016/success-kid-archived-in-20-seconds.jpg" alt="Live to Archived in 20 seconds" /></a></p>
<p>Smaller components are easier for our developers to work on, and it’s easy to scale up and tune the number of clients running at each individual stage. For example, if the system is under a lot of load, we can re-configure the number of web page renderers on the fly without bringing down the whole system.</p>
<p>Unfortunately, although leading with the web render step has improved things, it’s still not good enough.</p>
<p><a href="/building-web-archives/images/iipc-ga-2016/bbc-news-improved.png"><img src="/building-web-archives/images/iipc-ga-2016/bbc-news-improved.png" alt="BBC News - wider but with missing images" /></a></p>
<p>We’ve got the stylesheet and more images for the BBC News homepage, but we’re still missing some of the images.</p>
<p>To really solve these issues, we need to capture the resources precisely as the embedded web browser received them, rather than downloading them once from and then passing the URLs to Heritrix to be downloaded again. We also need to be able to run more pages through browser engines, which means we need to find a way of making Heritrix itself more modular and scalable.</p>
<p>To summarise, out of that big list of “future challenges” from 2012, the most critical ones we are facing right now are:</p>
<ul>
<li>Incremental display mechanisms
<ul>
<li><em>Especially multi-platform ‘responsive design’</em></li>
<li><em>The <code>srcset</code> attribute is still not supported by Heritrix3 out-of-the-box</em></li>
</ul>
</li>
<li>Multi-sourced, embedded content
<ul>
<li><em>Identifying embedded resources now requires JavaScript execution</em></li>
<li><em>Need to run more pages through browsers</em></li>
</ul>
</li>
<li>Rich, streamed media
<ul>
<li><em>Can often by captured, but tools are not integrated and storage and playback are problematic and unstandardised</em></li>
</ul>
</li>
<li>Paywalls
<ul>
<li><em>The crawlers need to be able to login</em></li>
</ul>
</li>
</ul>
<p>And I’d now add to the list:</p>
<ul>
<li>SSL more and more common
<ul>
<li><em>OpenWayback support for SSL is poor out-of-the-box</em></li>
<li><em>and what of HTTP/2 around the corner?</em></li>
</ul>
</li>
</ul>
<p>I know many of you have already face many of these issues in various ways, and I’d love to hear about the different tactics that you’ve been using.</p>
<p>Because we (the British Library) need a big push to improve our tools, and we are running out of time. Fortunately, my organisation is willing to put some funding behind solving this problem over the next two years. But we’d really rather not do this alone.</p>
<p>Whatever we do will be open source, but I’d really much rather be part of an <em>open source project</em>. Our tools and our results will be much better if we can find ways of pooling our resources, but more importantly, I think many of us enjoy working with our peers in other organisations. Simply put, I think open source projects are more fun, especially when you’re working in a complex niche, like web archiving. I suspect most of us are working in small teams or as individuals in much larger organisations, and I believe we can work best when work together.</p>
<p>There is a whole conference track dedicated to Exploring Collaboration tomorrow, where we can spend time examining the ways this kind of collaboration might work. Please do come along if you’re interested in building tools together.</p>
<p>Thank you.</p>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>It also means we can unintentionally violate the per-host crawl delays due to running multiple crawlers over the same site simultaneously. <a href="#fnref:1" class="reversefootnote">&#8617;</a></p>
</li>
<li id="fn:2">
<p>It also posts a message to the ‘uris-to-check’ queue, which will be used to check if the crawl launched successfully. <a href="#fnref:2" class="reversefootnote">&#8617;</a></p>
</li>
</ol>
</div>
<p><a href="http://anjackson.net/2016/04/11/building-tools-to-archive-the-modern-web/">Building Tools to Archive the Modern Web</a> was originally published by Andy Jackson at <a href="http://anjackson.net">andrew.n.jackson</a> on April 11, 2016.</p>http://anjackson.net/2016/02/16/shine-release-two-historical-search-update2016-02-16T00:00:00+00:002016-02-16T00:00:00+00:00Andy Jacksonhttp://anjackson.netanj@anjackson.net<p>Originally published <a href="http://britishlibrary.typepad.co.uk/webarchive/2016/02/updating-our-historical-search-service.html">on the UK Web Archive blog</a> on the 15th of February 2016.</p>
<p>Earlier this year, as part of the <a href="http://buddah.projects.history.ac.uk/">Big UK Domain Data for the Arts and Humanities</a> project, we released <a href="http://britishlibrary.typepad.co.uk/webarchive/2015/02/building-a-historical-search-engine-is-no-easy-thing.html">our first ‘historical search engine’ service</a>. We’ve <a href="britishlibrary.typepad.co.uk/webarchive/2015/06/towards-a-macroscope-for-uk-web-history.html">publicised it</a> at <a href="http://www.dcc.ac.uk/events/idcc15/programme-presentations">IDCC15</a>, the <a href="http://netpreserve.org/general-assembly/2015/overview">2015 IIPC GA</a> and at the <a href="http://events.netlab.dk/conference/index.php/resaw/june2015">first RESAW conference</a>, and it’s been very well received. Not only has it lead to some <a href="http://buddah.projects.history.ac.uk/2015/07/09/project-case-studies-now-available/">excellent case studies</a> that we can use to improve our services, but other web archives have shown interest in re-using <a href="https://github.com/ukwa/shine">the underlying open source code</a>. In particular, some of our Canadian colleagues have successfully launched <a href="http://webarchives.ca/">webarchives.ca</a>, which lets users search ten years worth of archived websites from Canadian political parties and political interest groups (see <a href="http://webarchives.ca/about/">here</a> for more details).</p>
<p>But we remained frustrated, for two reasons. Firstly, when we built that first service, we could not cope with the full scale of the 1996-2013 dataset, and we only managed to index the two billion resources up to 2010. Secondly, we had not yet learned how to cope with more than one or two users at a time, so we were loath to publicise the website too widely in case it crashed. So, over the last six months, and with the guidance of <a href="https://twitter.com/TokeEskildsen">Toke Eskildsen</a> and Thomas Egense at the <a href="https://en.statsbiblioteket.dk/">State Library of Denmark</a>, we’ve been working on resolving these scaling issues (their <a href="https://sbdevel.wordpress.com/">tech blog</a> is definitely worth a look if you’re into this kind of thing).</p>
<p>Thanks to their input, I’m happy to be able to announce that our <a href="https://www.webarchive.org.uk/shine/graph">historical search prototype</a> now spans the whole period from 1996 to <a href="http://www.bl.uk/catalogues/search/non-print_legal_deposit.html#april">the 6th April 2013</a>, and contains 3,520,628,647 distinct records.</p>
<p><img src="/blog/images/shine-release-two-total-resources-over-time.png" alt="Total Resource By Crawl Year" /></p>
<p>Broken down by year, you can see there’s a lot of variation, depending on the timings of the global crawls from which this collection was drawn. This is why <a href="https://www.webarchive.org.uk/shine/graph">our trends visualisation</a> plots query results as a percentage of all the resources crawled in each year rather than absolute figures. However, the overall variation and the fact that the 2013 chunk only covers the first three months should be kept in mind when interpreting the results.</p>
<p>You might also notice there seem to be a few data points from as early as 1938, and even from 2072! This tiny proportion of results correspond to malformed or erroneous records, although currently it’s not clear if the <a href="https://www.webarchive.org.uk/shine/search?query=*:*&amp;tab=results&amp;action=search&amp;facet.in.crawl_year=%221995%22">1,714 results from 1995</a> are genuine or not. No one ever said <strong>Big Data</strong> would be <em>Clean Data</em>.</p>
<p>Furthermore, we’ve decided to change the way we handle web archiving records that have been ‘<a href="https://iipc.github.io/warc-specifications/specifications/warc-deduplication/recording-arbitrary-duplicates-1.0/">de-duplicated</a>’. When the crawler visits a page and finds <em>precisely</em> the same item as before, instead of storing another copy, we can store a so-called “revisit record” and refer to the earlier copy rather than duplicating it. This crude form of data compression can save a lot of disk space for frequently crawled material, and it’s use has grown over time. For example, looking at the historical dataset, you can see that 30% of the 2013 results were duplicates.</p>
<p><a href="https://www.webarchive.org.uk/shine/graph?query=record_type:%22revisit%22"><img src="/blog/images/shine-release-two-revisits.png" alt="Total Resource By Crawl Year" /></a></p>
<p>However, as these records don’t hold the actual item, our indexing process was not able to index these items properly. Over the next few weeks, we shall scan through these 65 million revisit records and ‘reduplicate’ them. This does mean that, for now, the results from 2013 might be a bit misleading in some cases. We also failed to index the last 11,031 of the 515,031 WARC files that make up this dataset (about 2% of the total, likely affecting the 2010-2013 results only), simply because we ran out of disk space. The index is using up 18.7TB of SSD storage, and if we can find more space, we’ll fill in the rest.</p>
<p>In the meantime, please explore our historical archive and tell us what you find! It might be slow sometimes (maybe 10-20 seconds), so please be patient, but we’re pretty confident that it will be stable from now on.</p>
<p><a href="https://www.webarchive.org.uk/shine/graph?query=%22geocities%22%2C%22friendster%22%2C%22orkut%22&amp;year_start=1996&amp;year_end=2013&amp;action=update"><img src="/blog/images/shine-release-two-early-social-media.png" alt="Early social media" /></a></p>
<p><a href="https://www.webarchive.org.uk/shine/graph?query=%22myspace%22%2C%22facebook%22%2C%22twitter%22&amp;year_start=1996&amp;year_end=2013&amp;action=update"><img src="/blog/images/shine-release-two-later-social-media.png" alt="Later social media" /></a></p>
<p><a href="https://www.webarchive.org.uk/shine/graph?query=%28%22sub-prime%22+OR+%22subprime%22%29+AND+mortgage%2C+austerity+NOT+domain%3Amotorbooks.co.uk%2C+%22financial+crisis%22&amp;year_start=1996&amp;year_end=2013&amp;action=update"><img src="/blog/images/shine-release-two-austerity.png" alt="Austerity" /></a></p>
<p>Anj</p>
<p><a href="http://anjackson.net/2016/02/16/shine-release-two-historical-search-update/">Updating our historical search service</a> was originally published by Andy Jackson at <a href="http://anjackson.net">andrew.n.jackson</a> on February 16, 2016.</p>http://anjackson.net/2015/11/20/provenance-of-web-archives2015-11-20T00:00:00+00:002015-11-20T00:00:00+00:00Andy Jacksonhttp://anjackson.netanj@anjackson.net<p>Originally published <a href="http://britishlibrary.typepad.co.uk/webarchive/2015/11/the-provenance-of-web-archives.html">on the UK Web Archive blog</a> on the 20th November 2015.</p>
<p>Over the last few years, it’s been wonderful to see more and more researchers taking an interest in web archives. Perhaps we’re even teetering into the mainstream when a publication like Forbes carries an article digging into the gory details of how we should document our crawls in <a href="http://www.forbes.com/sites/kalevleetaru/2015/11/16/how-much-of-the-internet-does-the-wayback-machine-really-archive/">How Much Of The Internet Does The Wayback Machine Really Archive?</a></p>
<!--break-->
<p>Even before the data-mining <a href="http://buddah.projects.history.ac.uk/">BUDDAH project</a> raised these issues, we’d spent a long time thinking about this, and we’ve tried to our best to capture as much of our own crawl context as we can. We don’t just store the WARC request and response records (which themselves are much better at storing crawl context than the older ARC format), we also store:</p>
<ul>
<li>The list of links that the crawler found when it analysed each resource (this is a standard Heritrix3 feature).</li>
<li>The full crawl log, which records DNS results and other situations that may not be reflected in the WARCs.</li>
<li>The crawler configuration, including seed lists, scope rules, exclusions etc.</li>
<li>The versions of the software we used (in WARC Info records and in the PREMIS/METS packaging).</li>
<li>Rendered versions of original seeds and home pages, as PNG and as HTML, and associated metadata.</li>
</ul>
<p>In principle, we believe that the vast majority of questions about how and why a particular resource has been archived can be answered by studying this additional information. However, it’s not clear how this would really work in practice. Even assuming we have caught the most important crawl information, reconstructing the history behind any particular URL is going to be highly technical and challenging work because you can’t really understand the crawl without understanding the software (to some degree at least).</p>
<p>But there are definitely gaps that remain - in particular, we don’t document absences well. We don’t explicitly document precisely why certain URLs were rejected from the crawl, and if we make a mistake and miss a daily crawl, or mis-classify a site, it’s hard to tell the difference between accident and intent from the data. Similarly, we don’t document every aspect of our curatorial decisions, e.g. precisely why we choose to pursue permissions to crawl specific sites that are not in the UK domain. Capturing every mistake, decision or rationale simply isn’t possible, and realistically we’re only going to record information when the process of doing so can be largely or completely automated (as above, see also <a href="http://blog.dshr.org/2015/11/you-get-what-you-get-and-you-dont-get.html">You get what you get and you don’t get upset</a>).</p>
<p>And this is all just at the level of individual URLs. When performing corpus analysis, things get even more complex because crawl configurations vary within the crawls and change over time. Right now, it’s not at all clear how best to combine or summarize fine-grained provenance information in order to support data-mining and things like trend analysis. But, in the context of working on the Buddha project, we did start to explore how this <em>might</em> work.</p>
<p>For example, the Forbes article brings up the fact that crawl schedules vary, and so not every site has been crawled consistently, e.g. every day. Of course, we found exactly the same kind of thing when building the <a href="https://www.webarchive.org.uk/shine/graph">Shine search interface</a>, and this is precisely why our trend graphs currently summarize the trends by year. In other words, if you average the crawled pages by year, you can wash out the short-lived variations. Of course, large crawls can last months, so really you want to be able to switch between different sampling parameters (quarterly, six-monthly, or annual, starting at any point in the year, etc.), so that you can check whether any perceptible trend may be a consequence of the sampling strategy (not that we got as far as implementing that, <a href="https://github.com/ukwa/shine/pulls">yet</a>).</p>
<p><a href="https://www.webarchive.org.uk/shine/graph?query=%22Global+Financial+Crisis%22&amp;year_start=1996&amp;year_end=2013&amp;action=update"><img src="/blog/images/shine-global-financial-crisis.png" alt="&quot;Global Financial Crisis&quot;" /></a></p>
<p>Similarly, notice that Shine show you the <em>percentage</em> of matching resources by year, rather than the absolute number of matching documents. This is because showing the fraction of the crawled web that matches your query is generally more useful than just the number of matching resources because in the latter case the crawl scheduling tends to obscure what’s going on (again, it would be even better to be able to switch between the two so you can better understand what any given trend means, although if you download the data for the graph you get the absolute figures as well as the relative ones).</p>
<p>More useful still would be the ability to pick any other arbitrary query to be the normalization baseline, so you could plot matching words against total number of words per year, or matching links per total number of links, and so on. The crucial point is that if your trend is genuine, you can use sampling and normalization techniques to test that, and to <a href="https://acerbialberto.wordpress.com/2013/04/14/normalisation-biases-in-google-ngram/">find or rule out particular kinds of biases</a> within the data set.</p>
<p>This is also why the trend interface offers to show you a random sample of the results underlying a trend. For example, it makes it much easier to quickly ascertain whether the apparent trend is due to a large number of false-positive hits coming from a small number of hosts, thus skewing the data.</p>
<p>I believe there will be practical ways of summarizing provenance information in order to describe the systematic biases within web archive collections, but it’s going to take a while to work out how to do this, particularly if we want this to be something we can compare across different web archives. My suspicion is that this will start from the top and work down - i.e. we will start by trying different sampling and normalization techniques, and discover what seems to work, then later on we’ll be able to work out how this arises from the fine details of the crawling and curation processes involved.</p>
<p>So, while I hope it is clear that I agree with the main thrust of the article, I must admit I am a little disappointed by it’s tone.</p>
<blockquote>
<p>If the Archive simply opens its doors and releases tools to allow data mining of its web archive without conducting this kind of research into the collection’s biases, it is clear that the findings that result will be highly skewed and in many cases fail to accurately reflect the phenomena being studied.</p>
<p><small>Kalev Leetaru, <a href="http://www.forbes.com/sites/kalevleetaru/2015/11/16/how-much-of-the-internet-does-the-wayback-machine-really-archive/">How Much Of The Internet Does The Wayback Machine Really Archive?</a></small></p>
</blockquote>
<p>The implication that we should not enable access to our collections <em>until we have deduced their every bias</em> is not at all constructive (and if it inhibits other organisations from making their data available, potentially quite damaging).</p>
<p>No corpus, digital or <a href="http://www.wired.com/2015/10/pitfalls-of-studying-language-with-google-ngram/">otherwise</a>, is perfect. Every <a href="http://inkdroid.org/2013/11/26/the-web-as-a-preservation-medium/">archival sliver</a> can only come to be understood through use, and we must open up to and engage with researchers in order to discover what provenance we need and how our crawls and curation can be improved.</p>
<p>There are problems we need to document, certainly. Our BUDDAH project is using Internet Archive data, so none of the provenance I listed above was there to help us. And yes, when providing access to the data we do need to explain the crawl dynamics and parameters - you need to know that most of the Internet Archive crawls omit items over 10MB in size (see e.g. <a href="http://readme.lk/archiving-internet-wayback-machine/">here</a>), that they largely obey <a href="https://en.wikipedia.org/wiki/Robots_exclusion_standard">robots.txt</a> (which is often why mainstream sites are missing), and that right now <a href="http://blog.dshr.org/2012/05/harvesting-and-preserving-future-web.html">everyone’s harvesting processes are falling behind the development of the web</a>.</p>
<p>But researchers can’t expect the archives to already know what they need to know, or to know exactly how these factors will influence your research questions. You should expect to have to learn why the dynamics of a web crawler mean that any data-mined ranking is highly unlikely to match up the the popularity as defined by Alexa (which is based on web visitors rather than site-to-site links). You should expect to have to explore the data to test for biases, to confirm the known data issues and to help find the unknown ones.</p>
<p>“Know your data” applies to both of us. Meet us half way.</p>
<p>What we do lack, perhaps, is an adequate way to aggregating these experiences so new researchers do not have to waste time re-discovering and re-learning these things. I don’t know exactly what this would look like, but the <a href="http://netpreserve.org/2016-general-assembly-web-archiving-conference-reykjav%C3%ADk-overview">IIPC Web Archiving Conferences</a> provide a strong starting point and a forum to take these issues forward.</p>
<p><a href="http://anjackson.net/2015/11/20/provenance-of-web-archives/">The provenance of web archives</a> was originally published by Andy Jackson at <a href="http://anjackson.net">andrew.n.jackson</a> on November 20, 2015.</p>http://anjackson.net/2015/08/19/web-archiving-twine2015-08-19T00:00:00+01:002015-08-19T00:00:00+01:00Andy Jacksonhttp://anjackson.netanj@anjackson.net<p>A few months ago, a colleague suggested that we should come up with ways of helping people learn about the main stages of web archiving, and to help them understand some of the more common technical terminology.</p>
<p>I got a bit carried away…</p>
<!--break-->
<p>…because at the same time, I’d been hearing a lot about <a href="https://en.wikipedia.org/wiki/Twine_(software)">Twine</a> and about the interactive fiction that people can build using it. So, I thought, why not use a interactive fiction engine to built a ‘web archiving simulator’ that takes you through the core web archiving life-cycle? A way to ‘learn by doing’ without having all the baggage involved in doing it for real?</p>
<p>Well, because it’ll suck up a tonne of time learning about Twine and <a href="http://twinery.org/">twinery.org</a> and the <a href="http://twine2.neocities.org/">two different versions</a> and fiddling about with the structure and with the prose…</p>
<p><img src="/keeping-codes/experiments/uwa/editing-the-twine.png" alt="Editing the Twine" /></p>
<p>After a few evenings I ran out of steam, and the experiment has been sitting in browser tab since then, unfinished.</p>
<p>I enjoyed building it, but it’s really not going to get finished any time soon. I’m not even sure what ‘finished’ would look like any more. So I may as well publish it as it is. If you want to play the game of web archiving, click the link below…</p>
<div style="text-align:center; font-size: large;">
<a href="http://anjackson.net/keeping-codes/experiments/uwa/">Understanding Web Archiving</a>
</div>
<p>I’ve also made the <a href="http://anjackson.net/keeping-codes/experiments/uwa/8.19.2015, 8.47.54 PM Twine Archive.html">source export</a> available, which you should be able to upload at <a href="http://twinery.org/2/">twinery.org</a> if you want to extend it or just see how it works.</p>
<p>Let me know what you think!</p>
<p><a href="https://twitter.com/anjacks0n">Anj</a></p>
<p><a href="http://anjackson.net/2015/08/19/web-archiving-twine/">Playing at Web Archiving</a> was originally published by Andy Jackson at <a href="http://anjackson.net">andrew.n.jackson</a> on August 19, 2015.</p>http://anjackson.net/2015/05/08/let-them-emulate2015-05-08T00:00:00+01:002015-05-08T00:00:00+01:00Andy Jacksonhttp://anjackson.netanj@anjackson.net<p>On the first day of the <a href="http://netpreserve.org/general-assembly/ga2015-schedule">IIPC GA 2015</a>, the morning keynote was <a href="http://netpreserve.org/sites/default/files/attachments/2015-IIPC-GA_Abstract_01_Cerf-Satya.pdf">Digital Vellum: Interacting with Digital Objects Over Centuries</a>, presented by <a href="http://research.google.com/pubs/author32412.html">Vint Cerf</a> and <a href="http://www.cs.cmu.edu/~satya/">Mahadev Satyanarayanan</a>. This included some more details and demonstrations of the proposed preservation solution <a href="/2015/03/09/vellum">I blogged about before</a>, so I thought it was worth returning to the subject now that I know a little more about it.</p>
<!--break-->
<p>Certainly, the <a href="https://olivearchive.org/">Olive Executable Archive</a> is an impressive system, and the demonstration was great fun. However, it’s worth noting that most of the ‘preservation work’ was done elsewhere. For example, the emulation was actually done using <a href="http://wiki.qemu.org/Main_Page">QEMU</a>. This is nothing to be ashamed of, but it’s important to understand the technical architecture of these kinds of systems in order to be able to compare them.</p>
<p>Specifically, in terms of overall architecture, Olive is very similar to the <a href="http://emuframework.sourceforge.net/">KEEP Emulation Framework</a> (in that it is an attempt to re-package existing tools in a way that makes deployment much easier), while also also providing a ‘cloud’ mode very similar to <a href="http://bw-fla.uni-freiburg.de/">bwFLA’s Emulation as a Service</a>. Olive does have <a href="https://olivearchive.org/about/">some additional advantages</a>: it’s a very smooth experience, and it has a number of pleasant optimizations in terms of reducing the storage size when maintaining a library of software systems, and in terms of the clever <a href="https://olivearchive.org/docs/vmnetx/install/">local virtualization client</a> that can run from streamed/partially downloaded disk images. But these are only optimizations, and I’d argue that the KEEP and bwFLA systems are actually more technically advanced, in terms of things like having the ability to identify formats in order to create the association with the right software.</p>
<p>Will this tactic preserve access for centuries? Possibly - but in the context of the IIPC, the bigger problem is that this technique is not terribly well-suited to <em>internet</em> archiving. As Satya explained, it would be possible (in principle) to preserve specific web service systems by preserving machine images of the servers involved and reconstructing them at ‘playback’ time, from IPv4 and DNS all the way up to the web application itself. This is a pretty neat idea, and one I can imagine applying to a small number of particularly important web services, but is simply not scalable to millions of websites. Consequentially, a number of attendees commented that it was more of an <a href="http://ipres-conference.org">iPres</a> presentation than an IIPC one.</p>
<p>However, even accepting that, I found the presentation rather frustrating. In part, this was because they seem to have ignored almost all the existing work in this area. We know we should be archiving software. We’ve been studying it for a while. We know about most of these tactics and we’ve got a few tricks of our own. But if that’s the case, why aren’t we doing more? Why aren’t we preserving software routinely?</p>
<p>Well, on one slide, Vint presented a number of challenges to software preservation<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>:</p>
<p><img src="/blog/images/digital-vellum-challenges.png" alt="Challenges" /></p>
<p>which manages to reproduce the challenges that the digital preservation community has been chewing over for the last decade (at least), while omitting the central challenge: the economics of preservation.</p>
<p>Those organizations with the remit to think truly long term are also those that are under the yoke of austerity, and many are already struggling to negotiate the massive expansion of their remit due to the growth of digital media. To have big players like Google come along and tell us what we <em>ought</em> to be doing, when we cannot compete with their budgets (and their ability to attract people with the technical skills we need), sounds altogether too much like <a href="http://en.wikipedia.org/wiki/Let_them_eat_cake">“Let them emulate!”</a></p>
<p>Yes, we should be archiving software. Yes, we should be supporting frameworks to make emulation more accessible. Yes, we should be collecting the information and relationships that will support this kind of access. A number of organisations already working hard on these issues (e.g. the <a href="http://www.nsrl.nist.gov/">National Software Reference Library</a>, <a href="http://rhizome.org/editorial/2015/apr/17/theresa-duncan-cd-roms-are-now-playable-online/">Rhizome</a> or <a href="https://archive.org/details/softwarelibrary">the Internet Archive</a>) but I’m pretty sure that every archive and library I know of would agree that as a community we should be doing more. But we are too poor, and too few. Perhaps Vint could find a way Google could help us out, rather than telling us what we already know.</p>
<p>This doesn’t have to be monetary assistance – perhaps they could help address some of those other challenges. Perhaps they could donate disk images of some of their most influential systems (e.g. Google Docs, offical Android releases) to a body like the <a href="http://www.nsrl.nist.gov/">NSRL</a>? Perhaps they could help us get the DRM off their eBooks? Or cope with the <a href="http://en.wikipedia.org/wiki/Encrypted_Media_Extensions">Encrypted Media Extensions</a> that they’ve helped push onto the web?</p>
<p>Having said that, <em>we</em> should also try to learn from this situation. For example, it appears that the IIPC and iPres communities (and those big EU projects that have been working on digital preservation) have not done enough to spread the word about what we’ve been doing. Not just in terms of publicizing EU research in the US, but also in terms of reaching out to the broader IT world and to computer science departments and similar organizations working in this area<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>. We may not be entirely isolated, but it seems like we’re not managing to work together, and that’s something we can afford to change. Something we can’t afford <em>not</em> to change.</p>
<p>Anj</p>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>This might not be exactly the right slide, as Vint’s IIPC GA 2015 slides are not up yet. I got this slide from <a href="http://wirth-symposium.ethz.ch/slides/cerf.pdf">here</a>. <a href="#fnref:1" class="reversefootnote">&#8617;</a></p>
</li>
<li id="fn:2">
<p>For example, where is <a href="http://www.cs.cmu.edu/~satya/">Satya</a> publishing stuff about Olive? Are the iPres bunch there? <a href="#fnref:2" class="reversefootnote">&#8617;</a></p>
</li>
</ol>
</div>
<p><a href="http://anjackson.net/2015/05/08/let-them-emulate/">Let Them Emulate!</a> was originally published by Andy Jackson at <a href="http://anjackson.net">andrew.n.jackson</a> on May 08, 2015.</p>