Tuesday, December 20, 2016

One of the pages I came across, Hundreds of Louisiana Flood Victims Owe Their Lives to the 'Cajun Navy', highlighted the work of the volunteer "Cajun Navy" in rescuing people from their flooded homes. The page is fairly complex, with a Flash video, YouTube video, 14 embedded tweets (one of which contained a video), and 2 embedded Instagram posts. Here's a screenshot of the original page (click for full page):

Live page, screenshot generated on Sep 9, 2016

To me, the most important resources here were the tweets and their pictures, so I'll focus here on how well they were archived.

First, let's look at how embedded Tweets work on the live web. According to Twitter: "An Embedded Tweet comes in two parts: a <blockquote> containing Tweet information and the JavaScript file on Twitter’s servers which converts the <blockquote> into a fully-rendered Tweet."

Since I'd been using Archive-It to create the collection, that was the first tool I used to capture the page. Archive-It uses the Internet Archive's Heritrix crawler and Wayback Machine for replay. I set the crawler to archive the page and embedded resources, but not to follow links. No special scoping rules were applied.

The HTML source (https://wbrc.io/mweigle/south-louisiana-flood---2016/20160909144135mp_/http://ijr.com/2016/08/674271-hundreds-of-louisiana-flood-victims-owe-their-lives-to-the-cajun-navy/) looks similar to the original embed (except for re-written links):<blockquote class="twitter-tweet" data-width="500"><p lang="en" dir="ltr"><a target="_blank" href="https://wbrc.io/mweigle/south-louisiana-flood---2016/20160909144135mp_/https://twitter.com/hashtag/CajunNavy?src=hash">#CajunNavy</a> on the way to help those stranded by the flood. Nothing like it in the world! <a href="https://wbrc.io/mweigle/south-louisiana-flood---2016/20160909144135mp_/https://twitter.com/hashtag/LouisianaFlood?src=hash">#LouisianaFlood</a> <a href="https://wbrc.io/mweigle/south-louisiana-flood---2016/20160909144135mp_/https://t.co/HaugQ7Jvgg">pic.twitter.com/HaugQ7Jvgg</a></p>&mdash; Vernon Ernst (@vernonernst) <a href="https://wbrc.io/mweigle/south-louisiana-flood---2016/20160909144135mp_/https://twitter.com/vernonernst/status/765398679649943552">August 16, 2016</a></blockquote><script async src="//wbrc.io/mweigle/south-louisiana-flood---2016/20160909144135js_///platform.twitter.com/widgets.js" charset="utf-8"></script>

WARCreate is a Google Chrome extension that our group developed to allow users to archive the page they are currently viewing in their browser. It was last actively updated in 2014, though we are currently working on updates to be released in 2017.

The image below shows the result of the page being captured with WARCreate and replayed in webarchiveplayer.

WARCreate, captured Sep 9, 2016, replayed in webarchiveplayer

Upon replay, WARCreate is not able to display the tweet at all. Here's the close-up of where the tweets should be:

WARCreate capture replayed in webarchiveplayer, with tweets missing

Examining both the WARC file and the source of the archived page helps to explain what's happening.

This is the same markup that's in the DOM upon replay in webarchiveplayer, except for the script source being written to localhost:<h4>In stepped a group known as the “Cajun Navy”:</h4><twitterwidget class="twitter-tweet twitter-tweet-rendered" id="twitter-widget-1" data-tweet-id="765398679649943552" style="position: static; visibility: visible; display: block; transform: rotate(0deg); max-width: 100%; width: 500px; min-width: 220px; margin-top: 10px; margin-bottom: 10px;"></twitterwidget><p><script async="" src="//localhost:8090/20160822124810js_///platform.twitter.com/widgets.js?4fad35" charset="utf-8"></script></p>

WARCreate captures the HTML after the page has fully loaded. So what's happening here is that the page loads, widgets.js is executed, the DOM is changed (thus the <twitterwidget> tag), and then WARCreate saves the transformed HTML. But, what we don't get is the widgets.js script in order to be able to properly display <twitterwidget>. Our expectation is that with fixes to allow WARCreate to archive the loaded JavaScript, the embedded tweet would be displayed as on the live web.

Discussion
Each of these four archiving tools operates on the embedded tweet in a different way, highlighting the complexities of archiving asynchronously loaded JavaScript and DOM transformations.

Archive-It (Heritrix/Wayback) - archives the HTML returned in the HTTP response and JavaScript loaded from the HTML

Webrecorder.io - archives the HTML returned in the HTTP response, JavaScript loaded from the HTML, and JavaScript loaded after execution in the browser

It is useful to examine how different archiving tools and playback engines render complex webpages, especially those that contain embedded media. Our forthcoming update to the Archival Acid Test will include tests for embedded content replay.