a local version of Webrecorder that has not been patched to fix known exploits,
and a number of challenges for you learn how they might apply to web archives in general.

Archiving local server files

A page being archived might have links that, when interpreted in the context of the crawler, point to local resources that should not end up in public Web archives. Examples include:

http://localhost:8080/

http://192.168.1.1/

file:///etc/passwd

It is necessary to implement restrictions in the crawler to prevent it collecting from local addresses or from protocols other than http(s). It is also a good idea to run the crawler in an isolated container or VM to maintain control over the set of resources local to the crawler.

Hacking the headless browser

Nowadays collecting many Web sites requires executing the content in a headless browser such as PhantomJS. They all have vulnerabilities, only some of which are known at any given time. The same is true of the virtualization infrastructure. Isolating the crawler in a VM or a container does add another layer of complexity for the attacker, who now needs exploits not just for the headless browser but also for the virtualization infrastructure. But it requires that both need to be kept up-to-date. This isn't a panacea, just risk reduction.

Stealing user secrets during capture

User-driven Web recorders place user data at risk, because they typically hand URLs to be captured to the recording process as suffixes to a URL for the Web recorder, thus vitiating the normal cross-domain protections. Everything, login pages, third-party ads, etc. is regarded as part of the Web recorder domain.

Cross site scripting to steal archive logins

Similarly, the URLs used to replay content must be carefully chosen to avoid the risk of cross-site scripting attacks on the archive. When replaying preserved content, the archive must serve all preserved content from a different top-level domain from that used by users to log in to the archive and for the archive to serve the parts of a replay page (e.g. the Wayback machine's timeline) that are not preserved content. The preserved content should be isolated in an iframe. For example:

Archive domain: https://perma.cc/

Content domain: https://perma-archives.org/

Live web leakage on playback

Especially with Javascript in archived pages, it is hard to make sure that all resources in a replayed page come from the archive, not from the live Web. If live Web Javascript is executed, all sorts of bad things can happen. Malicious Javascript could exfiltrate information from the archive, track users, or modify the content displayed.

Injecting the Content-Security-Policy (CSP) header into replayed content can mitigate these risks by preventing compliant browsers from loading resources except from the specified domain(s), which would be the archive's replay domain(s).

Show different page contents when archived

I wrote previously about the fact that these days the content of almost all web pages depends not just on the browser, but also the user, the time, the state of the advertising network and other things. Thus it is possible for an attacker to create pages that detect when they are being archived, so that the archive's content will be unrepresentative and possibly hostile. Alternately, the page can detect that it is being replayed, and display different content or attack the replayer.

This is another reason why both the crawler and the replayer should be run in isolated containers or VMs. The bigger question of how crawlers can be configured to obtain representative content from personalized, geolocated, advert-supported web-sites is unresolved, but out of scope for Cushman and Kreymer's talk.

Banner spoofing

When replayed, malicious pages can overwrite the archives banner, misleading the reader about the provenance of the page.

4 comments:

"During the past month, both Google and Mozilla developers have added support in their respective browsers for "headless mode," a mechanism that allows browsers to run silently in the OS background and with no visible GUI."

And that this is risky:

"excellent news for malware authors, and especially for the ones dabbling with adware.

In the future, adware or clickfraud bots could boot-up Chrome or Firefox in headless mode (no visible GUI), load pages, and click on ads without the user's knowledge. The adware won't need to include or download any extra tools and could use locally installed software to perform most of its malicious actions.

In the past, there have been quite a few adware families that used headless browsers to perform clickfraud [1, 2, 3, report where miscreants had abused PhantomJS, a headless browser, to post forum spam.

The addition of headless mode in Chrome and Firefox will most likely provide adware devs with a new method of performing surreptitious ad clicks."

"Intel's ME consists of a microcontroller that works with the Platform Controller Hub chip, in conjunction with integrated peripherals. It handles much of the data travelling between the processor and external devices, and thus has access to most of the data on the host computer.

If compromised, it becomes a backdoor, giving an attacker control over the affected device."