I'm David Rosenthal, and this is a place to discuss the work I'm doing in Digital Preservation.

Tuesday, December 15, 2015

Talk on Emulation at CNI

When Cliff Lynch found out that I was writing a report for the Mellon Foundation, the Sloan Foundation and IMLS entitled Emulation & Virtualization as Preservation Strategies he asked me to give a talk about it at the Fall CNI meeting, and to debug the talk beforehand by giving it at UC Berkeley iSchool's "Information Access Seminars". The abstract was:

20 years ago, Jeff Rothenberg's seminal Ensuring the Longevity of Digital Documents compared migration and emulation as strategies for digital preservation, strongly favoring emulation. Emulation was already a long-established technology; as Rothenberg wrote Apple was using it as the basis for their transition from the Motorola 68K to the PowerPC. Despite this, the strategy of almost all digital preservation systems since has been migration. Why was this?

Preservation systems using emulation have recently been deployed for public use by the Internet Archive and the Rhizome Project, and for restricted use by the Olive Archive at Carnegie-Mellon and others. What are the advantages and limitations of current emulation technology, and what are the barriers to more general adoption?

Below the fold, the text of the talk with links to the sources. The demos in the talk were crippled by the saturated hotel network; please click on the linked images below for Smarty, oldweb.today and VisiCalc to experience them for yourself. The Olive demo of TurboTax is not publicly available, but it is greatly to Olive's credit that it worked well even on a heavily-loaded network.

Title

Once again, I need to thank Cliff Lynch for inviting me to to give this talk, and for letting me use the participants in Berkeley iSchool's "Information Access Seminars" as guinea-pigs to debug it. This one is basically "what I did on my summer vacation", writing a report under contract to the Mellon Foundation entitled Emulation and Virtualization as Preservation Strategies. As usual, you don't have to take notes or ask for the slides, an expanded text with links to the sources will go up on my blog shortly. The report itself is available from the Mellon Foundation and from the LOCKSS website.

I'm old enough to know that giving talks that include live demos over the Internet is a really bad idea, so I must start by invoking the blessing of the demo gods.

History

Emulation and virtualization technologies have been a feature of the
information technology landscape for a long time,
going back at least to the IBM709 in 1958, but their importance for preservation was first bought to public attention
in Jeff Rothenberg's seminal 1995 Scientific American article Ensuring the Longevity of Digital Documents.
As he wrote,
Apple was using emulation in the transition of the Macintosh from the
Motorola 68000 to the Power PC.
The experience he drew on was the rapid evolution of digital storage media
such as tapes and floppy disks,
and of applications such as word processors each with their own incompatible
format.

His vision can be summed up as follows:
documents are stored on off-line media which decay quickly,
whose readers become obsolete quickly,
as do the proprietary,
closed formats in which they are stored.
If this isn't enough,
operating systems and hardware change quickly in ways that break the
applications that render the documents.

Rothenberg identified two techniques by which digital documents could
survive in this unstable environment,
contrasting the inability of format migration to guarantee fidelity
with emulation's ability to precisely mimic the behavior of obsolete
hardware.

Rothenberg's advocacy notwithstanding,
most digital preservation efforts since have used format migration as their
preservation strategy.
The isolated demonstrations of emulation's feasibility,
such as the collaboration between the UK National Archives and Microsoft,
had little effect.
Emulation was regarded as impractical because it was thought
(correctly at the time) to
require more skill and knowledge to both create and invoke emulations
than scholars wanting access to preserved materials would possess.

Overview

MacOS7 on Apple Watch

Nintendo 64 on Android Wear

It took Nick Lee about 4 hours to get this emulation of MacOS 7 running on his Apple Watch. Hacking Jules followed with Nintendo 64 and PSP emulators on his Android Wear. Simply getting one of the many available emulators running in a new environment isn't that hard, but that isn't enough to make them useful.

Recently, teams at the Internet Archive, Freiburg University and Carnegie Mellon University have shown frameworks that can make emulations appear as normal parts of Web pages; readers need not be aware that emulation is occurring. Some of these frameworks have attracted substantial audiences and demonstrated that they can scale to match. This talk is in four parts:

First I will show some examples of how these frameworks make emulations of legacy digital artefacts, those from before about the turn of the century, usable for unskilled readers.

Next I will discuss some of the issues that are hampering the use of these frameworks for legacy artefacts.

Then I will describe the changes in digital technologies over the last two decades, and how they impact the effectiveness of emulation and migration in providing access to current digital artefacts.

I will conclude with a look at the single biggest barrier that has and will continue to hamper emulation as a preservation strategy.

A digital preservation system that uses emulation will consist of three
main components:

One or more emulators capable of executing preserved system images.

A collection of preserved system images,
together with the metadata describing which emulator configured in which way
is appropriate for executing them.

A framework that connects the user with the collection and the
emulators so that the preserved system image of the user's choice is
executed with the appropriately configured emulator connected to the
appropriate user interface.

Theresa Duncan's CD-ROMs

The Theresa Duncan CD-ROMs.

From 1995 to 1997 Theresa Duncan produced three seminal feminist CD-ROM games, Chop Suey, Smarty and Zero Zero. Rhizome, a project hosted by the New Museum in New York, has put emulations of them on the Web. You can click any of the "Play" buttons,
as I'm going to in a Chromium browser on my Ubuntu 14.04 system,
and have an experience very close to that of playing the CD on MacOS 7.5 . These experiences have proved popular. For several days after their initial release they were being invoked on average every 3 minutes.

What Is Going On?

What happened when I clicked Smarty's Play button?

The browser connects to a session manager in Amazon's cloud, which notices that this is a new session.

Normally it would authenticate the user, but because this CD-ROM emulation is open access it doesn't need to.

It assigns one of its pool of running Amazon instances to run the session's emulator.
Each instance can run a limited number of emulators.
If no instance is available when the request comes in it can take up to 90 seconds to start another.

It starts the emulation on the assigned instance,
supplying metadata telling the emulator what to run.

The emulator starts.
After a short delay the user sees the Mac boot sequence,
and then the CD-ROM starts running.

At intervals,
the emulator sends the session manager a keep-alive signal.
Emulators that haven't sent one in 30 seconds are presumed dead,
and their resources are reclaimed to avoid paying the cloud provider
for unused resources.

bwFLA
Rhizome, and others such as Yale, the DNB and ZKM Karlsruhe use technology from the bwFLA team at the University of Freiburg
to provide Emulation As A Service (EAAS).
Their GPLv3 licensed framework runs in "the cloud" to provide comprehensive management
and access facilities wrapped around a number of emulators. It can also run as a bootable USB image or via Docker. bwFLA encapsulates each emulator so that the framework sees three standard interfaces

Data I/O, connecting the emulator to data sources such as disk
images,
user files,
an emulated network containing other emulators,
and the Internet.

Interactive Access, connecting the emulator to the user using standard HTML5 facilities.

Control, providing a Web Services interface that bwFLA's resource
management can use to control the emulator.

The communication between the emulator and the user takes place via
standard HTTP on port 80;
there is no need for a user to install software,
or browser plugins,
and no need to use ports other than 80.
Both of these are important for systems targeted at use by the general
public.

bwFLA's preserved system images are stored as a stack of overlays in
QEMU's "qcow2'' format.
Each overlay on top of the base system image represents a set of writes
to the underlying image.
For example,
the base system image might be the result of an initial install of
Windows 95,
and the next overlay up might be the result of installing Word Perfect
into the base system.
Or, as Cal Lee mentioned yesterday, the next overlay up might be the result of redaction. Each overlay contains only those disk blocks that differ from the stack of
overlays below it.
The stack of overlays is exposed to the emulator as if it were a normal
file system via FUSE.

The technical metadata that encapsulates the system disk image is
described in a paper presented to the iPres conference in November 2015,
using the example of emulating CD-ROMs.
Broadly,
it falls into two parts,
describing the software and hardware environments needed by the CD-ROM in
XML.
The XML refers to the software image components via the Handle system,
providing a location-independent link to access them.

oldweb.today

BBC News via oldweb.today

Ilya Kreymer has used the same Docker (see his comment) technology to implement oldweb.today, a site through which you can view Web pages from nearly a dozen Web archives using a contemporary browser. Here, for example, is the front page of the BBC News site as of 07:53GMT on 13th October 1999 viewed with Internet Explorer 4.01 on Windows. This is a particularly nice example of the way that emulation frameworks can deliver useful services layered on archived content. Note that the URL for this page, http://oldweb.today/ie4/19991210182302/http://news.bbc.co.uk/, as with the Wayback Machine doesn't specify the technology used.

TurboTax

TurboTax97 on Windows 3.1

Here, again from a Chromium browser on my Ubuntu 14.04 system, is 1997's TurboTax running on Windows 3.1. The pane in the browser window has top and bottom menu bars, and between them is the familiar Windows 3.1 user interface.

What Is Going On?

The top and bottom menu bars come from a program called VMNetX that is running on my system. Chromium invoked it via a MIME-type binding, and VMNetX then created a suitable environment in which it could invoke the emulator that is running Windows 3.1, and TurboTax. The menu bars include buttons to power-off the emulated system, control its settings, grab the screen, and control the assignment of the keyboard and mouse to the emulated system.

The interesting question is "where is the Windows 3.1 system disk with TurboTax installed on it?"

Olive

The answer is that the "system disk" is actually a file on a remote Apache Web server.
The emulator's disk accesses are being demand-paged over the Internet using standard HTTP range queries to the file's URL.

This system is Olive,
developed at Carnegie Mellon University by a team under my friend Prof. Mahadev Satyanarayanan,
and released under GPLv2.
VMNetX uses a sophisticated two-level caching scheme to provide good emulated performance even over slow Internet connections.
A "pristine cache" contains copies of unmodified disk blocks from the "system disk". When a program writes to disk, the data is captured in a "modified cache". When the program reads a disk block, it is delivered from the modified cache, the pristine cache or the Web server, in that order. One reason this works well is that successive emulations of the same preserved system image are very similar,
so pre-fetching blocks into the pristine cache is effective in producing YouTube-like performance over 4G cellular networks.

VisiCalc

VisiCalc on Apple ][

This, from 1979, is Dan Bricklin and Bob Frankston's VisiCalc. It was the world's first spreadsheet. It is running on an emulated Apple ][ via a Chromium browser on my Ubuntu 14.04 system. Some of the key-bindings are strange to users conditioned by decades of Excel, but once you've found the original VisiCalc reference card, it is perfectly usable.

What Is Going On?

The Apple ][ emulator isn't running in the cloud, as bwFLA's does, nor is it running as a process on my machine, as Olive's does. Instead, it is running inside my browser. The emulators have been compiled into JavaScript, using emscripten. When I clicked on the link to the emulation, metadata describing the emulation including the emulator to use was downloaded into my browser, which then downloaded the JavaScript for the emulator and the system image for the Apple ][ with VisiCalc installed.

It might be thought that the performance of running the emulator locally by adding another layer of emulation (the JavaScript virtual machine) would be inadequate, but this is not the case for two reasons. First, the user’s computer is vastly more powerful than an Apple ][ and, second, the performance of the JavaScript engine in a browser is critical to its success, so large resources are expended on optimizing it.

Internet Archive

This is the framework underlying
the Internet Archive's software library, which currently holds nearly 36,000 items, including more than 7,300 for MS-DOS, 3,600 for Apple, 2,900 console games and 600 arcade games. Some can be downloaded, but most can only be streamed.

The oldest is an emulation of a PDP-1 with a DEC 30 display running the Space War game from 1962, more than half a century ago. As I can testify having played this and similar games on Cambridge University’s PDP-7 with a DEC 340 display seven years later, this emulation works well

The quality of the others is mixed. Resources for QA and fixing problems are limited; with a collection this size problems are to be expected. Jason Scott crowd-sources most of the QA; his method is to see if the software boots up and if so, put it up and wait to see whether visitors who remember it post comments identifying problems, or whether the copyright owner objects. The most common problem is the sound; problems with sound support in JavaScript affect bwFLA and Olive as well.

Concerns: Emulators

All three groups share a set of concerns about emulation technology. The first is about the emulators themselves. There are a lot of different emulators out there, but the open source emulators used for preservation fall into two groups:

QEMU is well-supported, mainstream open source software, part of most Linux distributions.
It emulates or virtualizes a range of architectures including X86, X86-64, ARM, MIPS and SPARC.
It is used by both bwFLA and Olive, but both groups have encountered irritating regressions in its emulations of older systems, such as Windows 95. It is hard to get the QEMU developers to prioritize fixing these, since emulating current hardware is its primary focus. The recent SOSP workshop featured a paper from the Technion and Intel describing their use of the tools Intel uses to verify chips to verify QEMU. They found and mostly fixed 117 bugs.

Enthusiast-supported emulators for old hardware including MAME/MESS, Basilisk II, SheepShaver, and DOSBox. These generally do an excellent job of mimic-ing the performance of a wide range of obsolete CPU architectures, but have some issues mapping the original user interface to modern hardware. Jason Scott at the Internet Archive has done great work encouraging the retro-gaming community to fix problems with these emulators but, for long-term preservation, their support causes concerns.

Developing and fixing bugs in emulators requires considerable programming skill; computer science grad students can do it but the average programmer in a library cannot.

Concerns: Metadata

Emulations of preserved software such as those I've demonstrated require not just the bits forming the image of a CD-ROM or system disk, but also several kinds of metadata:

Technical metadata, describing the environment needed in order for the bits to function. Tools for extracting technical metadata for migration such as JHOVE and DROID exist, as do the databases on which they rely such as PRONOM, but they are inadequate for emulation. The DNB and bwFLA teams' iPRES 2015 paper describes an initial implementation of a tool for
compiling and packaging this metadata which worked quite well
for the restricted domain of CD-ROMs. But much better, broadly applicable tools and databases are needed if emulation is to be affordable.

Bibliographic metadata, describing what the bits are so that they can be discovered by potential "readers".

Usability metadata, describing how to use the emulated software. An example is the VisiCalc reference card, describing the key bindings of the first spreadsheet.

Usage metadata, describing how the emulations get used by "readers", which is needed by cloud-based emulation systems for provisioning, and for "page-rank" type assistance in discovery. The Web provides high-quality tools in this area, although a balance has to be maintained with user privacy. The Internet Archive's praiseworthy policy of minimizing logging does make it hard to know how much their emulations are used.

There are no standards or effective tools for automatically extracting the bibliographic or usability metadata; it has to be hand-created. The Internet Archive's approach of crowd-sourcing enhancement of initially minimal metadata of these kinds works fairly well, at least for games.

Concerns: Fidelity

In a Turing sense all computers are equivalent, so it is possible and indeed common for an emulator to precisely mimic the behavior of a computer's CPU and memory. But physical computers are more than a CPU and memory. They have I/O devices whose behavior in the digital domain is more complex than Turing's model. Some of these devices translate between the digital and analog domains to provide the computer's user interface.

PDP1 front panel by fjarlq / Matt.
Licensed under CC BY 2.0.

A user experiences an emulation via its analog behavior, and this can be sufficiently different to impair the experience. Smarty's sound glitches are an example. Consider also the emulation of Space Wars on the PDP-1.
The experience of pointing and clicking at the Internet Archive's web
page,
pressing LEFT-CTRL and ENTER to start,
watching a small patch in one window on your screen among many others,
and controlling your spaceship from the keyboard is not the same as
the original.
That experience included loading the paper tape into the reader,
entering the paper tape bootstrap from the Address and Test Word switches at the left, pressing the Start switch at the bottom left, and then each player controlling their ship with three of the six Sense switches on the right.
The display was a large,
round,
flickering CRT.

Concerns: Loads & Scaling

Daily emulation counts

One advantage of frameworks such as the Internet Archive's and
Olive's
is that each additional user brings along the compute power needed to
run their emulation.
Frameworks in which the emulation runs remotely must add resources to
support added users.
The release of the Theresa Duncan CD-ROMs attracted considerable media attention,
and the load on Rhizome's emulation infrastructure spiked.

Amazon EC2 charges for an 8 CPU machine about €0.50 per hour.
In case of [Bomb Iraq],
the average session time of a user playing with the emulated machine was
15 minutes,
hence,
the average cost per user is about €0.02 if a machine is fully
utilized.

In the peak,
this would have been about €10/day,
ignoring Amazon's charges for data out to the Internet.
Nevertheless,
automatically scaling to handle unpredictable spikes in demand always
carries budget risks,
and rate limits are
essential for cloud deployment.

Why Mostly Games?

Using emulation for preservation was pioneered by video game enthusiasts. This reflects a significant audience demand for retro gaming which,
despite the easy informal availability of free games,
is estimated to be a $200M/year segment
of the $100B/year video games industry.
Commercial attention to the value of the game industry's back catalog
is increasing; a company called Digital Eclipse aspires to become the Criterion Collection of gaming, selling high-quality re-issues of old games. Because preserving content for scholars lacks the business model and fan base of
retro gaming,
it is likely that it will continue to be a minority interest in the emulation
community.

There are relatively few preserved system images other than games for several
reasons:

The retro gaming community has established an informal modus vivendi with the copyright owners.
Most institutions require formal agreements covering preservation and
access and,
just as with academic journals and books,
identifying and negotiating individually with every copyright owner
in the software stack is extremely expensive.

If a game is to be successful enough to be worth preserving,
it must be easy for an unskilled person to install,
execute and understand,
and thus easy for a curator to create a preserved system image.
The same is not true for artefacts such as art-works or
scientific computations,
and thus the cost per preserved system image is much higher.

A large base of volunteers is interested in creating
preserved game images,
and there is commercial interest in doing so.
Preserving other genres requires funding.

Techniques have been developed for mass preservation of,
for example,
Web pages,
academic journals,
and e-books, but no such mass preservation technology is available for emulations. Until it is, the cost per artefact preserved will remain many orders of magnitude
higher.

Artefact Evolution

As we have seen, emulation can be very effective at re-creating the
experience of using the kinds of digital artefacts that were being
created two decades ago. But the artefacts being created now are very
different, in ways that have a big impact on their preservation, whether
by migration or emulation.

Before the advent of the Web digital artefacts had easily identified
boundaries.
They consisted of a stack of components,
starting at the base with some specified hardware,
an operating system,
an application program and some data.
In typical discussions of digital preservation,
the bottom two layers were assumed and the top two instantiated in a
physical storage medium such as a CD.

The connectivity provided by the Internet and subsequently by the Web
makes it difficult to determine where the boundaries of a digital object are.
For example,
the full functionality of what appear on the surface to be traditional
digital documents such as spreadsheets or PDFs can invoke services
elsewhere on the network,
even if only by including links.
The crawlers that collect Web content for preservation have to be
carefully programmed to define the boundaries of their crawls.
Doing so imposes artificial boundaries,
breaking what appears to the reader as a homogeneous information space
into discrete digital "objects''.

Indeed,
what a reader thinks of as "a web page'' typically now consists of
components from dozens of different Web servers,
most of which do not contribute to the reader's experience of the page.
They are deliberately invisible,
implementing the Web's business model of universal fine-grained surveillance.

Sir Tim Berners-Lee's original Web
was essentially an implementation of Vannevar Bush's Memex hypertext concept,
an information space of passive,
quasi-static hyper-linked documents.
The content a user obtained by dereferencing a link was highly likely to
be the same as that obtained by a different user,
or by the same user at a different time.

Since then,
the Web has gradually evolved from the original static linked document model
whose language was HTML,
to a model of interconnected programming environments whose language is
JavaScript.
Indeed,
none of the emulation frameworks I've described would be possible without
this evolution.
The probability that two dereferences of the same link will yield the same
content is now low,
the content is dynamic.
This raises fundamental questions for preservation;
what exactly does it mean to ``preserve'' an artefact that is different
every time it is examined?

The fact that the artefacts to be preserved are now active
makes emulation a far better strategy than migration, but it increases
the difficulty of defining their boundaries. One invocation of an object
may include a different set of components from the next invocation, so
how do you determine which components to preserve?

In 1995,
a typical desktop 3.5'' hard disk held 1-2GB of data.
Today,
the same form factor holds 4-10TB, say 4-5 thousand times as much.
In 1995,
there were estimated to be 16 million Web users,
Today,
there are estimated to be over 3 billion,
nearly 200 times as many.
At the end of 1996,
the Internet Archive estimated the total size of the Web at 1.5TB,
but today they ingest that much data roughly every 30minutes.

The technology has grown,
but the world of data has grown much faster,
and this has transformed the problems of preserving digital artefacts.
Take an everyday artefact such as Google Maps.
It is simply too big and worth too much money for any possibility of
preservation by a third party such as an archive,
and its owner has no interest in preserving its previous states.

Infrastructure Evolution

While the digital artefacts being created were evolving, the infrastructure they depend on was evolving too. For preservation, the key changes were:

GPUs: As Rothenberg was writing,
PC hardware was undergoing a major architectural change.
The connection between early PCs and their I/O devices was the
ISA bus,
whose bandwidth and latency constraints made it effectively impossible
to deliver multimedia applications such as movies and computer games.
This was replaced by the PCI bus,
with much better performance, and multimedia became an essential ingredient of computing devices. This forced a division
of system architecture into a Central Processing Unit (CPU) and what became
known as Graphics Processing Units (GPUs).
The reason was that CPUs were essentially sequential processors,
incapable of performing the highly parallel task of rendering the graphics
fast enough to deliver an acceptable user experience.
Now,
much of the silicon in essentially every device with a user interface
implements a massively parallel GPU whose connection to the display is both very
high bandwidth and very low latency.
Most high-end scientific computation now also depends on the massive parallelism
of GPUs rather than traditional super-computer technology. Partial para-virtualization of GPUs was recently mainstreamed in Linux 4.4, but its usefulness for preservation is strictly limited.

Smartphones: Both desktop and laptop PC sales are in free-fall,
and even tablet sales are no longer growing. Smartphones are the hardware of choice.
They,
and tablets,
amplify interconnectedness;
they are designed not as autonomous computing resources but as
interfaces to the Internet. The concept of a stand-alone ``application'' is no longer really relevant to
these devices.
Their ``App Store'' supplies custom front-ends to network services,
as these are more effective at implementing the Web's business model
of pervasive surveillance.
Apps are notoriously difficult to collect and preserve.
Emulation can help with their tight connection to their hardware platform,
but not with their dependence on network services. The user interface hardware of mobile devices is much more diverse.
In some cases the hardware is technically compatible with traditional
PCs,
but not functionally compatible.
For example,
mobile screens typically are both smaller and have much smaller pixels,
so an image from a PC may be displayable on a mobile display but it may be
either too small to be readable,
or if scaled to be readable may be clipped to fit the screen.
In other cases the hardware isn't even technically compatible.
The physical keyboard of a laptop and the on-screen virtual keyboard of
a tablet are not compatible.

Moore's Law: Gordon Moore predicted in 1965
that the number of transistors per unit area
of a state-of-the-art integrated circuit would double about every two years.
For about the first four decades of Moore's Law,
what CPU designers used the extra transistors for was to make the CPU
faster.
This was advantageous for emulation;
the modern CPU that was emulating an older CPU would be much faster.
The computational cost of emulating the old hardware in software would
be swamped by the faster hardware being used to do it. Although Moore's Law continued into its fifth decade,
each extra transistor gradually became less effective at increasing CPU speed.
Further,
as GPUs took over much of the intense computation,
customer demand evolved from maximum performance per CPU,
to processing throughput per unit power.
Emulation is a sequential process,
so the fact that the CPUs are no longer getting rapidly faster
is disadvantageous for emulation.

Architectural Consolidation:
W. Brian Arthur's 1994 book Increasing Returns and Path Dependence in the Economy
described the way the strongly increasing returns to scale in
technology markets drove consolidation.
Over the past two decades this has happened to system architectures.
Although it is impressive that MAME/MESS emulates nearly two thousand different
systems from the past,
going forward emulating only two architectures (Intel and ARM) will capture
the overwhelming majority of digital artefacts.

Threats:
Although the Morris Worm took down the Internet in 1988,
the Internet environment two decades ago was still fairly benign. Now,
Internet crime is one of the world's most profitable activities,
as can be judged by the fact that the price for a single zero-day iOS exploit is $1M.
Because users are so bad at keeping their systems up-to-date with patches,
once a vulnerability is exploited
it becomes a semi-permanent feature of the Internet. For example, the 7-year old Conficker worm was recently found infecting brand-new police body-cameras. This threat persistence is a particular concern for emulation as a
preservation strategy. Familiarity Breeds Contempt by Clark et al.
shows that the interval between discoveries of new vulnerabilities in released
software decreases through time.
Thus the older the preserved system image,
the (exponentially) more vulnerabilities it will contain, and the more likely it is to be compromised as soon as its emulation starts.

Legal Issues

Warnings: I Am Not A Lawyer, and this is US-specific.

Most libraries and archives are very reluctant to operate in ways whose legal foundations are
less than crystal clear.
There are two areas of law that affect using emulation to re-execute
preserved software,
copyright and,
except for open-source software,
the end user license agreement (EULA), a contract between the original purchaser and the vendor.

Software must be assumed to be copyright,
and thus absent specific permission such as a Creative Commons or
open source license,
making persistent copies such as are needed to form collections of
preserved system images is generally not permitted.
The Digital Millennium Copyright Act (DMCA) contains a "safe harbor'"
provision under which sites that remove copies if copyright owners
send "takedown notices" are permitted;
this is the basis upon which the Internet Archive's collection
operates.
Further,
under the DMCA it is forbidden to circumvent any form of copy
protection or Digital Rights Management (DRM) technology.
These constraints apply independently to every component in the
software stack contained in a preserved system image,
thus there may be many parties with an interest in an emulation's
legality.

The Internet Archive and others have repeatedly worked through the "Section 108'"
process to obtain an exemption to the circumvention ban
for programs and video games "distributed in formats that have become
obsolete and that require the original media or hardware as a condition of
access,
when circumvention is accomplished for the purpose of preservation or archival
reproduction of published digital works by a library or archive."
This exemption appears to cover the Internet Archive's circumvention of
any DRM on their preserved software,
and its subsequent ``archival reproduction'' which presumably includes
execution.
It does not,
however,
exempt the archive from taking down preserved system images if the claimed copyright owner objects,
and the Internet Archive routinely does so.
Neither does the DMCA exemption cover the issue of whether the emulation
violates the EULA.

Streaming media services such as Spotify,
which do not result in the proliferation of copies of content,
have significantly reduced although not eliminated intellectual property
concerns around access to digital media. "Streaming'" emulation systems should have
a similar effect on access to preserved digital artefacts. The success of the Internet Archive's collections,
much of which can only be streamed,
and Rhizome's is encouraging in this respect.
Nevertheless,
it is clear that institutions will not build,
and provide access even on a restricted basis to,
collections of preserved system images at the scale needed
to preserve our cultural heritage unless the legal basis for doing so is
clarified.

Negotiating with copyright holders piecemeal is very expensive and time-consuming. Trying to negotiate a global agreement that would obviate the need for individual agreement would in the best case, take a long time. I predict the time would be infinite rather than long. If we wait to build collections until we have permission in one of these ways much software will be lost.

An alternative approach worth considering would separate the issues
of permission to collect from the issues of permission to provide
access.
Software is copyright.
In the paper world,
many countries had copyright deposit legislation allowing their national
library to acquire,
preserve and provide access (generally restricted to readers
physically at the library) to copyright material.
Many countries,
including most of the major software producing countries,
have passed legislation extending their national library's rights to the
digital domain.

The result is that most of the relevant national libraries already
have the right to acquire and preserve digital works,
although not the right to provide unrestricted access to them.
Many national libraries have collected digital works in physical form.
For example,
the DNB's CD-ROM collection includes half a million
items.
Many national libraries are crawling the Web to ingest Web pages relevant to their collections.

It does not appear that national libraries are consistently exercising
their right to acquire and preserve the software components needed to
support future emulations,
such as operating systems,
libraries and databases.
A simple change of policy by major national libraries could
be effective immediately in ensuring that these components were
archived.
Each national library's collection could be accessed by
emulations on-site in "reading-room" conditions,
as envisaged by the DNB.
No time-consuming negotiations with publishers would be needed.

If national libraries stepped up to the plate in this way, the problem of access would remain. One idea that might be worth exploring as a way to it is lending.
The Internet Archive has successfully implemented a lending system for their collection of digitized books. Readers can check a book out for a limited period; each book can be checked out to at most one reader at a time.
This has not encountered much opposition from copyright holders. A similar system for emulation would be feasible;
readers would check out an emulation for a limited period,
and each emulation could be checked out to at most one reader at a time.
One issue would be dependencies.
An archive might have,
say,
10,000 emulations based on Windows 3.1.
If checking out one blocked access to all 10,000 that might be too
restrictive to be useful.

Conclusion

I hope I have shown that the technical problems of delivering emulations of preserved software have largely been solved. Concerns remain, but most are manageable. The legal issues are intractable unless national libraries are prepared to use their copyright deposit rights to build collections of software. If they do, some way to provide off-site access will be needed, but at least the software will be around to be emulated when agreement is reached on it.

3 comments:

Thank you for including the newly released oldweb.today in this comprehensive presentation and post.

I just wanted to clarify that oldweb.today does not actually use bwFLA technology, but is built using some of the same building blocks, including Wine, Sheepshaver, Basilisk emulators. It relies entirely on Docker containers and does not have a separate image system from the one that is provided by Docker. The images for all the browsers are also available on Docker's public image repository, Docker Hub. The project leverages various Docker technologies, including the Docker Swarm for scaling across multiple machines.

The technology used in oldweb.today is mentioned briefly on the home page in the 'About / How it works' section, but perhaps this is not easily accessible (It requires a click to expand that section) and could be improved.