I'm David Rosenthal, and this is a place to discuss the work I'm doing in Digital Preservation.

Tuesday, November 10, 2015

Follow-up to the Emulation Report

Enough has happened while my report on emulation was in the review process that,
although I announced its release last week,
I already have enough material for a follow-up post. Below the fold, the details,
including a really important paper from the recent SOSP workshop.
First, a few links to reinforce points that I made in the report:

One important assumption that lies behind the use of emulation for preservation is that
future hardware will be much more powerful than the hardware that originally ran the
preserved digital artefact.
Moore's Law used to make this a no-brainer for CPU performance and memory size.
Although it has recently slowed,
the long time scales implicit in preservation mean that these are still a good bets.
But the capabilities emulation needs to be more powerful are not limited to CPU and memory.
They include the I/O resources needed for communication with the user.
The report points out that this is no longer a good bet.
Desktop and laptop sales are in free-fall
and as The Register reports, even
tablet sales
have been cratering over the last year.
The hardware future users will use to interact with emulations will be a smartphone.
It won't have a physical keyboard and its display and pixels will be much smaller.
Most current emulations are unusable on a smartphone.

The report starts with an image of a Mac emulator running on an Apple watch.
Nick Lee started a trend.
Hacking
Jules has Nintendo 64 and PSP emulators running on his Android Wear. Not, of course, that these emulated games really recreate the experience of playing on a Nintendo 64 or a PSP. But, as with Nick Lee's Mac, they show that simply running an emulation is not that hard.

Some papers at iPRES2015 addressed issues that were raised in the report:

Functional Access to Forensic Disk Images in a Web Service. by Kam Woods et al.
describe using Freiburg's emulation-as-a-service on a collection of forensic disk images.

Characterization of CDROMs for Emulation-based Access
by Klaus Rechert et al is a paper I cited in the report, thanks to a pre-print from Klaus.
It describes the DNB's efforts using Freiburg's EAAS to provide access to their collection of CD-ROM images. In particular it describes an automated workflow for extracting the necessary technical metadata.

Getting to the Bottom Line: 20 Digital Preservation Cost Questions. Cost is the single most important cause of the Half-Empty Archive. One concern the report raises is that, absent better ingest tools, the per-artefact cost of emulation is too high. Matt Schultz et al describe a resource to help institutions identify the full range of costs that might be associated with any particular
digital preservation service.

Based on the facts that cloud services depend heavily on virtualization, and that preserved system images generally work well, the report is cautiously enthusiastic about the fidelity with which emulators execute their target's instruction set. But it does flag several concerns in this area, such as an apparent regression in QEMU's ability to run Windows 95.

A paper at the recent SOSP by Nadav Amit et al entitled Virtual CPU Verification casts light on the causes and cures of fidelity failures in emulators. They observed that the problem of verifying virtualized or emulated CPUs is closely related to the problem of verifying a real CPU. Real CPU vendors sink huge resources into verifying their products, and this team from the Technion and Intel were able to base their research into X86 emulation on the tools that Intel uses to verify its CPU products.

Although QEMU running on an X86 tries hard to virtualize rather than emulate, it is capable of emulating and the team were able to force it into emulation mode. Using their tools, they were able to find and analyze 117 bugs in QEMU, and fix most of them. Their testing also triggered a bug in the VM BIOS:

But the VM BIOS can also introduce bugs of its own.
In our research, as we addressed one of the disparities in
the behavior of VCPUs and CPUs, we unintentionally
triggered a bug in the VM BIOS that caused the 32-bit version
of Windows 7 to display the so-called blue screen of death.

Their conclusion is worth quoting:

Hardware-assisted virtualization is popular, arguably allowing users to run multiple workloads robustly and securely
while incurring low performance overheads. But the robustness and security are not to be taken for granted, as it is
challenging to virtualize the CPU correctly, notably in the
face of newly added features and use cases. CPU vendors
invest a lot of effort—hundreds of person years or more—to develop validation tools, and they exclusively enjoy the
benefit of having an accurate reference system. We therefore
speculate that effective hypervisor validation could truly be
made possible only with their help. We further contend that
it is in their interest to provide such help, as the majority of
server workloads already run on virtual hardware, and this
trend is expected to continue. We hope that open source hypervisors will be validated on a regular basis by Intel Open
Source Technology Center.

Having Intel validate the open source hypervisors,
especially doing so by forcing them to emulate rather than virtualize,
would be a big step forward.
But note the focus on current uses of virtualization.
To what extent the validation process would test the emulation of the hardware features of
legacy CPUs important for preservation is uncertain.
Though the fact that their verification caught a bug that was relevant only to Windows 7 is encouraging.

18 comments:

I would like to add that I am continuing work on Netcapsule (https://github.com/ikreymer/netcapsule), I think a next step in the ''Internet Emulator' effort. It is a fully open source Docker-based system, currently supporting 13 browsers, each running in its own Docker container on-demand, and allowing browsing across 10+ Memento-enabled archives.

I do have one piece of feedback. One thing you seem to have missed in both of your blog posts and in the report is and how the use case you mentioned us having at Yale relates to the cost calculations related to using emulation long term and at a large scale.

One of my ongoing concerns is the problem of software dependent content. Which I consider to be extremely prevalent and problematic. For this reason and due to cost considerations outlined below, we are exploring at Yale the option of using emulation for enabling interaction with every "born" digital object in our archive (and potentially everything digital) using software that is contemporaneous to the objects. On first hearing, this can sound daunting and expensive. Fortunately much of this can be achieved quite inexpensively as a huge number of our files can best be interacted with using a relatively small set of environments. For example, if we set up each version of Microsoft office and/or WordPerfect office on one disk image each and made them available via Emulation as a Service for use with our content, we would be able to enable interaction with many millions of files. The per-file cost for enabling this would be minimal. Furthermore the cost could mostly be born at the point of access, i.e. just in time rather than just in case (which I'd consider applies to migrating everything - it is a relatively large and recurring expense "just in case" the content ever gets used).

The team in Freiburg have also made great progress with implementing something similar to the approach used in the KEEP project to enable the above use case in an even more automated and therefore cost-effective way. They have mapped PRONOM IDs to a number of preconfigured environments so that individual files to be accessed can be analyzed and a set of environments automatically identified that can interact with the files. Under their current implementation a default environment is automatically selected and booted but users can optionally choose another "compatible" environment.

Overall this use case is quite different to the e.g. CD-ROM use case as there is minimal initial effort and therefore cost per item to be preserved. Due to its "just in time" rather than "just in case" nature I think it makes it quite attractive as an option over the long term. Especially compared with the just in case migration alternative.

One other update, my student workers in the Library at Yale are currently processing about 500 CD-ROMs and floppy disks from our general collections per week. We have around 8000, probably 5-6000 of which will eventually go into the EaaS framework. These will take some effort to process (unlike the use case I described above). But I'm hoping the characterization service the bwFLA team have implemented may help with that.

The report points out that the persistence of malware in the Internet is a significant problem for the use of emulation in preservation. It uses 2008's Conficker worm as an illustration. Today, we get a reminder that these threats never go away. Conficker was just discovered in brand-new factory shipments of police body cameras.

The report discusses the problems GPUs pose for emulation (Section 3.2.1) and the efforts to provide paravirtualized GPU support in QEMU (Section 4.2.1). This limited but valuable support is now mainstreamed in the Linux 4.4 kernel.

"The goals of the JavaScript Machines project are to create fast, full-featured simulations of classic computer hardware, help people understand how these early machines worked, make it easy to experiment with different machine configurations, and provide a platform for running and analyzing old computer software."

The report concludes that widespread use of emulation depends on a solution to the obstacles posed by copyright. Anyone in doubt about how hard this will be should read Zachary Crockett's How Mickey Mouse Evades the Public Domain at Priceonomics:

"Disney has done everything in its power to make sure it retains the copyright on Mickey -- even if that means changing federal statutes. Every time Mickey’s copyright is about to expire, Disney spends millions lobbying Congress for extensions, and trading campaign contributions for legislative support. With crushing legal force, they’ve squelched anyone who attempts to disagree with them."

"Microsoft is trying to change its business model so it can in theory make money even if no one ever buys a new PC again. Meanwhile, Intel and the PC makers still generate sales from each new PC sold and therefore want personal computers to fly off the shelves. And all of the PC companies are trying everything they can to get out of the PC business -- or at least become less dependent on selling computers."

"These movies have always been in print," Cifaldi said. "Games could have been the same way, except we demonized emulation, and devalued our heritage. We've relegated a majority of our past to piracy."