I'm David Rosenthal, and this is a place to discuss the work I'm doing in Digital Preservation.

Tuesday, March 29, 2016

Following Up On The Emulation Report

A meeting was held at the Mellon Foundation to follow up on my report Emulation and Virtualization as Preservation Strategies. I was asked to provide a brief introduction to get discussion going. The discussions were confidential, but below the fold is an edited text of my introduction with links to the sources.

I think the two most useful things I can do this morning are:

A quick run-down of developments I'm aware of since the report came out.

A summary of the key problem areas and recommendations from the report.

I'm going to ignore developments by the teams represented here. Not that they aren't important, but they can explain them better than I can.

The quality of the emulators, especially when running legacy artefacts, is a significant concern. A paper at last year's SOSP by Nadav Amit et al entitled Virtual CPU Verification casts light on the causes and cures of fidelity failures in emulators. They observed that the problem of verifying virtualized or emulated CPUs is closely related to the problem of verifying a real CPU. Real CPU vendors sink huge resources into verifying their products, and this team from the Technion and Intel were able to base their research into X86 emulation on the tools that Intel uses to verify its CPU products.

Although QEMU running on an X86 tries hard to virtualize rather than emulate, it is capable of emulating and the team were able to force it into emulation mode. Using their tools, they were able to find and analyze 117 bugs in QEMU, and fix most of them. Their testing also triggered a bug in the VM BIOS:

But the VM BIOS can also introduce bugs of its own. In our research, as we addressed one of the disparities in the behavior of VCPUs and CPUs, we unintentionally triggered a bug in the VM BIOS that caused the 32-bit version of Windows 7 to display the so-called blue screen of death.

Having Intel validate the open source hypervisors, especially doing so by forcing them to emulate rather than virtualize, would be a big step forward. To what extent the validation process would test the emulation of the hardware features of legacy CPUs important for preservation is uncertain, though the fact that their verification caught a bug that was relevant only to Windows 7 is encouraging.

Frameworks

Second, the frameworks. The performance of the Internet Archive's JSMESS framework, now being called Emularity, depends completely on the performance of the JavaScript virtual machine. Other frameworks are less dependent, but its performance is still important. The movement supported by major browser vendors to replace this virtual machine with a byte-code virtual machine called WebAssembly has borne fruit. A week ago four major browsers announced initial support, all running the same game, a port of Unity's Angry Bots. This should greatly reduce the pressure for multi-core and parallelism support in JavaScript, which was always likely to be a kludge. Improved performance for in-browser emulation is also likely to make in-browser emulation more competitive with techniques that need software installation and/or cloud infrastructure, reducing the barrier to entry.

The report discusses the problems GPUs pose for emulation and the efforts to provide paravirtualized GPU support in QEMU. This limited but valuable support is now mainstreamed in the Linux 4.4 kernel.

Mozilla among others has been working to change the way in which Web pages are rendered in the browser to exploit the capabilities of GPUs. Their experimental "servo" rendering engine gains a huge performance advantage by doing so. For us, this is a double-edged sword. It makes the browser dependent on GPU support in a way it wasn't before, and thus makes the task of browser emulations such as oldweb.today harder. If, on the other hand, it means that GPU capabilities will be exposed to WebAssembly, it raises the prospect of worthwhile GPU-dependent emulations running in browsers, further reducing the barrier to entry.

Collections

Third, the collections. The Internet Archive has continued to release collections of legacy software using Emularity. The Malware Museum, a collection of currently 47 viruses from the '80s and '90s, has proven very popular, with over 850K views in about 6 weeks. The Windows 3.X Showcase, a curated sample of the over 1500 Windows emulations in the collection, has received 380K views in the same period. It is particularly interesting because it includes a stock install of Windows 3.11. Despite that the team has yet to receive a takedown request from Microsoft.

About the same time as my report, a team at Cornell led by Oya Rieger and Tim Murray produced a white paper for the National Endowment for the Humanities entitled Preserving and Emulating Digital Art Objects. I blogged about it. To summarize my post, I believe that outside their controlled "reading room" conditions the concern they express for experiential fidelity is underestimated, because smartphones and tablets are rapidly replacing PCs. But two other concerns, for emulator obsolescence and the fidelity of access to web resources, are overblown.

Tools

Fourth, the tools. The Internet Archive has a page describing how DOS software to be emulated can be submitted. Currently about 65 submissions a day are being received, despite the somewhat technical process it lays out. Each is given minimal initial QA to ensure that it comes up, and is then fed into the crowd-sourced QA process described in the report. It seems clear that improved tooling, especially automating the process via an interactive Web page that ran the emulation locally before submission, would result in more and better quality submissions.

Internet of Things

The Internet of Things has been getting a lot of attention, especially the catastrophic state of IoT security. Updating the software of Things in the Internet to keep them even marginally secure is often impossible because the Things are so cheap there are no dollars for software support and updates, and because customers have no way to tell that one device is less insecure than another. This is exactly the problem faced by preserved software that connects to the Internet, as discussed in the report. Thus efforts to improve the security of the IoT and efforts such as Freiburg's to build an "Internet Emulator" to protect emulations of preserved software may be highly synergistic.

Off on a tangent, it is worth thinking about the problems of preserving the Internet of Things. The software and hardware are intimately linked, even more so than smartphone apps. So does preserving the Internet of Things reduce to preserving the Things in the Internet, or does emulation have a role to play?

The To-Do List

To refresh your memories, here are the highlights of the To-Do List that ends the report, with some additional commentary. I introduce the list by pointing out the downsides of the lack of standardization among the current frameworks, in particular:

There will be multiple emulators and emulation frameworks, and they will evolve through time. Re-extracting or re-packaging preserved artefacts for different, or different versions of, emulators or emulation frameworks would be wasted effort.

The most appropriate framework configuration for a given user will depend on many factors, including the bandwidth and latency of their network connection, and the capabilities of their device. Thus the way in which emulations are advertised to users, for example by being embedded in a Web page, should not specify a particular framework or configuration; this should be determined individually for each access.

I stressed that:

If the access paths to the emulations link directly to evanescent services emulating the preserved artefacts, not to the artefacts themselves, the preserved artefacts are not themselves discoverable or preservable.

In summary, the To-Do list was:

Standardize Preserved System Images so that the work of preparing preserved system images for emulation will not have to be redone repeatedly as emulation technology evolves, and

Standardize Access To System Images and

Standardize Invoking Emulators so that the work of presenting emulations of preserved system images to the "reader" will not have to be redone repeatedly as emulation technology evolve.

Improve Tools For Preserving System Images: The Internet Archive's experience shows that even minimal support for submission of system images can be effective. Better support should be a high priority. If the format of system images could be standardized, submissions would be available to any interested archive.

Enhance Metadata Databases: these tools, and standardized methods for invoking emulators, rely on metadata database, which need significant enhancement for this purpose.

Support Emulators: The involvement of Intel in QA-ing QEMU is a major step forward, but it must be remembered that most emulations of old software depend on enthusiast-supported emulators such as MAME/MESS. Supporting ways to improve emulator quality, such as for example external code reviews to identify critical quality issues, and a "bounty" program for fixing them, should be a high priority. It would be important that any such program be "bottom-up"; a "top-down" approach would not work in the enthusiast-dependent emulator world.

Develop Internet Emulators: oldweb.today is already demonstrating the value of emulating software that connects to the Internet. Doing so carries significant risks, and developing technology to address them (before the risks become real and cause a backlash) needs high priority. The synergies between this and the security of the Internet of Things should be explored urgently.

Tackle Legalities: As always, the legal issues are the hardest to address. I haven't heard that the PERSIST meeting in Paris last November came up with any new ideas in this area. The lack of a reaction to the Internet Archive's Windows 3.X Showcase is encouraging, and I'm looking forward to hearing whether others have made progress in this area.

Experiential fidelity issues are caused by hardware differences so its really hard to see them being cured by specific different hardware.

VR is interesting. I helped get one of the very first VR companies (Sense8) started and have the polo shirt to prove it. But experiencing simulations of things in VR is not the same as experiencing the real thing in the physical world. The differences are conceptually similar to the differences the Cornell team are concerned about.

Thanks for the great update. I wanted to address the security concern re: Internet emulators. I don't think that it is necessarily all that different than other concerns related to remote code execution outside the intended purpose of the emulator. For example, a user using a remote emulator running a CD-ROM on MacOS 7.5+ could instead open and write AppleScript that does malicious things. Providing access to the Internet increases the capability for doing 'malicious things'. In this context, 'malicious things' is anything that affects things outside the emulator, such as affecting the host machine and degrading the experience for other users. The proper solution is some sort of sandboxing of what an emulator.

As it happens, container technologies (such as Docker) already come with a variety of resource allocation controls, allowing the operator to limit cpu, ram, disk space and network activity, including fine grained network access controls, for containers. Running emulators in containers thereby allows emulators to take advantage of these controls (and improvements in container security), without having to develop any tools specific to emulators.

Victoria Stodden's CNI plenary Defining the Scholarly Record for Computational Research is a comprehensive look at the problem of what exactly needs to be collected and preserved in order for the in silico part of science to be reproducible and replicable. She quotes David Donoho from 1998:

“The idea is: An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete ... set of instructions [and data] which generated the figures.”

This isn't actually quite enough. The example I used in my report on emulation was the Olive project's emulation of CHASTE 3.1, a large simulation package for computationally demanding problems in biology and physiology, released in 2013 for Ubuntu 12.04. Although the source code was released, building and running it requires a specific set of library versions and Ubuntu 12.04. So in general it is necessary to collect and preserve the entire stack all the way down to a description of the hardware in order to guarantee reproducibility.

Apps, as I wrote, are extremely problematic for preservation. But Peter Kafka at ReCode points out that The app boom is over. App saturation has arrived, user have the apps they need, and the space for new downloads is very restricted.

"The last ones to ship WebAssembly in the stable branches were Apple in Safari 11.0 and Microsoft in Microsoft Edge (EdgeHTML 16), the version that shipped with the Windows 10 Fall Creators Update, both released last month.

Currently, the standard is a resounding success. WebAssembly has already shipped in many Facebook games thanks to wasm-centric game engines released by the likes of Unity and Epic. In addition, WebAssembly has already made a name for itself on the malware scene where in-browser cryptocurrency miners like Coinhive and CryptoLoot wouldn't have ever been possible without it."