Saturday, December 4, 2010

Architectural simulators

As a researcher in the areas of operating systems, compilers, and computer architecture, I spend a lot of time dealing with simulators. An inordinate amount of time. Much of this time is spent figuring out how well a given simulator provides:

Useful timing measurements (cycle-accurate)

Support for fully functional OSs (full-system)

Auxiliary features for specific experiments

Support for useful hardware platforms and instruction sets

Openness to modifying microarchitectural features

Availability of technical support

In my current work, I'm interested in actively developed simulators that support cycle-accurate full-system simulation with a detailed memory hierarchy on out-of-order uniprocessor or multicore RISC architecture models with the SPARC, MIPS, or ARM instruction sets that supports extending the pipeline and instruction set. The rest of this article discussed my efforts in finding simulators to meet my needs.

A brief aside: I received a complaint that I write too much, so I have taken the effort to highlight in blue the important points I want to make; coloring is less work than thinking about how to be more concise. If you are familiar with architectural simulators, this should speed up processing my drivel. I go in to details about the 6 features listed above, then review some architectural simulators and processor emulators with respect to the above criteria.

The first two features, cycle-accuracy and full-system simulation, tend to be unavailable simultaneously. Cycle accurate simulation improves on simulators that provide only instruction set simulation by adding detailed timing (delay) and modelling of the microarchitectural features of a processor. Cycle accuracy has been available in open academic simulators for a number of years, with the dominant player being SimpleScalar. However, SimpleScalar fell behind the fast-moving industry, so that its usefulness is limited in modern research although it still is an approachable system for students or for embedded systems. Full-system simulation models the low-level hardware necessary to support running an OS and applications without modification. The defining features of a full-system simulator include support for interrupts, memory management hardware (MMU / TLB), low-level buses/interconnects, peripherals (e.g. keyboard, mouse), IO devices including disk and networking, and other functionally relevant system components.

The third feature that I investigate comes in the additional architectural modelling that is provided for supporting research. Perhaps the two most important "auxiliary" features of a simulator are a detailed memory hierarchy and an accurate power model. A detailed memory hierarchy exposes the timing and functionality of elements along the path from the CPU to main memory. Among other things, this includes the cache parameters (line size, set size, associativity, access latency), cache behavior or policy, memory access latencies on cache misses, and, more recently, details of the interconnect between caches and memories. A detailed memory hierarchy is necessary for cycle accurate and functional simulation, although the details required by both may vary. Accurate power estimation involves modelling the power demand of an architecture by estimating the energy dissipated by accessing architectural features (dynamic power) and the power loss due to leaking energy regardless of changes in hardware state (static/leakage power). Approaches to power estimation are derived from efforts to measure the power of real platforms using repeated executions of instructions to construct estimates of the power of accessing architectural features. Estimates are validated by running complex workloads and measuring the observed power of a real platform and comparing it with the predicted power of the simulator. A common tool for the microarchitecture power research in academia is WATTCH, which has been integrated in a number of simulators. CACTI is commonly used for estimating the power dissipated by caches. It is also important to account for memory power, especially when proposing solutions that trade-off performance and power such as dynamic voltage and frequency scaling (DVFS); current tools fall short on accounting for off-chip memory costs.

The supported hardware platforms and instruction sets of a given simulator also interacts with the ability to modify its low-level features. Some modern simulators target the x86(_64) architecture, but since actual x86 implementations do not directly implement x86 instructions it is difficult to model x86 accurately. (Although PTLsim does claim to capture the cycle overheads of all elements of actual x86 implementations.) Instead, the hardware translates at run-time the machine code into one or more RISC-like instructions called micro-ops (μops). This indirection makes it difficult to do low-level compiler work, and pretty much impossible to build an architecture that can trap x86 instructions at the assembly language level. It is also difficult to create realistic prototypes, since the actual implementation details are obscured and proprietary. I mainly look for support of superscalar out-of-order (OoO) RISC architectures, which are useful for architectural and language research due to the straightforward mapping from instructions to architecture pipelines while still having complexity in the pipeline. One caveat to that last statement is that many academics, and even some in industry, have stated the the future of CMP is in simpler cores; so CMP researchers may seek less sophisticated processor pipelines. Also of importance when considering what simulator to use is the set of architectures modelled, for example OoO versus in-order, superscalar vs VLIW, CMP vs SMP vs SMT vs uni, and so on.

The last two items on my list tend to be opposed to each other: open-source simulators tend to be supported by small communities of academic (graduate students) users, whereas commercially supported simulators tend to not allow much tinkering with the microarchitecture (since the source code is not open, and the simulator implementation is proprietary). It is critical that the microarchitecture simulated be open or extensible to enable research that modifies pipeline elements, adds new features, changes dataflow and control paths, etc. All the simulators of which I'm aware that provide enough flexibility for architectural research do so by providing the source code for modification, thus precluding proprietary/commercial simulators. There have been efforts to provide a plug-in framework for microarchitecture research, although I'm not aware of any current commercial simulators that support such plug-ins.

Simulators
The following notes cover some simulators that I have used or looked into using. The UW architecture group has a page of links that covers many of the available architecture simulators and related tools.

An easy-to-modify, academically licensed source-available cycle-accurate simulator that lacks full-system capabilities and only supports uniprocessor architectures including the Alpha, MIPS, and ARM instruction sets. This simulator is no longer actively maintained or supported.

An academic project out of UIUC, SESC is a cycle-accurate superscalar OoO simulator that supports CMP (multicore) platforms. The only instruction set supported is MIPS, and there is no support for full-system simulation.

Originally an academic project, Simics was commercialized by Virtutech and sold to Wind River, an Intel subsidiary. Virtutech, and now Wind River, provide academic licensing for Simics, which provides a limited set of the full Simics suite of processor models. The most detailed models in the academic suite are the SPARC models, followed by the x86. Simics provides full-system functional simulation with multicore platforms that executes an in-order model with 1 cycle per instruction. The microarchitectural interface (MAI) provided a plug-in framework for researchers to observe instructions and hook timing functionality to simulate varying feature latency; however, MAI is no longer supported by versions of Simics past 4.0. Simics also allows for user decoders to be defined that can re-define the functional behavior of instructions. User decoders support ISA extensions and tweaks while still maintaining functional fidelity.

Multifacet GEMS (more commonly just Simics GEMS) is a project from UW-Madison that provides modules for Simics without using the MAI. GEMS implements an OoO processor (Alpha) model for the SPARC-V9 instruction set with a detailed CMP memory network (specifically, the UltraSPARC III+ instruction set). GEMS is composed of two primary modules, Opal and Ruby. Opal is the OoO cycle-accurate simulator that relies on Simics for some of the difficult full-system features; when Opal does something that is detected as functionally incorrect, it squashes its work and reads architected state from Simics. Ruby is a complex memory subsystem intended for easing research in the protocols and interconnects of CMPs. Opal is designed to call into Ruby for its detailed memory hierarchy, although Opal also has a simple two-level memory hierarchy that is usable in uniprocessor mode. Most work uses Opal+Ruby, and some researchers only use Ruby (since they are only interested in the memory subsystem), and there are even efforts to port Ruby to other simulators to provide the detailed memory hierarchy.

PTLsim is an open-source cycle-accurate x86 simulator. The base PTLsim does not provide full-system simulation nor does it model multicore, although full-system features are provided through Xen and more recently KVM/Qemu. MPTLsim was also presented at DAC, but I have not seen any mention of its implementation. It is an interesting project though, especially if x86 is a compelling architecture for a particular research problem. I haven't personally tried to use PTLsim, although our group did task an undergrad to give it a whirl -- he had difficulty with building and running it, although this was for PTLsim/X. The newer version relying on KVM might be better supported.

I have previously talked about M5 on this blog. It is another academic simulator, originating at Michigan, that provides a combination of full-system (FS) or processor emulation (SE), cycle-accuracy, and architectural models, although at present only the ALPHA instruction set is supported in FS mode with a cycle-accurate (OoO) model. I believe the origin of M5 was to study networking, so the memory hierarchy is not particularly robust. There is an effort called GEM5 (clever!) to port the Ruby memory hierarchy to M5. Community support for M5 is decent, although best-effort.

Emulators
Loosely related to architectural simulators are processor emulators and hypervisors. I looked at using two of these, but they do not model the architecture in detail enough to support easy microarchitectural research or cycle-accurate timing.

A full-system emulator for some of the x86 and x86_64 ISAs that is based on binary translation. When I looked into Bochs, it did not support multicore or SMP, although that may have changed as it is still an active project. I also read, but never verified, that the emulation in Bochs is very time-consuming.

Another binary-translating proceessor emulator, QEMU supports a broader range of architectures and is fairly efficient. I actually do use it for rapid prototyping in some work that I do, but only for application and kernel development, not for architectural research. It is an active project.

Well, that wraps up my view on the current architectural simulators. If you have any experiences, differing opinions, or other simulators that have compelling features please feel free to drop me a line.

A short note on Simics: the MAI was end of life in Simics 3.0, as is not available in later versions of Simics that puts greater emphasis on fast simulation of complex machines. The current academic version of Simics is 4.2.

> There have been efforts to provide a plug-in framework for microarchitecture research, although I'm not aware of any current commercial simulators that support such plug-ins.

This is very true - in practice, all real work on processor architecture has turned out to be done using source-code modification to a simulator. It simply seems to be the fact that any framework is going to be built based on today's architectures - and therefore they are probably unable to model the fun things people are doing tomorrow.

Examples of this was clear when SMT came along, not to mention transactional memory or scout threads.

The GEMS timing-first approach is probably the best compromise.

Note that another way around this is to use a commercial simulator for the device ring and machine, and then plugin a detailed home-made complete simulator of the processor and memory hierarchy. For this to work, you do need to have a CPU model complete enough to run a real OS, though.

Hi Gedare, thanks for the very interesting post. Did any of you looked into COTSon (http://sites.google.com/site/hplabscotson/)? It is based on AMD's Simnow and only simulates AMD processors, but beyond that it is fairly complete and relatively easy to set up.

Hi Gedare , Thanks for such a nice blog. I was wondering if it's possible to combine McPat (a power tool) with GEM5. I would like to simulate power aware dynamic cache resizing using GEM5. Do you have any idea on which file I have to work.Thanks

Thanks I took a look at Sniper (http://snipersim.org) recently. It admits up to 25% error in timing accuracy, so its usefulness for performance estimation is questionable for architectural modifications. However, it offers a good tradeoff for simulation speed if you don't need to measure performance with less than 25% error.

I wish to simulate the addition of some new pipeline stages with specific interconnections to the memory hierarchy and branch prediction unit. Which simulator do you think would be the best to model this and observe the delays with the changes?