Individual Benchmark Results

FAQ

asm.js appears in one test in Octane, and in several
tests in JetStream, which is great, but several aspects of asm.js performance are
not fully measured in those. In particular, asm.js
often appears as large (you might even say massive) source files, which can have different performance characteristics than the typical
small programs appearing in most benchmarks. For example, a very large codebase may contain very large functions,
which due to their size are difficult to optimize efficiently (either not fully optimized, or optimized slowly in a noticeable way),
or just the sheer number of functions may be very high and cause the browser to pause as the codebase is parsed and starts to execute.
Such very large codebases can therefore bring new challenges to JavaScript engines (or rather, more extreme versions of familiar challenges),
and it is important to measure performance on them because they are showing up with increasing frequency on the web (for example,
as native plugins are fading out, game engine companies like Unity and
Epic are starting to compile their large codebases to asm.js).
For these reasons, the Massive benchmark includes several very large codebases (Poppler, SQLite, etc.), and measures throughput as well as
responsiveness, variability and startup time (see details below).

The Emscripten benchmark suite evolved over time in order
to benchmark Emscripten itself, and therefore mainly focused on throughput, and is runnable in both shell and
browser. Massive, on the other hand, tests not just
throughput but also browser responsiveness and other factors that only make sense when running in a browser, things
not measured by the Emscripten benchmark suite (or by the main JavaScript benchmarks).

Main Thread Responsiveness measures the user experience as a large codebase is loaded. What is tested is whether
the main thread stalls as the codebase is prepared and executed for a short while. The score here can be improved
by parsing the code off the main thread, for example. This does not measure how much time is spent,
but only how responsive or unresponsive the user experience is (how much time is spent is measured by
Preparation, and to some extent Throughput). Technically, we measure responsiveness by seeing if events on
the main thread execute at the proper interval (as when the main thread stalls, it stalls both the user
experience and other events).

Throughput measures how fast a large computational workload runs. This is what is typically measured by
benchmarks. Massive's throughput tests focus on very large real-world codebases.

Preparation measures how long (in wall time) is spent to get a codebase ready to execute, before
executing any of it. This measures how much
time passes between adding a script tag with that code and being able to call the code (this may or may not
cause a user-noticeable pause, depending on whether it is parsed on or off the main thread; Main Thread Responsiveness
tests that aspect). "Preparation" is basically all the time before code is actually able to run; that may include
parsing, conversion to bitcode, JIT compilation, etc., depending on the JS engine.

Variance measures how variable the frame rate is in an application that needs to run in each frame
(this is important in things like games, which must finish all they need every 1/60 of a second, in order to be
smooth). Specifically, we run many frames and then calculate the statistical variance and worst case. Note that
one VM might have a much faster overall frame rate than another, but also more variance: in general, given two
VMs with the same average, the one with less variance is "better" since it's smoother. But given a different mean,
things are less clear (perhaps we are happy to get some average slowdown in order to reduce variance which can
cause noticable but rare pauses?). Hence we measure variance separately from throughput (which is a measurement of the total
speed, and is proportional to the average).

Most of the tests, in particular the throughput ones, are generally very consistent, as we run a deterministic workload
in a web worker, which minimizes outside noise. We also run a few repetitions and average the results. However, in particular
the Main Thread Responsiveness tests need to run on the main thread, and they involve DOM events like adding a script tag,
setInterval, etc., which can be fairly variable. We run a larger amount of repetitions on those tests to average out the
noise, but even so they appear to be less consistent between runs on some browsers.

When we see the results of a test are too variable, we mark it with "(±X%!)" next to the score. The cause of such variability
might be something else on your machine (perhaps a background indexing service happend to use a CPU core during a test, etc.),
or it might be that the browser behaves unpredictably for some reason.

Box2D: A 2D physics engine, used in many games, for example Angry Birds. Stresses
floating-point processing performance. The workload is based on
jgw's bench2D. (~30KLOC)

Lua: A script language that is used in many games as well as on Wikipedia. Here the
entire Lua VM is compiled down to JavaScript, including interpreter, garbage
collector, etc. The workloads used are the scimark and binarytrees benchmarks,
which test raw computation and garbage collection, respectively. (~16KLOC)

Poppler: A PDF rendering engine, used by many applications, for example LibreOffice.
Rendering PDFs requires many capabilities (font rendering, graphics, etc.),
making this the largest of the codebases tested here, especially since it is
built together with the FreeType font rendering library. The workload is
Lawrence Lessig's "Free Culture". (~250KLOC)

SQLite: A complete transactional SQL database engine. Parsing and executing SQL
queries is done using a large interpreter-loop type function, which is
challenging to optimize. The workload is the SQLite speedtest1.c benchmark,
which SQLite devs constructed to represent real-world usage patterns. (~128KLOC)

All of these codebases are open source, so you can build and inspect them yourself (the build
tool, Emscripten, is of course open source as well).

Note that the KLOC numbers mentioned above do not include system libraries like libc and libc++,
the necessary parts of which are necessarily included in the benchmarks.

Generally quite a while, as it is designed to execute fixed workloads of sufficient length to
measure real-world performance on large applications. How long it takes will depend on the machine and
browser, of course, but you can probably expect it to take at least a few minutes (on a desktop or laptop
machine; a mobile device may take much more). Massive should not lock up your
browser as it runs, however - except for the Main Thread Responsiveness tests, which run first, benchmarks
are run in web workers (and even the Main Thread Responsiveness tests should not reduce responsiveness
very much). Note that results of individual benchmarks show up when ready, so you can view those
before all of Massive is complete.

Some calculations have an "absolute optimal" result. For example, Variance measures how variable the frame rate is. If the frame rate is practically still - no jumping around at all - then the result is the maximum score of 10,000. For practical reasons, there is an absolute threshold: In the case of Variance, anything under 5ms is considered perfect; this avoids large differences between results like 2ms and 4ms (double the variance in the second!), because 5ms is already so small as to be below the threshold of noticeability.