On 2D Performance Measurement

Trying to get a handle on 2D graphics rendering performance can be a
difficult task. Obviously, people care about the performance of their
2D applications. Nobody wants to wait for a web browser to scroll past
tacky banner ads or for an email client to render a screen full of
spam. And it's easy for users to notice "my programs aren't rendering
as fast with the latest drivers". But what developers need is a way to
quantify exactly what that means, in order to track improvements and
avoid regressions. And that measurement is the hard part. Or at least
it always has been hard, until Chris Wilson's recent cairo-perf-trace.

Previous attempts at 2D benchmarking

Various attempts at 2D-rendering benchmark suites have appeared and
even become popular. Notable examples are x11perf and gtkperf. My
claim is that these tools range from useless to actively harmful when
the task is understanding performance of real applications.

These traditional benchmarks suites are collections of synthetic
micro-benchmarks. Within a given benchmark, some tiny operation, (such
as "render a line of text" or "draw a radio box"), is performed
hundreds of times in a tight loop and the total time is measured. The
hope is that these operations will simulate the workload of actual
applications.

Unfortunately, the workload of things like x11perf and gtkperf rarely
come close to simulating practical workloads. In the worst case, the
operation being tested might never be used at all in modern
applications, (notice that x11perf tests things like stippled fills
and wide ellipses which are obsolete graphics operations). Similarly,
even if the operation is used, (such as a GTK+ radio button), it might
not represent a significant fraction of time spent rendering by the
application, (which might spend most of its time drawing its primary
display area rather than any stock widget).

So that's just the well-known idea to not focus on the performance of
things other than the primary bottlenecks. But even when we have
identified a bottleneck in an application, x11perf can still be the
wrong answer for measurement. For example, "text rendering" is a
common bottleneck for 2D applications. However, a test like "x11perf
aa10text" which seems like a tempting way to measure text performance
is far from ideal. This benchmark draws a small number of glyphs from
a single font at a single size over and over. Meanwhile, a real
application will use many glyphs from many fonts at many sizes. With
layers and layers of caches throughout the graphics stack, it's really
not possible to accurately simulate what "text rendering" means for a
real application without actually just running the actual application.

And yes, I myself have used and perhaps indirectly advocated for using
things like x11perf in the past. I won't recommend it again in the
future. See below for what I suggest instead.

What do the 3D folks do?

For 3D performance, everybody knows this lesson already. Nobody
measures the performance of "draw the same triangles over and
over". And if someone does, (by seriously quoting glxgear fps numbers,
for example), then everybody gets a good laugh. In fact, the phrase
"glxgears is not a benchmark" is a catchphrase among 3D
developers. Instead, 3D measurement is made with "benchmark modes" in
the 3D applications that people actually care about, (which as far as
I can tell is just games for some reason). In the benchmark mode, a
sample session of recorded input is replayed as quickly as possible
and a performance measurement is reported.

As a rule, our 2D applications don't have similar benchmark
modes. (There are some exceptions such as the trender utility for
mozilla and special command-line options for the swfdec player.) And
coding up application-specific benchmarking code for every interesting
application isn't something that anyone is signing up to do right now.

Introducing cairo-perf-trace

Over the past year or so, Chris "ickle" Wilson has been putting a lot
of work into a debugging utility known as cairo-trace, (inspired by
work on an earlier tool known as libcairowrap by Benjamin Otte and
Jeff Muizelaar). The cairo-trace utility produces a trace of all
cairo-based rendering operations made by an application. The trace is
complete and accurate enough to allow all operations to be replayed
with a separate tool.

The cairo-trace utility has long proven invaluable as a way to capture
otherwise hard-to-reproduce test cases. People with complex
applications that exhibit cairo bugs can generate a cairo-trace and
often easily trim it down to a minimal test case. Then after
submitting this trace, a developer can replicate this bug without
needing to have a copy of the complex application nor its state.

More recently, Chris wrote a new "cairo-trace --profile" mode and a
tool named cairo-perf-trace
for replaying traces for benchmarking purposes. These tools are
currently available by obtaining the cairo source
code, (either from git or in the
1.9.2 development snapshot or eventually the 1.10 release or
later). Hopefully we'll see them get packaged up so they're easier to
use soon.

With cairo-perf-trace, it's a simple matter to get rendering
performance measurements of real applications without having to do any
modification of the application itself. And you can collect a trace
based on exactly the workload you want, (as long as the application
you are interested in performs its rendering with cairo). Simply run:

cairo-trace --profile some-application

Which will generate a compressed file named something like
some-application.$pid.lzma. To later benchmark this trace, first
uncompress it:

lzma -cd some-application.$pid.lzma > some-application.trace

And then run cairo-perf-trace on the trace file:

cairo-perf-trace some-application.trace

The cairo-perf-trace utility will replay several iterations of the
trace, (waiting for the standard deviation among reported times to
drop below a threshold), and will report timing results for both the
"image" backend (cairo's software backend) and whatever native backend
is compiled into cairo, (xlib, quartz, win32, etc.). So one
immediately useful result is its obvious to see if the native backend
is slower than the all-software backend. Then, after making changes to
the graphics stack, subsequent runs can be compared to ensure
regressions are avoided and performance improvements actually help.

Finally, Chris has also established a cairo-traces git
repository which collects
useful traces that can be shared and compared. It already contains
several different browsing sessions with firefox, swfdec traces (one
with youtube), and traces of poppler, gnome-terminal, and
evolution. Obviously, anyone should feel free to generate and propose
new traces to contribute.

Putting cairo-perf-trace to use

In the few days that cairo-perf-traces has existed, we're already
seeing great results from it. When Kristian Høgsberg recently proposed
a memory-saving
patch
for the Intel driver, Chris Wilson followed up with a
cairo-perf-trace
report
showing that the memory-saving had no negative impact on a traced
firefox session, which addressed the
concern
that Eric had about the patch.

As another example, we've known that there's been a performance
regression in UXA (compared to EXA) for trapezoid rendering. The
problem was that UXA was allocating a pixmap only to then use
software-based rasterization to that pixmap (resulting in slow
read-modify-write cycles). The obvious fix I implemented is to simply
malloc a buffer, do the rasterization, and only then copy the result
to a pixmap.

After I wrote the patch, it was very satisfying to be able to validate
its real-world impact with a swfdec-based trace. This trace is based
on using swfdec to view the Giant
Steps
movie. When running this trace, sysprof makes it obvious that
trapezoid rendering is the primary bottleneck. Here is the output of
cairo-perf-trace on a GM965 machine before my patch:

The performance problem is quite plain here. Replaying the swfdec
trace to the X server takes 194 seconds compared to only 45 seconds to
replay it only to cairo's all-software image backend. Note that 194
seconds is longer than the full video clip, meaning that my system
isn't going to be able to keep up without skipping here. That's
obviously not what we want.

Here the xlib result has improved from 194 seconds to 81
seconds. That's a 2.4x improvement, and fast enough to now play the
movie without skipping. It's very satisfying to validate performance
patches with real-world application code like this. This commit is in
the recent 2.7.99.901 or the Intel driver, by the way. (Of course,
there's still a 1.8x slowdown of the xlib backend compared to the
image backend, so there's still more to be fixed here.)

The punchline is that we now have an easy way to benchmark 2D
rendering in actual, real-world applications. If you see someone
benchmarking with only toys like x11perf or gtkperf, go ahead and
point them to this post, or the the cairo-perf-trace
entry in the cairo FAQ, and
insist on benchmarks from real applications.