Synchronous compositing in the i965 driver

A couple of weeks ago I aired my confusion about a
situation in which using EXA with an i965 card was 4 times slower than
using NoAccel, but that there weren't any large bottlenecks visible in
the profile at all, (nothing over 5%). And that didn't seem likely at all.

As is often the
case
this wasn't a wrong answer, but was instead a problem with me not
understanding the question I had asked. I had been profiling over
enttire runs of mozilla's Trender benchmark, but the 4x slowdown was
based on the results it reported which measured only the inner loop of
rendering itself. So my profiles including extraneous computation such
as page loading that Trender was explicitly not measuring. This was
actually easy to determine by simply measuring the total runtime of
Trender and seeing that it slowed down by only 1.5x instead of 4x.

So to get better profiling data, I changed strategy slightly and
started using an on-demand Trender technique that
Vladimir offered me. It makes it easy to run a one-off Trender loop
for hundreds of iterations so I can profile exactly the rendering and
nothing else.

So I used that approach to measure the rendering performance of a
worthwhile web page, with
both NoAccel and EXA, (again with an Intel i965 card). I used my
standard profile script to generate
piles of data.

The trickiest part was then coming up with a reasonable way to
visualize the data to try to figure out what was going on. My brother,
Richard, pulled his usual HTML+CSS magic and came up with the
following for me.

Note: If you click through to my blog, you'll get to see these charts
in their colorful glory. Otherwise, if you're reading this
somewhere like http://planet.gnome.org then you'll probably
just see the boring list version. But one really cool thing
about Richard's hack here is that in the absence of CSS
styling this chart does degrade nicely to just list with all the data
available. So these charts should be much more accessible than things
I've done previously with PNG bar chart images. Plus, I get to
embed hyperlinks at all the intuitive spots as well. Anyway, Richard
totally rocks for having put this together.

So there are some big percentages there---much nicer than the little
4% and 5% things I was seeing a couple of weeks ago. So, yes, it
definitely helps to know that you're asking the question you mean to
be. There are several links in the chart above to different profile
reports. If you drill down the libc profile, you'll see that there's a
lot of memcpy going on.

An easy guess is that the memcpy is due to pixmap migration due to
software fallbacks. To verify this guess, I first disabled all pixmap
migration from video memory to system memory, (which was as simple as
if (! can_accel) return; in exaDoMigration. This actually got rid of
most of the memcpy problem, and didn't cause any incorrect
rendering. Next I enabled DEBUG_TRACE_FALL in exa_priv.h to
identify all fallbacks that were getting hit by the benchmark and I
put early returns in place to disable them. Unsuprisingly, this did
cause a lot of incorrect rendering.

And of course, what we really want to do is to find what is triggering
those fallbacks and accelerate them. But for now, I wanted to see how
performance would be if we were already at the point that we had no
software fallbacks at all. Here's what I got:

So, as hoped, a lot of the libc time went away. But there's still
a heck of a lot left, as well as kernel system-call time. That's
clearly in gettimeofday and that's easy to track down to the
I830WaitLpRing function. Fortunately, it's also extremely easy to
eliminate that system call almost entirely there. The only reason it's
being called is to ensure that we don't busy-wait forever, (due to a
video card lockup). So we could call gettimeofday only once every
several thousand iterations and still detect lockup very
quickly. Here's a
patch
to do exactly that.

Again, a lot of the libc time was eliminated, and a huge amount
of the kernel time was eliminated as well. But, sadly, and
dramatically, the overall performance did not improve much at all. The
time spent in the Intel driver increased not only in percentage, but
also increased in absolute timing, taking up almost all of the time
saved from libc and the kernel.

What's going on here? What's happening is that the driver is spending
an awful lot of time busy-waiting, and all we did here was to help it
busy-wait even faster. Now, we know that there aren't any software
fallbacks happening here, so what's all the waiting about? It appears
to be caused primarily by i965_prepare_composite which includes the
following comment and code:

/* Because we only have a single static buffer for our state currently,
* we have to sync before updating it every time.
*/
i830WaitSync(pScrn);

That is, when we want to render multiple composite operations in
sequence, instead of doing some nice, pipelined fire-and-forget
operations, the driver is currently waiting for the previous composite
operation to entirely complete before starting off another. And it's
doing that waiting with CPU-burning busy-waiting.

I'll need to learn a bit more about this hardware before I can know
the right way to fix this, (my, but documentation would certainly be
helpful about now). For example, it would be easy to setup a circular
buffer of state structures. But can I get the hardware to tell me when
composite operations complete so that I can efficiently reuse those?
Another approach would be to stuff the necessary compositing state
into the LP ring the same way the compositing commands are being
sent. Is there any reason not to do that?

Meanwhile, it looks like there's a bunch of state in the strucure that
needn't be sent every time, (much of it is always constant). So there
are likely easy ways to improve that as well. I'm definitely looking
forward to people with more Intel-hardware-specific knowledge helping
me out here, (or just plain fixing the thing).

Meanwhile, even if the driver time were completely eliminated, the
rest of the time in this profile would still exceed what was spent in
the NoAccel case. Some things are plainly working well, such as the
fact that the pixman time has gone down. Some things are not working
well, such as the new time being spent in libexa, (which appears to be
mostly an inefficient approach for computing an LRU heuristic as I
mentioned earlier). Some of the increased time, such as that of libc,
might be related to driver problems that we hope to fix.

But one item in particular looks very strange. The libxul code is
spending signficantly more time in the EXA case than in the NoAccel
case, when we wouldn't expect it to be aware of the difference at
all. That will be an interesting thing to track down, (and it sure
would be nice if that were the biggest thing to worry about right
now). But clearly we've got a lot of work to do to get the i965
performing properly with EXA.