Core Rendering with Glamor

I've hacked up the intel driver to bypass all of the UXA paths
when Glamor is enabled so I'm currently running an X server that uses
only Glamor for all rendering. There are still too many fall backs, and
performance for some operations is not what I'd like, but it's
entirely usable. It supports DRI3, so I even have GL applications
running.

Core Rendering Status

I've continued to focus on getting the core X protocol rendering
operations complete and correct; those remain a critical part of many
X applications and are a poor match for GL. At this point, I've got
accelerated versions of the basic spans functions, filled rectangles,
text and copies.

GL and Scrolling

OpenGL has been on a many-year vendetta against one of the most common
2D accelerated operations -- copying data within the same object, even
when that operation overlaps itself. This used to be the most
performance-critical operation in X; it was used for scrolling your
terminal windows and when moving windows around on the screen.

Reviewing the OpenGL 3.x spec, Eric and I both read the glCopyPixels
specification as clearly requiring correct semantics for overlapping
copy operations -- it says that the operation must be equivalent to
reading the pixels and then writing the pixels. My CopyArea
acceleration thus uses this path for the self-copy case. However, the
ARB decided that having a well defined blt operation was too nice to
the users, so the current 4.4 specification adds explicit language to
assert that this is not well defined anymore (even in the face of the
existing language which is pretty darn unambiguous).

I suspect we'll end up creating an extension that offers what we need
here; applications are unlikely to stop scrolling stuff around, and
GPUs (at least Intel) will continue to do what we want. This is the
kind of thing that makes GL maddening for 2D graphics -- the GPU does
what we want, and the GL doesn't let us get at it.

For implementations not capable of supporting the required semantic,
someone will presumably need to write code that creates a temporary
copy of the data.

PBOs for fall backs

For operations which Glamor can't manage, we need to fall back to
using a software solution. Direct-to-hardware acceleration
architectures do this by simply mapping the underlying GPU object to
the CPU. GL doesn't provide this access, and it's probably a good
thing as such access needs to be carefully synchronized with GPU
access, and attempting to access tiled GPU objects with the CPU
require either piles of CPU code to 'de-tile' accesses (ala wfb), or
special hardware detilers (like the Intel GTT).

However, GL does provide a fairly nice abstraction called pixel buffer
objects (PBOs) which work to speed access to GPU data from the CPU.

The fallback code allocates a PBO for each relevant X drawable, asks
GL to copy pixels in, and then calls fb, with the drawable now
referencing the temporary buffer. On the way back out, any potentially
modified pixels are copied back through GL and the PBOs are freed.

This turns out to be dramatically faster than malloc'ing temporary
buffers as it allows the GL to allocate memory that it likes, and for
it to manage the data upload and buffer destruction asynchronously.

Because X pixmaps can contain many X windows (the root pixmap being
the most obvious example), they are often significantly larger than
the actual rendering target area. As an optimization, the code only
copies data from the relevant area of the pixmap, saving considerable
time as a result. There's even an interface which further restricts
that to a subset of the target drawable which the Composite
function uses.

Using Scissoring for Clipping

The GL scissor operation provides a single clipping rectangle. X
provides a list of rectangles to clip to. There are two obvious ways
to perform clipping here -- either perform all clipping in software,
or hand each X clipping rectangle in turn to GL and re-execute the
entire rendering operation for each rectangle.

You'd think that the former plan would be the obvious choice; clearly
re-executing the entire rendering operation potentially many times is
going to take a lot of time in the GPU.

However, the reality is that most X drawing occurs under a single
clipping rectangle. Accelerating this common case by using the
hardware clipper provides enough benefit that we definitely want to
use it when it works. We could duplicate all of the rendering paths
and perform CPU-based clipping when the number of rectangles was above
some threshold, but the additional code complexity isn't obviously
worth the effort, given how infrequently it will be used. So I haven't
bothered. Most operations look like this:

Next, allocate the VBO space and copy all of the X data into it. Note
that the data transfer is simply 'memcpy' here -- that's because we
break the X objects apart in the vertex shader using instancing,
avoiding the CPU overhead of computing four corner coordinates.

GL texture size limits

X pixmaps use 16 bit dimensions for width and height, allowing them to
be up to 65536 x 65536 pixels. Because the X coordinate space is
signed, only a quarter of this space is actually useful, which makes
the useful size of X pixmaps only 32767 x 32767. This is still larger
than most GL implementations offer as a maximum texture size though,
and while it would be nice to just say 'we don't allow pixmaps larger
than GL textures', the reality is that many applications expect to be
able to allocate such pixmaps today, largely to hold the ever
increasing size of digital photographs.

Glamor has always supported large X pixmaps; it does this by splitting
them up into tiles, each of which is no larger than the largest
texture supported by the driver. What I've added to Glamor is some
simple macros that walk over the array of tiles, making it easy for
the rendering code to support large pixmaps without needing any
special case code.

Glamor also had some simple testing support -- you can compile the
code to ignore the system-provided maximum texture size and supply
your own value. This code had gone stale, and couldn't work as there
were parts of the code for which tiling support just doesn't make
sense, like the glyph cache, or the X scanout buffer. I fixed things
so that you could leave those special cases as solitary large tiles
while breaking up all other pixmaps into tiles no larger than 32
pixels square.

I hope to remove the single-tile case and leave the code supporting
only the multiple-tile case; we have to have the latter code, and so
having the single-tile code around simply increases our code size for
not obvious benefit.

Getting accelerated copies between tiled pixmaps added a new
coordinate system to the mix and took a few hours of fussing until it
was working.

Rebasing Many (many) Times

I'm sure most of us remember the days before git; changes were often
monolithic, and the notion of changing how the changes were made for
the sake of clarity never occurred to anyone. It used to be that the
final code was the only interesting artifact; how you got there didn't
matter to anyone. Things are different today; I probably spend a third
of my development time communicating how the code should change with
other developers by changing the sequence of patches that are to be
applied.

In the case of Glamor, I've now got a set of 28 patches. The first few
are fixes outside of the glamor tree that make the rest of the server
work better. Then there are a few general glamor infrastructure
additions. After that, each core operation is replaced, one a at a
time. Finally, a bit of stale code is removed. By sequencing things in
a logical fashion, I hope to make review of the code easier, which
should mean that people will spend less time trying to figure out what
I did and be able to spend more time figuring out if what I did is
reasonable and correct.

Supporting Older Versions of GL

All of the new code uses vertex instancing to move coordinate
computation from the CPU to the GPU. I'm also pulling textures apart
using integer operations. Right now, we should correctly fall back to
software for older hardware, but it would probably be nicer to just
fall back to simpler GL instead. Unless everyone decides to go buy
hardware with new enough GL driver support, someone is going to need
to write simplified code paths for glamor.

If you've got such hardware, and are interested in making it work
well, please take this as an opportunity to help yourself and others.

Near-term Glamor Goals

I'm pretty sure we'll have the code in good enough shape to merge
before the window closes for X server 1.16. Eric is in charge of the
glamor tree, so it's up to him when stuff is pulled in. He and Markus
Wick have also been generating code and reviewing stuff, but we could
always use additional testing and review to make the code as good as
possible before the merge window closes.

Markus has expressed an interest in working on Glamor as a part of the
X.org summer of code this year; there's clearly plenty of work to do
here, Eric and I haven't touched the render acceleration stuff at all,
and that code could definitely use some updating to use more modern GL
features.

If that works as well as the core rendering code changes, then we can
look forward to a Glamor which offers GPU-limited performance for
classic X applications, without requiring GPU-specific drivers for
every generation of every chip.