Keithp.com

Serving from Portland for a while.

What's up this week

The last week has certainly been entertaining. We're quickly merging a pile
of new code into the driver and trying to get everything building in one
place so that people can play with stuff before we release.

Getting 2D on top of GEM

One of the big missing pieces last week was getting the 2D driver working
with Pixmaps as GEM objects. This is critical as we move towards unified
kernel memory management for rendering resources to allow us to use objects
across multiple APIs. The most pressing need here is to enable the
GLX_EXT_texture_from_pixmap extension in an efficient fashion.

So, what's the plan then? Fairly simple; allocate GEM objects for every
pixmap and then use GEM relocations to manage access to them. No need for
the 2D driver to even know what's bound to the GTT; it can treat every
Pixmap exactly alike and let the kernel manage the low-level hardware
details. Our experience with the 3D driver has been quite good; GEM is easy
to use and reasonably efficient.

The initial thought was that we'd use EXA's ability to forward pixmap
creation back to the driver and have our driver call-back create the GEM
object. However, in looking at that, it turns out to have a terrible (and
incomplete) API. The driver has no say in the pixmap layout, it must use the
EXA-enforced pixel organization. In a land of tiled pixmaps, that's not OK.
Further enquiry showed a wealth of other code which is useless in our
uniform Pixmap environment. Damage tracking, and enforce hardware
synchronization are wasteful performance robbing activities.

Ok, so if EXA isn't what we want, then what is? Well, I like the basic EXA
acceleration plan -- accelerate solid fills, copy area and the composite
operation and leave everything else to software. In fact, the whole EXA
drawing API is just fine, it's just the wasteful EXA code that isn't
necessary.

UXA -- the UMA Acceleration Architecture

Ok, so instead of hacking up EXA and trying to make it work for the GEM
driver and existing drivers, I decided to just make it work for GEM on UMA
hardware and see what it looked like. The hope is that we'll find some way
to either patch EXA or at least find a way to share the low-level rendering
code between UXA and EXA. For now, UXA lives in the intel driver itself;
once we figure out how we want the X server rendering infrastructure to
work, we'll merge whatever results back into the core server.

I started UXA by just copying the existing EXA code and running an edit
script to change all of the names. Then, I went through the code and removed
everything dealing with pixmap migration, damage computation or explicit
global hardware synchronization. The only synchronization primitive left is
the prepare_access/finish_access pair which signals the start and end of
software drawing. The hardware driver is expected to deal with all other
synchronization issues itself.

Oddly, GEM does rendering synchronization automatically when rendering with
the hardware, and provides simple primitives to provide for software
fallbacks. The key here is that we never need to idle the whole chip, we
only need to wait for it to finish working on whatever objects are currently
being drawn with. The goal is to avoid artificial serialization.

The result is less than 5000 lines of code, as compared to EXA which has
about 7500 lines.

Yeah, but does it work?

The short answer is "Yes, it works". The longer answer is "Yes, with
limitations". The biggest limitation right now is that GEM objects can only
be mapped directly by the CPU. For lots of operations, this is exactly what
you want; a fully cached view into the objects as it offers full performance
for CPU-bound rendering operations.

However, it has one performance problem and one functional limitation.

The performance problem is that using the CPU cache with these objects means
flushing the CPU cache whenever switching between CPU and GPU rendering.
CPU cache flushing is horribly expensive, enough so that it's often far
better to take the huge performance penalty of using un-cached reads if the
number of reads is small.

Yes, we could create write-combining PTEs for this direct mapping, but
constructing write-combining PTEs is also really expensive as that
involves flushing those PTEs from every CPU TLB, which requires an
inter-processor interrupt. Of course, you can't just create a
write-combining PTE, you have to make sure that the page it maps is not in
any CPU cache, so you have to perform a CPU cache flush as well.

Someday maybe this won't be true; there are plans afoot within the Linux
kernel to make this reasonably efficient. Perhaps this will happen before we
get our flying cars.

So, it's a performance problem; we can deal with that.

Tiled Surfaces

What we can't deal with is how tiled surfaces work under a CPU map. A normal
surface has an entire scanline mapped to a linear section of memory. This
places vertically adjacent pixels a fair distance apart in memory. Drawing a
vertical line means touching two different cache lines and two different
pages. Even a large cache and TLB will not help much if you draw tall
objects. Tiled surfaces arrange for nearby screen pixels to be nearby in
memory, usually by constructing the surface from a set of rectangular
page-sized tiles. Vertically adjacent pixels will then be in the same page,
and can even be in the same cache line in some cases.

The performance benefits for tiled surfaces are obvious; fewer cache and TLB
misses. The cost to the hardware is fairly small; just some gates to stir
addresses around when fetching and storing pixels. However, the cost to
software is fairly large; computing the address of a pixel now involves some
fairly ugly computation.

We already managed to make Mesa deal with tiled surfaces. That was fairly
easy as Mesa has a single span-based pixel fetch and store architecture.
Write new span accessing functions and the rest of the sw rendering code
just works.

Fixing the X server 2D software rendering code is another matter entirely --
there's a lot of it, and it all wants to touch memory in a linear fashion.
Aaron Plattner from nVidia actually did go and whack fb to make it work;
every pixel fetch or store goes through a function call which is passed the
nominal linear address of the pixel. These accessor/setter functions then
munge that address into the actual tiled address. However, that's yet
another huge performance impact for software rendering.

Hardware De-Tiling

A better solution is to just use the hardware. When a tiled surface is bound
to the GTT, it is visible to everyone using linear addresses; those
addresses are swizzled in the hardware and head out to memory in tiled form.
There's no performance benefit from the CPU as its TLBs and caches all see
the linear address, but it doesn't have to deal in a non-linear space.

The second benefit of the GTT map is that it lives under a write-combining
MTRR, so all accesses to memory are write-combining and not write-back. This
eliminates all of the CPU cache coherence issues and leaves us back with the
old performance that we know and love -- fast writes and really slow reads,
but no penalty for switching rapidly between GPU and CPU.

What's Next?

So, the basic Pixmaps-in-GEM code is up and running in the gem-pixmap branch
of my driver repository,
git://people.freedesktop.org/~keithp/xf86-video-intel.
The next step will be to integrate Carl Worth's 965 render changes which
place all of the temporary data that it uses into GEM objects as well. That
will finish the DRI2 enabling work and allow us to provide zero-copy
texture-from-pixmap support.

However, before that can really go main-stream, we need to get the GTT
object mapping to fix tiled surface support and get back some performance
lost to the CPU cache flushing. We'll see if Kristian is ready with DRI2
tomorrow, if not, I'll probably spend the day figuring out enough additional
parts of the Linux MM code to get my GTT maps working.