Keithp.com

Serving from Portland for a while.

Shared Memory Fences

In our last adventure, dri3k first steps, one of the 'future work'
items was to deal with synchronization between the direct rendering
application and the X server. DRI2 "handles" this by performing a
round trip each time the application starts using a buffer that was
being used by the X server.

As DRI3 manages buffer allocation within the application, there's
really no reason to talk to the server, so this implicit
serialization point just isn't available to us. As I mentioned last
time, James Jones and Aaron Plattner added an explicit GPU
serialization system to the Sync extension. These SyncFences
serializing rendering between two X clients, but within the server
there are hooks provided for the driver to use hardware-specific
serialization primitives.

The existing Linux DRM interfaces queue rendering to the GPU in the
order requests are made to the kernel, so we don't need the ability to
serialize within the GPU, we just need to serialize requests to the
kernel. Simple CPU-based serialization gating access to the GPU will
suffice here, at least for the current set of drivers. GPU access
which is not mediated by the kernel will presumably require
serialization that involves the GPU itself. We'll leave that for a
future adventure though; the goal today is to build something that
works with the current Linux DRM interfaces.

SyncFence Semantics

The semantics required by SyncFences is for multiple clients to block
on a fence which a single client then triggers. All of the blocked
clients start executing requests immediately after the trigger fires.

There are four basic operations on SyncFences:

Trigger. Mark the fence as ready and wake up all waiting clients

Await. Block until the fence is ready.

Query. Retrieve the current state of the fence.

Reset. Unset the fence; future Await requests will block.

SyncFences are the same as
Events
as provided by Python
and other systems. Of course all of the names have been changed to keep things
interesting. I'll call them Fences here, to be consistent with the
current X usage.

Using Pthread Primitives

One fact about pthreads that I recently learned is that the
synchronization primitives (mutexes, barriers and semaphores) are
actually supposed to work across process boundaries, if those objects
are in shared memory mapped by each process. That seemed like a great
simplification for this project; allocate a page of shared memory, map
into the X server and direct rendering application and use the
existing pthreads APIs.

Alas, the pthread objects are architecture specific. I'm pretty sure
that when that spec was written, no-one ever thought of running
multiple architectures within the same memory space. I went and looked
at the code to check, and found that each of these objects has a
different size and structure on x86 and x86_64 architectures. That
makes it pretty hard to use this API within X as we often have both
32- and 64- bit applications talking to the same (presumably 64-bit) X
server.

As a last resort, I read through a bunch of articles on using futexes
directly within applications and decided that it was probably possible
to implement what I needed in an architecture-independent fashion.

Futexes

Linux Futexes live in this strange
limbo of being a not-quite-public kernel interface. Glibc uses them
internally to implement locking primitives, but it doesn't export any
direct interface to the system call. Certainly they're easy to use
incorrectly, but it's unusual in the Linux space to have our
fundamental tools locked away 'for our own safety'.

Fortunately, we can still get at futexes by creating our own syscall
wrappers.

Atomic Memory Operations

I need atomic memory operations to keep separate cores from seeing
different values of the fence value, GCC defines a few such primitives
and I picked sync_bool_compare_and_swap and
sync_val_compare_and_swap. I also need fetch and store operations
that the compiler won't shuffle around:

If your machine doesn't make these two operations atomic, then you
would redefine these as needed.

Futex-based Fences

These wake-all semantics of Fences greatly simplify reasoning about
the operation as there's no need to ensure that only a single thread
runs past Await, the only requirement is that no threads pass the
Await operation until the fence is triggered.

A Fence is defined by a single 32-bit integer which can take one of
three values:

The basic requirement that the thread not run until the fence is
triggered is met by fetching the current value of the fence and
comparing it with 1. Until it is signaled, that comparison will return
false.

The compare_and_swap operation makes sure the fence is -1
before the thread calls futex_wait, either it was already -1 in the
case where there were other waiters, or it was 0 before and is now -1
in the case where there were no waiters before. This needs to be an
atomic operation so that the fence value will be seen as -1 by the
trigger operation if there are any threads in the syscall.

The futex_wait call will return once the value is no longer -1, it
also ensures that the thread won't block if the trigger occurs
between the swap and the syscall.

The atomic compare_and_swap operation will make sure that no Await
thread swaps the 0 for a -1 while the trigger is changing the value
from 0 to 1; either the Await switches from 0 to -1 or the Trigger
switches from 0 to 1.

If the value before the compare_and_swap was -1, then there may be
threads waiting on the Fence. An atomic store, constructed with two
memory barriers and a regular store operation, to mark the Fence
triggered is followed by the futex_wake call to unblock all Awaiting
threads.

The Query function is just an atomic fetch:

int fence_query(int32_t *f)
{
return atomic_fetch(f) == 1;
}

Reset requires a compare_and_swap so that it doesn't disturb things if
the fence has already been reset and there are threads waiting on it:

A Request for Review

Ok, so we've all tried to create synchronization primitives only to
find that our 'obvious' implementations were full of holes. I'd love
to hear from you if you've identified any problems in the above code,
or if you can figure out how to use the existing glibc primitives for
this operation.