RPi Xorg rpi Driver

This is the documentation for the Xorg Raspberry Pi driver developed in this thread [1]

By default, each Raspberry Pi Linux distro uses the generic framebuffer driver to draw the X display. All rendering of the display is done by the CPU into off- or on-screen buffers which eventually are shown on the output by the scan-out hardware. As the CPU on the Raspberry Pi is reasonably weak this makes for a sluggish user interface that at the same time causes a high CPU load, that slows down other programs.

In the modern 2D X11 desktop environment however there are two major ways that an application can choose to render itself,

all rendering done by the X server

nearly all rendering done by the application, with the X server simply presenting the rendered output

The primary goals of this project are to improve the performance of the first case and leave the second to other projects - there are many different user libraries that can do application-side rendering and boosting the performance of each of those would be a huge undertaking. The driver accomplishes this by offloading common tasks onto other hardware on the SoC that can process the work asynchronously, allowing the X server to be pre-empted by the OS yet still allowing progress to be made. This should allow other processes to see more CPU time.

Unfortunately this means that even if the X server runs infinitely faster, applications can still seem unresponsive if extensive application-side rendering is used.

No effort has been made so far to allow applications access to OpenGL/GL ES through the X server - basic 2D has been the priority so far.

Contents

Design

Xorg provides a mechanism for drivers to accelerate a number of important rendering tasks. This is called EXA [2]. EXA allows easy overriding of,

block copy aka blitting

solid colour fills

compositing, aka alpha blending

This driver implements the required functionality of EXA by using different parts of the Raspberry Pi SoC.

2D A->B block copies are performed using asynchronous DMA. A DMA engine is programmed to copy - in either a linear or 2D fashion - from A(x, y) to B(x2, y2) with an incrementing source address and incrementing destination address.

Solid colour fills are also performed by asynchronous DMA. A DMA engine is programmed to copy (again either linear or 2D) from a non-moving source address to a moving destination address starting at B(x, y). The source address holds the colour that is to be used in the fill.

DMA commands are enqueued one after the other to ensure correct results. They are constructed as a chain of DMA control blocks (CBs), and passed to the DMA controller to be kicked in one go.

For cases where the DMA set-up time would take longer than a naive CPU copy or fill, a CPU fallback is used instead.

Composition is more complex as the inputs vary much more. The operation allows many different operating modes, eg with different filtering, transformations, pixel formats, blend equations, wrapping modes - but some operations are much more common than others. These are handled in three different ways:

synchronous acceleration by the VPU's vector unit [3]. This covers the fewest cases but ideally the ones in which the most pixels need to be processed, where a speed-up will be most appreciated. Hand coded in assembly.

synchronous low-latency CPU implementation using 32-bit ARM SIMD, catching all common cases. This should perform a "good enough" job, as it is primarily designed for low-pixel-count operations, eg rendering small antialiased characters.

fallback to the generic X implementation: this covers all other composition modes. This is the worst case, as the overhead reaching the first actual image processing instruction is high.

SoC hardware used

As mentioned above, the driver leverages three things that the generic driver wouldn't otherwise use.

The ARM CPU's SIMD instruction set

This is not as comprehensive [4] as the NEON instruction set found in some v7 ARM implementations but is still useful for the task of composition. Through C++ template metaprogramming, a careful consideration of what a compiler can and cannot optimise, and finally eyeballing the code generated we can have composition functions that are of a comparable speed to those hand-optimised functions in pixman. The templating helps here by generating hundreds of these functions, rather than the handful that are specially implemented in pixman.

DMA engines

The Raspberry Pi SoC includes a decent number of DMA engines which can be used by the ARM for moving data around memory. They can all access the full bandwidth of the memory - more than the ARM could itself. They all share the bandwidth, and one DMA is sufficient to saturate the bus.
The DMA engines are not all the same however - some have greater performance or features than the others. For instance, half of them have the ability to perform '2D' DMAs rather than straight linear transfers. Also one of the DMA engines (DMA zero) has a deeper FIFO allowing it to do larger read bursts.

The DMA hardware does not live within the same address space as the ARM CPU. It uses the bus address space instead. A translation must be made to the address to get from one to the other. Also, the ARM's page tables are not used by this hardware. Virtually contiguous ARM addresses are not necessarily physically contiguous the DMA hardware won't know about this - as a result DMA needs to be sometimes broken up into 4kb blocks to ensure the correct result.

There is a start-up cost associated with DMA, and as a result sometimes it is not efficient to use this hardware. Steps include,

breaking a large transfer up into something which respects 4kb page boundaries

entering the kernel

translating user virtual addresses into bus addresses for each DMA CB

flushing and invalidating parts of the data cache, then kicking off the DMA chain

returning to user mode to do more work

entering the kernel

waiting for the DMA to complete

returning to user mode

For reference, a user->kernel->user transition takes roughly 1 us. Each DMA CB appears to take the DMA engine around 5 us to start.

VPU

Also on the SoC is the a custom processor that appears to be the controlling brains of the GPU, the VPU. This is what the 'firmware' runs on. Rpi_Software#Overview

Within this processor is a 16-way vector unit which is well-suited to image processing operations. Although this processor is ordinarily clocked nearly three times slower than the ARM core, the vector unit and improved memory interface more than make up for it. Some composition functions that commonly operate on thousands of pixels have been coded to run on this unit.

Like the DMA hardware, it lives within the bus address memory space. User virtual ARM addresses need translation, and the 4kb page boundary/page table issues still apply. Also from the viewpoint of the driver the VPU appears as an asynchronous co-processor: there is a ~56 us overhead at stock clock speeds communicating with it from X so work should really only be sent to it if worthwhile.

Blocking waits

The workload sent to the driver from the running applications is not known in advance. The structure of the work is generally the same though.
Allocate some images, upload some data from the user application, perform a handful of operations, synchronisation point. Perform some more operations, synchronisation point.

The point of the frequent synchronisation points is to allow the application to get a hold of the rendered pixel data. It is also there to allow the application to release memory. By knowing that all rendering has completed by a given point, it knows what is sees in the buffer is correct and also that a given image buffer is no longer in use and can be freed.

This behaviour is contrary to a game-style render loop: for the majority of a frame a command buffer is filled with drawing commands. At the end of a frame this command buffer is sent to the GPU for processing. However whilst this was going on, the GPU was processing the last frame's command buffer. This contrasts the X update loop as,

there are many more synchronisation points, and it is unknown when they will appear

there is not a double-buffered command buffer, meaning the GPU cannot be processing "last frame's image" whilst the CPU is building this frame's command buffer

image data cannot go too far away from the CPU as the application may want to inspect it with the minimum of delay

This means OpenGL and other run-times are not appropriate for an X driver on a system like the Raspberry Pi that lacks horsepower. If GL or a similar run-time was used, the high CPU cost of setting up and tearing down images and command lists etc would drawf the actual amount of time the GPU would be doing rendering. Also the fact that on the Raspberry Pi textures are not trivially accessible by the CPU means that when an application needs to gain access to pixel data a high cost must be paid stopping the GPU and then downloading the textures.
Even the 6 us start-up time of a DMA CB is noticeable in some applications - if instead the whole 3D stack is traversed many applications would be far slower than the generic CPU route.

That said applications which expect the driver to be implemented on a run-time like OpenGL will tailor their workload to properly suit it. For example, the Chromium browser. This is an exception though - most applications expect a synchronous driver with easily-reachable memory

When to synchronise

As mention before, the driver treats the VPU and DMA hardware as asynchronous co-processors. The CPU overhead of reaching them is relatively high and as CPU time is not in abundance this overhead needs to be amortised over as many operations as possible. This means the CPU builds up a list of work to send to the DMA hardware and a list of work to send to the VPU hardware. Eventually these lists are sent off for processing. The overhead cost is only paid once.

Yet as it is not known when a synchronisation point may appear it is not clear when a DMA or VPU command list must be started. By waiting longer the overhead is decreased (per unit of work) yet there is a greater chance that the application requests the work is finished, but it won't be as it has yet to start. The application then blocks. The opposite is also true - if work is kicked too frequently, too much overhead results yet the application will wait less as there is a greater chance the work has completed by the time it requests a sync point.

Performing composition with the VPU is tricky too - as there is a 56 us start-up cost to performing work there, are there sufficient pixels to process that it is worth this cost? An estimate needs to be made of the relative speeds of the two processors (ARM and VPU) and based on this a decision is made whether the task is run on the co-processor or not. Something to also consider is that on the ARM CPU the run-time is variable. There are early-out optimisations that can be performed based on the value of the mask (if present). These optimisations can't be used on the VPU.

Finally there is also the case where work to be processed with DMA is so small that it would be quicker to process on the CPU. Yet if a million tiny pieces of work come along back-to-back it would still have been faster to do with DMA - but it is not known how many pieces of work will be enqueued. This means there is a decision to be made: when should the CPU perform work that it thinks it could do faster than the DMA hardware?