Accelerated OpenCL using Gallium3D

Summary

The goal of the project is to address the architectural issues standing on the way of an open accelerated implementation of OpenCL, placing as much burden as possible on the common Gallium3D infrastructure, and, at the same time, to get most of the relevant low-level changes done in the Nouveau driver. The project was carried out by Francisco Jerez as part of the EVoC program, building upon the previous work done in the same direction by Zack Rusin and Denis Steckelmacher during the previous months and years.

Proposed schedule

At the end of this period the Nouveau driver will be able to run compute grids of arbitrary machine code reliably on the hardware in question.

Implement compute kernel set-up and execution on the nv50 platform, using the "compute" object of the graphics engine. I can only test it on cards with the "0x85c0" variant of this object (nvA3 and up), but I don't rule out adding support for earlier or newer generations if I find volunteers willing to send traces and test patches on different hardware.

Extend the nv50 compiler to support the peculiarities of compute programs, including the defined TGSI language extensions.

Extend the Nouveau DRM interface to sidestep the memory addressing limitations of compute shaders on that hardware generation.

Entry points for assigning buffer objects, textures and samplers to a set of binding points defined by the shader - not necessarily the compute shader, in the spirit of Direct3D 11.

The main design principles would be those of the OpenCL API, but the idiosyncrasies of other similar APIs like DirectCompute, AMD FireStream and CUDA would be taken into account (e.g. the buffer management peculiarities of the latter make it difficult to write a clean implementation of it in terms of OpenCL, this problem should be addressed in the proposed API).

At the end of this period all the mentioned OpenCL APIs will be functional (modulo bugs and lack of documentation to be addressed in the next point), assuming the library is provided with TGSI bytecode as input instead of C source code.

Implement context, queue, buffer, texture and sampler management on top of Gallium3D.

Implement accelerated memory and image transfer operations on top of Gallium3D.

Work done so far

The device-independent part of this project (i.e. the OpenCL state tracker and remaining Gallium support changes) has already been included in mesa master. Most of the driver-specific work done until now can be found in a git repository: https://github.com/curro/mesa/commits/master

A preliminary compute API was written to have a framework under which the subsequent work could be tested, afterwards the TGSI language was extended with preliminary resource writeback, memory access and grid parameter support.

A series of unit tests for the mentioned API was written.

Code was added to init the GPU compute subsystem, set up and execute a grid of given dimensions.

A form of raw resource access (raw as in, no channel conversion, the color values returned by the opcode are in the exact form they're found in memory without float/integer conversions or scaling) was implemented. This involved adding a way to track which resources are intended to be written to, then setting up the GPU surface slots in the compute object, teaching the nv50 compiler front-end how to translate/lower the new TGSI opcodes, and then fixing the back-end code emitter to generate the correct binary form of the corresponding hardware instructions.

Access to the grid parameters (thread id, group id/size, grid layout) from the shader code was implemented.

At this point I realized that the changes required in the compiler for the long term would be somewhat deeper than I expected, and probably not worth doing because Christoph Bumiller had been working in a complete rewrite of it in C++ that would replace the current one once it's ready. So I decided to focus my work on the new code and port what had been done on top of it.

A number of bug fixes were done on the new compiler code, especially in the optimization passes and in the handling of texture, control flow and integer ops. Afterwards a number of clean-ups in the internal compiler data structures were carried out.

The new compiler code was adapted to support proper subroutines (so far it had been inlining the whole input unconditionally). This involved substantial changes in the IR objects the compiler works with. The SSA conversion, live analysis and register allocation passes were modified to be able to deal with undefined "formal" arguments and return values, the register allocator was adapted to assign matching physical registers to the arguments and return value(s) of a function in both the caller and callee. The TGSI parser was extended to emit separate code units for each function, and to infer which TGSI registers need to be treated as function inputs or outputs. Finally an inlining optimization pass was implemented to optionally bring back the old behavior, having in mind a future implementation of an inlining heuristic than would do something smarter than simply inlining everything.

The compiler code was modified to emit a kind of "symbol table" with the offsets of each input function in the generated binary. This was used to add support for loading compute programs with multiple executable kernels stored in them.

Access to the global, local, private, and input memory was implemented, like resource access it involved making fixes in both the front- and back-end of the compiler for the new ops to be dealt with correctly. Code was written for uploading the input arguments of the compute kernel to the GPU, at the same time filling in the blanks left inside for GPU pointers.

Proper texture sampling from compute programs was implemented. This involved no compiler work aside from bug fixes, because it functions the same way as in vertex and fragment programs. The sampler/texture unit setup code was extended to make it able to deal with its compute counterparts.

The kernel was modified to start allocating the graphics virtual memory space of each process at the 4GB mark, in order to leave the low memory reserved for buffers objects with special addressing requirements like the ones intended to be used for GPGPU. This uncovered a number of kernel bugs that had to be fixed, caused by incorrect handling of high memory in the command submission and notifier block allocation code.

A winsys loader API was created to provide a common interface to enumerate and initialize all the winsys implementations present in a platform. "drm" and "software" backends for this API were implemented. The loader code of the gbm state tracker was replaced with calls to this API, and the gallium "trivial" tests were changed to use it, with a view to running them on top of some hardware pipe driver instead of softpipe.

The problem of subroutine parameter passing in TGSI was considered: Until now some of the temporary registers touched by a subroutine were unnecessarily being treated as if they were taking part in the calling protocol, this meant that some optimization opportunities were missed because different registers holding separate return value(s) of a function couldn't be reused and merged together by the register allocator. For this reason the TGSI language was extended with the "LOCAL" declaration modifier that means that a given register isn't intended for parameter passing and the compiler doesn't have to make any guarantees about it being preserved across function boundaries. It's a backwards-compatible change in the sense that implementations lacking a register allocator have the freedom to ignore the LOCAL keyword and treat them as normal declarations without changing the semantics of the program. In order to make room for the LOCAL flag some restructuring had to be done in the TGSI declaration tokens, and afterwards the nv50 compiler was modified to take advantage of local declarations during the register allocation pass.

The possibility of using regular register files for addressing the GLOBAL, LOCAL, CONSTANT, INPUT spaces was explored but the fact that they require byte-based addressing instead of float4-based addressing led to inconsistencies with the way other register files work, so it seemed more satisfactory to either keep using the explicit resource load/store opcodes for them, or possibly to change all the other register files to byte addressing as well, though, the latter seemed considerably more intrusive.

The code that handles resource access opcodes in the nv50 compiler was restructured to make room for constant buffer access, formatted surface access, and (at some point) an nvc0 implementation of them.

Access to constant buffers from a compute program using the resource access opcodes was implemented in the nouveau driver. At this point it became obvious that indirect resource indexing would be necessary if OpenCL's constant address space is to be implemented using constant buffers. nv50 hardware doesn't have native support for it though (nvc0 does), so it was worked out using emulation.

Surface load and stores with formatting (transparent (un)packing, channel conversion, etc) was implemented. This is something the OpenCL spec. requires but nv50 doesn't support, so a lowering pass had to be implemented to emulate it in terms of raw resource access. The possibility of moving the implementation to some sort of built-in shader "standard library" was considered but not completed because of lack of time, anyway it's probably worth doing because right now each surface opcode generates a considerable amount of machine code.

Opcodes for work-group barriers and memory fences were defined and implemented in the nv50 driver. Several unit tests were written for it.

Opcodes for a number of atomic operations were defined, implemented, and extensively tested in the nouveau driver. Due to the fact that nv50 only supports arbitrary atomics in global memory and not in work-group shared memory, some of them had to be emulated in terms of "locked" loads (a kind of 1-bit atomic compare-and-swap operation coupled with a normal load/store op).

The general structure of the clover state tracker was redesigned and simplified. The intermediate device abstraction layer was removed because the Gallium API is already playing the role of driver interface and there's no point to support other kinds of hardware back-ends -- the Gallium API seems to be a good fit even for non-GPU devices (e.g. llvmpipe, cell, svga).

Most of the state tracker was rewritten, making extensive use of C++11: range-for loops, variadic templates, lambda expressions, type inference, some of the new STL functionality.... This was an immense relief for the code and my mental health, but it could also be argued that it might become an obstacle for portability at some point in the future. As of now it requires gcc-4.6 or newer to build.

Right now the library expects TGSI assembly as input. An LLVM back-end that will take care of the translation to TGSI is being worked on.

There are still many limitations and deficiencies: the parameter checking of most API calls is still quite naive, the implementation of some of the less widely-used APIs is still incomplete, buffer sharing across different devices doesn't work (though that's going to require special kernel support anyway).