Dysfunctional Programming

Splitting DRM and KMS device nodes

While most devices of the 3 major x86 desktop GPU-providers have GPU and display-controllers merged on a single card, recent development (especially on ARM) shows that rendering (via GPU) and mode-setting (via display-controller) are not necessarily bound to the same device. To better support such devices, several changes are being worked on for DRM.

In it’s current form, the DRM subsystem provides one general-purpose device-node for each registered DRM device: /dev/dri/card<num>. An additional control-node is also created, but it remains unused as of this writing. While in general a kernel driver is allowed to register multiple DRM devices for a single physical device, no driver made use of this, yet. That means, whatever hardware you use, both mode-setting and rendering is done via the same device node. This entails some rather serious consequences:

Access-management to mode-setting and rendering is done via the same file-system node

Mode-setting resources of a single card cannot be split among multiple graphics-servers

Sharing display-controllers between cards is rather complicated

In the following sections, I want to look closer at each of these points and describe what has been done and what is still planned to overcome these restrictions. This is a highly technical description of the changes and serves as outline for the Linux-Plumbers session on this topic. I expect the reader to be familiar with DRM internals.

1) Render-nodes

While render-nodes have been discussed since 2009 on dri-devel, several mmap-related security-issues have prevented it from being merged. Those have all been fixed and 3-days ago, the basic render-node infrastructure has been merged. While it’s still marked as experimental and hidden behind the drm.rnodes module parameter, I’m confident we will enable it by default in one of the next kernel releases.

What are render-nodes?

From a user-space perspective, render-nodes are “like a big FPU” (krh) that can be used by applications to speed up computations and rendering. They are accessible via /dev/dri/renderD<num> and provide the basic DRM rendering interface. Compared to the old card<num> nodes, they lack some features:

No mode-setting (KMS) ioctls allowed

No insecure gem-flink allowed (use dma-buf instead!)

No DRM-auth required/supported

No legacy pre-KMS DRM-API supported

So whenever an application wants hardware-accelerated rendering, GPGPU access or offscreen-rendering, it no longer needs to ask a graphics-server (via DRI or wl_drm) but can instead open any available render node and start using it. Access-control to render-nodes is done via standard file-system modes. It’s no longer shared with mode-setting resources and thus can be provided for less-privileged applications.

It is important to note that render-nodes do not provide any new APIs. Instead, they just split a subset of the already available DRM-API off to a new device-node. The legacy node is not changed but kept for backwards-compatibility (and, obviously, for mode-setting).

It’s also important to know that render-nodes are not bound to a specific card. While internally it’s created by the same driver as the legacy node, user-space should never assume any connection between a render-node and a legacy/mode-setting node. Instead, if user-space requires hardware-acceleration, it should open any node and use it. For communication back to the graphics-server, dma-buf shall be used. Really! Questions like “how do I find the render-node for a given card?” don’t make any sense. Yes, driver-specific user-space can figure out whether and which render-node was created by which driver, but driver-unspecific user-space should never do that! Depending on your use-cases, either open any render-node you want (maybe allow an environment-variable to select it) or let the graphics-server do that for you and pass the FD via your graphics-API (X11, wayland, …).

So with render-nodes, kernel drivers can now provide an interface only for off-screen rendering and GPGPU work. Devices without any display-controller can avoid any mode-setting nodes and just provide a render-node. User-space, on the other hand, can finally use GPUs without requiring any privileged graphics-server running. They’re independent of the kernel-internal DRM-Master concept!

2) Mode-setting nodes

While splitting off render-nodes from the legacy node simplifies the situation for most applications, we didn’t simplify it for mode-setting applications. Currently, if a graphics-server wants to program a display-controller, it needs to be DRM-Master for the given card. It can acquire it via drmSetMaster() and drop it via drmDropMaster(). But only one application can be DRM-Master at a time. Moreover, only applications with CAP_SYS_ADMIN privileges can acquire DRM-Master. This prevents some quite fancy features:

Running an XServer without root-privileges

Using two different XServers to control two independent monitors/connectors of the same card

The initial idea (and Ilija Hadzic’s follow-up) to support this were mode-setting nodes. A privileged ioctl on the control-node would allow applications to split mode-setting resources across different device-nodes. You could have /dev/dri/modesetD1 and /dev/dri/modesetD2 to split your KMS CRTC and Connector resources. An XServer could use one of these nodes to program the now reduced set of resources. We would have one DRM-Master per node and we’d be fine. We could remove the CAP_SYS_ADMIN restriction and instead rely on file-system access-modes to control access to KMS resources.

Another discussed idea to avoid creating a bunch of file-system nodes, is to allocate these resources on-the-fly. All mode-setting-resources would now be bound to a DRM-Master object. An application can only access the resources available on the DRM-Master that it is assigned to. Initially, all resources are bound to the default DRM-Master as usual, which everyone gets assigned to when opening a legacy node. A new ioctl DRM_CLONE_MASTER is used to create a new DRM-Master with the same resources as the previous DRM-Master of an application. Via a DRM_DROP_MASTER_RESOURCE an application can drop KMS resources from their DRM-Master object. Due to their design, neither requires a CAP_SYS_ADMIN restriction as they only clone or drop privileges, they never acquire new privs! So they can be used by any application with access to the control node to create two new DRM-Master resources and pass them to two independent XServers. These use the passed FD to access the card, instead of opening the legacy or mode-setting nodes.

From the kernel side, the only thing that changes is that we can have multiple active DRM-Master objects. In fact, per DRM-Master one open-file might be allowed KMS access. However, this doesn’t require any driver-modifications (which were mostly “master-agnostic”, anyway) and only a few core DRM changes (except for vmwgfx-ttm-lock..).

3) DRM infrastructure

The previous two chapters focused on user-space APIs, but we also want the kernel-internal infrastructure to account for split hardware. However, fact is we already have anything we need. If some hardware exists without display-controller, you simply omit the DRIVER_MODESET flag and only set DRIVER_RENDER. DRM core will only create a render-node for this device then. If your hardware only provides a display-controller, but no real rendering hardware, you simply set DRIVER_MODESET but omit DRIVER_RENDER (which is what SimpleDRM is doing).

Yes, you currently get a bunch of unused DRM code compiled-in if you don’t use some features. However, this is not because DRM requires it, but only because no-one sent any patches for it, yet! DRM-core is driven by DRM-driver developers!

There is a reason why mid-layers are frowned upon in DRM land. There is no group of core DRM developers, but rather a bunch of driver-authors who write fancy driver-extensions. And once multiple drivers use them, they factor it out and move it to DRM core. So don’t complain about missing DRM features, but rather extend your drivers. If it’s a nice feature, you can count on it being incorporated into DRM-core at some point. It might be you doing most of the work, though!

With custom heuristics. There is currently no notion of “speed” in the DRM API, but afair Ian was implementing an OpenGL extension to give some useful information to the user. So you could just open all render-nodes, see what they provide and then use them.

On the other hand, on systems with multiple GPUs, vga-switcheroo should just completely disable the unselected GPU so you only see one render-node (or a vga-swicheroo flag marking it as inactive). So a user has to explicitly select the active GPU.

Note that this isn’t a regression – it’s a case of not making things better instead.

Today, if you have multiple GPUs, you need to use custom heuristics to find the one that meets your needs, then ask the existing DRM master on that GPU to give you permission to use it.

In the render-nodes world, you use the same heuristics to find the GPU, but instead of then tracking down the right DRM master, and getting consent to use it, you just go ahead, open the rnode, and get on with your task.

This plays nicely with OpenCL calls already, BTW. You’d walk the platforms returned by clGetPlatformIDs to get a list of platforms; on render-node platforms, you’d use clGetDeviceIDs to find all the render-nodes a platform supports, and clGetDeviceInfo to find the “best” render node for your needs, based on abstract OpenCL properties (like parallel compute count, max job size etc). When you call clCreateContext, you point to a specific device, and that call opens the right render-node for you.

The thing is, we need to know which information to provide to user-space for them to judge which GPU to use. And I doubt this information can be provided independent of OpenGL. So yeah, similar to your OpenCL example, for OpenGL you would simply look for all available render-nodes, retrieve information via some GL extensions and then use the one you think fits best.
But I am probably not the right person to speak to. Everything mesa/GL related is better discussed on mesa-devel@fdo. If it turns out we need additional kernel interfaces, I will be glad to work on it, though.

This really is a huge failure to make a step forward. At the same time the D3D/DXGI folks are looking to solve this issue, the Free drivers should be doing the same rather than needing to play catch-up yet again in 4-5 years. The use case I’m most interested in is the case of an integrated GPGPU along with a discrete GPU and gaming. At the end of the day, gaming graphics will want to use the _entire_ discrete GPU. This is one reason that GPGPU physics haven’t become a must-have feature; nobody wants to give physics much time on the GPU at all. If instead the physics, AI, and so on could be accelerated by the integrated GPU while the discrete GPU is left all to graphics, that would be ideal. Yes, there are some heuristics to figure out, but this is not even remotely a hard problem to solve compared to all the architectural coding work going on. Just let a driver set some flags based on its knowledge of the hardware’s purposes. HIGH_EFFICIENCY, HIGH_SPEED, etc. An app should then be able to query for a render node based on which API and version it wants, whether it wants HIGH_SPEED or not, etc. With a flexible query API (akin to fbconfig queries) this could be extended in time to allow more query flags or even discrete levels for a particular flag. The game can query which GPU it wants for graphics and then take whichever other card it likes for physics. vga-switcheroo would among other things simply disable the discrete GPU during queries so games will always end up with the power-efficient integrated GPU during battery time.

Yeah, it seems kind of useless not being able to select which hardware device your code runs on. If we had a scheduler that could move gpu tasks between gpus, I wouldn’t mind and wouldn’t care, but right now I’d at least want to know if the program is running on nVidia or intel (or, god forbid, amd).

This is very interesting. I’m a little put-off by (2) though. I imagine rendering off-screen on a GPU for various things (HD Video, 3D games) then displaying the video stream on one of several display seats connected via USB 3.0 video adapter (Amazon example: http://goo.gl/YSpiAz). I’m admittedly a layman, but I’ve taken a lot of interest in your work since first encountering KMSCON last year, and am excited about the possibilities presented here, but am unsure if I’m interpreting them correctly.

Perhaps I’m alone in this, but (2) – especially paragraph (3) – confusingly seemed to indicate that though capabilities like those I hope for – such as farming a powerful gpu as renderer for separately and far-less capably gpu-driven displays – were discussed as possibilities, and were even planned implementations at one point, for whatever reason such ideas were set aside. Probably I’m just (for the second time) overlooking some point too technical for me to understand, but when you say “initial idea” and follow with “another idea discussed” I take it to mean that both were superseded by a third option that somehow obviates or negates the features you describe of the first two.

Then again, the language you use in paragraph (3) of (2) does seem to be a little too postively promotional, though, for an idea which you eventually chose not to implement after all. This is (largely) your project after all, and, from what I can tell, it is definitely praiseworthy even if it is lacking those features you describe in paragraph (3), so I find it strange that you would seemingly promote an idea which you eventually rule out for reasons not obvious to me. And so, as I hope, and as your reply to me seems to indicate, you did in fact successfully abstract the hardware interfaces to this degree and gpus should now all basically support native multiplexing?

If true, it solves a lot of problems and opens a lot of doors. Even if not, I suspect you’ve gone a long way toward accomplishing this if you’re able to detail the mechanism so well. I imagine a virtual machine – even Windows – happily allocating pooled 3D render resources for whatever purpose it requires and politely surrendering them back to the stack upon task completion or as otherwise directed by the “DRM Master” as necessary alongside concurrent processes happily sharing alike and socketing pre-rendered video streams to minimally equipped displays. I imagine very attainable gpu over-commits and stacked render-node queues – all on consumer grade hardware (provided, of course, practical use of available open-source drivers). Is this what you’ve done, or at least, is this the sort of application you’ve made possible?

If so, then either I am missing something fundamental or other commenters here are. It seems some here object to the display/render division on the grounds that pooling of the render resources in multi-gpu environments will result in priority tasks being performed on sub-par hardware. This is strange to me; shouldn’t it instead result in those tasks having access to more resources overall? Surely, even if it hasn’t yet been implemented, such a system will inevitably require a scheduler and a means of render prioritization, I expect, just as smp cpus necessitated the current Linux kernel scheduling system.

I really like where this is going, and I think I really like where you’ve already taken it (though I’m still unclear there). If I’m right in my understanding, though, what I like best about this is that it seems to be device agnostic – as the kernel smp scheduler became – and I think this could broaden the horizon immenesely for the state of Linux graphics.

All that said, I’m usually wrong about this stuff, so I’d like to know: what have I got wrong this time? -Mike

All that is not really related to my work. GPU contexts are implemented by hardware drivers and already have been working for some time. All I did is split the API (and dev-nodes) so you can use the GPU independently of the display-controller (I didn’t even change the API). Your conclusions are mostly correct, but it’s not the work I have done.

Yes, with modern GPUs we are reaching a point where they resemble CPUs quite a lot. Generic job-scheduling would be nice, but is currently done on a per-driver basis (same for resource-management). If you want more information on that, I recommend talking to driver developers (i915, radeon, nouveau, .. or #dri-devel on freenode).

Yes, since writing that I’ve read some of the relevant mailing list archives and put a lot of that together myself. Still, job well done David. Right now I’m trying to puzzle my way through an X–free (funny that has such a different meaning now than it did only a few years ago) Wayland install on Arch Linux running KMScon, Chromium (–Ozone), and hopefully little else. I’ve got SystemD in initramfs, KMScon successfully installed and (almost) configured, but I keep running into seat–related permission issues when attempting to load Weston because, as I suppose, I haven’t correctly setup the DRMMaster handoff. Well, back to it I guess.

“X-Free”.. nice.
Regarding weston, you must run it as root. Or use weston-launch (but I recommend running it directly as root). If you have systemd-207 you can run weston-git with systemd as normal user (code merged yesterday into weston).

Oh nice. I’d seen your chain of recent commits regarding seats while tree–browsing the day before yesterday for SystemD and hoped it would follow in soon. Since then I’ve mostly been in the manpages. Seriously, SystemD is friggin complicated. Not that it’s a bad thing – I definitely appreciate the security implications for PID1 – but wow. man systemd.index is mind-blowing. Isnt it only a couple years old now? I hope to harness it and nspawn to do something like bedrock linux but hopefully a little less… brittle.

>Instead, if user-space requires hardware-acceleration, it should open any node and use it.

How does a program figure out whether a node is actually free to use? Iterate over all renderD* nodes? Should we be having some special device node that just returns the next free render, something like ptmx?

If there are multiple render nodes and multiple controllers, isn’t there a bandwidth issue when some nodes are “further” from some controllers? It sounds like under this system, given a box with two full plug-in graphics cards, it’d be easy to make the mistake of drawing on one card and then having to send it over the bus to go out a display port on the other card, unnecessarily eating PCI bandwidth in the process.

This is an independent problem. This happens regardless of whether you use render-nodes or not (it just isn’t as obvious, anymore). The way to solve it is to negotiate a shared memory region which is best for both devices (there are ideas for dma-buf to support this, but still not done). Obviously, this doesn’t help if there is no fast shared memory. But in this case, it is up to the client to resolve this problem. If we have multiple GPUs, it is up to the client to either choose the one “close” to the scanout-hardware, or use a remote one which _may_ be faster.

Luckily, multi-GPU systems are normally designed in a way, that scanout is possible from VMEM and sysmem, so you shouldn’t run into this problem in real systems. But your point is valid for other use-cases than scanout. Anyhow, that’s an issue of “device selection”, not “device access”. Render-nodes solve problems related to device-access, for device-selection you’re still left to the same hackish stuff as before. Feedback always appreciated! (especially with real world scenarios)

I assume that this technique works on the Intel integrated GPU. However, it seems to fail on the AMD ARUBA integrated GPU under Fedora 21 alpha.
I can create a buffer for egl rendering and scanout from /dev/dri/card0 using gbm. However, when I use gmb to create a buffer from a render node, then it is impossible to scanout the buffer (using drmModeAddFB). Furthermore, using gbm_bo_get_fd / gbm_bo_import to attempt a dma-buf transfer from a render node buffer to a card0 buffer compiles and runs, but produces garbage on scanout.
Does this problem arise in the hardware, the gallium/r600 drivers, the gbm libraries or more generally?

On render-nodes, the DUMB-buffer interfaces is disabled. That means, drivers are responsible of creating suitable buffers for scanout. If gbm cannot create scanout buffers on render-nodes, it’s a bug in the mesa code of the respective driver. Usually, the problem is a wrong tiling mode. Same is true for dma-buf sharing. Especially radeon cards use tiling heavily so it must really be configured correctly.
But maybe we should just enable the DUMB-buffer interface on render-nodes..

… typically shows a static screen formed from a mixture of previously rendered images (with the mixture coming from modeset changes). I would expect a tiling problem to produce a garbled screen that changed over time. What I get looks like a failure to transfer anything from the render buffer to the scanout buffer.
I think that this looks like a combination of memory management issues. Do you have any test results for your code on non-Intel or non-integrated graphics?

Your first code-example doesn’t make any sense. You cannot use a gbm buffer from one device on the other. But you probably noticed that yourself.

Your second example is how it should work. But as I said, using gbm_surface_create() on a render node does not create buffers suitable for scan-out. I’d expect drmModeAddFB() to fail, in case the buffer is unsuited for scan-out. However, given that this is cross-device, the importing driver might not have enough information. There might be a lot of problems involved. For instance, the scan-out hardware might not even be able to access the memory backing the buffer. Tiling-problems are one of many possible errors that might occur. Cache-coherency, GTT mappings, and a lot more might go wrong.

DRM still lacks a cross-device buffer allocation technique. And given the huge differences between hardware, cross-device sharing is still not recommended.

This rather limits the practical uses of render nodes on AMD hardware. The gallium drivers do not provide EGL configs for rendering onto EGLImages when the EGLDisplay is created from a render node. The drivers do allow rendering into window surfaces, but this rendering goes into buffers that can not be scanned out, even using the same device driver on the same physical device and the same physical memory.
Can the nouveau drivers render to EGLImages using render nodes?
I am raising these questions to sound a note of caution. There are problems in sharing resources that can not be addressed using the naive code that I presented above. These problems need to be emphasized or developers will lose track of the hardware.

Seriously, talk to the mesa/drivers authors about this. This is driver dependent and can be easily fixed by proper allocation parameters.

What I recommend doing: The graphics server allocates scan-out buffers on the primary node, exports it and then imports it into the render node it wants to work on. Clients allocate buffers on the render-node and pass it to the graphics-server which then performs compositing into the scan-out buffer.

Obviously, you can skip the export/import step on the graphics-server as the primary node allows rendering, too.

Furthermore, if the client uses a render-node that is attached to another GPU than the server, buffer sharing may fail! Cross-GPU sharing is still being worked on and is a driver issue, not a render-node issue.

If the client wants to render directly into a scan-out buffer so you can skip the compositing-step, the client *must* know how to allocate such buffers. Same is true if the client renders on a GPU other than the one used for compositing. Therefore, the graphics-server _has_ to tell the client what kind of buffers to allocate. This is, again, a driver issue (buffer allocation is *not* generic).

The only generic buffer-allocation is software-rendering (buffer is in main-memory). DUMB-buffers are *not* meant as generic buffer allocation technique. They only provide _fallback_ scan-out buffers for DRM software renderers (as main-mem buffers might not be available for scan-out).

Yes, the buffer-allocation is a mess, but so far no-one found generic descriptions for tiling, accessibility and format. Feel free to propose one. So far, buffers allocated on a node should only be used on nodes of the same GPU.