Leveraging the GPU to accelerate the Linux kernel

Powerful graphics cards are pretty affordable these days. Even though we rarely do high-end gaming on our daily machine we still have a GeForce 9800 GT. That goes to waste on a machine used mainly to publish posts and write code for microcontrollers. But perhaps we can put the GPU to good use when it comes compile time. The KGPU package enlists your graphics card to help the kernel do some heavy lifting.

This won’t work for just any GPU. The technique uses CUDA, which is a parallel computing package for NVIDIA hardware. But don’t let lack of hardware keep you from checking it out. [Weibin Sun] is one of the researchers behind the technique. He posted a whitepaper (PDF) on the topic over at his website.

I have been using nVidia’s binary drivers myself ever since I bought myself an MX200. In the very beginning there were a couple of rough patches but overall I have had very few issues with nVidia’s binary driver’s stability. ATI is a mess.

I’ve been through nvidia driver hell more than once, for years, but the same for ATI though, they both are very experimental and beta, both in software as well as hardware, but that’s the name of the game, but nvidia said they would embrace OpenCL (and directcompute but that’s a windows thing), and seeing OpenCL is the open standard that belongs to the open source linux it’s annoying if they try to get the CUDA nonsense (which they at some point said they’d get rid of even) on linux. It’s like they would encourage people to make DirectX stuff for linux.. you’d also be WTF??!?

Right. At least their linux drivers aren’t discontinued. And of course we all know how usable the open source alternatives are. In my view, nvidia is currently the only way to go on linux if you want decent 3d and/or gpu acceleration.

Which series? There are quite a number of them. Going for cheap n good you can get high end GT series cards with CUDA 2.1 for 60 dollars. What is ‘useful work’? What do you want to do? What can you do now? Have you integrated with CUDA systems before? All the cards will improve your power, by how much varies based on, well, everything.

Any series. I thought it might be useful to practice programming a GPU. I have no project in mind, other than learning. I left the definition of useful work up to people who might answer. I have not interfaced with Cuda before. I have stayed away from nvidia cards for years, since nvidia would not produce open drivers. I don’t play games(unless learning is a game), Intel and ATI video has done everythingg I needed. I’ll probably start with the author’s encrypted filesystem drivers using GPUs and some benchmarking of code to factor integers.
Based on fluffers comment, I ordered a GT 610 based card today from Amazon.

I have a 9800 GT myself – its best feature is that its passive [no fan noise]. To reach CUDA 2.0 or higher, but stay passive, I’m looking at the ZOTAC GT 640. Lousy for fancy games, fine for a little CUDA powered processing oomph.

I’ve gotten significant improvement (orders of magnitude) porting personal projects to CUDA on the GeForce 8400. It’s a different computing architecture, and with the proper work load, 8 cuda cores can beat the crap out of a single x86 core.

you can get a brand new GTX640 for about £75 which will support CUDA 2.1. I run a GTX460 which is a hair over £100 normally, it also runs 2.1. I have no idea which is the more powerful card, the x60 models are intended for games whereas the x40 models are intended for media but then the 4xx is older than the 5xx or 6xx so it isn’t unreasonable to suggest that it is slower (mine is factory overclocked though and I am thinking about cranking it further).

It seems the 610 also supports CUDA 2.1 so that might also be a cheap option.

As for useful work, define useful…. CUDA will be faster than CPU any day but the gaming performance of the cheap cards will barely beat using the old motherboard/CPU graphics. The 640 isn’t a bad card, not intended for gaming but I imagine most things will at least run on it. The 460 although old copes very nicely with newer games, I run skyrim with most settings maxed (a few knocked a notch or 2 down) and with Fraps being used to test framerate over a 1 hour period I got a minimum framerate of 59 and a max of 61 so nicely V-Synced as it should be.

Just bare in mind: laptops cannot have their GPU’s upgraded and thats not an old hackaday overcomeable limitation of laptops (exception to the rule, thunderbolt ports now have external PCIe x8 ports available for them, many compatible with running GPU’s). Also keep note of what your computers power supply can kick out as it may need upgrading before a GPU upgrade.

1)GPUs are not energy-efficient. Not even remotely. Intel is cranking out server CPUs with boatloads of cores that use less than 50W. A GTX480 uses upwards of 200W-250W.

2)It has long been the stance of many Linux folks that things like network and crypto accelerators are a waste of money – that the money is better spent on a slightly faster CPU or better overall architecture – because when you’re not using the extra capacity for network or crypto (for example), it’s available for everything else to go a bit faster.

A pretty nice graphics card is about $200. That’s more than the difference between the bottom end and top end Intel desktop CPUs on Newegg…

Except that, with the right problem, A GPU performs *orders of magnitude* faster than a CPU — I don’t know what definition of efficiency you use, but 5x power draw for 10x, 100x, even 1000x performance seems to fit the definition I know.

On your second point, modern GPUs are not the same as crypto or network accelerators because they stopped being single-purpose circuits many years ago. In point of fact, GPUs can be used to accelerate both crypto and network functions, audio processing, image processing — any application where you do the same basic operation over a large amount of data — think the map portion of map-reduce.

Spending $200 on a GPU gives you something on the order of 1 or 2 *thousand* simple CPU cores, plus a gig or two of high-bandwidth RAM, and the memory architecture to make it all work nicely. Spending $200 on a CPU upgrade buys you a few hundred Mhz and *maybe* another couple cores, depending where your baseline CPU is.

Its very simple really, a CPU is optimized for low-latency processing — to give you a single result as quickly as it can. GPUs are optimized for high throughput processing — to give you many results reasonably fast, essentially by utilizing economies of scale at a silicon and software level (which, to achieve you sacrifice some ISA capability, though that will change in the next round or two of GPUs.)

Finally, your point falls flat because many people, indeed most, already have a GPU that’s capable of performing compute functions, and that number is only going to grow as even integrated GPUs provide some pretty non-trivial grunt. Integrated GPUs in some cases even perform better than either CPUs *or* high-end GPUs because the problem isn’t bottle-necked by the PCIe bus.

Honestly, if you went out and bought/built a computer in the last 3 years, you literally have to go out of your way to come out with a system that’s not capable of any GPU computing.

Cuda doesnt run all of your kernels in parallel, instead it glues them into bundles called “warps” and runs them in series :/ You need a whole lot of tehm to get decent performance. Oh , did I mention how branches make whole “warps” invalid? if you have a branch inside a warp GPU will run WHOLE “warp” twice per every branch.

Where did you get the idea that CUDA runs all the warps in series? Cuda bundles the kernels into warps, yes, and then the bundles within a warp run in lockstep on a set of cores — but that’s only using 32 cores. The cores are massively hardware-multithreaded (like hyperthreading but much more so!), so several warps are running in parallel on that set of 32 cores, and there are dozens and dozens of sets of cores — so you can easily have a couple of hundred warps running in parallel.

This, of course, is why you need to create so many parallel kernels at the same time to get decent performance.

Also, if you have a branch inside a warp, CUDA will NOT run the whole warp twice; that wasn’t true even on the early versions. All it does is run both sides of the branch, in series (but it doesn’t run code outside the branch twice), and it only even does that if some threads in the warp take one side and some the other. If all the threads take the same side of the branch, there’s no slowdown, and if the branched code is small, it’s only a minor issue. The only way it will effectively “run the whole warp twice” is if you’ve got the entire kernel wrapped in a big if statement and some threads in the warp take the “true” branch and some take the “false” branch.

This isnt a very accurate post. Firstly, the GPU has ~10x greater memory throughput than a single socket CPU. Secondly, submitting literally thousands of threads to the CUDA cores is a mechanism to mask latency while threads are busy waiting for memory to move around. It’s not a very serial operation. If you want fast serial code, use x86. Which brings me to your complaint about warp branching…

On x86 one of the heuristics to improve serial code execution performance is something called “speculative execution”. When a processor sees a branch, it speculates on the path taken, and continues executing code along that path before fully deciding which path to take! The purpose of this is to keep the pipeline full while waiting for a potentially lengthy conditional statement. It’s part of the reason an x86 processor uses more power than a simple embedded processor. On some RISC processors, this latency hiding was accomplished with a “delay slot”. So thirdly, the GPU uses the immense hardware parallelism to execute both sides of a branch, and pick the one that is correct after the conditional. There is a limit to the depth which can be done, just as with speculative execution.

However, latency to send data back and forth from the GPU and CPU is one of the big bottlenecks that makes me scratch my head about this. Ideally you would have some sort of task that you want to run over a large chunk of data before returning, such as crypto.

CUDA is actually a very nice framework and toolchain, compared to other tools for high performance massively parallel architectures. Take a look at programming for the Cell architecture. Without even optimizing for branch prediction, it is fairly easy to port data parallel C code to CUDA and achieve 10x or more speedup. I would say that your complaints towards a GPU architecture applies even more so for x86, which is by far uglier, with an instruction set to legitimately have nightmares over…

cptfalcon: “So thirdly, the GPU uses the immense hardware parallelism to execute both sides of a branch, and pick the one that is correct after the conditional.”

No; this is not speculative execution, nor “immense hardware parallelism”, nor anything relating to hiding latency. There is no speculation involved. It’s much more like lane-masking on SIMD vector operations.

What’s happening is that, for the 32-“core” warp, there is only one instruction decoder, and so all 32 cores must be executing the same instruction at the same time. Thus, if there is a branch where some cores go one way and some go the other, the whole set needs to execute the instructions for both branches. However, the instructions from each branch are “masked” so that, when they are executed on a core that supposed to be taking the other branch, they don’t write their results into the registers. This gives the effect of one set of cores taking one branch and the others taking the other branch — but a key point is that the mask is known before the instruction executes.

There are no direct performance advantages for this, unlike with speculative execution on CPUs. The performance advantage is indirect: If you only have one instruction scheduler for every 32 compute units, you can put many more compute units in the same size-and-power budget.

The likely hood of most people running higher end GPU’s and using Linux enough to warrant kernel compilation is rather slim as gaming on Linux while fully achievable is a head ache at best. That being said I’m tinkering with facial recognition and CUDA using the Linux platform.

I don’t think so. I have a GTX570 and I even write kernel code. And I know more people that do too. I think that the picture that kernel developers only use intel cards is some sort of a stereotype.
And then there are tons of people who use Gentoo and derivatives. I think most of us compile our kernels. I think most people multiboot for gaming. Using windows for anything besides gaming is a headache.

I was coming from the average computer tinkerer deciding to see the performance differences in CPU vs GPU (aka me). Now, I’m no fan boy of either OS but I do use windows as my main platform as I do a lot of audio/visual work and I really do like coding in C#.

Unfortunately my main laptop has switchable graphics and there is just no hope of ever getting it to work effectively on Linux :(. I have done some kernel mods and looked into kernel coding but I have never needed to reinvent the wheel for most of my needs.

A lot of my friends and majority of my work peers use lower end GPU’s in the there Linux based systems because they feel that it is sufficient for there needs. Noise is also a big factor with GPU’s.

I’ve been waiting for this to happen for a while now. I can’t wait for someone to implement this with radeon cards as well. I believe this is the future of software RAID, crypto and many other application not yet imagined as well.

No. Technically if you are doing a number of small parallel threads the GPU will be much, much faster. RAID stands for Redundant Array of Independent Disks. Meaning all RAID does is increase storage reliability or storage speed. RAID (disk I/O) and processing speed are not related. If your program is not multiprocessing aware it will only run on one of the many computational units inside a GPU, just like if it was only running on your CPU. If you want faster software RAID then buy many of the latest SATA solid state drives and set them up for striping. CUDA will do nothing for RAID performance just as putting a more powerful engine in your car wont make the suspension better.

I detect a little newspeak in your comment, “RAID stands for Redundant Array of Independent Disks.” The I in RAID is supposed to stand for inexpensive. I can see how independent would suit some better though.

My guess is that there’s not much in the kernel that would be worth accelerating — because GPUs are designed for high data throughput at the expense of latency, and nearly everything in an operating system kernel needs to have very low latency and doesn’t involve a lot of data.

I did recently see an interesting research paper that was using GPU acceleration to implement automated cache prefetch, but it involved using something closer to an AMD Fusion core with the GPU on the same chip as the CPU, and I think it also involved some specialized hardware to make it really work properly. And even then it wasn’t clear to me that it was really doing anything that wasn’t just “let’s power up these GPUs that aren’t running at the moment and use lots of energy for a marginal speedup of our CPU code.”

Pretty awesome. I wonder how resource management works though? For example what happens when many applications (and the kernel) ask for more resources from the GPU than it can provide? Will their compute time get prioritized fairly? Or will new requests just fail outright?

The GPU kernel driver is responsible for marshalling all of those requests into a single execution queue that gets fed to the GPU, so the answer to most of those questions depends on how the driver is written. However, there’s no task pre-emption once things are actually running on the GPU (with current technology), so once a task is running, it will run to completion unless completely cancelled; there’s nothing the driver can do about it.

Since it’s a queue rather than a “do this now” programming model, there’s no real sense of “more resources from the GPU than it can provide”; the queue just keeps getting longer and tasks wait longer to execute. I assume at some point you get an error when the driver runs out of queue space, but that may only happen when the whole machine is out of memory.

what would be nice is a modified version of the Linux kernel that support AMD GPU processing automatically, not all of us are lucky to have a Nvidia video chip-sets on our laptops ! So it would be nice to see some AMD GPU programming :P