Share this story

AMD wants to talk about HSA, Heterogeneous Systems Architecture (HSA), its vision for the future of system architectures. To that end, it held a press conference last week to discuss what it's calling "heterogeneous Uniform Memory Access" (hUMA). The company outlined what it was doing, and why, both confirming and reaffirming the things it has been saying for the last couple of years.

The central HSA concept is that systems will have multiple different kinds of processors, connected together and operating as peers. The two main kinds of processors are conventional: versatile CPUs and the more specialized GPUs.

Modern GPUs have enormous parallel arithmetic power, especially floating point arithmetic, but are poorly-suited to single-threaded code with lots of branches. Modern CPUs are well-suited to single-threaded code with lots of branches, but less well-suited to massively parallel number crunching. Splitting workloads between a CPU and a GPU, using each for the workloads it's good at, has driven the development of general purpose GPU (GPGPU) software and development.

Even with the integration of GPUs and CPUs into the same chip, GPGPU is quite awkward for software developers. The CPU and GPU have their own pools of memory. Physically, these might use the same chips on the motherboard (as most integrated GPUs carve off a portion of system memory for their own purposes). From a software perspective, however, these are completely separate.

This means that whenever a CPU program wants to do some computation on the GPU, it has to copy all the data from the CPU's memory into the GPU's memory. When the GPU computation is finished, all the data has to be copied back. This need to copy back and forth wastes time and makes it difficult to mix and match code that runs on the CPU and code that runs on the GPU.

The need to copy data also means that the GPU can't use the same data structures that the CPU is using. While the exact terminology varies from programming language to programming language, CPU data structures make extensive use of pointers: essentially, memory addresses that refer (or, indeed, point) to other pieces of data. These structures can't simply be copied into GPU memory, because CPU pointers refer to locations in CPU memory. Since GPU memory is separate, these locations would be all wrong when copied.

hUMA is the way AMD proposes to solve this problem. With hUMA, the CPU and GPU share a single memory space. The GPU can directly access CPU memory addresses, allowing it to both read and write data that the CPU is also reading and writing.

hUMA is a cache coherent system, meaning that the CPU and GPU will always see a consistent view of data in memory. If one processor makes a change then the other processor will see that changed data, even if the old value was being cached.

This is important from an ease-of-use perspective. In non-cache-coherent systems, programs have to explicitly signal that they have changed data that other processors might have cached, so that those other processors can discard their stale cached copy. This makes the hardware simpler, but introduces great scope for software errors that result in bugs that are hard to detect, diagnose, and fix. Making the hardware enforce cache coherence is consistent with AMD's motivation behind HSA: making it easier for developers to use different processor types in concert.

The memory addresses used in the CPU are not, in general, addresses that correspond to physical locations in RAM. Modern operating systems and software all use virtual memory. Each process has its own private set of addresses. Whenever an address is accessed, the CPU maps these virtual memory addresses to physical memory addresses. The set of virtual addresses can be, and often is, larger, in total, than the amount of physical memory installed on the system. Operating systems use paging to make up the difference: memory from some virtual addresses can be written out to disk instead of being kept in physical memory, allowing that physical memory to be used for some other virtual address.

Whenever the CPU tries to access a virtual address that's been written out to disk, rather than being resident in physical memory, it calls into the operating system to retrieve the data it needs. The operating system then reads it from disk and puts it into memory. This system, called demand-paged virtual memory, is common to every operating system in regular use today.

It is, however, a problem for traditional CPU/GPU designs. As mentioned before, in traditional systems, data has to be copied from the CPU's memory to the GPU's memory before the GPU can access it. This copying process is often performed in hardware independently of the CPU. This makes it efficient but limited in capability. In particular, it often cannot cope with memory that has been written out to disk. All the data being copied has to be resident in physical RAM, and pinned there, to make sure that it doesn't get moved out to disk during the copy operation.

hUMA addresses this, too. Not only can the GPU in a hUMA system use the CPU's addresses, it can also use the CPU's demand-paged virtual memory. If the GPU tries to access an address that's written out to disk, the CPU springs into life, calling on the operating system to find and load the relevant bit of data, and load it into memory.

Together, these features of hUMA make switching between CPU-based computation and GPU-based computation much simpler. The GPU can use CPU data structures and memory directly. The support for demand-paged virtual memory means that the GPU can also seamlessly make use of data sets larger than physical memory, with the operating system using its tried and true demand paging mechanisms.

As well as being useful for GPGPU programming, this may also find use in the GPU's traditional domain: graphics. Normally, 3D programs have to use lots of relatively small textures to apply textures to their 3D models. When the GPU has access to demand paging, it becomes practical to use single large textures—larger than will even fit into the GPU's memory—loading the portions of the texture on an as-needed basis. id Software devised a similar technique using existing hardware for Enemy Territory: Quake Wars and called it MegaTexture. With hUMA, developers will get MegaTexture-like functionality built-in.

AMD has needed HSA for a long time. The Bulldozer processor module, introduced in late 2011, paired two integer cores to a single shared floating point unit. Each core pair can run two threads, but if both threads make intensive use of floating point code, they have to compete for that shared floating point unit.

AMD's theory was that floating point-intensive code would use the GPU, so the relative lack of floating point power in the CPU wouldn't matter. But that didn't happen and still hasn't happened. There are several reasons for this, but one of the biggest is the inconvenience and inefficiency of mixing between CPU and GPU code, due to the memory copying and pinning that has to take place. HSA eliminates these steps. While this still doesn't make programming the GPU easy—many programmers will have to learn new techniques to take advantage of their massively parallel nature—it certainly makes it easier.

The first processor to come to market with HSA hUMA is codenamed Kaveri. It will combine 2-3 compute units (two integer cores, but shared floating point) using AMD's Bulldozer-derived Steamroller cores with a GPU. The GPU will have full access to system memory.

Kaveri is due to be released some time in the second half of the year.

HSA isn't just for CPUs with integrated GPUs. In principle, the other processors that share access to system memory could be anything, such as cryptographic accelerators, or programmable hardware such as FPGAs. They might also be other CPUs, with a combined x86/ARM chip often conjectured. Kaveri will in fact embed a small ARM core for creation of secure execution environments on the CPU. Discrete GPUs could similarly use HSA to access system memory.

The big difficulty for AMD is that merely having hUMA isn't enough. Developers actually have to write programs that take advantage of it. hUMA will certainly make developing mixed CPU/GPU software easier, but given AMD's low market share, it's not likely that developers will in any great hurry to rewrite their software to take advantage of it. We asked company representatives if Intel or NVIDIA were going to implement HSA. We're still awaiting an answer.

The company boasts that its HSA Foundation does have wide industry support, including ARM Ltd, major ARM vendors Qualcomm, Samsung, and TI, and embedded graphics company Imagination. If this set of companies embraced HSA, it's certainly possible that we could see it become a standard feature of the ARM systems-on-chips that power so many tablets and smartphones. What's not clear is how this would do much to help AMD, given its minor position in the tablet market and non-existent place in the smartphone market.

AMD's penultimate slide did point to one possible (though potentially temporary) salvation: games consoles. The PlayStation 4, released later this year, will contain an AMD CPU/GPU. It's widely believed that the next generation Xbox will follow suit. Though there's no official news either way, it's possible that one or both of these processors will be HSA parts. This will give AMD a steady stream of income, and also ensure that there's a steady stream of software written and designed for HSA systems. That in turn could provide the impetus and desire to see HSA used more widely.

Update: A reader has pointed out that in an interview with Gamasutra, PlayStation 4 lead architect Mark Cerny said that both CPU and GPU have full access to all the system's memory, strongly suggesting that it is indeed an HSA system.

Enlarge/ Is this a hint of what's powering the next generation consoles?

Share this story

124 Reader Comments

This is interesting, but on the other hand, it's not anything new. What was done previously with multiple CPUs is now being done with CPUs and GPUs. It's just a logical progression of computer systems architecture given the increasing importance of, and use of, GPU type hardware. That said, I don't mean to talk it down, instead it's more along the lines of "it's about time!"

Actually, we do know (or, at least, the specs they've announced point to it) that the PS4 is supposed to have 8GB of unified RAM. Now, that's a Jaguar-based architecture (instead of Kaveri), but since it's still a custom chip, AMD may have used it as a testbed for HSA.

-Few questions:1. Does this hUMA need an operating system that's aware (explicitly supports) the feature?2. If yes, will Windows "Blue" contains the required enhancement(s)?3. Also, is current Linux kernel already prepared for this/patches already being put in (if yes to question #1)?4. Will it has any impact on existing applications?5. And, will current games benefit from video driver-side update/tweaks (with this hUMA)?

-BTW, I'm a bit disappointed that the imminent "Temash" and "Kabini" don't have this feature, considering they use the same "Jaguar" cores to be used in PS4, which have unified memory (between CPU & GPU). Yes, PS4 mostly implement its unified memory using AMD's hUMA technique.

The console alluded to at the end is the same rational for teaming with ARM designs and licensees. Mobile is experiencing huge growth, if they start using hUMA, that's a big win in getting developers on board too.

I'm not seeing any info on performance estimates. Not even in the slides. Any unqualified percentages of performance increases floating around out there?

Assuming that the hardware is otherwise similar to current chips, a two-module 4GHz Bulldozer-dervied CPU with a 384-shader 800MHz GCN GPU gives 128 SP GFLOPS on the CPU side and 614.4 SP GFLOPS on the GPU.

If they have three CPU modules, then 192 vs 614.4.

By comparison, a four-core Sandy/Ivy Bridge chip has twice the FP resources of a two-module Bulldozer/Piledriver, so 256 GFLOPS at 4GHz.

Actually, we do know (or, at least, the specs they've announced point to it) that the PS4 is supposed to have 8GB of unified RAM. Now, that's a Jaguar-based architecture (instead of Kaveri), but since it's still a custom chip, AMD may have used it as a testbed for HSA.

I didn't think that we knew exactly how unified it was; a common memory pool with a fixed partition (as is done with most integrated GPUs already), or HSA?

As a developer myself and someone who transcodes a lot of video, anything that makes it simpler and more mainstream to offload operations to the GPU is welcome. It's often such a missed opportunity. It kills me that none of the Linux video encoders will leverage my fancy GPU - at least that I've been able to discover. Most of them won't even take advantage of multiple CPU cores.

Actually, we do know (or, at least, the specs they've announced point to it) that the PS4 is supposed to have 8GB of unified RAM. Now, that's a Jaguar-based architecture (instead of Kaveri), but since it's still a custom chip, AMD may have used it as a testbed for HSA.

I didn't think that we knew exactly how unified it was; a common memory pool with a fixed partition (as is done with most integrated GPUs already), or HSA?

Actually we've known for the past two years at least.... Information privy to the C++ community only

I'm not seeing any info on performance estimates. Not even in the slides. Any unqualified percentages of performance increases floating around out there?

Cache coherence is going to decrease performance, but by possibly making it simpler to program, more programmers will use GPGPU. It's a mixed bag.

If we are talking about ease of use, this is a far stretch from Xeon Phi. Imagine a dozen of MIC cores integrated into next iteration of Core i.

Without hardware-enforced cache coherence, the software will still have to perform the same cache flushing operations as it otherwise would have (or else you'd have corruption). So I would think implementing these operations in hardware would actually be faster.

I think it was in the recent mobile SoC articles that a chip was mentioned where the memory was just stacked on top of the CPU die. I mean, why is there such a limit on die space? Laptop RAM is packaged so small these days, even if you just took all those transistors and stuck them on the CPU die, it would only maybe be 4cm x 4cm of die space. Ordinarily, too big of a die is a bad thing because of heat issues and the difficulty of supplying power to the whole area evenly, but you could just power the memory part separately, and DRAM chips don't really produce much heat anyway.

I think it's also about yields; if you put everything on a single die then for the chip to work the entire die must come out correctly. If you compose the chip as a package of, say, three smaller dies then you just need to find any three functioning examples. So each imperfection costs you only a third of an output component rather than an entire one.

What I'd like to see in the way of RAM technology is full integration with the CPU as a single package, and thus the only access delay being in the D flip-flops, with no more delay due to sending signals over the wires on the motherboard.

......

Yeah, PoP is cool for mobile. So, what happens when I want 16GB of RAM when "8GB should be enough for anybody"? Should there be another SKU for each combination of clock freq., cache, hyper-threading capability (or lack thereof), and memory, on each process model and micro architecture?

Laptop RAM is packaged so small these days, even if you just took all those transistors and stuck them on the CPU die, it would only maybe be 4cm x 4cm of die space.

Yields plummet as you increase chip size. That hypothetical 4cm x 4cm chip you mention would be absurdly, hilariously expensive.

In any case, companies like Apple who obsess over miniaturization and who can effortless market premium pricing would have already been all over this if it was even remotely viable.

Keep in mind, that 4cm x 4cm is at whatever process size laptop SODIMM-packaged RAM uses, not at 22nm. It was just a random figure. Actually, I'd expect it to be roughly comparable to the CPU die size, or less (since much of the space on the CPU die is for the complex wiring patterns, whereas in RAM its regular grid).

As a developer myself and someone who transcodes a lot of video, anything that makes it simpler and more mainstream to offload operations to the GPU is welcome. It's often such a missed opportunity. It kills me that none of the Linux video encoders will leverage my fancy GPU - at least that I've been able to discover. Most of them won't even take advantage of multiple CPU cores.

GPU transcoded video still suffers in quality, though. You get better results with the CPU. It's like cooking; a microwave may be fast enough to do the job in seconds, but it rarely gives comparable results to a few minutes in the oven.

I hadn't heard that. Why would that be the case? Shouldn't it be the same mathematical operations either way, just broken down differently? Do they need to substantially alter the encoding algorithms to take advantage of the parallelism? I would have expected it to be pretty similar to running one of those climate models where every geographical cell's state depends on the states of the adjacent cells, and those are run on massively parallel setups all the time.

Do they experience the same quality loss when leveraging multiple CPU cores? If not, why not? Not disputing, just interested. Any links?

If they were beside each other, it would take up about as much space as doubling the number of cores (minus whatever gains you get from shrinking the RAM down to 22nm). I don't think yields would be THAT terrible, considering it would be a premium product with a HUGE performance increase.

I assume there's a way to keep the GPU/CPU from stomping on each other?

I'm not a hardcore developer, but I imagine that it's the job of the software developer to keep his mental model accurate and his address-space in order....

i.e. a process sharing GPU and CPU code has a shared virtual address space, and existing practices should keep the memory from being stomped in the same way that current multithreaded apps don't stomp on the memory of other threads in the same process.

Laptop RAM is packaged so small these days, even if you just took all those transistors and stuck them on the CPU die, it would only maybe be 4cm x 4cm of die space.

Yields plummet as you increase chip size. That hypothetical 4cm x 4cm chip you mention would be absurdly, hilariously expensive.

In any case, companies like Apple who obsess over miniaturization and who can effortlessly market premium prices would have already been all over this if it was even remotely viable.

PoP could be utilized, where the processor and memory are fabed separately and then combined later in the manufacturing process. I still don't like the idea, but I don't see how yield issues come into play here.

I assume there's a way to keep the GPU/CPU from stomping on each other?

If they're sharing memory space it should be just like programming in any other multi-threaded scenario. The only difference would be specifically ear-marking certain code to run on GPU vs CPU, where normally you would just spin up another thread and not care which identical core handled it.

What I'd like to see in the way of RAM technology is full integration with the CPU as a single package, and thus the only access delay being in the D flip-flops, with no more delay due to sending signals over the wires on the motherboard.

......

Yeah, PoP is cool for mobile. So, what happens when I want 16GB of RAM when "8GB should be enough for anybody"? Should there be another SKU for each combination of clock freq., cache, hyper-threading capability (or lack thereof), and memory, on each process model and micro architecture?

.........no answer?

Yes. You can have 8GB, 16GB, 32GB, 64GB SKUs. I don't see the problem here. Intel already has an absurd number of SKUs (and, importantly, no competition). Or just have one special SKU with the highest amount currently feasible, since this is really only for enthusiasts anyway?

What I'd like to see in the way of RAM technology is full integration with the CPU as a single package, and thus the only access delay being in the D flip-flops, with no more delay due to sending signals over the wires on the motherboard.

......

Yeah, PoP is cool for mobile. So, what happens when I want 16GB of RAM when "8GB should be enough for anybody"? Should there be another SKU for each combination of clock freq., cache, hyper-threading capability (or lack thereof), and memory, on each process model and micro architecture?

.........no answer?

Yes. You can have 8GB, 16GB, 32GB, 64GB SKUs. I don't see the problem here. Intel already has an absurd number of SKUs (and, importantly, no competition). Or just have one special SKU with the highest amount currently feasible, since this is really only for enthusiasts anyway?

But that's my point! The number of SKUs is, to borrow your adjective, absurd! I don't see the performance benefits being enough to make this worthwhile. The reason this is done in mobile isn't because of performance (although that's a nice bonus), it's because it has a smaller footprint! They can make our phones smaller by stacking, but if they weren't worried about this then it would have taken is a lot longer (if ever) to start stacking chips like this.

Actually, we do know (or, at least, the specs they've announced point to it) that the PS4 is supposed to have 8GB of unified RAM. Now, that's a Jaguar-based architecture (instead of Kaveri), but since it's still a custom chip, AMD may have used it as a testbed for HSA.

I didn't think that we knew exactly how unified it was; a common memory pool with a fixed partition (as is done with most integrated GPUs already), or HSA?

If it's not unified it's usually referred to as "shared" memory. The PS4's is unified like that of kaveri.

Quote:

"The 'supercharged' part, a lot of that comes from the use of the single unified pool of high-speed memory," said Cerny. The PS4 packs 8GB of GDDR5 RAM that's easily and fully addressable by both the CPU and GPU.

If you look at a PC, said Cerny, "if it had 8 gigabytes of memory on it, the CPU or GPU could only share about 1 percent of that memory on any given frame. That's simply a limit imposed by the speed of the PCIe. So, yes, there is substantial benefit to having a unified architecture on PS4, and it’s a very straightforward benefit that you get even on your first day of coding with the system. The growth in the system in later years will come more from having the enhanced PC GPU. And I guess that conversation gets into everything we did to enhance it."

I'm not seeing any info on performance estimates. Not even in the slides. Any unqualified percentages of performance increases floating around out there?

Cache coherence is going to decrease performance, but by possibly making it simpler to program, more programmers will use GPGPU. It's a mixed bag.

If we are talking about ease of use, this is a far stretch from Xeon Phi. Imagine a dozen of MIC cores integrated into next iteration of Core i.

Without hardware-enforced cache coherence, the software will still have to perform the same cache flushing operations as it otherwise would have (or else you'd have corruption). So I would think implementing these operations in hardware would actually be faster.

It's sync when needed VS sync always.There's also a question about GPU L1 coherence. Tile based rendering is faster, yet not simpler.

hUMA sound a lot like NUMA, and we know its performance characteristics.

Actually, we do know (or, at least, the specs they've announced point to it) that the PS4 is supposed to have 8GB of unified RAM. Now, that's a Jaguar-based architecture (instead of Kaveri), but since it's still a custom chip, AMD may have used it as a testbed for HSA.

I didn't think that we knew exactly how unified it was; a common memory pool with a fixed partition (as is done with most integrated GPUs already), or HSA?

If it's not unified it's usually referred to as "shared" memory. The PS4's is unified like that of kaveri.

Quote:

"The 'supercharged' part, a lot of that comes from the use of the single unified pool of high-speed memory," said Cerny. The PS4 packs 8GB of GDDR5 RAM that's easily and fully addressable by both the CPU and GPU.

If you look at a PC, said Cerny, "if it had 8 gigabytes of memory on it, the CPU or GPU could only share about 1 percent of that memory on any given frame. That's simply a limit imposed by the speed of the PCIe. So, yes, there is substantial benefit to having a unified architecture on PS4, and it’s a very straightforward benefit that you get even on your first day of coding with the system. The growth in the system in later years will come more from having the enhanced PC GPU. And I guess that conversation gets into everything we did to enhance it."

What I'd like to see in the way of RAM technology is full integration with the CPU as a single package, and thus the only access delay being in the D flip-flops, with no more delay due to sending signals over the wires on the motherboard.

......

Yeah, PoP is cool for mobile. So, what happens when I want 16GB of RAM when "8GB should be enough for anybody"? Should there be another SKU for each combination of clock freq., cache, hyper-threading capability (or lack thereof), and memory, on each process model and micro architecture?

.........no answer?

Yes. You can have 8GB, 16GB, 32GB, 64GB SKUs. I don't see the problem here. Intel already has an absurd number of SKUs (and, importantly, no competition). Or just have one special SKU with the highest amount currently feasible, since this is really only for enthusiasts anyway?

But that's my point! The number of SKUs is, to borrow your adjective, absurd! I don't see the performance benefits being enough to make this worthwhile. The reason this is done in mobile isn't because of performance (although that's a nice bonus), it's because it has a smaller footprint! They can make our phones smaller by stacking, but if they weren't worried about this then it would have taken is a lot longer (if ever) to start stacking chips like this.

This isn't being done in mobile SoCs. I'm talking about having both on the same piece of silicon. The stacked SoC was just an example of something similar. The point is not to save space, its to have essentially ZERO access latency to main memory (assuming you use SRAM, not DRAM, which is just changing to a different type of flip-flop). That has huge performance implications.

Also, customers don't care about the number of SKUs of the CPU in the computer they buy. They look for a computer with a certain combination of specs, and they buy it. On the suppliers' side, Intel could just take chips where some of the RAM blocks didn't turn out well and disable them to get the different RAM sizes necessary without having multiple assembly lines.

Besides, if you don't like how many different SKUs Intel has, what are you going to do? Buy AMD? rofl

I hadn't heard that. Why would that be the case? Shouldn't it be the same mathematical operations either way, just broken down differently?

There's a lot of math that GPUs just are not good at, and they're not flexible enough to get good at it anytime soon. Unfortunately a lot of the processes in transcoding video relies on that kind of math, which is not easy to make massively parallel. What most modern GPUs do is install some special purpose hardware that is specifically dedicated to handling h.264 video, and run the conversion on that. But these are basically black boxes that don't leverage real GPGPU programming advancements like CUDA and OpenCL anyway, and it's up to the conversion software manufacturers to deal with each company's specific hardware their own way.

Quote:

Do they experience the same quality loss when leveraging multiple CPU cores? If not, why not? Not disputing, just interested. Any links?

No, because CPUs are far more general-purpose than GPUs and can handle all kinds of math easily; they're just not decked out to do massively parallelized operations as effectively as GPUs, which are almost entirely just huge banks of parallel hardware. GPUs are bad at the kind of serial operations CPUs eat for breakfast, and a lot of the math that goes into encoding video is inherently serial.

As a developer myself and someone who transcodes a lot of video, anything that makes it simpler and more mainstream to offload operations to the GPU is welcome. It's often such a missed opportunity. It kills me that none of the Linux video encoders will leverage my fancy GPU - at least that I've been able to discover. Most of them won't even take advantage of multiple CPU cores.

GPU transcoded video still suffers in quality, though. You get better results with the CPU. It's like cooking; a microwave may be fast enough to do the job in seconds, but it rarely gives comparable results to a few minutes in the oven.

I hadn't heard that. Why would that be the case? Shouldn't it be the same mathematical operations either way, just broken down differently? Do they need to substantially alter the encoding algorithms to take advantage of the parallelism? I would have expected it to be pretty similar to running one of those climate models where every geographical cell's state depends on the states of the adjacent cells, and those are run on massively parallel setups all the time.

Do they experience the same quality loss when leveraging multiple CPU cores? If not, why not? Not disputing, just interested. Any links?

I think he's thinking of dedicated hardware encoders, which have their own way of encoding things, and generally don't have many settings to tweak. And yeah, most GPU accelerated encoders currently kinda suck when it comes to quality.

With this, you could possibly rewrite something like x264 to take advantage of the GPU, which would still let you tweak all the options and whatnot.

As a developer myself and someone who transcodes a lot of video, anything that makes it simpler and more mainstream to offload operations to the GPU is welcome. It's often such a missed opportunity. It kills me that none of the Linux video encoders will leverage my fancy GPU - at least that I've been able to discover. Most of them won't even take advantage of multiple CPU cores.

GPU transcoded video still suffers in quality, though. You get better results with the CPU. It's like cooking; a microwave may be fast enough to do the job in seconds, but it rarely gives comparable results to a few minutes in the oven.

I hadn't heard that. Why would that be the case? Shouldn't it be the same mathematical operations either way, just broken down differently? Do they need to substantially alter the encoding algorithms to take advantage of the parallelism? I would have expected it to be pretty similar to running one of those climate models where every geographical cell's state depends on the states of the adjacent cells, and those are run on massively parallel setups all the time.

Do they experience the same quality loss when leveraging multiple CPU cores? If not, why not? Not disputing, just interested. Any links?

I think he's thinking of dedicated hardware encoders, which have their own way of encoding things, and generally don't have many settings to tweak. And yeah, most GPU accelerated encoders currently kinda suck when it comes to quality.

With this, you could possibly rewrite something like x264 to take advantage of the GPU, which would still let you tweak all the options and whatnot.