More Memory and Cache Changes

This memory virtualization will be shared between discrete GPUs as well as integrated parts. The iGPUs of course have access to the CPU’s memory controller, and in the Fusion parts it is actually given priority over the CPU. As Eric Demers described it, current quad core CPUs really only take up around 8 to 12 GB/sec of bandwidth when fully loaded. This explains why we do not see a whole lot of performance increases as of late with modern processors and fast DDR-3. Going from 1333 MHz to 1600 MHz would often show no real performance improvement on the AMD side, and on the Intel side they do not even officially support speeds higher than DDR-3 1333. The iGPU changes that. AMD reworked their memory controller, and it can feed upwards of 30 GB/sec of data to the GPU. Stream benchmarks will not show this kind of utilization, but in testing there are very distinct performance improvements in graphics applications by going from 1333 speeds up to the top 1866 speed supported by the new AMD processors.

On the discrete GPU side, virtual memory will be accomplished by tunneling through the PCI-E connection. For both iGPU and dGPU to work in this manner, IOMMU must be used and supported in the OS. The platform is really the primary goal for all of these changes. By implementing these things into the platform, we should have better memory management plus the virtual memory. This will help to make programming easier, and therefore more approachable to developers who many not have the time or budget to address a more closed system. Parallelism between the CPU and GPU will be complimentary rather than antagonistic. The CPU can handle the more serial operations, while the GPU goes for the highly parallel. With the shared virtual memory space, the CPU and GPU can schedule work for each other. AMD wants to keep the platform open and work with other IHVs and ISVs to address their needs. Finally AMD has worked very hard to improve overall memory efficiency. Not just in the caches, but also the memory controllers and the virtualization. With each generation of cards starting with the HD 2900 XT, AMD has focused on adding compute enhancements, but at a rate which would not significantly impact die size or graphics performance. Here AMD has taken a big step towards far greater compute performance, but has also looked to improve graphics performance with these particular changes.

Double precision is again supported, but it is very flexible as compared to previous iterations. Peak double precision on older parts was typically 1/5th that of single precision, due to the nature of the VLIW-5 and VLIW-4 architectures. With the new vector based units, peak double precision is now up to one half that of single precision. But AMD has given itself the option to turn down that performance, depending on the product. For the top end FirePro cards, we would see one half. For the top end gaming card, we might see one fourth. For mainstream and integrated, we could see as low as one sixteenth. AMD has stated that all of its products based upon this architecture will have the ability to do double precision; it just is a matter of how much performance is enabled.

Graphics performance is still the primary goal of this new architecture. There will still be plenty of fixed function units which have not changed in ages. ROPS and Z-units will remain, and it is quite likely their numbers will grow with each shrink, thereby allowing more rendering power to push pixels to the screen. With performance sinks like Eyefinity and 3D, pixel fillrates are still very important.

In Closing

We will see the first generation of parts come out in Q4 2011. These will be discrete GPUs. The current Llano CPUs are based on the older VLIW-5 architecture, and it looks like Trinity (Bulldozer + iGPU) will be based on the VLIW-4 architecture. After that though, we can expect the integrated parts to use this new architecture. This opens up a new realm of possibilities for AMD. One scenario discussed was that of physics acceleration. Instead of the dGPU doing both rendering and physics/compute work, the iGPU on the CPU would handle those. The iGPU would have the advantage of being located on the CPU, sharing the same memory controller, and accessing the main memory very quickly, as well as greater memory localization of the data. This would reduce latency by a significant degree as compared to the dGPU doing the same thing over the PCI-E bus. By taking care of this business, the dGPU would better handle other operations such as geometry and pixel shading or tessellation.

The first iterations of virtual memory would likely be featured on AMD only platforms. Intel would have to buy into this concept, and allow it to work on their CPUs and platforms. This would not be a simple driver addition to enable this functionality with Intel processors. AMD is committed to this being an open architecture. It will be interesting to see if NVIDIA jumps onboard. Currently NVIDIA does have a virtual memory mode for their GPUs, but it is not sharing it with the CPU, and it does not exist in the same virtual space.

This is a big deal for AMD. While they have had trouble keeping up with Intel on the CPU side, we can see that they have had no problems staying ahead with graphics. Their push towards heterogeneous computing is also shared with NVIDIA, and their combined efforts towards utilizing this functionality will benefit both in the long run. Intel is still more CPU-centric, but we are starting to see them taking a larger interest in this technology. The cancelled Larabee project may have been misguided in terms of addressing the gaming and graphics market, but the parallel computing ramifications of this part and its extreme programmability hint at things to come.

It will be interesting to see how they organize the rest of the chip. We know from a previous presentation at Fusion 11 that primitive setup will be pretty flexible, and different chips will have a different amount of units doing this setup. Plus how many CUs will be put in a larger functional unit and how is that tied into texture units? Lots and lots that they left uncovered.

What AMD has essentially done is broken down the CPU into compute units that can either act as x86+AVX scalar units or as GPU compute units.

the vector units are 16-wide, which means they can be used as SIMD units for GPU tasks or as AVX-SIMD units for the x86 scalar unit.

It is likely that we will see this first 512-bit AVX implementation (16-wide) as we would see half the units being idle if AMD stuck with a 256-bit AVX implementation (8-wide).

We can expect to see a very efficient use of silicon with this architecture as the CPU can tune the ratio of x86 vs compute units depending on the workload, potentially hundreds if not thousands of times per second.

From the 3rd to last and last slide it seems like everything is going to be an "apu" now, even though it really isn't. Makes me wonder how confusing the future will be, but this is why AMD is a very cool company. They don't sit on laurels, they do things, they change things, they try things, which makes it all a lot more interesting to keep track of things.

I wonder how much of a boost these cards will see. I know the die shrink helps, but this whole new architecture with the cache and memory could either help or only help in things like folding where it is calculation intensive. Who knows, finger's crossed and glad as heck that I waited out the 5xxx and 6xxx series.

I agree I still have a 775 CPU from intel. If AMD came pull this off with good real world numbers I will switch over to the new hybrids. It all comes down to high gaming numbers though. I go with switch company has the best scores weather its cheaper or not. I do have a 5850 in my system at the moment and one of the best GPU's I have ever had.

I'm really curious about how it will all turn out. Does anyone have a clue yet about density, for instance? I'm not much of a chip engineer, but the main advantages of VLIW 5 and 4 to me seemed that they could cram so many ALUs into their chips. In getting rid of that, will they be able to maintain a similar performance per mm^2? Will the increased utilisation be enough?

IOMMU support from M$CPU/GPU virtual memory in Fermi2
AND that would then force also Intel to follow
It looks like Nvidia is missing a license for the x86
BUT at least Tegra3 ia a winner
The Future looks very exciting
and I'm waiting for the Haswell to be my future CPU
but what will my dGPU?