I know that Haswell is supposed to have a much more powerful integrated graphics, but in theory, how powerful could a integrated GPU become? Just as a thought experiment, let's say we throw 80%, 90% or 95% of the transistor budget at the IGP: would we ever get performance of say a mid, or lower-mid range discrete GPU?

Sure, you could do that; and assuming competent engineers who know their way around a GPU, the performance of the IGP would be similar to that of a discrete GPU with a similar transistor budget. But the performance of the crippled CPU would suck, so you would end up needing to shove a real CPU into your PCIe x16 slot to get anything done!

It would end up memory bandwidth starved well before so much die space was dedicated to it. The solution is on-die high bandwidth memory, or maybe a memory chip on the package as an MCM with a special dedicated link to the IGP, or at least linked in to the CPU/IGP cache. Intel's ring bus architecture is great but it hasn't got the sheer quantity of space needed to get great graphics performance so it ends up doing main memory fetches for graphics. AMD has done a fantastic job balancing the CPU, IGP, and memory bandwidth for both in Trinity on the other hand, but it ends up middle of the road for both (ok, it's great for an IGP but only with the qualifier of 'for an IGP')

So regardless of die space dedicated to the IGP, it comes down to memory bandwidth. I can see something like an MCM graphics memory chip of 512MB happening a few process shrinks out - maybe 10nm. Of course at that point 512MB won't be a whole lot of graphics memory. Right now the memory chips themselves just take up too much die space. Or something like AMD's sideport making a comeback, at least that didn't share main memory bandwidth.

Pretty much, all else being equal it's pretty much reliant on transistors... Perhaps they'll figure out a way to actually allow the IGP and CPU to support each other for certain tasks (like the CPU would help render certain elements in the scene)... Sharing cache and then balancing graphics with the CPU load and offloading some GPU work to the CPU if there is overhead... or vice versa... Like OpenCL, only automatic and works in hardware.

There could be a lot of different ways to approach that... All else not being equal, Intel IGPs may become more and more powerful and eventually be more efficient and powerful then stand alone GPUs... Of course if that was the case they'd simply figure out a way to scale the IGP and make a more powerful stand alone GPU.

Ignorant question here, so why is the CPU bandwith so much less than the GPU bandwidth (that's PCIe, right?) Is this a historical limitation going back to the original 8086 CPU or the PC bus architecture?

If we knew what we know now (or if we knew where things stood currently) would the original PC architecture be engineered differently?

Ignorant question here, so why is the CPU bandwith so much less than the GPU bandwidth (that's PCIe, right?) Is this a historical limitation going back to the original 8086 CPU or the PC bus architecture?

If we knew what we know now (or if we knew where things stood currently) would the original PC architecture be engineered differently?

Guess it is a too late to start over with a clean slate?

Part of it is that you just don't need as much bandwidth in a CPU. Look at Sandy Bridge-E, and how it's not all that much faster than regular Sandy Bridge systems despite having twice the bandwidth available.

Not to mention cost. GDDR5 is more expensive than DDR3, and putting more traces for more memory channels on the motherboard is also an expensive proposition. Right now we have an (effective) 128-bit path to memory and the next step would be to double it.

Since iGPUs have a smaller transistor allocation than discrete GPUs, they go in lower-cost systems, and when your'e trying to cut costs as it is, doubling memory bandwidth just for a budget GPU doesn't make sense.

I do not understand what I do. For what I want to do, I do not do. But what I hate, I do.

Ignorant question here, so why is the CPU bandwith so much less than the GPU bandwidth (that's PCIe, right?)

It's a question of space, distance and signal integrity.

Memory slots on a motherboard are much further away from the memory controller than on a GPU. This means it's harder to:a) route lots of lanesb) run the memory/bus at high clockspeedsc) have more operations per clock

Conversely, GPUs have wider buses, shorter traces, faster signalling and more operations per clock. A high end GPU like the 7970 has a 384 bit bus running at an effective 5.5 GHz (quad-pumped) for 264 GB/s of bandwidth. Compare that to a typical desktop processor like a corei5 with two channels of DDR3 1600 RAM giving an effective 128 bits at an effective 800 MHz speed, which is 12.8* 25 GB/s of bandwidth. As you can see, the desktop CPU simply can't compare. Even a midrange GPU like the 7850 ($150-175) has 153 GB/s of memory bandwidth, far outstripping the CPU (and for the near future, any IGP).

So I'm with MadmanOriginal - IGP performance will be bottlenecked by memory bandwidth in the near future.

* EDIT: messed up the original calculation, see following posts.

Last edited by Voldenuit on Tue Jan 15, 2013 2:17 pm, edited 2 times in total.

Voldenuit wrote>>>>>> Compare that to a typical desktop processor like a corei5 with two channels of DDR3 1600 RAM giving an effective 128 bits at an effective 800 MHz speed, which is 12.8 GB/s of bandwidth.>>>>

I don't mean to nitpick, but that bandwidth is 25.6GB/s, not 12.8. But yes, it is far, far below the 250+ GB/s available on the AMD7970 GPU!

WhatMeWorry wrote:Ignorant question here, so why is the CPU bandwith so much less than the GPU bandwidth?

Any processing device will need some amount of local cache memory to act as a high-speed buffer between information it needs RIGHT NOW versus information it will need to get to a little bit later. A CPU tends to import, process, and export data in relatively small and discrete chunks. So, a lot of the intermediate steps can be held in a relatively small local cache without hindering the execution process.

GPUs tend to work on very massive datasets in a wide-parallel fashion. Suppose you have a 1920x1080 display operating at 32bpp. The raw math says you have 2,073,600 pixels to address and 66,355,000 bits of data, or 7.9MB, just to write out ONE frame. Now let's also suppose that we're processing a relatively simple 3D application. That one frame is being updated to the display buffer multiple times per second. In order to do so the GPU may be drawing elements from wireframe models totaling 100MB or more and overlaying them with several hundred MB of textures, all of which are being rotated and scaled in real-time. What this means is that the GPU needs a LOT of information RIGHT NOW, far more than can be efficiently incorporated on a single die at present.

Last edited by ludi on Tue Jan 15, 2013 12:32 pm, edited 1 time in total.

phileasfogg wrote:I don't mean to nitpick, but that bandwidth is 25.6GB/s, not 12.8. But yes, it is far, far below the 250+ GB/s available on the AMD7970 GPU!

You are correct, the DDR3 1600 bus is 800 MHz double pumped, I made a mistake and used the memory clock rate of 200 MHz quad pumped instead, so yes, 2 channels of DDR3-1600 would give ~25 GB/s of bandwidth.

Since this is all hypothetical anyway, you could solve the bandwidth issue by having dedicated VRAM for the IGP. The downside is that this would require a lot more pins on the CPU socket.

On a somewhat related note, in the past few years there have been a few AMD motherboards that had what they called "SidePort" memory. This was a dedicated bank of RAM on the motherboard connected to the northbridge for the Radeon IGP to use. But in this case the idea wasn't to make the IGP perform like a dedicated GPU; the idea was to keep the IGP from stealing memory bandwidth away from the CPU.

Well for maximum power, I think that Nvidia will show us with Maxwell since it will be an "IGP" where the "GP" is the vast majority of the chip and the dinky integrated part is actually some ARM cores that are really there to keep the GPU fed with data.

The real limitations that AMD & even Intel face come down to power & memory bandwidth, both of which are discussed above. I will add that the ultra-wide memory controllers used in discrete GPUs would not do all that well in a regular CPU because the memory access patterns are so different. For example: need to load a 64 MB contiguous block of memory for in-order linear processing? Great for the memory controller on the GPU. Want to load random 4K blocks from different locations in memory? The CPU's memory controller wins every time. The memory controllers in both CPUs and GPUs are setup for the optimal access patterns of their most common workloads. APUs are splitting the difference and as we've seen with AMD, the GPU can get a big boost with added bandwidth. The problem with that added bandwidth is the expense of higher-specced memory and the issues with increased power consumption in mobile platforms.

just brew it! wrote:Texturing operations still require a fair bit of random access to read the source textures...

Ahh... Texture Atlases are your friends. You pack lots of mini textures that have spatial locality into a single chunk of memory and then bind the whole atlas at once.

That avoids the (substantial) overhead of repeatedly binding the textures, but does not help with the random access issue. Unless the entire thing fits into an internal cache inside the GPU you still need to pull the sub-textures out of the atlas in the order that they are requested by the texturing pipelines, and this is not going to result in the entire block of memory being streamed out sequentially!

Edit: In fact, it potentially causes the memory accesses made by the texturing pipelines to be *less* sequential, since you're pulling a small rectangular region of a large texture instead of an entire small texture. (It's still a net win because the overhead of changing the texture bindings is huge compared to the penalty for having to do what amounts to a gather operation on the rows of the source texture data.)

Actually there are specific on-chip texture caches in modern GPUs. There are (at least) two reasons that the impact of irregular memory accesses can larger on GPUs compared to CPUs: 1. GPUs have caches (and the cache sizes are growing rapidly in newer compute-oriented GPUS), but the caches are still much smaller than in modern CPUs. The caches in modern CPUs mask the impact of small random accesses. 2. In order to get the great performance, GPUs need to keep lots of execution units (SMs, stream processors, "cores" whatever you want to call them) operating at a time. To get a small random page out of memory stalls accesses to other parts of the memory, and often starves parts of the GPU that have to wait to get the irregularly accessed chunk of data into only 1 of the execution units. Newer GPUs are gradually implementing gather operations to help alleviate these issues.

just brew it! wrote:

chuckula wrote:

just brew it! wrote:Texturing operations still require a fair bit of random access to read the source textures...

Ahh... Texture Atlases are your friends. You pack lots of mini textures that have spatial locality into a single chunk of memory and then bind the whole atlas at once.

That avoids the (substantial) overhead of repeatedly binding the textures, but does not help with the random access issue. Unless the entire thing fits into an internal cache inside the GPU you still need to pull the sub-textures out of the atlas in the order that they are requested by the texturing pipelines, and this is not going to result in the entire block of memory being streamed out sequentially!

Edit: In fact, it potentially causes the memory accesses made by the texturing pipelines to be *less* sequential, since you're pulling a small rectangular region of a large texture instead of an entire small texture. (It's still a net win because the overhead of changing the texture bindings is huge compared to the penalty for having to do what amounts to a gather operation on the rows of the source texture data.)