I want to understand better how the GPU works, specially what causes why and when the performance issues related to it.

for example,

A) if a texture needed for a triangle is stored in VRAM, that means that when using a text2d(...) instruction within the shader code, the GPU stalls waiting to get the appropriate pixel from VRAM, am I right?... or does the whole texture get stored in cache?... if so, that means that all texture used are stored in cache (bump, diffuse, etc)?

B) when rendering, the GPU needs to write on the appropriate render target, would the whole RT be also on a local cache?... so that menas that when changing RT's it needs to send the old RT to VRAM and bring the new one to cache?

C) when changeing render states, I beleive this would be a matter of just changeing a flag in the GPU, so that wouldn't cause any performance issues, would it?... that is, I could go crazy changing states without changeing RT or textures or shader code and it would not have any relevant penalty, right?

D) if VRAM runs out of space, the textures, would be stored in System RAM?

Thanks!

"lots of shoulddas, coulddas, woulddas in the air, thinking about things they shouldda couldda wouldda donne, however all those shoulddas coulddas woulddas ran away when they saw the little did to come"

A) Yes there will usually be a stall here. But rather than letting the GPU sit idle it will start to work on other pixels/vertices instead. GPUs can have many thousands of pixels/ vertices in some stage of execution at any point in time. One of the limiting factors is each element currently in progress requires some registers to store intermediate values so optimizing the shader to use less registers can help with ensuring there are enough elements in flight to hide these stalls.

B) Typically RTs are not in cache but they do have local ROP tiles which can cache data. These ROP tiles are flushed to VRAM when they are finished being written to or there is a RT switch.

C) Some render states can be pipelined with the draw call. Some can't and are set in one of many state contexts. Potentially some render state changes could cause the pipeline to flush or partially flush leading to bubbles of the GPU going idle. Which can and can't is very much hardware dependent. Also note that some render state switches could potentially cause a lot of work in the driver on the CPU side if the hardware doesn't directly support the feature or the CPU has to do some kind of processing on the data first.

D) if VRAM runs out of space, the textures, would be stored in System RAM?

I'm not sure, actually. When I was doing OpenCL work, I seemed to observe some surprising memory paging effects (unused buffers getting swapped to system memory when required), so I suspect this would be the case. This could be implementation-defined behaviour, however.

The slowsort algorithm is a perfect illustration of the multiply and surrender paradigm, which is perhaps the single most important paradigm in the development of reluctant algorithms. The basic multiply and surrender strategy consists in replacing the problem at hand by two or more subproblems, each slightly simpler than the original, and continue multiplying subproblems and subsubproblems recursively in this fashion as long as possible. At some point the subproblems will all become so simple that their solution can no longer be postponed, and we will have to surrender. Experience shows that, in most cases, by the time this point is reached the total work will be substantially higher than what could have been wasted by a more direct approach.

A) if a texture needed for a triangle is stored in VRAM, that means that when using a text2d(...) instruction within the shader code, the GPU stalls waiting to get the appropriate pixel from VRAM, am I right?... or does the whole texture get stored in cache?... if so, that means that all texture used are stored in cache (bump, diffuse, etc)?

B) when rendering, the GPU needs to write on the appropriate render target, would the whole RT be also on a local cache?... so that menas that when changing RT's it needs to send the old RT to VRAM and bring the new one to cache?

C) when changeing render states, I beleive this would be a matter of just changeing a flag in the GPU, so that wouldn't cause any performance issues, would it?... that is, I could go crazy changing states without changeing RT or textures or shader code and it would not have any relevant penalty, right?

D) if VRAM runs out of space, the textures, would be stored in System RAM?

C) Pixels are batched up into "segments" on the GPU-side. If multiple successive draw-calls have the same state, then their pixels will probably end up in the same "segment". Some state changes will force the end of a segment and the start of a new one, while other state-changes won't. There's no rules here, each card may be different. Generally, bigger changes, like changing the shader program will definately end a segment, while smaller changes, like changing a texture may not.

Also, as mentioned by AliasBinman, changing states may have a significant CPU-side overhead within the driver or API code.

A) As above, when processing pixels, the GPU has a whole "segment" worth of pixels that need to be processed. It can break the pixel shader up into several "passes" of several instructions each, and then perform pass 1 over all pixels in the segment, then pass 2, and so on.
For example, given this code, and the comments pretending how it's been broken up into passes:

So to begin with, it executes pass#1 - issueing all the texture fetch instructions, which will read texture data out of VRAM (or the cache) and write that data into the cache. Then after it's issued the fetch instructions for pixels #360-400, it will move onto pass #2 for pixels #1-40. Hopefully by this point in time, the fetch instructions for these pixels have completed, and there's no waiting around (if the fetches are still in progress, there will be a stall). Then, after this pass has performed all it's pow calls, the next pass is run, which does some shuffling and multiplication, generating the final result. These results are then sent to the ROP stage.

The bigger your "segments", the more able the GPU is able to hide latency by working on many pixels at once. Shaders that require a lot of temporary variables will reduce the maximum segment size, because the current state of execution for every pixel shader needs to be saved when moving on to other pixels (and more temporary variables == bigger state). Also, certain state-changes -like changing shaders- will end a segment. So if you have a shader with lots of fetches, you want to draw hundreds (or thousands) of pixels before switching to a different shader.

B) Some GPUs work this way, especially older ones, or ones that boast having "EDRAM" -- there's a certain (small) bit of memory where render targets must exist to be written to. When setting a target, it has to be copied from VRAM into this area (unless you issue a clear command before drawing), and afterwards it has to be copied from this area back to VRAM (unless you issue a special "no resolve" request). On other GPUs, render-targets can exist anywhere in VRAM (or even main RAM) and there is no unnecessary copying. The ROP stage will perform buffering of writes to deal with the latency issues, similar to the above ideas in (A).

D) This depends on the API, driver and GPU. On some systems, the GPU may be able to read from main RAM just like it reads from VRAM, so storing texutres in main RAM is not much of a problem. On other systems, the driver will have to reserve an area of VRAM and move textures back and forth between main/VRAM as required... On other systems, texture allocation may just fail when VRAM is full.

* Disclaimer -- all of this post is highly GPU dependent, and the details will be different on different systems. This is just an illustration of how things can work.