Veteran

It doesn't seem like a useful feature (time slicing shaders), because there are so many cores, it would be better to implement some work queue scheme with resource reservation (e.g. I'm enqueuing this task, I need atleast N cores for it). The context saving/restoration is too costly IMHO.

ModeratorVeteran

It doesn't seem like a useful feature (time slicing shaders), because there are so many cores, it would be better to implement some work queue scheme with resource reservation (e.g. I'm enqueuing this task, I need atleast N cores for it). The context saving/restoration is too costly IMHO.

Click to expand...

While I agree that cooperative scheduling is definitely desirable and more efficient being *able* to do preemption is important for correctness guarantees on producer/consumer workloads. Currently these workloads are not possible to express robustly in the current APIs/languages but it'll probably be desirable to do this at least at a coarse granularity in the future.

Veteran

gpu's already multitask don't they ? for example graphics run at the same time as cuda calculated water in just cause 2

Click to expand...

I don't think so, it would be nuts if nV had to issue separate instructions to their CUDA Cores, best have one thread and one instruction per cycle. Heck, next thing we know someone comes up and insists that CUDA is made of nothing more than GPU commands.

Legend

I don't think so, it would be nuts if nV had to issue separate instructions to their CUDA Cores, best have one thread and one instruction per cycle. Heck, next thing we know someone comes up and insists that CUDA is made of nothing more than GPU commands.

sorry

Click to expand...

Yeah, makes more sense for the game to execute all of the CUDA commands on the GPU, so that the GPU is in pure CUDA-mode, finish preparing the frame, and then render the frame in normal graphics mode.

VeteranSubscriber

I guess that depends on your definition of "at the same time". Obviously, one ALU can only execute one command at a time - and that cannot belong to a cuda kernel and a graphics kernel.

On earlier-than-Fermi GPUs though, the whole chip could only execute one kernel at a time and needed to be in "cuda state" or "graphics state" with significant switching times and could. That has been remedied in Fermi, where multiple kernels can run simultaneously, but still have to belong to the same group (compute or graphics), switching time has been drastically shortened according to Nvidia. AMD claims the same capabilities, running multiple kernels at once, for their chips since I don't know when. So they've been there probably much earlier.

Legend

In graphics mode conceptually you have, as a minimum, vertex shader and pixel shader kernels running "independently" on the GPU at the same time.

Xenos/R600 run these as independent kernels.

G80 onwards, theoretically, is the same. Though there's always the chance that NVidia implemented VS/GS/PS by running a single uber-kernel and then simply using a run-time constant in the context (hardware thread's context) to specify which sub-section of the uber-kernel to run, one section for VS, another for GS and a third for PS. Would require a fair bit of digging to find out for sure...

Regular

Evergreen can do it, and in fact can do it under DXCS right now. It's just tricky. On the CL side, we have some work to do to get this through, namely getting out of order queues and fixing up some dependency tracking issues that are too aggressive. Om stated the status a little too strongly. It's being looked at on getting this exposed, but no ETA for CL.

Legend

Ooh, pipeline length is shorter, which means less hardware threads are required to hide pipeline latency, in general.

Throughput is peculiar, it never goes above 28 ops/clk as far as I can see (for things one would expect to be 32 ops/clk).

Throughput for a 32-bit integer MUL is spot-on, though, 15.9 ops/clk.

Double-precision throughput looks like a disaster zone. Register bandwidth spoilt by bad register allocation? In other words, I think the nature of the test is screwing things up, and this will eventually come out right once the driver is more mature.

Some of the GT200 numbers in your test differ from that paper (GTX280 in both cases), e.g. you report 6.0 ops/clk for MAD but the paper has 7.9, and 12.4 ops/clk for MUL but the paper has 11.2 Something going on in the driver/compiler or register allocation?...

notice that this is the peak ops/clk, at only 6 warps. Whereas with GTX480 the peak ops/clk occurs at 16 warps. This entirely contradicts my earlier suggestion that the reduced pipeline length would reduce the count of hardware threads required to hide pipeline latency

This may just be another artefact of this test/compilation. If it truly needs to be heavily populated with hardware threads that's going to be quite awkward.

Maybe this is because the register file has been substantially re-worked due to load/store operation, where ALU operands always come from registers, never from anywhere else and the register file has to support more clients.

Veteran

Double-precision throughput looks like a disaster zone. Register bandwidth spoilt by bad register allocation? In other words, I think the nature of the test is screwing things up, and this will eventually come out right once the driver is more mature.

Click to expand...

I don't think so as nvidia has artificially limited the DP throughput on consumer cards to 1/4th of the Tesla cards. So 4 ops/clock and SM is exactly what one expects (only 168 GFlop/s Peak for a GTX480).

About Us

Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!