If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.

"It is possible to get good performance, just not with a direct port from Cuda.(openCL)"

Every app is different, of course, but it's possible to write apps which are heavily optimized for a specific vendor's hardware, in which case a "direct port" will still have those optimizations and may not perform well on different hardware.

Nothing wrong with that if you only plan to run on one hardware type, of course, but developers are learning to follow "generic" best practices (rather than vendor-specific ones) which allow good performance on multiple vendor's hardware, including CPU. What seems to make the most difference is memory access patterns - tweaking the ALU code can give maybe a 100% speedup but you can easily get a 10-1 or better improvement (or worsening) by changing the way memory is used.

Here's an example that goes the other way - IIRC this app was written in OpenCL from the beginning.

In general I think the performance results will lie somewhere in betwteen. If you follow some of the GPGPU threads you can see the cross-platform issues gradually being knocked off so that the final code runs fast on at least three different platforms (CPU as well as NVidia/ATI GPU)

Comment

"It is possible to get good performance, just not with a direct port from Cuda.(openCL)"

Every app is different, of course, but it's possible to write apps which are heavily optimized for a specific vendor's hardware, in which case a "direct port" will still have those optimizations and may not perform well on different hardware.

Nothing wrong with that if you only plan to run on one hardware type, of course, but developers are learning to follow "generic" best practices (rather than vendor-specific ones) which allow good performance on multiple vendor's hardware, including CPU. What seems to make the most difference is memory access patterns - tweaking the ALU code can give maybe a 100% speedup but you can easily get a 10-1 or better improvement (or worsening) by changing the way memory is used.

Here's an example that goes the other way - IIRC this app was written in OpenCL from the beginning.

In general I think the performance results will lie somewhere in betwteen. If you follow some of the GPGPU threads you can see the cross-platform issues gradually being knocked off so that the final code runs fast on at least three different platforms (CPU as well as NVidia/ATI GPU)

"GPU Ray-tracing for OpenCL"

nice :-) "Sample/sec -- 17,298.6K" is that ok for rendering a game in full Ray-tracing ??

why nvidia cards are so slow in this benchmark ? ? ? ? ? ? ?

whats up with openCL for hd3xxx hardware?

whats up with openCL for the opensource driver?

Phantom circuit Sequence Reducer Dyslexia

Comment

It seems the guy overclocks the gfx cards. Not really usefull for that purpose. There should be test apps used to verify correctness. That's why NV works on ecc memory check for next gen quadro cards - to detect (and correct) memory errors. For games usually only visual checks are done to prove that overclocking is working right, but that can lead to completely useless results on gpu computing.

It probably just happens to be coded in a way that maps better onto ATI strengths (eg math intensive) than NVidia strengths. Revisit the thread in a month and I expect the gap between the two GPU vendors will be smaller, and the code will be running faster on all hardware.

Each new generation has additional inter-thread hardware support and there's a certain level required for a full, fast OpenCL implementation. IIRC the global data share (GDS) was added to rv670 first, and local data share (LDS) was added to rv770 first, and both of them are required for OpenCL.

It's probably possible to implement an OpenCL subset that runs fine on older hardware but that is probably more likely to happen with the open drivers than with the proprietary stack.

Nothing has changed AFAIK - it'll probably run over Gallium3D drivers, and there may need to be some changes to TGSI before that happens. VMWare's short term priority was getting their SVGA driver ready for production along with the graphics state trackers, so the devs have mostly been working on that instead of OpenCL. Zack's blog is still the best reference AFAIK.

Comment

Each new generation has additional inter-thread hardware support and there's a certain level required for a full, fast OpenCL implementation. IIRC the global data share (GDS) was added to rv670 first, and local data share (LDS) was added to rv770 first, and both of them are required for OpenCL.

LDS was first in rv770 yes but for openCL the RV770 was wrong or something else to smal or not the right features,..

Comment

That doesn't sound right. It's possible that LDS is not used to implement OpenCL global memory (the fit there isn't great) but IIRC it is used for something, probably synchronization. If LDS isn't being used then the main alternatives would be direct shader access to memory (which was expanded a lot in 7xx) or "global" GPRs, which were only added in 7xx, and either way I don't see any reason to think that implementation on 6xx would be easy.

The graphics-related programming model didn't change much between 6xx and 7xx, but the compute-related parts changed a lot more. Evergreen has non-trivial changes in both areas -- you can see a summary in the front of the ISA guide.

Comment

That doesn't sound right. It's possible that LDS is not used to implement OpenCL global memory (the fit there isn't great) but IIRC it is used for something, probably synchronization. If LDS isn't being used then the main alternatives would be direct shader access to memory (which was expanded a lot in 7xx) or "global" GPRs, which were only added in 7xx, and either way I don't see any reason to think that implementation on 6xx would be easy.

To be (hopefully) precise, the LDS in RV770 isn't used in OCL, and will probably never be used there, as its access model is too restrictive to fit the specification. My understanding of your implementation is that you emulate shared memory via global memory in RV770, so implementations relying on "heavy" shared memory usage will be a bit underperforming there. On the other hand, one has a pretty fat register file there, which can offset some of the pain (such aspects do make direct ports from CUDA less than great ideas, as more often than not you'd be using shared mem on G80+). There's no GDS in R6xx parts (again, IIRC), there is a memory R/W cache that's similar in certain aspects, but it's small-ish and not wholly equivalent.

As for an R6xx implementation, lack of a Compute Shader mode(this was added with RV770) could be a limitation, since it'd mean somewhat higher overhead for launching kernels (you'd run them as Pixel Shaders). However, given the current state of the software stack and aspects of the architecture, one often ends up pretty fast implementing compute via Pixel Shaders, so the main hold-up for older parts, IMHO, is lack of relevancy coupled with lack of resources. ATI could probably coherce R6xx parts into OCL compliancy, but I fail to see why that would provide any benefit whatsoever, whilst it'd cost their already over-stretched Compute centric guys a fair chunk of time.

Comment

There's no GDS in R6xx parts (again, IIRC), there is a memory R/W cache that's similar in certain aspects, but it's small-ish and not wholly equivalent.

Yeah, that matches what I'm seeing in the documentation, although "conventional wisdom on the internet" seems to be that rv670 at least did have the GDS. Maybe confusion between GDS and the scatter/gather memory access functionality.

I'm pretty sure there was als a hardware issue related to synchronization on older parts; I'll see if I can remember what it was.

It's probably obvious that the open source team has been focusing on the graphics functionality so far, and not the compute bits

Comment

Yeah, that matches what I'm seeing in the documentation, although "conventional wisdom on the internet" seems to be that rv670 at least did have the GDS. Maybe confusion between GDS and the scatter/gather memory access functionality.

I'm pretty sure there was als a hardware issue related to synchronization on older parts; I'll see if I can remember what it was.

It's probably obvious that the open source team has been focusing on the graphics functionality so far, and not the compute bits

"To be (hopefully) precise, the LDS in RV770 isn't used in OCL, and will probably never be used there, as its access model is too restrictive to fit the specification."