My 9800GTX+ was still using almost 100% CPU with 301.24 & 302.59 drivers and the r1305 app (but i was purposly leaving a core free), not tried the r1316 app yet on it, will try it once my present Cuda testing is complete,

I did the same, leaving 1 'thread' free which ups the GPU load by 5-7%, also
the 2nd has an even higher load as the first!? 98%-99% doing 2 MB wus.
This was with MB work, but noticed the same with AstroPulse work.
Unfortunatly no AstroPulse tasks available.

In the mean time, also no 610 work, I did some MW work, with a resource share
of 650 (SETI) and 75(MW) it's OK. AMD ATI 5870 GPUs
But it crashed (heat?) don't know, I immediatly received an survey from HP
as to what has happened and wanted to help?!

Now doing 4 610 tasks on the GPUs.
But some, better alot, AsroPulse work would be nice since the rev.1316,
which really is faster and 2 results were validated, out of 2.

Ah, so maybe that's the key? I'm only running one WU at a time on the GTX 260 Core 216 as well - it seems that multiple simultaneous WUs don't scale well unless on Fermi or later cards. For my Fermi and Kepler cards, I'm running two at a time.

And yes, I also noticed that the pre-Fermi G200-era GPU designs have several more compute units than Fermi and Kepler. Different architecture, different implementation.

Edit: This is interesting... with the current lack of AP WUs, one of my Fermi cards received only one AP WU to work on... even though it was processing a MB/CUDA WU simultaneously, the AP WU it was working on exhibited lower CPU usage similar to that seen on the pre-Fermi card. So perhaps the contributing factor to the high CPU usage isn't the GPU architecture, but rather how many OpenCL tasks you try to run on it simultaneously. MB/CUDA WUs don't appear to suffer high CPU usage.Soli Deo Gloria

it seems that multiple simultaneous WUs don't scale well unless on Fermi or later cards.

Yes, that's the way they were built.

Tasks don't actually run 'simultaneously' (any more than they do on a single-core CPU under a multi-tasking operating system). The hardware is switched from task to task on a scale of milliseconds, perhaps even microseconds.

Fermis and later have specialised silicon to handle this task-switching at high speed: earlier GPUs don't.

If you have something to say on topic or explain why this important for users, please do post in corresponding threads.

I thought I'd do a little light reading tonite and check out the Nvidia thread mentioned above (since all I run are their cards), and got this notice when I went to their site:

Posted July 12, 2012

NVIDIA suspended operations today of the NVIDIA Developer Zone (developer.nvidia.com). We did this in response to attacks on the site by unauthorized third parties who may have gained access to hashed passwords.

We are investigating this matter and working around the clock to ensure that secure operations can be restored.

As a precautionary measure, we strongly recommend that you change any identical passwords that you may be using elsewhere.

NVIDIA does not request sensitive information by email. Do not provide personal, financial or sensitive information (including new passwords) in response to any email purporting to be sent by an NVIDIA employee or representative.

We will post updates about this matter here. For any questions, email us at devzoneupdate@nvidia.com.

Yes, I recived E-mail with same message, NV was hacked recently. Someone tired from their bugs perhaps ;)

Well, last reports imply change in latest NV drivers actually.
When I reproduced this bug on my GPU I saw CPU usage increase on GTX 260, signle MB task. So, apparently something was changed since then. Will update to latest drivers from rock-stable 263.06 (no CPU usage bug there at all, even if 2 AP tasks running) and check what we really have now.

EDIT: As our inner investigation starts to show GPU-host synching very complex matter, not only on NV but on ATi hardware too. CPU time increase on low end ATi GPUs with r1316 over r1305 I attribute now to change in synching mode inside APP runtime when wait time increases. Will do post with these observations on ATi forums and post link here. So one who interesting in topic could follow discussion with ATi specialists (in case they bother to answer of course).

I don't fully understand all the details of that, but I think I get the gist of it. I suppose the lower end GPUs have different implementations which are more sensitive to the differences in the two algorithms you tried.

I don't fully understand all the details of that, but I think I get the gist of it. I suppose the lower end GPUs have different implementations which are more sensitive to the differences in the two algorithms you tried.

Not quite... low-end GPUs just manifest it very noticeable.
Mike did test on his quite fast GPU - he got 4s CPU time vs 10s CPU time... Values are low so more testing will be needed, but difference is very big in relative scale...

From the other side, I think it's some switch between 2 sync modes inside driver itself, depending on time to wait.
If it's true high-end GPU could still not pass threshold for such switch and not show so big increase in CPU time... will see, more testing needed.

Edit: I only just saw your private message (after making the below post). I feel pretty stupid now. x.x

---

<GLaDOS>Continue testing...</GLaDOS>

Ten seconds, wow. What card is that, out of curiosity?

There were no instructions so it took me a while, but I finally managed to work out how to run your dummy work-unit in stand-alone mode and here are my results. I kept parameters the same across all executions (for the same GPU) and also made sure that the binary caches were built before making any timings (so run-time should be just the actual GPU processing and not skewed by the time for the CPU to create those caches).

I'm including CPUs in this summary for reference though I don't expect it should have too great an influence on the times.

Not sure what to make of these trends, but hopefully they'll be of some use to you. Maybe. I suppose only one kind of work-unit doesn't reflect performance increase or decrease as much as we'd like - I'm not sure if the differences are significant enough to overcome the margin of error.Soli Deo Gloria

The times seem to confirm my manual testing for the mid-range and high-end cards - not sure what the script does differently that the other two GPUs took twice as long compared with my manual testing. But the results should still be relative to each other here.Soli Deo Gloria

1) strange result for very first GPU, r555 looks faster %) [C-50 - the same...]
2) results for C-50 resemble resuls from my own C-60, looks like it likes r1305 more.
3) "twice as long" - noidea, it's worth to understand why.
So could you upload full TestData directory archived somewhere? Or if you prefer just E-mail me archive.

I will upload modified version soon then post link here.It should have same performance as r1305 & r1316 but is able to switch between different modes of execution via command line switch.

EDIT: btw, for C-60 I reciving very inconsistent results. Time fluctuations are big. So it's worth to make few copies of the same task and run them all orjust rep[eat run few times to get estimation of error range.
BTW, did you test with CPU free or CPU was busy with BOINC tasks ?

When you say 'faster', are you only looking at CPU times? Because for the HD 6950, I can definitely say that overall run-time is better with r1316. On the old r555, it would finish zero-blanked AP task in about 50-55 minutes. With r1316, it finishes in about 40 minutes, maybe less. Maybe this test WU is not best representation of 'typical' AP tasks, if there's such a thing.

Again, when you say 'like', are you only considering CPU time?

Maybe it's just due to normal variance - will have to test some more.

I sent you a private message with a link - let me know if you can't download it. I had to re-do the test for HD 5670 because I lost the first results, though results are very similar this time around, even though I suspended CPU tasks as well.

I will try to do some more testing later (need to sleep now, very tired), especially for the C-50. Let me know what you want me to focus on. For the results I posted, I only allowed BOINC suspend on HD 6950. For others, I only suspend ATI tasks because I wanted to test 'real-world' conditions, so left the other work-units running. Since the CPU priority for GPU tasks is higher than the CPU-only ones, it shouldn't have too much effect. Except for C-50, of course, because CPU and GPU are on the same chip, so I will suspend CPU tasks for that one as well, for next test.

...as it turns out, I think testing (suspending/resuming repeatedly) killed the AP task on the HD 4670 prematurely (30/30 repeating pulses). So I took the opportunity to make another test run on it with CPU tasks suspended - I included the results in the link I gave you.

Raistmer, I did another series of tests and I'm satisfied with these results so I won't be doing any more testing until the next build you want tested. These times are average of three runs, all tests had BOINC fully suspended and all parameters (on a given host) were kept the same throughout. The variances in times seems to be no more than about 5% for all tests - in most cases even less than that (this includes the C-50) - so I'm confident these times are accurate for my particular hosts.

In conclusion, I still think r1316 is the overall winner out of these three versions across a broad range of GPU architectures. About the only disadvantage I can see is an increase in CPU usage on the C-50, even though it produced the shortest overall run-times.

Now, I'm curious to know how GCN compares... x.x I'm still limited to WinXP / Catalyst 12.1 at the moment, so no HD 7000-series for me just yet.Soli Deo Gloria

Thanks!
Some slowdown of r1305 on fast GPU can be from defaults change. Default FFA block size was decreased vs r555. If you want to change this try to add -ffa_block N and -ffa_block_fetch N params (where N should be the same for all revisions of course).