It's passed all the test I've run with known 'problem' WUs. If you have a certain Test WU you'd like to see tested post it and I'll try it with the benchmark App.
It's been running a week without a single Error or Invalid, and has a Lower Inconclusive count than any other 'Special' version. Something has been solved.
SETI@home v8 (anonymous platform, NVIDIA GPU)
Number of tasks completed: 1018
Consecutive valid tasks: 1185https://setiathome.berkeley.edu/host_app_versions.php?hostid=7769537

It's passed all the test I've run with known 'problem' WUs. If you have a certain Test WU you'd like to see tested post it and I'll try it with the benchmark App.
It's been running a week without a single Error or Invalid, and has a Lower Inconclusive count than any other 'Special' version. Something has been solved.
SETI@home v8 (anonymous platform, NVIDIA GPU)
Number of tasks completed: 1018
Consecutive valid tasks: 1185https://setiathome.berkeley.edu/host_app_versions.php?hostid=7769537

Average processing rate: 306.75 GFLOPS
For a GTX 750Ti, very nice.

And inconclusives are around 7.66% ; not quite the 5% target, but damn close.
Very, very nice.Grant
Darwin NT

With the zi+a sources, can confirm I was never able to reproduce the pulse finding issue spotted on Petri's machine/build, on my Windows build. It has a number of issues to solve of its own (seemingly Windows specific). There are differences in the two codebases I have not examined in detail."Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.

With the zi+a sources, can confirm I was never able to reproduce the pulse finding issue spotted on Petri's machine/build, on my Windows build. It has a number of issues to solve of its own (seemingly Windows specific). There are differences in the two codebases I have not examined in detail.

Does it basically come down to precision/rounding issues due to differences in the different libraries used on the different Operating Systems?Grant
Darwin NT

With the zi+a sources, can confirm I was never able to reproduce the pulse finding issue spotted on Petri's machine/build, on my Windows build. It has a number of issues to solve of its own (seemingly Windows specific). There are differences in the two codebases I have not examined in detail.

My guess is there is some difference in the Paths, at least with the OSX and Linux builds. The OSX builds are still running about twice as many Inconclusives as the Linux build even though they are almost exactly the same builds. Chris was running the original p_zi and running around 45 Inconclusives, now the machine is climbing. I expect it to level out around where my Mac is running. They should be running about the same with p_zi+ as they were with p_zi.

With the zi+a sources, can confirm I was never able to reproduce the pulse finding issue spotted on Petri's machine/build, on my Windows build. It has a number of issues to solve of its own (seemingly Windows specific). There are differences in the two codebases I have not examined in detail.

Does it basically come down to precision/rounding issues due to differences in the different libraries used on the different Operating Systems?

My Initial attempts using my normal compiler/precision approaches yielded the right ballpark accuracy wise, since then broken during various attempts to fix other issues. My problems with the codebase on Windows have been related more to the different driver demands on this OS, with respect to execution times of kernel launches.

In a nutshell, Windows drivers have DirectX based gaming oriented optimisations the other OSes don't. These hidden 'features' fuse kernels in their streams into single large launches, removing synchronisation that really needs to be there. I've experimented with some methods to limit/mitigate this, which have worked to some extent, though introduced other instability along the way (to be isolated).

Switching to the new 1050ti verified the same behaviour as I was seeing on my GTX980 on Win7, and running the 1050ti on generic baseline since last night it looks like the instability is gone (so far). [...so specific to the alpha code & my breakages]. It just means, as I lay down some of the new infrastructure for x42, I'll need to include some comprehensive timing and debug code aimed at improved automatic scaling and giving some control. That's what was intended for x42 anyway, just Petri's contributions change the direction a bit. Ideally I'd like the compatibility broadened along the way, since stock integration at any level requires as broad a support as possible (A Boinc server/scheduler issue limitation). The breaking changes by Cuda version & deprecated devices are complicating what the next generation will look like.

My next dev run in the generic stock direction will end up polymorphic, so as to support multiple Cuda versions/devices. The current alpha code embedded in a clean framework adapted from stock CPU, and 'pluginised', is likely to supplant x41 baseline, and serve as a platform for my Vulkan compute kernels."Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.

With the zi+a sources, can confirm I was never able to reproduce the pulse finding issue spotted on Petri's machine/build, on my Windows build. It has a number of issues to solve of its own (seemingly Windows specific). There are differences in the two codebases I have not examined in detail.

My guess is there is some difference in the Paths, at least with the OSX and Linux builds. The OSX builds are still running about twice as many Inconclusives as the Linux build even though they are almost exactly the same builds. Chris was running the original p_zi and running around 45 Inconclusives, now the machine is climbing. I expect it to level out around where my Mac is running. They should be running about the same with p_zi+ as they were with p_zi.

Yeah, whatever that difference is will likely turn up as I assemble the various pieces (the alpha sources, new buildsystem, and cleaned out codebase). Now having the parts here to turn my Mac Pro into a triple OS dev machine, should aid tracking down any compiler/library/build differences. Just a matter of a lot of rejigging of development environment to do first, which will be easier now that work will let up for Christmas."Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.

Would be good also that those who install "special" app would list corresponding hosts here. Also would be good to have those hosts to be joined beta (with "special" app too).SETI apps news
We're not gonna fight them. We're gonna transcend them.

In a nutshell, Windows drivers have DirectX based gaming oriented optimisations the other OSes don't. These hidden 'features' fuse kernels in their streams into single large launches, removing synchronisation that really needs to be there. I've experimented with some methods to limit/mitigate this, which have worked to some extent, though introduced other instability along the way (to be isolated).

Jason, this "fused kernels" issue in Windows .... is that with the latest optimized for gaming Windows drivers?? Was/Is the problem present with much older Windows drivers? I'm thinking that BOINC users not interested in gaming but in stable and productive systems avoid to a great extent getting on the "latest Windows drivers" carousel that are released seemingly every other day to coincide with the latest popular game. I'm sure a lot of us just use the earliest stable driver that supports the board architecture of the cards we use.Seti@Home classic workunits:20,676 CPU time:74,226 hours

In a nutshell, Windows drivers have DirectX based gaming oriented optimisations the other OSes don't. These hidden 'features' fuse kernels in their streams into single large launches, removing synchronisation that really needs to be there. I've experimented with some methods to limit/mitigate this, which have worked to some extent, though introduced other instability along the way (to be isolated).

Jason, this "fused kernels" issue in Windows .... is that with the latest optimized for gaming Windows drivers?? Was/Is the problem present with much older Windows drivers? I'm thinking that BOINC users not interested in gaming but in stable and productive systems avoid to a great extent getting on the "latest Windows drivers" carousel that are released seemingly every other day to coincide with the latest popular game. I'm sure a lot of us just use the earliest stable driver that supports the board architecture of the cards we use.

The Cuda specific portion optimisations, whereby implicit synchronisations can be optimised out by the driver, trace back a fair way. Slightly earlier than whichever driver it was where 'trusty old' Cuda 3.2 magically became incompatible with later gen cards. At that point I was prompted to alert Eric to block stock distribution of that build if there was any Maxwell [Even Kepler now I think back...] or Newer GPU in the system (Which he promptly did).

Most relevant seems to be that the Cuda version switched to an LLVM based compiler after that, which resides in the driver, replacing the old Cuda one. ~Cuda 4/4.1 were too buggy to use here, though 4.2 and 5 vastly improved the picture. The mechanism is that the drivers ignore the embedded PTX binaries in favour of a JIT recompile that gets cached in %APPDATA%\NVIDIA\ComputeCache

Since the 'old school' Cuda code relies on synchronisation points that get 'optimised out', the Cuda 3.2, 4.2 and 5.0 Cuda sources are identical, and debugging reveals underlying DirectX calls that won't be in play on Linux or Mac, It just points to nVidia's deprecation of Pre-Fermi architecture as a vector for introducing new bugs, with deprecation of Fermi class and x86 platform starting with Cuda [~6.5-8.0].

It's that complex round of breaking changes and deprecations with Cuda, whereby Petri wisely chooses to go the path of least resistance in supporting newest generations only, taking advantage of the improved streaming optimisation capabilities. At the same time Boinc limitations in identifying mixed GPU systems block stock integration of the newest forms. (i.e. What app to send if someone puts a Fermi class and Pascal in the same system).

The 'obvious' solution then is to engineer dispatch into the next generation of Cuda-enabled applications, such that internal regression tests can choose the code based on what works. With Raistmer having chosen the OpenCL squillion build route, which is handling things nicely for GBT at the moment, it does give some breathing room for the daunting amount of software engineering to take place. We have the Stock CPU example of such a mechanism working for a different context. For the Generic stock distribution route that means I personally become tied up in creating new supporting infrastructure. Fortunately that doesn't mean non-stock third party development becomes impeded, though it does mean the alpha code here will be very situation specific, and have quirks across the platforms & devices that prevent widespread adoption for the time being."Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.

Thanks for the technical explanation, Jason. I was hoping that development of two paths where XXX.app for Windows drivers <<37X.XX or whatever had a disclaimer to run the XXX.app on XXX family of cards and then another fork where ZZZ.app for Windows drivers >> 38Z.ZZ or whatever had a disclaimer not to run on less than ZZZ family of cards. I was thinking that might simplify your development where you have to develop an app for all possible card architectures and all the older, current and future drivers. See that is not easy or desirable in fact. At some point in time I would think that the developers have to just deprecate support for old hardware. The manufacturers do it for their latest drivers. Why can't the BOINC developers? Again, thanks for the explanation of your development methodology.Seti@Home classic workunits:20,676 CPU time:74,226 hours

At some point in time I would think that the developers have to just deprecate support for old hardware. The manufacturers do it for their latest drivers. Why can't the BOINC developers?

Because of opposite goals.
Goal of vendor to take your money as much as it can.
Goal of BOINC-based projects developers to allow you to use what you have w/o additional money spend.SETI apps news
We're not gonna fight them. We're gonna transcend them.

TBAR: Have you compared the speed of your compile compared to Petris different builds? Good that the invalid rates are down but as we all know by know we cant eliminate the way the validator thing works either.
The more you produce faster the more inconclusive ratio that host seem to have until it vanishes of.

What i now write below is my theory:

With that i mean, if you have a slow host that doesn't process that much WUs per day you tend to end up crunching units that your wingman already has crunched. If the validator compares the work of a (I call it Petri Cuda) WU and compare it to the other that has been crunched already you get a validation pass and both get rewarded credits and the WU soon is cleared from the system (Cannot find the WU) as we can see when they have been processed and thus the invalid ratio is low!

When you have the opposite a ultra-speedy system that crunches thousands of WUs per day the more inconclusive you will get because that machine is so fast and returns the work first of them all and Waits for other computers to Catch up and when they start to return WUs and the overflowed results are pooring in so that speedy Machines inconclusive ratio will rise faster than others as well.

/End of Theory

What actually matters is ofcourse that the code does the work properly! Q ratio high as possible in various tasks, GBT, High/Low AR etc etc you all know that part but the value as Tbar refers to as "Consecutive valid tasks"
That one is the main thing to keep track of in my mind not the inconclusive part because the more parallel code the more inconclusives we will get wether it's an CPU, GPU , FPGA, PS4 yada.

Thanks for your work TBar and thank you Petri for going the brute force route of taking advantage of newer hardware that made this leap. Latest SoG is also speedy as hell! The 1080 if mine is utilized better than running multiple parallell Cudas now! Thank you Raistmer,Jason,Urs and all you Alpha/Beta testers and others that has contributed that has made that we're where we are at the moment! The list of ppl would get long.
_________________________________________________________________________
Addicted to SETI crunching!
Founder of GPU Users Group

At some point in time I would think that the developers have to just deprecate support for old hardware. The manufacturers do it for their latest drivers. Why can't the BOINC developers?

Because of opposite goals.
Goal of vendor to take your money as much as it can.
Goal of BOINC-based projects developers to allow you to use what you have w/o additional money spend.

Agreed. The $ amount of nVidia cards already retired, given away to the needy, or on my shelf collecting dust, is already worth way more than my car. IMO AMD's got a pretty golden opportunity right now, with a vacuum created by NV and Intel gouging."Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.

TBAR: Have you compared the speed of your compile compared to Petris different builds?.....

I was one of the First testers, been at it for over a Year now. I've tested hundreds of builds during that Year, right to p_zi3i. I haven't been sent any newer version that zi3i.
Your other theory doesn't take into account the use of Offline Benchmarking. The Benchmark App will identify the source of the problem. I just ran another series of tests which show the Pulsefind Error that was addressed in zi3f is still present, it's just a little better in the zi+ build than the zi3i build.

...The interesting part is the Linux builds are different than the Mac builds...they shouldn't be. ...

The differences underneath between Linux (OpenGL/Vulkan), OSX (Metal), and Windows(DirectX) are directly in the way synchronisation is done, which is key in the new optimisations. IMO out of the 3, the Linux one looks the most solid/stable (despite some pretty radical changes to cope with 4k block NVME devices in recent kernels). Probably gremlins can turn up in the app code for sure, however all 3 of those systems are in a state of flux."Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.

...The interesting part is the Linux builds are different than the Mac builds...they shouldn't be. ...

The differences underneath between Linux (OpenGL/Vulkan), OSX (Metal), and Windows(DirectX) are directly in the way synchronisation is done, which is key in the new optimisations. IMO out of the 3, the Linux one looks the most solid/stable (despite some pretty radical changes to cope with 4k block NVME devices in recent kernels). Probably gremlins can turn up in the app code for sure, however all 3 of those systems are in a state of flux.

Except the problem doesn't happen with other Apps. It doesn't even happen with the Old version of the same App. I'm more inclined to think it's similar to the Pulsefind problem prior to zi3f. Some overlooked character that induces a random error when accessed just right. That would explain why the same WU on one platform can end up with a Bad Pulse while working fine on a different platform with the same build number. That happened twice BTW. The first WU worked on the Mac but gave a Bad Pulse in Linux. The normal BLC3 worked in Linux but gave a Bad Pulse on the Mac. Seriously strange in my book.
Have you noticed it's always just One Bad Pulse? Never 2 or more, always one , no matter how many Pulses are found.

...The interesting part is the Linux builds are different than the Mac builds...they shouldn't be. ...

The differences underneath between Linux (OpenGL/Vulkan), OSX (Metal), and Windows(DirectX) are directly in the way synchronisation is done, which is key in the new optimisations. IMO out of the 3, the Linux one looks the most solid/stable (despite some pretty radical changes to cope with 4k block NVME devices in recent kernels). Probably gremlins can turn up in the app code for sure, however all 3 of those systems are in a state of flux.

Except the problem doesn't happen with other Apps. It doesn't even happen with the Old version of the same App. I'm more inclined to think it's similar to the Pulsefind problem prior to zi3f. Some overlooked character that induces a random error when accessed just right. That would explain why the same WU on one platform can end up with a Bad Pulse while working fine on a different platform with the same build number. That happened twice BTW. The first WU worked on the Mac but gave a Bad Pulse in Linux. The normal BLC3 worked in Linux but gave a Bad Pulse on the Mac. Seriously strange in my book.
Have you noticed it's always just One Bad Pulse? Never 2 or more, always one , no matter how many Pulses are found.

That's right. That's how race conditions (due to omissions or typos) tend to manifest. The architecture is virtualised, so ordering and correctness (or otherwise) is dependant almost completely on the underlying implementation. The same situation arose with the introduction of Fermi, whereby NV had to return to produce 6.10. [6.09] Code worked as is on Pre Fermi, produced garbage pulses on Fermi, simply due to cache/thread behaviour.

Quite possible there's one or more reduction pointers that Petri hadn't realised need to be marked 'volatile'. That different systems, drivers and GPUs manage virtualised memory and caching differently, is not surprising, but either way the omission or other problem is in the app code rather than the drivers. It's just complicated by that the implementations are changing underneath."Living by the wisdom of computer science doesn't sound so bad after all. And unlike most advice, it's backed up by proofs." -- Algorithms to live by: The computer science of human decisions.