I'm supposed to be a professional linux guru, but this is driving me mental!

A few months ago I managed to get lunatics linux+cuda app running on fc16, and since upgrading to fc17 and boinc-client-7.0.29-1.r25790svn.fc17.x86_64 GPU tasks are aborting with this error about 90% of the time.

In directory /var/lib/boinc/projects/setiathome.berkeley.edu
I copied a failed WU to work_unit.sah and ran
./setiathome_x41g_x86_64-pc-linux-gnu_cuda32 -standalone
and it completed OK.
I'm at a loss to explain why it won't work reliably under boinc-client :(

The crux of the issues relate to the continually evolving specifications on which the hardware/firmware & software infrastructure are based on. Whether on not you subscribe to Microsoft 'wisdom', the WDDM (Windows Device Driver Model, introduced from Vista onwards) was designed to replace the more traditional hardware based XP driver model (XPDM, ala directX - XBox ) with a fully virtualised environment to last some 10-15 years into the future.

That involves significant architectural difficulties cross generation at all levels of abstraction, not least of those involving synchronisation and security concepts familiar to multithreading/OS implementations beforehand, but otherwise alien to 'coprocessors'. I naively suspect the latter [Linux] kernel may have introduced multithreaded display drivers, so inducing the same set of issues.

Some recent Windows updates related to Texture/Font cache functionality. It wouldn't surprise me at all if recent Linux Kernel/driver updates mirror the 'odd behaviour' to at least some extent, and that Aaron, Martin myself & maybe fedora/redhat/centos need to slap each other around a bit to figure out what'll work there as an intermediate fix... until technologies stabilise a bit at least.

Don't feel too bad. Getting the messages about the rapid pace of development through to Windows developers is at least as difficult, as evidenced by game developers similarly struggling to keep up. Somebpdy needs to strap devs down to listen, and Engineers to reign in progress enough for us Luddites to keep up.

Jason"It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change."
Charles Darwin

Always worth reproducing & characterising the issues where practical, particularly isolating the particular technology change to OS/Driver/library specific areas. Notifying the right parties with as much info as possible seems to work for me.

Would you have a relatively recent Centos install Martin ? being further down the development chain there might be some insights there as well ?"It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change."
Charles Darwin

Jason, I've identified why it happens! Just don't know how to solve it yet though.

I get a ton of spam - many thousands per day. DKIM, virus and spamassassin scanning takes up a lot of webserver CPU.
Given that most of the spam comes from countries that none of my clients do business with - VN, CN, UA, TH, ID, BR etc, I did something clever on the webserver to ease the load.

I set up iptables rules to preroute the addresses from just those countries and send them over openvpn to here, then to the i7 with the nvidia 550 card to process the spam and automatically report to spamcop etc.

Not only do they defend servers from attack, the bad ip addresses are also distributed over mysql replication to all other servers and added to their iptables too. (They are purged automatically after a few days depending on history).

The really weird thing, I noticed in the messages log that CUDA errors were happening when iptables blocked an address! WTF???

I stopped proxying mail traffic to the i7 now, and the CUDA errors have gone away.

It looks like there is a path to investigate, but how in the world does a net packet rejection cause CUDA to fail?

The kernel and rsyslog would be involved, and maybe a writing line to the console and messages log introduced some kind of delay that caused the error.

I noticed it also happened with Einstein CUDA app. and occurrence is very strongly correlated from the messages file:
The app that was running at the time was also aborted with an error, and the seti app also generated similar NVRM errors.

On Windows you can starkly see when enabling any system device that has a poor quality driver ( such as a wifi card, usb devices etc ), by its impact on the DPC latency, which directly affects memory transfer 'timeouts' streaming multimedia performance/dropouts). In later app builds I use Boincapi's new temporary exit facilities, so catching & resolving these failures when they occur (still preferable to not trigger them of course).

In your case, barring some equivalent Linux DPC latency checking tool, I would suspect your NIC driver may be the culprit... Failing that, iptables itself or other supporting infrastructure may not be using best practices with respect to multithreaded environment, particularly not returning from interrupts immediately & handling IO completion in a separate (deferred) thread, but instead processing for too long in a kernel callback or similar. Worth delving deeper if better NICs/Drivers yield no improvement.

Jason"It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change."
Charles Darwin

Unlikely to be the onboard nic and drivers, it deals with gigabytes of transfers without upsetting CUDA, a packet has to be received before being rejected, so I suspect something else. (I'll modify the iptables rules to drop silently and see if that helps).

I grew up with hardware interrupts and non-maskable interrupts on a 6502 micro!
I don't know the intricacies of CUDA and hardware level architecture, but it is my understanding that non-urgent interrupts should themselves be interruptible, and I would define a boinc task and functions called, cuda or otherwise to be non-urgent!

Please have a look at the CUDA app code again and consider a retry if a routine fails. I think the CUDA library should be looked at too by NVidia to respect the demands of a system and yield - adapting a kernel to accommodate a library's deficiencies or inaccurate assumptions is a bigger task, though probably not without precedent.
I'll have a look to see where I can file a bug report!

Unlikely to be the onboard nic and drivers, it deals with gigabytes of transfers without upsetting CUDA, a packet has to be received before being rejected, so I suspect something else. (I'll modify the iptables rules to drop silently and see if that helps).

I grew up with hardware interrupts and non-maskable interrupts on a 6502 micro!
I don't know the intricacies of CUDA and hardware level architecture, but it is my understanding that non-urgent interrupts should themselves be interruptible, and I would define a boinc task and functions called, cuda or otherwise to be non-urgent!

Please have a look at the CUDA app code again and consider a retry if a routine fails. I think the CUDA library should be looked at too by NVidia to respect the demands of a system and yield - adapting a kernel to accommodate a library's deficiencies or inaccurate assumptions is a bigger task, though probably not without precedent.
I'll have a look to see where I can file a bug report!

Yes those aspects have been ongoing development matters for all concerned, including MS, nVidia (through reports backstage by myself & others), and in app development. Those issues really only came forward as driver infrastructure moved toward being more multithreaded for scalability.

As mentioned, the most recent app codebase ( up to x41zb ) takes the path of using Boincapi's temporary exit mechanisms where needed, with retries under certain known recoverable/transient conditions.

After initialisation steps were moved to be much more atomic than the original 6.08-6.10 code, various forms of recovery were experimented with. The primary reason boinc temporary exits have been found to be preferred in many cases, is that historically various driver, library, OS & genuine hardware issues have been pretty much unrecoverable at the application level (context corruption). A full exit/restart reinitialises the Cuda context etc completely, often after the OS decided to reset the device. Another aspect to consider is we are dealing with consumer level hardware, in many cases being pushed too far , which is going to take a whole new approach toward ensuring some level of reliability/integrity.

Much of this challenge originates from the incorporation of DMA engines & helper threads, into traditionally not very threadsafe programming environment. For example boincapi freeing host memory buffers underneath active transfers was found to be the dominant cause of 'sticky-downclocks' seen when snoozing/exiting Boinc, effectively sending everything lower level into failsafe. That's one example only, and in retrospect it would have been helpful if nVidia had pushed forward warnings about thread safety as the architectures moved toward the current highly threaded implementation. (rather than wait for our reports :) )

Once someone gets around to making an updated build for Linux, I'd be interested to see if there are any special considerations there as well. It's certainly been a long haul on shifting ground, as was expected.

,would require some makefile updates to reflect the changes for V7 autocorrelation additions in testing on the beta project, along with other refinements. Hopefully as the project resolves some technical issues server side, and the current approaches toward improving cross-device-generation, cross cuda-version matches are able to be verified/improved, then the focus can shift more toward dealing with the likes of rogue hosts & misconfigured systems, which tend to dominate more than more pedestrian issues of the past.

Jason"It is not the strongest of the species that survives, nor the most intelligent that survives. It is the one that is the most adaptable to change."
Charles Darwin