Over in the PST forum, Ken and the crew have been working on a GPU application for sieving. The program is called ppsieve and is currently being used on the PPSE sieve. If all goes well, we may be able to merge the PPS and PPSE sieves bringing PPSE into BOINC.

Currently available for 32 & 64 bit Linux AND 32 bit Windows (will run on 64 bit). It should work on cards with any compute capability. You can download it here:

The "Elapsed time" on a GTX 260-192 @ 667 MHz is nearly the same as on the Q9550 @ 3.4 GHz. So the output of the GTX 260-192 at stock clock is roughly the same as that of 4 cores of my Q9550 at stock clock. This ratio also applies to the current AP26 apps (1.01 (cuda23) vs 1.04).

Whoops! It looks like I left some flags in the ppconfig.txt file for the CPU version of PPSieve that was on my site, including one that forces sieving for Riesel numbers instead of Proth.

The timing information here is fine, but before doing anything in the PPSE sieve, anyone getting "-1"s in their results file should either download what I just uploaded, or edit ppconfig.txt to remove the "riesel" line.

OK, now that's over with, about the GPU testing...

3M is actually a small test range for the GPU code. I use it because it has good known factors, and becuase I don't have a compute-capable GPU and the emulator's really slow! But a 30M range or something would probably be better for speed comparison, if you have a minute or four:

I find no checks for a specific compute capability (1.0, 1.1 or 1.3) in the source code. You can download the zip file with the binaries in it (< 100 K) and simply start the 32 or the 64 bit binary with the commands John gave in his post. The test range is very short.

or, of course, we simply could read John's initial post. I overlooked it too:

Currently available for 32 & 64 bit Linux. It should work on cards with any compute capability. You can download it here:

Has anyone given any thought to what's going to happen when this goes live via BOINC?

In particular, consider how I've got PrimeGrid set up, which is probably a fairly common arrangement for people with CUDA capable GPUs:

1) I've got a quad-core CPU and a GTX280 GPU.

2) I run AP26 on the GPU

3) I run other PrimeGrid stuff on the CPU

4) I do not want to run AP26 on the CPU because I can run it much, much faster on the GPU (about 5 min/WU).

5) There's no explicit BOINC mechanism to say "run X on the CPU and Y on the GPU", unless PrimeGrid makes two separate sub-projects, "AP26-CPU" and "AP26-GPU".

6) So, I have ONLY the CPU tasks selected on project preferences page to feed the right tasks to the CPU

7) ... and "Send work from any subproject..." to send AP26 to the GPU. This works because nothing else exists for the GPU, so it has to send AP26 tasks.

As soon as there's more than one GPU project, there will be no way of selecting what you want to run on the GPU, unless you want to also allow those to run on the CPU (which I think most people would prefer not to do.)

One possible solution is to make separate sub-projects for the GPU versions, but I realize that's far from ideal.
____________Please do not PM me with support questions. Ask on the forums instead. Thank you!

I think it would be a mistake to build too much complexity into local bespoke code.

We need the BOINC client to do the right thing with GPU scheduling. We need the BOINC server to allow per subproject CPU/GPU preferences. We should try and feed those requirements into the mainstream BOINC development process and then make use of it when released.

All IMHO of course, and somewhat idealistic, since BOINC development seems to follow whatever direction the Berkeley folks are meandering in at a particular moment in time...

I intentionally avoided the whole concept of "well, the BOINC software really *should* do this..." for the obvious reasons, primarily that we're going to have a problem in the near future whereas the BOINC client might eventually get around to solving a problem like this anywhere from tomorrow to never. They've got much bigger GPU scheduling problems to solve first before they could get around to this one.
____________Please do not PM me with support questions. Ask on the forums instead. Thank you!

As soon as there's more than one GPU project, there will be no way of selecting what you want to run on the GPU ...

The anonymous platform mechanism lets you control exactly what you want to run. I deployed it today on the Linux side of my quad and it even picked up existing tasks correctly.

Obviously, this isn't a solution for the masses, but it works for me.

It's been a loooooong time since I've set up app_info by hand, and I don't really remember how to do it. Anyone know of a good reference for what exactly needs to be done?
____________Please do not PM me with support questions. Ask on the forums instead. Thank you!

As soon as there's more than one GPU project, there will be no way of selecting what you want to run on the GPU ...

The anonymous platform mechanism lets you control exactly what you want to run. I deployed it today on the Linux side of my quad and it even picked up existing tasks correctly.

Obviously, this isn't a solution for the masses, but it works for me.

It's been a loooooong time since I've set up app_info by hand, and I don't really remember how to do it. Anyone know of a good reference for what exactly needs to be done?

The BOINC wiki has this article. You can use the example in the format section as a template.

Open the client_state file in the data directory and find the Primegrid project data. Copy the app_version section of the first subproject you want to run and paste over the app_version in the template removing only the platform tag. You can edit the flops value if you know it's off for your current duration correction factor. Correct the <app> <name> and declare the files in the app_version also with <file_info> tags like in the example. Repeat for other subprojects you want to run. Save as app_info.xml in the PG project folder and restart BOINC. You should make sure all the files you declared actually exist in the PG folder.

It is probably a good idea to run down your cache and/or make a backup of the data directory (suspend network activity, too!) before deploying.

Alright, I've posted a release candidate version to the locations given at the beginning of the thread. It may be slightly faster; it may also lock up your GPU until it's done. Let me know how it goes.

In any case, I've recently been looking at some other projects' GPU speeds, and I'm finding myself disappointed with my speeds. When Milkyway@Home is 17 times faster on high-end NVidia (PDF), and even a simple Collatz app (not the Collatz, but the only source code I could find) is more than twice as fast as a CPU on a mid-range card, but my code is only as fast as a CPU on a high-end card, I wonder if I'm doing something wrong. Would any of the experienced CUDA developers around here care to give my code the once-over, to see if I'm doing something obviously stupid like not giving the card enough threads?

I suppose the other side of the coin could be that my CUDA code isn't bad, but that my and Geoff's CPU code is extraordinarily good.

In any case, I've recently been looking at some other projects' GPU speeds, and I'm finding myself disappointed with my speeds. When Milkyway@Home is 17 times faster on high-end NVidia (PDF), and even a simple Collatz app (not the Collatz, but the only source code I could find) is more than twice as fast as a CPU on a mid-range card, but my code is only as fast as a CPU on a high-end card, I wonder if I'm doing something wrong.

One thing to consider is that some problems just don't lend themselves very well to parallel processing. Even with the best code in the world it still might not work very well on a GPU. Remember, the GPU isn't all that fast compared to a CPU. It's its ability to run several hundred calculations simultaneously that makes it fast. If the problem doesn't fit the hardware well, the GPU won't be able to crunch it very quickly.

____________Please do not PM me with support questions. Ask on the forums instead. Thank you!

Yeah...the application here is computationally-bound, and doesn't require much memory. Probably the slowest part are the 64-bit multiplies. When Fermi comes out, I expect that each stream processor will run my app (once recompiled) twice as fast.

Another part of it could be that others are comparing GPU speed to CPU speed on one core. In that case my app is 4 times as fast as the CPU version. :)
____________

Nice work so far Ken.. I only see one issue, this appears to be a compute-mode only CUDA application, meaning it will not run on the primary adapter under Windows in current form (driver watchdog timer). Correct?
____________

If you mean it's not using the driver API, that's correct. I was hoping to avoid it.

In Windows, if a CUDA kernel runs longer than 5 seconds the program will be terminated by the driver. Briefly looking at your posted source, it appears you're running one huge kernel.

RE: app speeds, currently in AP26 a 1.3 CUDA card is about 5.5 times as fast as one core of an Intel Q6600 CPU. So your app isn't exactly slow, it's just doing things the GPU isn't good at.
____________

Have you investigated whether some of the later compute capabilities add features that increase speed? It is nice to see an application that works on all CUDA cards, but given that only a handful of models are compute capable version 1.0 (G80 chips), added features such as atomic functions in compute capable 1.1 cards might help with speed depending on the processes computed in the application.

In Windows, if a CUDA kernel runs longer than 5 seconds the program will be terminated by the driver. Briefly looking at your posted source, it appears you're running one huge kernel.

Not exactly. I load up the GPU with either 384 or 768 P's per multiprocessor, run just those, further check any that found a factor on the CPU, then repeat. There's no specific time checking, but I estimate the kernel won't run more than 1 or 2 seconds at a time.

Scott: I looked into it. I'm not using much global memory, or any shared memory, so atomic functions don't matter. I'm not sure; double-precision might have enough precision to be useful in one case, but it would be tricky. Otherwise there's nothing until compute capability 2.0, which as I mentioned makes multiplication faster.

Thanks for all your testing help, guys! I've got one more thing to test, and I hope you won't mind because I expect it to be slower. (But I've been wrong before!)

Linux 64-bit users *only*, with pre-Fermi cards (Fermi isn't in stores yet if you didn't know), please try the two binaries in this zipfile. This is an experiment in 24-bit multiplies instead of 64-bit ones. Both binaries do 24-bit multiplies, despite their names, but they do other stuff differently. Even if it doesn't work here, this is a plausible algorithm for ATI if I can ever figure out how to develop for OpenCL without their GPU.

If anyone reading this *does* have a Fermi (GTX4xx), I'd love to see a benchmark from the original code (linked in the first post by John). If Fermi doesn't run 50-100% faster per shader, I may have to recompile or something for maximum speed.
____________

OK, new version uploaded, probably the finalized version, at the links in the first post. I found a somewhat major bug in previous versions: around 30 of the highest N's in the ranges were being skipped! But that's fixed now.

Bryan, see if you can get this code to build in VS, perhaps without BOINC first. If you need to make changes, perhaps I should set up a GitHub account?
____________

Well, I thought I was done, but I've made a few more changes for the release version of PPSieve-CUDA. The biggest change is compiling with CUDA 3.0. I hope it works for everyone!

The biggest code change is that I gave up using boinc_init_parallel() in favor of boinc_init(), because it's more compatible. The rest of the code changes are to header files and paths to BOINC header files. So nothing major there.

By the way, apparently CUDA 3.0 introduces an easier way to lower CPU usage. It might go from 5% down to 1 or 2%. But I'm going to leave lowering CPU usage for V0.1.2, if it's needed.
____________

Thanks Ken - I just got the 0.1.1-rc2 version ported to Mac OS X (only minor tweaks required as __thread attribute is not supported by GCC on Mac OS X), I'll pick up the new code and rebuild ASAP and post here when it's done (hopefully in a couple of days)...

Compiling with CUDA 3.0 will probably mean that many will be forced to upgrade drivers to at least the 195.xx series. Just FYI...the 196.xx and 197.xx dirvers have been noted to slow down many cards computational speeds compared to the 190.xx and 191.xx drivers (especially 8xxx and 9xxx series cards under Win7 and Vista), so the gain in freer CPU may actually be lost (and maybe exceeded) by the loss in speed on some cards.

Compiling with CUDA 3.0 will probably mean that many will be forced to upgrade drivers to at least the 195.xx series. Just FYI...the 196.xx and 197.xx dirvers have been noted to slow down many cards computational speeds compared to the 190.xx and 191.xx drivers (especially 8xxx and 9xxx series cards under Win7 and Vista), so the gain in freer CPU may actually be lost (and maybe exceeded) by the loss in speed on some cards.

The slow downs some people have reported with some versions of the drivers have been significant. Around 25%, IIRC.

____________Please do not PM me with support questions. Ask on the forums instead. Thank you!

OK, it doesn't have to be compiled with 3.0 (yet). I wasn't sure if 2.3 would support Fermi. Since it looks like it does (PDF), I'll see about going back to 2.3.

Edit: To be clear, only the binaries will change, not the source code.

That's good...am i reading it correctly that CUDA 2.3 devices will use the native CUBIN that can work with the older drivers, but Fermi devices will need to have the 195.xx driver or higher to utilize the PTX code?

Note that only 32 bit CUDA executables are supported on the Mac, but as most runtime is spent on the GPU, this is not a problem. Since upgrading to Mac OS 10.6.3, Apple now only support CUDA 3.0, so this app is build and linked with the CUDA 3.0 libraries. However, it should work fine with machines where CUDA 2.3 is installed. If you have a Mac running OS 10.5 and/or CUDA 2.3 I'd be very grateful for your testing.

To test the app, please use the same inputs as in the original post, and obviously the output should be the same!

Two possibilities for computation errors. Most likely this one is because you're using 0.1.1-rc1. I believe I fixed a bug between rc2 and the final release that could rarely cause this error. It could definitely cause factors to be missed near NMax.

So please download the latest version from the link in the top post.

A computation error means that the GPU says it found some factor (it doesn't return what factor), but the CPU failed to find a factor in that range. So it could also be caused by an unstable GPU or rarely an unstable CPU.
____________

So now you can see the various directions I've considered. The redc branch is the current one, but I have an idea for the other branch that might pull it ahead, if I can find a large enough, fast-enough area of memory; maybe texture memory.

But first, since I've heard nothing from mfl0p, I think I'd better try to set up a WinXP VM and build a version for Windows.
____________

Thank you very much for setting up the repository. This makes it easier to follow the developments. I think it is time for me to reinstall the NVIDIA drivers and their CUDA toolkit under Lucid Lynx and get my GTX 260-192 out of hibernation mode again (in the last few weeks I've crunched with a HD 4770 under Windows and Linux).

By the way: The repo contains a file named pps/ppse_37TE1.txt that is a link to a file in a /downloads/... directory that is not in the repo. Is this file too large to include in the repository or are there other reasons not to include the file?
____________

Heh, didn't know that file was in there. It's a 1.2 GB file, so there's no way to include it. Plus it's not going to be used with BOINC, so its only purpose here would be for testing with many_n_test.sh and maybe some of the other testing scripts. It's not useful for the testing we're doing in this thread.

Edit: By the way, the code hasn't changed in about a month. I just made the code and its previous changes easier to access.
____________

I tried compiling the source with VC++ 2008 Express. The files compile fine, but when linking, it's like no file sees any other file's header. I included the header files - even some new ones to replace missing Linux versions, so I'm not sure what's going on.

If you know anything about MSVC++ (since I don't), could you please take a look at my source code?

Thanks!

P.S. What all has to be included in the source code to save the proper build instructions? Does the .sln file need to be there? I really want to avoid including the gigantic .ncb file.
____________

Tanya, angle brackets (<>) are used for including non-user-written libraries that are (or should be) in your compiler path. You use the quotes ("") when what you're including is in the same directory or in the directory of another file that includes it. If it's still not found, I think it then checks the compiler path.
____________

Jay, what you are saying sounds mostly right, so I'm not sure if you're saying I got something wrong in my earlier message. I do know that for including headers, at least with the 2010 version of Visual C++, that I need to use quotes or the header won't work, and I have used a non user-written library before: #include <iostream>. I didn't think that the non user-written library had to be included in the compilers path, although I don't know where it would be.

Perhaps I have misuderstood something, as I have done very little with C++.
____________

I couldn't find any reference to gfn_app.h or gfn_main.h. Where did you see that?

Instead of bringing up the full project in visual C++, I went looking through folder at the individual files. One file was named gfn_main.c which I opened and looked at the code. That was where I saw the line
#include <gfn_app.h>
That is also where the line "#include <gfn_main.c>" is.

By the way, I wasn't sure if project-level preprocessor directives got included in the file I zipped up. You should make sure NDEBUG is #defined in the project, or you may get more errors than I did.

I downloaded the zipfile and tryed to run the exe in the release folder. I got a message that the program can't start because cudart.dll is missing from my computer. I think I may have found a place to get the cudart.dll. Do I need to get it and put it in the directory with the exe, or is there something else I need to do?

I downloaded the zipfile and tryed to run the exe in the release folder. I got a message that the program can't start because cudart.dll is missing from my computer. I think I may have found a place to get the cudart.dll. Do I need to get it and put it in the directory with the exe, or is there something else I need to do?

I did a spit take! But then I realized that's the wrong line. What did the line that starts with "Elapsed time" say?

Sorry, I stopped it running at about 25% (I am switching the card out this evening for an ATI 4670 that I just picked up). A wall clock estimate for the total run time based on the 25% complete would be in the neighborhood of about 3-3.5 hours. Also, and interestingly, I had very little delayed screen response.

I am also curious as to what this may be...it depends on what version of the CUDA SDK this was compiled with...newer versions will run considerably faster on newer cards as well as include increased capabilities (double precision, anyone?)

I'll try getting the cudart.dll from a project like GPUGrid or Milkyway, which both use at least CUDA 2.2 (due to double precision support) and see what, if any, difference that makes...

EDIT: Whoa, put my foot in my mouth a bit there...Collatz would use 2.2...my bad... :p
And also, when I switched the cudart.dll with the one from GPUGrid, it made NO difference whatsoever...
____________

OK, I had thought it was running over 1MP/s; it was just 1KP/s. I think something may be wrong with my sleep timing. I'll look into it and get back to you.

I did notice (through watching GPU usage on EVGA Precision) that the GPU usage never stayed constant...it would spike for a second or two to around 75% and then fall to zero for about 10-20 seconds....

Hope that helps!

And BTW, thanks Ken for building a Windows version! It seems like it has a some more ground to cover to catch up with the linux builds, but great job nonetheless!
____________

Taking a look at the GPU utilization graph on GPU-Z, it showed that the vast majority of the time the utilization was 0%. About every 15 seconds or so, the utilization briefly spiked way up, then returned back to 0. Even stranger was that it wasn't using the CPU during the time the GPU was idle. CPU utilization was at about 10% to 20% of a single core according to task manager. (The output from the program said it was using 0.03 CPU cores, which was significantly lower than what task manager was showing.)

So, for most of the run time, it's not using the GPU or the CPU. I would guess that it's either waiting on a resource or sleeping.
____________Please do not PM me with support questions. Ask on the forums instead. Thank you!

Taking a look at the GPU utilization graph on GPU-Z, it showed that the vast majority of the time the utilization was 0%. About every 15 seconds or so, the utilization briefly spiked way up, then returned back to 0. Even stranger was that it wasn't using the CPU during the time the GPU was idle. CPU utilization was at about 10% to 20% of a single core according to task manager. (The output from the program said it was using 0.03 CPU cores, which was significantly lower than what task manager was showing.)

So, for most of the run time, it's not using the GPU or the CPU. I would guess that it's either waiting on a resource or sleeping.

Hmmm...this might show something about Vista specifically. On my 9600GSO under 32-bit XP Pro, GPU-Z shows the GPU utilization at 99% for the whole test. My unusually long 9500GT results (which on the OC'ed 32-shader card should be similar to the stock clocked 9600 GS 48-shader card) are also obtained on Vista (albeit 64-bit). Looks like something in the code is not activating the GPU properly under Vista (and I'd suspect under Win 7 also).

A stop error with that code is usually associated with a problematic boot device (usually a hard drive)...kinda weird to see it with this CUDA application. You aren't by chance trying to run it off of a USB stick?

Alright, I'm getting the impression that sleep isn't fixed - at least not in all cases.

So, I'd like those of you who had problems with it in particular, and anyone else, to test the version I just uploaded, and please report the sleep diagnostics it outputs. By the way, "OVERslept by WAY TOO LONG!" is the line that indicates trouble; but I can only learn the magnitude of the trouble from the other lines.

The other option is to use one CPU core 100%, but I'd like to avoid that if I can.
____________

Ok, here's he result using that DLL, which I believe is the same as the one I was already using. Note that the sleep time starts at just under 1 second and steadily increases to around 30 seconds. During the sleep time, the GPU is idle.

EDIT: Correction, the new DLL is not the same one I used before. WinRAR seems to have a bug and was telling me they were the same when they were not. Nevertheless, this test was done with the correct DLL.

With the same DLL, I am getting the exact same results (just much longer) on my 9500GT in a Vista 64-bit system (driver 191.07). GPU-Z shows off-and-n spikes to about 30%, but never higher for GPU load.

So far, all the issues are on Vista even with the same DLL and across multiple driver versions. We really need a Win7 machine to test this, too (unfortunately my one Win7 box has an ATI card).

GPU temps were going up nicely, although this ran quickly enough that it didn't have enough time to go all the way up. GPU-Z was showing GPU utilization in the 90-100% range, which is excellent. I had other BOINC stuff running, but that shouldn't affect this test because it's running at a higher priority than normal BOINC tasks. CPU utilization was shown as 1%, which is about 4% on a single core.

That's the good news.

The bad news is that while this was running, the video display was rather unresponsive. The problem was bad enough so that I would never run this on a computer that had a live user. I think that is a side effect of the individual kernals running too long, but I may be wrong. If I'm correct, shortening the amount of work done by each kernal will make the screen more responsive, but will increase the amount of work the CPU has to do because it has to launch more kernels.
____________Please do not PM me with support questions. Ask on the forums instead. Thank you!

The bad news is that while this was running, the video display was rather unresponsive. The problem was bad enough so that I would never run this on a computer that had a live user. I think that is a side effect of the individual kernals running too long, but I may be wrong. If I'm correct, shortening the amount of work done by each kernal will make the screen more responsive, but will increase the amount of work the CPU has to do because it has to launch more kernels.

Aha! I'd gotten reports of slow displays before, but I never knew what to do about it! So thanks!

Now that I know what to Google for, I'm seeing suggestions of 10-20ms kernel runtimes to avoid slow screens. Based on the runtimes (sleep times) above, I think I can do that by breaking the one kernel up into one setup kernel and 20 or so calls to one iteration kernel. But not tonight.
____________

Now I'd like everyone who tried the last Windows version to try the version I just uploaded. It should make it easier to use your computer while it runs; but it may slow the computation. How much I'm not sure yet.

For those who missed it, the previous version is here, so you can compare the old and new versions too.

By the way, if this works, I will have a Linux version of it; I'm just in the middle of Windows coding right now.
____________

It's actually a second or two *faster* than before, but that could be due to outside factors. However, the screen lag seemed to be about as bad as before too.

I'm not sure I was running the right executable. The date of the .exe in the zip file was June 4th, at around 3PM. Is that the most recent file? Could you post a link to the correct download? I had to scroll back pretty far to find it and maybe I followed the wrong link.
____________Please do not PM me with support questions. Ask on the forums instead. Thank you!

Compiling for BOINC with VS didn't work on my virtual WinXP machine, and I've been putting off doing it on a real XP machine. If anyone with experience compiling BOINC apps on VS wants to do it themselves, they're more than welcome to do so. :)
____________