Overall I'm very satisfied with the program, and would gladly release partial/full bounty .. pending a quick answer for why I had to edit the registry to get the program to operate. Equipoise I will send you a few xmr outside of the bounty, thanks a lot for providing it! I would like to hear some more from tsiv.

It is actually explained on the project's front page on Github, I believe

The initial release had the entire algorithm stuffed into a single huge CUDA kernel. Having to do the whole slow algorithm in one go had a tendency to take just a bit over 2 seconds per kernel launch, with 2 seconds being the timeout for Windows getting impatient and going "hmmh, I haven't heard from the GPU in 2 seconds. Must've crashed, better reset the driver." The registry tweak works around the problem by increasing the time that Windows allows the GPU to be "unresponsive" aka stuck running a CUDA kernel.

This has been addressed in later releases, mainly by splitting the single huge kernel into smaller pieces and making parts of the hash faster. The slowest part is still quite slow, taking roughly 1.4 seconds with launch config 8x60 on a 750 Ti but it should stay well within the default 2 second window.

There is something more to be done about the -l MxN.About the first number M:"First of all, your thread block size should always be a multiple of 32, because kernels issue instructions in warps (32 threads). For example, if you have a block size of 50 threads, the GPU will still issue commands to 64 threads and you'd just be wasting them."About the second number N:You could find it by gradually increasing it until your card stop working (showing impossible hash rate 3474958.52 H/s) and then restart is needed for maximum performance (but not for testing), because without restart my hash rate is felling 2x compared to the same options before the crash.

The "magical numbers" for 650M seems to be -l 128x5

I realize the 8x60 or 8x40 make absolutely no sense, they're something I ran into while trying out different values. The reasonable values would be based on the number of SMM/SMX on the GPU and 32 or 64 threads per block would make a lot of sense. I can't tell exactly why performance takes a dive if you try 64x5 for example, it should be a very good value to start at. Might have something to do with the huge amount of random global memory access in the second major loop of the algo, trying to do more work in parallel bottlenecks at the memory access?

Good news is that I've since modified 2 of the 3 main loops to use 8 parallel threads per hash as opposed to the original 1 thread per hash. So essentially 8x60 leads to running 64 threads per block for those two loops. Still working on the last loops, it does seem a fair bit harder to make it more parallel.

At launch, I get a series of results like:GPU #0: GeForce GTX 750 Ti, using 40 blocks of 8 threadsPool set diff to 15000GPU #0: GeForce GTX 750 Ti, 93.81 H/sthen a popup says display driver stopped responding and has recovered. After that I see results with crazy high numbers of hashes like this:GPU #0: GeForce GTX 750 Ti, 163611988.12 H/sinterspersed with 'stratum detected new block'but no accepted results within a half hour check period.

I also tried downloading the previous release, but switching to that one makes the cmd.exe pop up and vanish immediately on my system (Windows 8.1, Driver 337.88). The GTX750ti is not attached to a display output.

Overall I'm very satisfied with the program, and would gladly release partial/full bounty .. pending a quick answer for why I had to edit the registry to get the program to operate. Equipoise I will send you a few xmr outside of the bounty, thanks a lot for providing it! I would like to hear some more from tsiv.

It is actually explained on the project's front page on Github, I believe

The initial release had the entire algorithm stuffed into a single huge CUDA kernel. Having to do the whole slow algorithm in one go had a tendency to take just a bit over 2 seconds per kernel launch, with 2 seconds being the timeout for Windows getting impatient and going "hmmh, I haven't heard from the GPU in 2 seconds. Must've crashed, better reset the driver." The registry tweak works around the problem by increasing the time that Windows allows the GPU to be "unresponsive" aka stuck running a CUDA kernel.

This has been addressed in later releases, mainly by splitting the single huge kernel into smaller pieces and making parts of the hash faster. The slowest part is still quite slow, taking roughly 1.4 seconds with launch config 8x60 on a 750 Ti but it should stay well within the default 2 second window.

There is something more to be done about the -l MxN.About the first number M:"First of all, your thread block size should always be a multiple of 32, because kernels issue instructions in warps (32 threads). For example, if you have a block size of 50 threads, the GPU will still issue commands to 64 threads and you'd just be wasting them."About the second number N:You could find it by gradually increasing it until your card stop working (showing impossible hash rate 3474958.52 H/s) and then restart is needed for maximum performance (but not for testing), because without restart my hash rate is felling 2x compared to the same options before the crash.

The "magical numbers" for 650M seems to be -l 128x5

I realize the 8x60 or 8x40 make absolutely no sense, they're something I ran into while trying out different values. The reasonable values would be based on the number of SMM/SMX on the GPU and 32 or 64 threads per block would make a lot of sense. I can't tell exactly why performance takes a dive if you try 64x5 for example, it should be a very good value to start at. Might have something to do with the huge amount of random global memory access in the second major loop of the algo, trying to do more work in parallel bottlenecks at the memory access?

Good news is that I've since modified 2 of the 3 main loops to use 8 parallel threads per hash as opposed to the original 1 thread per hash. So essentially 8x60 leads to running 64 threads per block for those two loops. Still working on the last loops, it does seem a fair bit harder to make it more parallel.

For wins 8.1 750ti , 8x60 will crash the driver. Lower setting 6x40 works but it affect performance. Anyway to fix that ?

At launch, I get a series of results like:GPU #0: GeForce GTX 750 Ti, using 40 blocks of 8 threadsPool set diff to 15000GPU #0: GeForce GTX 750 Ti, 93.81 H/sthen a popup says display driver stopped responding and has recovered. After that I see results with crazy high numbers of hashes like this:GPU #0: GeForce GTX 750 Ti, 163611988.12 H/sinterspersed with 'stratum detected new block'but no accepted results within a half hour check period.

I also tried downloading the previous release, but switching to that one makes the cmd.exe pop up and vanish immediately on my system (Windows 8.1, Driver 337.88). The GTX750ti is not attached to a display output.

Any help appreciated. Not sure what's going wrong.

Pretty sure it's still a TDR issue, the biggest part of the cryptonight core get still run as a single launch and it just might take that 2 seconds and Windows with default TDR delay considers the GPU stuck and does a driver reset. https://bitcointalk.org/index.php?topic=656841.msg7529269#msg7529269 for a workaround. I plan on looking at splitting the work down, at quick glance it looks like it could be run piece by piece. Will probably hurt performance a bit, have to save and reload the encryption keys on every kernel launch and launches themselves have some overhead. My thought was to make it a cmd line option, allowing the user to decide how much (or if) they want to split it up. Maybe add a few microseconds of sleep between the launches, stop the display freezing for 1+ seconds at a time and make the computer at least semi-usable.

I tried the TDR delay reg edit on my rig, and it now seems to be running stably at 45-50 H/s (40*, 750 ti, diff 15000. I just got my first accepted after about 10 mins so it seems to be working, thanks. Is the hash rate a bit low though?

I tried the TDR delay reg edit on my rig, and it now seems to be running stably at 45-50 H/s (40*, 750 ti, diff 15000. I just got my first accepted after about 10 mins so it seems to be working, thanks. Is the hash rate a bit low though?

What do you have in your command line for the "-l" parameter?

If you don't specify the parameter it should be 8x40 by default. The 750 Ti should be hashing in the 200's I believe.

If not, perhaps HardwarePal could make one, publish the view key? I don't want to hold it, just send to it.

I think smooth was collecting, but if HardwarePal is hiring someone directly then that would probably be okay, so long as the view key is published. Maybe we should check with smooth to see if he's collected anything? It doesn't seem like anyones working on this bounty besides HardwarePal, so my part of the ATI bounty still stands to be claimed by what's being worked on (150 XMR - Keyboard-Mash), and it looks like Tsiv will be claiming the Nvidia miner bounty. Tsiv can you please provide an XMR address and viewkey here?

Due to all the talk about Claymores Closed Source 5% Gpu Miner, I have paid Wolf to release his OpenCL for the Gpu miner on github and have some of the opensource community contribute and himself aswell.

The initial idea was to pay him 10BTC to do the project and release a working miner. I have changed plans due to Wolf being tired + pool owners wanting to cut Claymores 5% which could cause other bigger problems.

I paid him a total of 3BTC to release the code. He will be updating on the main Monero thread in a few hours.

Overall I'm very satisfied with the program, and would gladly release partial/full bounty .. pending a quick answer for why I had to edit the registry to get the program to operate. Equipoise I will send you a few xmr outside of the bounty, thanks a lot for providing it! I would like to hear some more from tsiv.

It is actually explained on the project's front page on Github, I believe

The initial release had the entire algorithm stuffed into a single huge CUDA kernel. Having to do the whole slow algorithm in one go had a tendency to take just a bit over 2 seconds per kernel launch, with 2 seconds being the timeout for Windows getting impatient and going "hmmh, I haven't heard from the GPU in 2 seconds. Must've crashed, better reset the driver." The registry tweak works around the problem by increasing the time that Windows allows the GPU to be "unresponsive" aka stuck running a CUDA kernel.

This has been addressed in later releases, mainly by splitting the single huge kernel into smaller pieces and making parts of the hash faster. The slowest part is still quite slow, taking roughly 1.4 seconds with launch config 8x60 on a 750 Ti but it should stay well within the default 2 second window.

There is something more to be done about the -l MxN.About the first number M:"First of all, your thread block size should always be a multiple of 32, because kernels issue instructions in warps (32 threads). For example, if you have a block size of 50 threads, the GPU will still issue commands to 64 threads and you'd just be wasting them."About the second number N:You could find it by gradually increasing it until your card stop working (showing impossible hash rate 3474958.52 H/s) and then restart is needed for maximum performance (but not for testing), because without restart my hash rate is felling 2x compared to the same options before the crash.

The "magical numbers" for 650M seems to be -l 128x5

I realize the 8x60 or 8x40 make absolutely no sense, they're something I ran into while trying out different values. The reasonable values would be based on the number of SMM/SMX on the GPU and 32 or 64 threads per block would make a lot of sense. I can't tell exactly why performance takes a dive if you try 64x5 for example, it should be a very good value to start at. Might have something to do with the huge amount of random global memory access in the second major loop of the algo, trying to do more work in parallel bottlenecks at the memory access?

Good news is that I've since modified 2 of the 3 main loops to use 8 parallel threads per hash as opposed to the original 1 thread per hash. So essentially 8x60 leads to running 64 threads per block for those two loops. Still working on the last loops, it does seem a fair bit harder to make it more parallel.

For wins 8.1 750ti , 8x60 will crash the driver. Lower setting 6x40 works but it affect performance. Anyway to fix that ?

Quote

It's making windows to think the driver crashed because it's taking long enough without returning a result (windows default is 2 seconds for the GPU) and that's why windows restarts the driver and ccminer won't work and show you impossible hash rate. I made the timeout 40 seconds on my machine (tried with 10 first - it worked for some time, but then it crashed again). This made my laptop second nvidea card to start happily mining with about 22 H/s. Don't do this if you don't have a second video card, because your PC will become completely unusable while mining (if you are cpu mining on the same machine run the cpu miner before running the gpu miner, because otherwise it'll become difficult for you to even start the cpu miner). Here is a link to a .reg file, which will set the timeout to 40 seconds - just double click it and it'll add the setting to the registry (it'll ask you if you are sure). Then you should restart your windows and ccminer should work after the restart. If you find it useful don't forget to tip me Smiley https://www.dropbox.com/s/ci8b3h7oxtvd6dq/TdrDelaySetTo40.reg

There is something more to be done about the -l MxN.About the first number M:"First of all, your thread block size should always be a multiple of 32, because kernels issue instructions in warps (32 threads). For example, if you have a block size of 50 threads, the GPU will still issue commands to 64 threads and you'd just be wasting them."About the second number N:You could find it by gradually increasing it until your card stop working (showing impossible hash rate 3474958.52 H/s) and then restart is needed for maximum performance (but not for testing), because without restart my hash rate is felling 2x compared to the same options before the crash.

Due to all the talk about Claymores Closed Source 5% Gpu Miner, I have paid Wolf to release his OpenCL for the Gpu miner on github and have some of the opensource community contribute and himself aswell.The initial idea was to pay him 10BTC to do the project and release a working miner. I have changed plans due to Wolf being tired + pool owners wanting to cut Claymores 5% which could cause other bigger problems.I paid him a total of 3BTC to release the code. He will be updating on the main Monero thread in a few hours. Link from Wolf0: https://bitcointalk.org/index.php?topic=671784.0

I tried the TDR delay reg edit on my rig, and it now seems to be running stably at 45-50 H/s (40*, 750 ti, diff 15000. I just got my first accepted after about 10 mins so it seems to be working, thanks. Is the hash rate a bit low though?

What do you have in your command line for the "-l" parameter?

If you don't specify the parameter it should be 8x40 by default. The 750 Ti should be hashing in the 200's I believe.

No command line "-l". Default 40x8. Tried 80x8 and 40x16, but they were worse. Surely Windows can't be nerfing the GPU so badly?

I can't run tsiv's ccminer under win8.1.Driver crash everytime I start ccminer.Any solution?

The NVIDIA drivers will "time out" and give you a crash message. This is Windows basically detecting that the drivers haven't responded properly for a while, and so it stops them and of course your mining quits as well. You can usually get around this via a registry edit, but in some cases even that may not work all the time so be prepared to fiddle around a bit. The registry hack is easy enough:

Run "regedit.exe" from the Start Menu. Navigate to "HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicsDrivers" Right-click on the right panel and create a new 32-bit DWORD value. Name the key "TdrDelay" and assign it a value of anywhere from 10 to 30 (decimal -- 0A to 1E hex). Reboot and you should be set.