I'm running multiple NVidia GTX 680 under Ubuntu 10.04 in a pretty hot environment (troubles with rack cooling) and sometimes it's getting over 95C. When I detect the overheating, can I somehow tell the driver to reduce the used resources, e.g.

number of threads

number of cores

GPU clock frequency

memory clock frequency

..?

dynamically, without restarting the process, so that the GPU can cool down a little?
Perhaps there is something like nvidia-smi or nvidia-settings that would allow me to do so? The only thing is: I need to do so externally, without modifying the actual code.

The process runs several days and performs some scientific calculations without any graphical output, so it would be fine if the matrix multiplication would slow down for some time.

What you're trying to do is a work-around, fix the root cause - which is insufficient cooling. I use HP SL270s Gen8's with 8 x M2090's per server (i.e. 4096 CUDA cores) and they never get near to 80 degrees C, you need to cool better.
–
Chopper3Jul 23 '13 at 12:23

I'd appreciate a work-around though. Proper cooling is a different topic, but for a moment I'm looking for a way to control the GPU resources as I described.
–
PavelJul 23 '13 at 13:17

1

nVidia states the GTX 680 can run safely up to 98º, you're much too close to the max. You can underclock the GPU, which will result in it running cooler, usually this functionality is built into the calculation software itself..
–
Chris SJul 23 '13 at 13:40

1

The Linux kernel doesn't really have any say in the process. The CUDA driver provides clocking functionality, but the application would have to support it. Try the nvclock utility. Also, we've got a couple system builders around the site (Chopper being one) who run many cards in a single rack server without these issues. You're definitely better off fixing your cooling problems.
–
Chris SJul 23 '13 at 14:02

1

have you seen this secHighTmprMon.sh? it throttles CPU based on its temperature, and I think there is GPU throttling planned too.
–
Aquarius PowerOct 24 '14 at 16:57

1 Answer
1

Trying to "fix" the problem by throttling the GPUs when you detect overheating is a Bad Idea.
You're operating on the ragged edge of the envelope, and even if you start throttling back at say 90 degrees (8 degrees before the "redline" that nVidia specifies) there's no guarantee you won't overshoot the limits of your cooling (and the hardware's safe operating range).

Down this road lies only misery - in the form of computation errors, hardware damage, and large repair/replacement bills.

Throttling the GPUs can help if you do it early enough.
You could throttle the GPUs all the time, preventing them from ever exceeding their maximum operating temperature. This will save your hardware, but you're crippling performance to keep the system at a safe temperature.
You could implement this with a PID algorithm that starts throttling the GPUs around say 80 degrees, to hold them at or below 90 degrees.

Presumably though you're spending a lot of money on this compute farm -- throttling it kinda defeats the purpose (getting results fast).

Fixing your cooling problem is the only Real Solution.
Like the commenters pointed out, your core problem is bad/insufficient cooling.

We don't know WHY you have insufficient cooling, and the solutions would depend on the underlying cause.

If the case has poor airflow you can add blowers to move a higher volume of air through the system.

If your datacenter has poor cooling airflow you can redesign your room to ensure the intake air is cooler.

If your datacenter is chronically overheated you may need to add more cooling (however much is necessary to handle your heat load).

Thank you for your answer! You're right with everything you said, but I'm still looking for some technical details of how to perform the throttling. nvclock was a good start, but it seems that it doesn't support my video cards. Perhaps there are other ways?
–
PavelJul 24 '13 at 13:04

@Pavel You will need to research your card and determine what tools (if any) are available that support it.
–
voretaq7♦Jul 24 '13 at 16:01