Comments

I'm running Tensorflow 0.9.0 installed from wheel on Python 2.7 on a K40 with CUDA 7.0.

The following test case attempts to minimize the mean of a vector through gradient descent. The script finds that the vectors are equal at all steps, but the means are not. I believe the vectors being equal at all steps is pure numerical luck, since non-deterministic loss likely means non-deterministic gradient which means non-deterministic/reproducible iterative optimization. I've observed cases where training results in different final losses where the only source of non-determinism is from reduce_mean.

Atomic floating point adds on GPU are the problem. Having floating point adds to the same address in an undefined order is inherently non-deterministic due to non-associativity of floating point arithmetic.

This issue could be solved (and reduction performance improved) by using some sort of reduction tree to reduce within blocks, and then launching a second kernel (or doing some manual block synchronization tricks) to reduce across blocks.

This comment has been minimized.

@lightcatcher as you point out, our current implementation of sum reduction (and the various ops that depend on it) is not deterministic on either GPU or multi-threaded CPUs. It is primarily a speed/accuracy trade-off and if we could get comparable speed from one of the approaches you mention, we would be happy to switch the implementation. In short, this is working as intended for now, but contributions for a more accurate or even deterministic sum reduction (possibly as a separate op) would certainly be welcomed.

This comment has been minimized.

@TimZaman Unless something has changed in the recent months, some of the cuDNN code is non-deterministic, for instance cudnn.SpatialConvolution. So, I guess that some of the CNN-related stuff in tensorflow may be non-reproducible (if run on GPU). Would probably be a bit of work, but it would be nice to have a flag or note in the TF docstrings of the affected functions.

This comment has been minimized.

edited

As @zheng-xq mentioned earlier, anything using cuda atomics is non-deterministic, so a way to narrow it down is to see which CuDNN algorithms use CUDA atomics. For CPU ops, the way to check might be to track down parallel ops (see which ops use tensorflow/core/util/work_sharder.cc) and check that result is independent of the order in which individual work shards complete. Note that there are more tricky cases of non-determinism, for instance same sequence of SSE instructions can give different results on rerun, so to get a stronger guarantee of determinism you need to disable multi-threading and special instruction sets: http://blog.nag.com/2011/02/wandering-precision.html

This comment has been minimized.

Are reductions still non-determinstic by default on GPU? If so, can this issue be re-opened? Determinstic computation is critical for reproducibility, and reductions are a critical part of neural nets.

Finally:
Several comments on this issue and other linked issues mention that "reductions are non-determinstic for performance". This is not the case. A reduction tree is both determinstic and generally faster than using atomic adds (which cause non-determinism). https://devblogs.nvidia.com/faster-parallel-reductions-kepler/ describes reduction trees, shows that reduction tree + very limited use of atomic add is the fastest option, but that reduction trees (determinstic) is only marginally slower.
Last I checked (quite a while ago), the TF reduction implementation exclusively used atomics and no reduction tree, so a switch to a reduction tree only implementation would provide a performance boost. My guess is that atomics were used for the TF implementation not for performance but because the implementation with atomics is somewhat simpler to write.

This comment has been minimized.

@yaroslavvb
I just re-ran my initial examples on TF 1.5.0 and found that that reduce_mean call produced consistent results across 10K trials with the mean kernel running on a K80 GPU. This observation agrees with @ahmedhosny 's observation.