To run unit-tests

luajit -l cunn -e'cunn.test()'

GPU Training Concepts

Performance

data should be transferred between main memory and gpu in batches, otherwise the transfer time will be dominated
by latency associated with speed of light, and execution overheads, rather than by bandwidth

... this will allocate one thousand new CudaTensors, one for each call to torch.add(a, 1).

Use instead this form:

require'cutorch'local a = torch.CudaTensor(1000):uniform()
local b = torch.CudaTensor(1000):uniform()
for it=1,1000do
b:add(a, 1)
end

In this form, b is allocated only once, before the loop. Then the b:add(a,1) operation will perform
the add inside the GPU kernel, and store the result into the original bCudaTensor. This
will run noticeably faster, in general. It's also a lot less likely to eat up arbitrary amounts of memory,
and less likely to need frequent calls to collectgarbage(); collectgarbage().

Benchmarking

GPU operations will typically continue after an instruction has been issued