Theano on Debian: maintenance, BLAS and CUDA

[2016-08-30: some corrections, added some info on nvidia-smi and cuda_check;
2016-10-17: added Lasagne example at the end;
2016-10-26: a few language improvements]

I’m glad to announce that we have the current release of Theano (0.8.2) in Debian unstable now, it’s on its way into the testing branch and the Debian derivatives, heading for Debian 9.
The Debian package is maintained in behalf of the Debian Science Team.

We have a binary package with the modules in the Python 2.7 import path (python-theano), if you want or need to stick to that Python flavor a little longer (as a matter of fact, in the current popcon stats it’s the most installed package), and a package running on the default Python 3 version (python3-theano).
The comprehensive documentation is available for offline usage in another binary package (theano-doc).

Although Theano builds its extensions on run time and therefore all binary packages contain the same code, the source package generates arch specific packages1 for the reason that the exhaustive test suite could run over all the architectures to detect if there are problems somewhere (#824116).

what’s this?

In a nutshell, Theano is a computer algebra system (CAS) and expression compiler, which is implemented in Python as a library.
It is named after a Classical Greek female mathematician and it’s developed at the LISA lab (located at MILA, the Montreal Institute for Learning Algorithms) at the Université de Montréal.

Theano tightly integrates multi-dimensional arrays (N-dimensional, ND-array) from NumPy (numpy.ndarray), which are broadly used in Scientific Python for the representation of numeric data.
It features a declarative Python based language with symbolic operations for the functional definition of mathematical expressions, which allows to create functions that compute values for them.
Internally the expressions are represented as directed graphs with nodes for variables and operations.
The internal compiler then optimizes those graphs for stability and speed and then generates high-performance native machine code to evaluate resp. compute these mathematical expressions2.

One of the main features of Theano is that it’s capable to compute also on GPU processors (graphical processor unit), like on custom graphic cards (e.g. the developers are using a GeForce GTX Titan X for benchmarks).
Today’s GPUs became very powerful parallel floating point devices which can be employed also for scientific computations instead of 3D video games3.
The acronym “GPGPU” (general purpose graphical processor unit) refers to special cards like NVIDIA’s Tesla4, which could be used alike (more on that below).
All together, Theano is a heavy-duty number cruncher with an own computing engine which could be used for large-scale scientific computations.

If you haven’t came across Theano as a professional mathematician with interests in Python software, it’s also one of the most prevalent frameworks for implementing deep learning applications (training multi-layered, “deep” artificial neural networks, DNN) around5, and has been developed with a focus on machine learning from the ground up.
There are several higher level user interfaces build in the top of Theano (for DNN, Keras, Lasagne, Blocks, and others, or for Python probalistic programming, PyMC3).
I’ll seek for them also becoming available in Debian, too.

helper scripts

Both binary packages ship three convenience scripts, theano-cache, theano-test, and theano-nose.
Instead of them being copied into /usr/bin, which would result into a binaries-have-conflict violation among the two binary packages, the scripts are to be found in /usr/share/python-theano (python3-theano respectively), so that both module packages of Theano can be installed at the same time.

The scripts could be run directly from these folders, e.g. you could do $ python /usr/share/python-theano/theano-nose.
If you’re going to heavy use them, you could add the directory of the flavour you prefer (Python 2 or Python 3) to the $PATH environment variable manually by either typing e.g. $ export PATH=/usr/share/python-theano:$PATH on the prompt, or save that line into ~/.bashrc.

Manpages aren’t available for these little helper scripts6, but you could always get info on what they do and which arguments they accept by invoking them with the -h (for theano-nose) resp. help flag (for theano-cache).

running the tests

On some occasions you might want to run the testsuite of the installed library, like to check over if everything runs fine on your GPU hardware.
There are two different ways to run the tests (anyway you need to have python{,3}-nose installed).
One is, you could launch the test suite by doing $ python -c 'import theano; theano.test() (or the same with python3 to test the other flavour), that’s the same what the helper script theano-test does.
However, by doing it that way some particular tests might fail by raising errors also for the group of known failures.

Known failures are excluded from being errors if you run the tests by theano-nose, which is a wrapper around nosetests, so this might be always the better choice.
You can run this convenience script with the option --theano on the installed library, or from the source package root, which you could pull by $ sudo apt-get source theano (there you have also the option to use bin/theano-nose).
The script accept options for nosetests, so you might run it with -v to increase verbosity.

For the tests the configuration switch config.device must be set to cpu.
This will also include the GPU tests when a proper accessible device is detected, so that’s a little misleading in the sense of it doesn’t mean “run everything on the CPU”.
You’re on the safe side if you run it always like this: $ THEANO_FLAGS=device=cpu theano-nose, if you’ve set config.device to gpu in your ~/.theanorc.

Depending on the available hardware and the used BLAS implementation (see below) it could take quite a long time to run the whole test suite through, on the Core-i5 in my laptop that takes around an hour even excluded the GPU related tests (which perform pretty fast, though).
But Theano features a couple of switches to manipulate the default configuration for optimization and compilation.
There is a rivalry between optimization and compilation costs against performance of the test suite, and it turned out the test suite performs a quicker with lesser graph optimization.
There are two different switches available to control config.optimizer, the fast_run toggles maximal optimization, while fast_compile runs only a minimal set of graph optimization features.
These settings are used by the general mode switches for config.mode, which is either FAST_RUN by default, or FAST_COMPILE.
The default mode FAST_RUN (optimizer=fast_run, linker=cvm) needs around 72 minutes on my lower mid-level machine (on un-optimized BLAS).
To set mode=FAST_COMPILE (optimizer=fast_compile, linker=py) brings some boost for the performance of the test suite because it runs the whole suite in 46 minutes.
The downside of that is that C code compilation is disabled in this mode by using the linker py, and also the GPU related tests are not included.
I’ve played around with using the optimizer fast_compile with some of the other linkers (c|py and cvm, and their versions without garbage collection) as alternative to FAST_COMPILE with minimal optimization but also machine code compilation incl. GPU testing.
But to my experience, fast_compile without another than the linker py results in some new errors and failures of some tests on amd64, and this might the case also on other architectures, too.

By the way, another useful feature is DebugMode for config.mode, which verifies the correctness of all optimizations and compares the C to Python results.
If you want to have detailed info on the configuration settings of Theano, do $ python -c 'import theano; print theano.config' | less, and check out the chapter config in the library documentation in the documentation.

cache maintenance

Theano isn’t a JIT (just-in-time) compiler like Numba, which generates native machine code in the memory and executes it immediately, but it saves the generated native machine code into so-called “compiledirs”.
The reason for doing it that way is quite practical like the docs explain, the persistent cache on the disk makes it possible to avoid generating code for the same operation, and to avoid compiling again when different operations generate the same code.
The compiledirs by default are located within $(HOME)/.theano/.

After some time using Theano the folder becomes quite large, and might look something like this:

If the used Python version changed like in this example you might to want to purge obsolete cache.
For working with the cache resp. the compiledirs, the helper theano-cache comes in handy.
If you invoke it without any arguments the current cache location is put out like ~/.theano/compiledir_Linux-4.5--amd64-x86_64-with-debian-stretch-sid--2.7.12-64 (the script is run from /usr/share/python-theano).
So, the compiledirs for the old Python versions in this example (11+ and 12rc1) can be removed to free the space they occupy.

All compiledirs resp. cache directories meaning the whole cache could be erased by $ theano-cache basecompiledir purge, the effect is the same as by performing $ rm -rf ~/.theano.
You might want to do that e.g. if you’re using different hardware, like when you got yourself another graphics card.
Or habitual from time to time when the compiledirs fill up so much that it slows down processing with the harddisk being very busy all the time, if you don’t have an SSD drive available.
For example, the disk space of build chroots carrying (mainly) the tests completely compiled through on default Python 2 and Python 3 consumes around 1.3 GB (see here).

BLAS implementations

Theano needs a level 3 implementation of BLAS (Basic Linear Algebra Subprograms) for operations between vectors (one-dimensional mathematical objects) and matrices (two-dimensional objects) carried out on the CPU.
NumPy is already build on BLAS and pulls the standard implementation (libblas3, soure package: lapack), but Theano links directly to the BLAS library instead of using NumPy as intermediate layer to reduce the computational overhead.
For this, Theano needs development headers and the binary packages pull libblas-dev by default, if any other development package of another BLAS implementation (like OpenBLAS or ATLAS) isn’t already installed, or pulled with them (providing the virtual package libblas.so).
The linker flags could be manipulated directly through the configuration switch config.blas.ldflags, which is by default set to -L/usr/lib -lblas -lblas.
By the way, if you set it to an empty value, Theano falls back to using BLAS through NumPy, if you want to have that for some reason.

On Debian, there is a very convenient way to switch between BLAS implementations by the alternatives mechanism.
If you have several alternative implementations installed at the same time, you can switch from one to another easily by just doing:

The implementations are performing differently on different hardware, so you might want to take the time to compare which one does it best on your processor (the other packages are libatlas-base-dev and libopenblas-dev), and choose that to optimize your system.
If you want to squeeze out all which is in your paid hardware for carrying out Theano’s computations on the CPU, another option is to compile an optimized version of a BLAS library especially for your processor.
I’m going to write another blog posting on this issue.

The binary packages of Theano ship the script check_blas.py to check over how well a BLAS implementation performs with it, and if everything works right.
That script is located in the misc subfolder of the library, you could locate it by doing $ dpkg -L python-theano | grep check_blas (or for the package python3-theano accordingly), and run it with the Python interpreter.
By default the scripts puts out a lot of info like a huge perfomance comparison reference table, the current setting of blas.ldflags, the compiledir, the setting of floatX, OS information, the GCC version, the current NumPy config towards BLAS, NumPy location and version, if Theano linked directly or has used the NumPy binding, and finally and most important, the execution time.
If just the execution time for quick perfomance comparisons is needed this script could be invoked with the option -q.

Theano on CUDA

The function compiler of Theano works with alternative backends to carry out the computations, like the ones for graphics and GPGPU cards.
Currently, there are two different backends for GPU processing available, one docks onto NVIDIA’s CUDA (Compute Unified Device Architecture) technology7, and another one for libgpuarray, which is also developed by the Theano developers in parallel.

The libgpuarray library is an interesting alternative for Theano, it’s a GPU tensor (multi-dimensional mathematical object) array written in C with Python bindings based on Cython, which has the advantage of running also on OpenCL8.
OpenCL, unlike CUDA9, is full free software, vendor neutral and overcomes the limitation of the CUDA toolkit being only available for amd64 and the ppc64el port (see here).
I’ve opened an ITP on libgpuarray and we’ll see if and how this works out.
Another reason for it would be great to have it available is that it looks like CUDA currently runs into problems with GCC 610.
More on that, soon.

Here’s a litle checklist for setting up your CUDA device so that you don’t have to experience something like this:

$ THEANO_FLAGS=device=gpu,floatX=float32 python ./cat_dog_classifier.py
WARNING (theano.sandbox.cuda): CUDA is installed, but device gpu is not available (error: Unable to get the number of gpus available: no CUDA-capable device is detected)

hardware check

For running Theano on CUDA you need an NVIDIA graphics card which is capable of doing that.
You can recheck if your device is supported by CUDA here.
When the hardware isn’t too old (CUDA support started with GeForce 8 and Quadro X series) or too strange I think it isn’t working only in exceptional cases.
You can check your model and if the device is present in the system on the bare hardware level by doing this:

If a line like this doesn’t get returned, your device most probably is broken, or not properly connected (ouch).
If rev ff appears at the end of the line that means the device is off meaning powered down.
This might be happening if you have a laptop with Optimus graphics hardware, and the related drivers have switched off the unoccupied device to save energy11.

kernel module

Running CUDA applications requires the proprietary NVIDIA driver kernel module to be loaded into the kernel and working.

If you haven’t already installed it for another purpose, the NVIDIA driver and the CUDA toolkit are both in the non DFSG-free section of the Debian archive, which is not enabled by default.
To get non-free packages you have to add non-free (and it’s better to do so, also contrib) to your package source in /etc/apt/sources.list, which might then look like this:

deb http://httpredir.debian.org/debian/ testing main contrib non-free

After doing that, perform $ apt-cache update to update the package lists, and there you go with the non-(DFSG-)free packages.

The headers of the running kernel are needed to compile modules, you can get them together with the NVIDIA kernel module package by running:

DKMS will then build the NVIDIA module for all the kernels on the system.
The module then can be loaded into the running kernel with $ sudo modprobe nvidia-current.
If you want to load the kernel driver at boot time, add nvidia-current to /etc/modules.

A quick working check could be performed with nvidia-smi (package: nvidia-smi):

troubleshooting

If you have problems with the CUDA device, it’s advised to verify if the following things concerning the NVIDIA driver resp. kernel module are in order:

blacklist Nouveau

Check if the default Nouveau kernel module driver (which blocks the NVIDIA module) for some reason still gets loaded by doing $ lsmod | grep nouveau.
If nothing gets returned, that’s right.
If it’s still in the kernel, just add blacklist nouveau to /etc/modprobe.d/blacklist.conf, and update the booting ramdisk with § sudo update-initramfs -u afterwards.
Then reboot once more, this shouldn’t be the case then anymore.

rebuild kernel module

To fix it when the module haven’t been properly compiled for some reason you could trigger a rebuild of the NVIDIA kernel module with $ sudo dpkg-reconfigure nvidia-kernel-dkms.
When you’re about to send your hardware in to repair because everything looks all right but the device just isn’t working, that really could help (own experience).

After the rebuild of the module or modules (if you have a few kernel packages installed) has completed, you could recheck if the module really is available by running:

The kernel module could be also loaded and reloaded with nvidia-modprobe without superuser privileges (that tool is from the package nvidia-modprobe).

unsupported graphics card

Be sure that you graphics cards is supported by the current driver kernel module.
If you have bought the latest hardware, that’s quite possible to come out being a problem.
You can get the version of the current NVIDIA driver with:

Then, google the version number like nvidia 352.79, this should get you onto an official driver download page like this.
There, check for what’s to be found under “Supported Products”.

I you’re stuck with that there are two options, to wait until the driver in Debian got updated, or replace it with the latest driver package from NVIDIA.
That’s possible to do, but something more for experienced users.

occupied graphics card

The CUDA driver cannot work while the graphical interface is busy like by processing the graphical display of your X.Org server.
Which kernel driver actually is used to process the desktop could be examined by this command:12

This example shows that the rendering of the desktop is performed by the graphical device of the Intel CPU, which is just like it’s needed for running CUDA applications on your NVIDIA graphics card, if you don’t have another one.

nvidia-cuda-toolkit

With the Debian package of the CUDA toolkit (nvidia-cuda-toolkit) everything pretty much runs out of the box for Theano.
Just install it with apt-get, and you’re ready to go, the CUDA backend is the default one.
Pycuda is also a suggested dependency of the binary packages, it could be pulled together with the CUDA toolkit.

A quick check if your CUDA device works correctly could be done with this tool.
You have to compile it with the CUDA compiler by doing nvcc -o cuda_check cuda_check.c -lcuda, and there you go13:

The up-to-date CUDA release 7.5 is currently available, with that you have Maxwell architecture support so that you can run Theano on e.g. a GeForce GTX Titan X with 6,2 TFLOPS on single precision14 at an affordable price.
CUDA 815 is around the corner with support for the new Pascal architecture16.
Like the GeForce GTX 1080 high-end gaming graphics card already has 8,23 TFLOPS17, the new Pascal Titan X (G102 GPU) 11 TFLOPS at single precision.
When it comes to professional GPGPU hardware like the Tesla P100 there is much more computational power available, scalable by multiplication of cores resp. cards up to genuine little supercomputers which fit on a desk, like the DGX-118.
Theano can use multiple GPUs for calculations to work with highly scaled hardware, I’ll write another blog post on this issue.

Theano on the GPU

Only single precision floating point numbers (float32) are supported on the GPU, but that is sufficient for deep learning applications.
Theano uses double precision floats (float64) by default, so you have to set the configuration variable config.floatX to float32, like written on above, either with the THEANO_FLAGS environment variable or better in your .theanorc file, if you’re going to use the GPU a lot.

Switching to the GPU actually happens with the config.device configuration variable, which must be set to either gpu or gpu0, gpu1 etc., to choose a particular one if multiple devices are available.

Here’s is a little test script check1.py, it’s taken from the docs and slightly altered.
You can run that script either with python or python3 (there was a single test failure on the Python 3 package, so the Python 2 library might be a little more stable currently).
For comparison, here’s an example on how it perfoms on my hardware, one time on the CPU, one more time on the GPU:

If you got a result like this you’re ready to go with Theano on Debian, training computer vision classifiers for your gladiator drone or whatever you want to do with it.
The MNIST example of Lasagne could be used for a quick check if the whole library stack works properly19:

Some ports are disabled because they are currently not supported by Theano. There are NotImplementedErrors and other errors in the tests on the numpy.ndarray object being not aligned. The developers commented on that, see here. And on some ports the build flags -m32 resp. -m64 of Theano aren’t supported by g++, the build flags can’t be manipulated easily.
[return]

If Optimus (hybrid) graphics hardware is present (like commonly today on PC laptops), Debian launches the X-server on the graphics processing unit of the CPU, which is ideal for CUDA. The problem with Optimus actually is the graphics processing on the dedicated GPU. If you are using Bumblebee, the Python interpreter which you want to run Theano on has be to be started with the launcher optirun (primusrun doesn’t work!), because Bumblebee powers the GPU down with the tool bbswitch every time it isn’t used, and I think also the kernel module of the driver is dynamically loaded.
[return]