I am going to use the same approach highlighted in the previous post, basically use the CUDA runtime 6.5 and CUDDN v2 but compile the code with the newer 7.0 compiler.

Install the 7.0.76 compiler:

Before starting, you will need to download the new compiler. NVIDIA does not make your life easy in finding the link (they would like you to use Jetpack, but I don't like to reformat a working system if not absolutely needed) but you can download the .deb package directly on your Jetson with:

At this point we need to restore the standard 6.5 toolchain as the default one (we just want the 7.0 compiler to generate the object files), since the current driver on the Jetson TK1will only work with the 6.5 runtime. Go to the /usr/local directory and remove the cuda symlink to cuda-7.0 and make a new one for 6.5:

ubuntu@tegra-ubuntu:/usr/local$ sudo rm cuda

ubuntu@tegra-ubuntu:/usr/local$ sudo ln -s cuda-6.5/ cuda

You should see this output:

ubuntu@tegra-ubuntu:~$ nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver

Copyright (c) 2005-2014 NVIDIA Corporation

Built on Fri_Dec_12_11:12:07_CST_2014

Cuda compilation tools, release 6.5, V6.5.35

ubuntu@tegra-ubuntu:~$ /usr/local/cuda-7.0/bin/nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver

Copyright (c) 2005-2015 NVIDIA Corporation

Built on Mon_Feb_22_15:38:26_CST_2016

Cuda compilation tools, release 7.0, V7.0.74

Install protobuf and Bazel:

For protobuf you can follow the instruction from the previous blog post ( the only change is the location of protobuf-java-3.0.0-beta-x.jar , now in the java/core/target subdirectory).

Also for Bazel the procedure is similar, the only change required is the version, TF0.8 requires Bazel 0.1.4 so after cloning bazel, you will need to use the proper tag:

$ git clone https://github.com/bazelbuild/bazel.git

$ cd bazel

$ git checkout tags/0.1.4

Install TensorFlow 0.8:

The first thing to do it is to check out the source code and select the proper version:

this is a new memory allocator, that is going to cause a floating point exception unless you change the following code:

if (kCudaHostMemoryUseBFC) {

allocator =

#ifdef __arm__

new BFCAllocator(new CUDAHostAllocator(se), 1LL << 31,

true/*allow_growth*/, "cuda_host_bfc"/*name*/);

#else

new BFCAllocator(new CUDAHostAllocator(se), 1LL << 36/*64GB max*/,

true/*allow_growth*/, "cuda_host_bfc"/*name*/);

#endif

} else {

We are now ready to build. The only thing left to do is to remove the check to disable the use of variadic templates in Eigen. I have not found a clean way to do it (someone with better Bezel skills may have a better idea). My solution is to start the build and then wait for the first failure:

Friday, November 27, 2015

Google recently released TensorFlow, an open source software library for numerical computation using data flow graphs.

TensorFlow has a GPU backend built on CUDA, so I wanted to install it on a Jetson TK1. Even if the system did not meet the requirements ( CUDA 7.0 is not available and the GPU is a compute capability 3.2), I decided to give it a try anyway. This blog reports all the steps required to build TensorFlow from source, it is quite challenging but it can be done. Including all the prerequisites, the whole build will take several hours ( if you just want to try Tensorflow, you can download the wheel file I generated and do a pip install. The file is at https://drive.google.com/file/d/0B1uGKNpQ7xNqZ2pvSmc3SlZJS2c/view?usp=sharing ).

TensorFlow is under active development and the coding is using a lot of advanced C++ features that really push the compiler, these instructions worked with the version available on 11/26 but new

The first challenge is to build Bazel, another software developed at Google used as building system for TensorFlow. Bazel requires a protobuf version newer than the one presents in the Ubuntu 14.04 repos, so the first step will be to install protobuf 3 from source, since there are no prebuilt binary for ARM32.

Java 8:

The first step is to install Java8, but this is quite simple since Oracle provides a package:

$ sudo add-apt-repository ppa:webupd8team/java

$ sudo apt-get update

$ sudo apt-get install oracle-java8-installer

Protobuf:

In order to build protobuf and bazel, we will need several other packages. The exact list will depend on the status of your Jetson, but you will need at least these ones:

At the end of the compilation, the bazel binary will be in the output directory. You can add this directory

to your path or copy the binary in /usr/local/bin

TensorFlow

We are now ready to tackle the TensorFlow build for GPU. Just be sure to have CUDA 6.5 and CUDNN 6.5 installed on your Jetson TK1.

You will also need some files from the CUDA 7.0 package ( cuda-repo-l4t-r23.1-7-0-local_7.0-71_armhf.deb ) that you can download from

the NVIDIA web site ( it is the one for Jetson TX1).

While Jetson TK1 cannot run the 7.0 runtime, since the driver shipped with the system does not support it, it is still possible to run the CUDA 7.0 compiler. We need the 7.0 compiler because some of the TensorFlow source files will generate an internal compiler error with the 6.5 nvcc.

All the libraries and runtime will be the standard 6.5 ones.

On my system I have also enabled some swap space. You can plug a USB memory stick, create a swap file and mount it with

TensorFlow is expecting a 64bit system and has a bunch of library paths and libraries hard-coded in the files.

Before starting the installation, we will need to modify several files. We will need to change all the reference from lib64 to lib and change the 7.0 libraries to 6.5. We can find all the files with the strings and apply all the changes with these commands:

Wednesday, September 11, 2013

The latest MATLAB versions, starting from 2010b, have a very cool feature that enables calling CUDA C kernels from MATLAB code.
This is much better and simpler than writing MEX files to call CUDA code ( being the original author of the first CUDA MEX files and of the NVIDIA white-paper, I am speaking from experience) and it is a very powerful tool.

Let's take a very simple CUDA C code, add.cu, that adds a scalar to a vector:

For the generation of the PTX file, instead of invoking nvcc, we will call pgf90 with the right
flags to generate the PTX file:

pgf90 -c -Mcuda=keepptx,cc20 addf.cuf

The keepptx flag will generate the PTX file for compute capabilities 2.0, addf.n001.ptx.
If the compute capabilities are missing or if you specify multiple targets, the PGI compiler will generate different PTX files, you will need to inspect the ptx files to check the compute capabilities, the ordering is just an enumeration. We can perform this step from a OS shell or from inside MATLAB.
In order to invoke the compiler from the MATLAB prompt, we need to load the proper bash variables issuing the command:

setenv('BASH_ENV','~/.bash_profile');

and then invoking the pgf90 invocation preceded by an exclamation point. The exclamation point indicates that the rest of the input line is issued as a command to the operating system.

!pgf90 -c -Mcuda=keepptx,cc20 addf.cuf

In order to load the PTX file in MATLAB, we need to slightly change the syntax.

When loading the PTX file generated by CUDA C, we were passing both the PTX file name and

the original CUDA C file. In this way, MATLAB will automatically discover the prototype of the function. There are other ways, in which we explicitly pass the prototype signature to parallel.gpu.CUDAKernel.

This is what we need to load the PTX file generated from CUDA Fortran.

The entry point is now sumgpu_sum_, even if the subroutine was named sum. This is a consequence of being embedded in a module.

When the CUDA Fortran compiler generate the PTX file, it renames the subroutine entry as a concatenation of the module name, the subroutine name and a trailing underscore.

While this is not important when the module contains a single subroutine, it is crucial for situations in which multiple entry points are defined. If the module had multiple subroutines, we would have received an error when trying to load the PTX file: