As promised here is a tutorial on potentially configuring and running say a 20-node CUDA 5 Multi-GPU cluster on Amazon’s AWS cloud infrastructure. The secret is to not pay the $2.10*20=$42/hour cost by using Spot Instances together with the awesome StarCluster python package which takes the pain out of creating clusters on AWS. For the purpose of this post, we will stick to just 2-nodes and will point out the place where you can easy add more nodes all the way up to 20. So lets get started!

Prerequisites

The first thing we need is to install StarCluster and also configure our Amazon AWS credentials and keys. On my 64-bit Mac OSX, I had to install pycrpto first with the following command (you may need to sudo):
➜ export ARCHFLAGS='-arch x86_64'
➜ easy_install pycrypto
...
➜ easy_install starcluster
...

And once installed we need to run it with the help command to create the config file by pressing 2:
➜ starcluster help
StarCluster - (http://web.mit.edu/starcluster) (v. 0.9999)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to starcluster@mit.edu

Now would be a good time to create a key via:
➜ starcluster createkey cuda -o ~/.ssh/cuda.rsa
...
>>> keypair written to /Users/kashif/.ssh/cuda.rsa

and add its location to the .starcluster/config under the [key cuda] section.

Its always good to also create a ~/.aws-credentials-master file and fill it in with the same information so that we can also use the Amazon command line tools:
➜ cat ~/.aws-credentials-master
# Enter the AWS Keys without the < or >
# You can either use the AWS Accounts access keys and they can be found at
# http://aws.amazon.com under Account->Security Credentials
# or you can use the access keys of a user created with IAM
AWSAccessKeyId=blahblah
AWSSecretKey=blahblahblah

Basic Idea

What we are going to do is to use an official StarCluster HVM AMI and update and create an EBS backed AMI of it. Then we will use this new AMI to run the cluster. The updated AMI will hopefully have the latest CUDA 5 as well as other goodies.

Customizing an Image Host

We first launch a new single node cluster called imagehost as a spot instance based of an existing StarCluster AMI on a GPU enabled instance. We need to choose an AMI or machine image which supports HVM so we have access to the GPU. We can list all the StarCluster AMIs via:
➜ starcluster listpublic
StarCluster - (http://web.mit.edu/starcluster) (v. 0.93.3)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to starcluster@mit.edu

The last thing we need to do is to ensure that the device files /dev/nvidia* exist and have the correct file permissions. This can be done by creating a startup script e.g.:
$ cat /etc/init.d/nvidia
#!/bin/bash
PATH=/sbin:/bin:/usr/bin:$PATH

Cluster Template

Now we can setup the cluster template in the StarCluster config file. We need to choose the AMI or machine image which we just created before. The ami-9f6ed8f6 is the AMI which we will use to setup a small cluster template in the StarCluster config file:
...
[cluster smallcluster]
KEYNAME = cuda
CLUSTER_SIZE = 2
CLUSTER_USER = sgeadmin
CLUSTER_SHELL = bash
NODE_IMAGE_ID = ami-9f6ed8f6
NODE_INSTANCE_TYPE = cg1.4xlarge
SPOT_BID = x.xx

Its important to have a SPOT_BID = x.xx or else the actual price will be charged, which is not what we want Also to run a bigger cluster just replace CLUSTER_SIZE = 2 with the number you need.

Finally in the [global] section of the config file we need to tell StarCluster to use this template:
[global]
DEFAULT_TEMPLATE=smallcluster

This post will take you through starting and configuring an Amazon EC2 instance to use the Multi GPU features of CUDA 4.0.

Motivation

CUDA 4.0 comes with some new exciting features such as:

the ability to share GPUs across multiple threads;

or use all GPUs in the system concurrently from a single host thread;

and unified virtual addressing for faster multi GPU programming;

and many more.

The ability to access all the GPUs in a system is particularly nice on Amazon, since the large GPU enabled instances come with two Tesla M2050 Fermi boards, each capable of 1030 GFlops theoretical peak performance with 448 cores and 3GB of memory.

Getting started

Signing up to Amazon’s AWS is easy enough with a Credit Card, and once you are logged in, go to the EC2 tab of your console which should look something like this:

The EC2 console page.

Now press the Launch Instance button and in the Community AMIs tab set the Viewing option to Amazon Images and search for gpu and Select the CentOS 5.5 GPU HVM AMI and press Continue:

Choose the CentOS 5.5 GPU HVM AMI (bottom one).

Next we need to select the Instance Type and its important here to select the Cluster GPU type, and then press Continue:

Select the Cluster GPU Instance Type.

Next we need to Create a New Key Pair, by giving it a name like amazon-gpu and press Create & Download your Key Pair to download it to your local computer as a file called amazon-gpu.pem:

Create and download Key Pair.

We press Continue to go to the Firewall setting. Here we Create a new Security Group, give it a name and description, and then Create a new rule for ssh so that we can log into our instance once its up and running, and press Continue:

Create a new Security Group and a new ssh rule.

And finally we can review our settings and Launch it:

Review and Launch instance.

Back in our EC2 console we can go to our Instances and see our new AMI’s Status. It should be booting or running, rather than stopped as in the case below:

AMI Instance's Status and Description.

The Description tab will also contain the Public DNS which we can use together with the Key Pair we downloaded locally to ssh into our instance:

I leave it as an exercise to figure out how to reboot the instance from the console, but once its back up and running, we can ssh back into it to download and install the CUDA 4.0 drivers, toolkit and SDK. For example:

After that, we can go to the NVIDIA_GPU_Computing_SDK/C/ folder and type make. The binaries will be installed in the NVIDIA_GPU_Computing_SDK/C/bin/linux/release/ directory and if we go there, we can run the simpleMultiGPU example:

MultiGPU Cluster Setup

Using the above setup and this video, it is also possible to configure an 8 node cluster of GPU instances as described here for high performance computing applications. I will try to do a MultiGPU and Open MPI example in another blog post so stay tuned.

We will be speaking at this year’s GPU Tech. Conf. in San Jose, which goes from Sept. 30 to Oct. 2, about using CUDA within Mathematica. The slides are almost ready and we are just organizing some logistics etc. I thought we might write a bit about the talk in order to get some initial feedback on the content.

The talk is divided into three parts, initially we introduce the structure of Mathematica, in particular its MathLink API and go into the basics idea of creating a simple C++ application which we can call from Mathematica. Then we discuss the API in a bit more details, especially receiving and sending arrays to and from Mathematica. Its here where we also discus how to receive and send complex numbers, which is handy when doing FFT for example. We then briefly discuss running MathLink applications on remote computers, which is specially useful if you share your CUDA enabled computer with others. Finally we go through some basic error and interruption handling in the MathLink API.

The second part then concentrates on the CUDA aspect of the MathLink application, in some sense the whole philosophy of the talk. If we create a CUDA application that can get and receive data from Mathematica, via the MathLink API, then we are done! In particular we give an overview of a simple example using the mathematica_cuda plugin, which lets you do just this. For a more universal solution, one that works under Windows, there is the excellent CMake module: FindCUDA together with my FindMathLink module which I wrote about previously. We then finish this part by going through a complete example: FFT via CUFFT and show how one goes about getting it working in Mathematica.

The last part, time permitting, is where we show some of the work we have been doing with sending computations to the GPU from Mathematica. In particular I will show some of the work I have been doing with image deconvolution of Confocal and Wide-field images. I am using the GPU to do my deconvolution experiments and using Mathematica to read in the images and analyze the results. check domains . Shoaib will present his work on calculating the vegetation index in multi and hyper-spectral satellite images.

I hope you find this overview helpful. We will put the slides up here when the tutorial is over, and if you plan to attend the conference it would be great to see you and get your feedback. Also if there is something specific you would like us to cover, you still have a few days to let us know.

In order to get a more universal solution to my mathematica_cuda plugin, one that works on Windows as well as on Mac and Linux, I decided to use CMake, which comes with the excellent FindCUDA module together with a MathLink module which would offer the same functionality as the current mathematica_cuda plugin, plus more.

I looked on the web if someone else had already written such a module for MathLink, and in the end found Erik Franken who sent me a version he had modified from a version by Jan Woetzel and others:

By this time I had a version on github which I wrote up. Feel free to download it from here.

Recently Markus van Almsick sent me a more advanced version which I will integrate into my version soon. google earth . . Pitijobleare .

A great article about Twittering with Mathematica on the Wolfram blog. I had investigated a while ago a Mathematica twitter bot for doing “Micro-calculations” with the results from Mathematica being less than 140 chars. Not very useful but a fun bot.

Anyways if you are interested, I made a gist for it. Its in Java and uses JLink to communicate with Mathematica. It was never running for long as I suspect it violated some end user license, but basically one would send a Mathematica command to @mathematica and it would tweet you back your result evaluated by the MathKernel. I am hoping Wolfram might create a similar bot themselves for when you need to know the value of a special function quickly

I have decided to push the initial Mathematica Cuda plug-in to a public repo on github. Feel free to download or fork it.

The basic structure of the project follows that of the Nvidia’s Cuda SDK, in that the individual projects are in their own folder inside the projects folder. Right now I have the scalarProd example from Nvidia. I have also included Nvidia’s cuda utilities cutils and extended the make system to handle Mathematica template files.

Currently I have tested it only on 64-bit Linux, but hopefully I will see if I can get it working under Mac and Windows. I also plan to add more documentation in the project’s wiki on github, and hopefully get some more useful examples implemented, perhaps FFT.

Since there is a Matlabplug-in for CUDA that provides some examples of off-loading computation to the GPU, I thought it might be neat to have something similar for Mathematica. So as a start, I decided to try out a simple scalar product example using MathLink.

The initial template of my function is in the scalarProd.tm file:

which describes the ScalarProd[] function in Mathematica, and links it to the scalarProd() C method, which is where we receive the two arrays from Mathematica and use CUDA to calculate their scalar product and send the result back. This and the main() function for Linux and Mac, which is what I was using, are in the scalarProd.cu file. Note that Windows has a slightly different main() method.

and in the same scalarProd.cu we now include the scalarProd_kernel.cu kernel from CUDA’s SDK together with our scalarProd() C function:

Now we are ready to run Mathematica’s mprep pre-processor from MathLink to generate a scalarProdtm.cu file, and on this we run CUDA’s compiler nvcc and compile everything with the appropriate CUDA and MathLink libraries to generate our scalarProd binary, which we can now call from within Mathematica: