At work we use Google Cloud Platform to run our machine learning jobs on
multiple machines. GCP has a monitoring platform called Stack­driv­er which can be
used to view all kinds of metrics about your VMs. Un­for­tu­nate­ly, it doesn't
collect any metrics about GPUs, neither usage or memory. The good news is that
it is extensible and you can "easily" set up a new kind of metric and monitor
it.

To get GPU metrics, we can use the nvidia-smi program, which is installed when
you get all the necessary drivers for your graphics card. If you call it simply,
it will give you the following output:

The first value is the GPU uti­liza­tion, as a percentage, and the second value is
the memory usage of the GPU, also as a percentage.

We are going to write a Python process that open a subprocess to call nvidia-smi
once a second and aggregates statistics, on a per minute basis. We have to do
this, because we cannot write to Stack­driv­er metrics more than once a minute,
per label (which are a sort of identifier for these time series).

fromsubprocessimportPopen,PIPEimportosimporttimeimportsysdefcompute_stats():all_gpu=[]all_mem=[]foriinrange(10):p=Popen(["nvidia-smi","--query-gpu=utilization.gpu,utilization.memory","--format=csv,noheader,nounits"],stdout=PIPE)stdout,stderror=p.communicate()output=stdout.decode('UTF-8')# Split on line breaklines=output.split(os.linesep)numDevices=len(lines)-1gpu=[]mem=[]forginrange(numDevices):line=lines[g]vals=line.split(', ')gpu.append(float(vals[0]))mem.append(float(vals[1]))all_gpu.append(gpu)all_mem.append(mem)time.sleep(1)max_gpu=[max(x[i]forxinall_gpu)foriinrange(numDevices)]avg_gpu=[sum(x[i]forxinall_gpu)/len(all_gpu)foriinrange(numDevices)]max_mem=[max(x[i]forxinall_mem)foriinrange(numDevices)]avg_mem=[sum(x[i]forxinall_mem)/len(all_mem)foriinrange(numDevices)]returnmax_gpu,avg_gpu,max_mem,avg_mem

Here we computed both the average and the maximum over a 1 minute interval. This
can be changed to other statistics if they are more relevant for your use case.

To write the data to Stack­driv­er, we have to build up the ap­pro­pri­ate protobufs.
We will set two labels: one for the zone in which are machines are and one for
the instance_id, which we will hack to contain both the name of the machine and
the number of the GPU (this is useful in case you attach multiple GPUs to one
machine). I hacked the instance_id because Stack­driv­er kept refusing any API
calls with custom labels, even though the docs said it supported them.

And now, we put everything together. The program must be called with a the name
of the instance as a first parameter. If you run it only on GCP, you can use the
GCP APIs to get the name of the instance au­to­mat­i­cal­ly.

iflen(sys.argv)<2:print("You need to pass the instance name as first argument")sys.exit(1)try:max_gpu,avg_gpu,max_mem,avg_mem=compute_stats()foriinrange(len(max_gpu)):write_time_series('max_gpu_utilization',i,max_gpu[i])write_time_series('max_gpu_memory',i,max_mem[i])write_time_series('avg_gpu_utilization',i,avg_gpu[i])write_time_series('avg_gpu_memory',i,avg_mem[i])exceptExceptionase:print(e)

If you save all this code to a file called gpu_monitoring.py and you run this
locally, on a machine with an NVidia GPU, after a minute you should start seeing
the new metrics in your Stack­driv­er console associated with your GCP project.

This code can then be called with cron once a minute or it can be changed so
that it runs without stopping, posting results once a minute.