At work we use Google Cloud Platform to run our machine learning jobs on
multiple machines. GCP has a monitoring platform called Stack­driv­er which can be
used to view all kinds of metrics about your VMs. Un­for­tu­nate­ly, it doesn't
collect any metrics about GPUs, neither usage or memory. The good news is that
it is extensible and you can "easily" set up a new kind of metric and monitor
it.

To get GPU metrics, we can use the nvidia-smi program, which is installed when
you get all the necessary drivers for your graphics card. If you call it simply,
it will give you the following output: