Advertising the GPU

A key challenge of advertising GPUs is that a GPU can only be used by one job at a time. If an execute node has multiple slots (a likely case!), you'll want to limit each GPU to only being advertised to a single slot.

You have several options for advertising your GPUs. In increasing order of complexity they are:

Static configuration

Automatic configuration

Dynamic advertising

This progression may be a useful way to do initial setup and testing. Start with a static configuration to ensure everything works. Move to an automatic configuration to develop and test partial automation. Finally a few small changes should make it possible to turn your automatic configuration into dynamic advertising.

Static configuration

If you have a small number of nodes, or perhaps a large number
of identical nodes, you can add static attributes manually using
STARTD_ATTRS on a per slot basis. In the simplest case, it might just be:

SLOT1_HAS_GPU=TRUE
SLOT1_GPU_DEV=0
STARTD_ATTRS=HAS_GPU,GPU_DEV

This limits the GPU to only being advertised by the first slot. A job can use HAS_GPU to identify available slots with GPUs. The job can use GPU_DEV to identify which GPU device to use. (A job could use the presence of GPU_DEV to identify slots with GPUs instead of HAS_GPU, but "HAS_GPU" is a bit easier to read than "(GPU_DEV=!=UNDEFINED)"

Dynamic advertising

One step beyond automatic configuration is dynamic configuration. Instead of a static or automated configuration, HTCondor itself can run your program and incorporate the information.
This is HTCondor's "Daemon ClassAd Hooks" functionality,
previous known as HawkEye and HTCondor Cron. This is the route taken by the condorgpu project (Note that the condorgpu project has no affiliation with HTCondor. We have not tested or reviewed that code and cannot promise anything about it!)

Such a configuration might look something like this, assuming that each machine had at most two GPUs.

$(MODULES)/get-gpu-info will be invoked twice, once for each of the two possible GPUs. (You can support more by copying the above entries and increasing the integers. #2196, if implemented, may allow for a simpler configuration.) get-gpu-info will be passed the device ID to probe (0 or 1). The output should be a ClassAd; entries will have GPU_ prepended, then they will be added to to slot ClassAds for slots 1 and 2.

get-gpu-info would write output to its standard output that looked something like:

specifying that the job requires the CUDA GPU API (as opposed to OpenCL or another), that it wants a GPU with at least 16 cores, and it wants a GPU with a name of "Tesla".

Identify the GPU

Once a job matches to a given slot, it needs to know which GPU to use, if multiple are present. Assuming the slot advertised the information, you can access it through the job's arguments or the environment using The $$() syntax. For example, if your job takes an argument "--device=X" where X is the device to use, you might do something like

arguments = "--device=$$(GPU_DEV)"

Or your job might look to the environment variable GPU_DEVICE_ID:

environment = "GPU_DEVICE_ID=$$(GPU_DEV)"

The Future

The HTCondor team is working on various improvements in how HTCondor can manage GPUs. We're interested in how you are currently using GPUs in your cluster and how you plan on using them. If you have thoughts or questions, you can post to the public condor-users mailing list, or us directly.

This work supported in part by NSF grants MCS-8105904, OCI-0437810, OCI-0850745, and/or ACI-1321762. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. Site built using CVSTrac.