Using → GPU Metrics

Open XDMoD includes support for GPU metrics in the jobs realm starting with version 9.0.0. Specifically, the number of GPUs allocated to each job is tracked and used to calculate the number of GPU hours and to allow grouping by the number of GPUs allocated.

Only Slurm and PBS are supported at this time.

Please note that if your resource manager is not supported or GPU data is not available/parsable, that Open XDMoD will report zero GPU hours and a GPU count of zero.

Slurm

The GPU count source for Slurm data is the AllocTRES accounting field taken from sacct output. This field contains the Trackable Resources allocated to the job. The specific resource that is used to determine the GPU count is the Generic Resource identified by gres/gpu.

For example:

billing=10,cpu=10,gres/gpu=4,mem=374000M,node=2

This AllocTRES value would indicate that the job was allocated 4 GPUs.

PBS

The GPU source count for PBS data is the Resource_List.nodes field in the accounting log files. If a value is specified for gpus then that is used as the number of GPUs per node.

For example:

Resource_List.nodes=2:ppn=32:gpus=2

This would indicate that the job used 4 GPUs (2 GPUs per node * 2 nodes).

Non-Standard PBS GPU data

In addition to the standard way of logging GPUs, two non-standard ways of logging this data are also supported. If the GPU count cannot be determined from Resource_List.nodes then these will be used when present.

If the Resource_List.nodect field specifies a value for gpus then that will be used.

For example:

Resource_List.nodect=2:ppn=32:gpu=2

This would indicate that the job used 4 GPUs (2 GPUs per node * 2 nodes).

If neither of those methods produce a value, then Resource_List.gpu may be used to determine the number of GPUs.

For example:

Resource_List.gpu=2

This would indicate that the job used 2 GPUs.