Checking GPU Usage and Performance

Once a program built for GPU is up and running on a Grace Hopper compute node, it's useful to monitor the GPU usage of that program. The command nvidia-smi is a simple tool to check the current status of the GPU in the GH compute node(s) where your job is running on Vista, either interactively or in batch. It displays basic information about all the GPUs on the GH node, including memory usage, temperature, power consumption, and a list of processes using the GPU. Note that nvidia-smi is not available on the nodes without GPUs, i.e., the login nodes and the GG nodes.

To update the display periodically, you can use the -l (loop) option. The command below refreshes the display screen every two seconds:

Common metrics to monitors are the memory usage, the GPU utilization rate, and the list of running processes. If your program does not appear in the running process list, then it is finished executing or did not make use of the GPUs. If memory usage exceeds the GPU's capacity, out-of-memory errors may occur, which may require an adjustment of the program or the inputs. For more advanced diagnostics and performance analysis, NVIDIA provides the Nsight Systems and Nsight Compute tools, available via the commands nsys and ncu.

Checking the GPU Compute Capability

In addition, the nvidia-smi command shows the compute capability of the current device. For the gh or gh-dev nodes with H200s, the command returns the following.

Checking the Compute Capabilities of an Executable

Conversely, to check which compute capabilities an executable targets, use cuobjdump:

 
©  |   Cornell University    |   Center for Advanced Computing    |   Copyright Statement    |   Access Statement
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)