Checking GPU Usage and Performance

Once a program built for GPU is up and running on a Grace Hopper compute node, it's useful to monitor the GPU usage of that program. The command nvidia-smi is a simple tool to check the current status of the GPU in the GH compute node(s) where your job is running on Vista, either interactively or in batch. It displays basic information about all the GPUs on the GH node, including memory usage, temperature, power consumption, and a list of processes using the GPU. Note that nvidia-smi is not available on the nodes without GPUs, i.e., the login nodes and the GG nodes.

To update the display periodically, you can use the -l (loop) option. The command below refreshes the display screen every two seconds:

Common metrics to monitors are the memory usage, the GPU utilization rate, and the list of running processes. If your program does not appear in the running process list, then it is finished executing or did not make use of the GPUs. If memory usage exceeds the GPU's capacity, out-of-memory errors may occur, which may require an adjustment of the program or the inputs. For more advanced diagnostics and performance analysis, NVIDIA provides the Nsight Systems and Nsight Compute tools, available via the commands nsys and ncu.

 
©  |   Cornell University    |   Center for Advanced Computing    |   Copyright Statement    |   Access Statement
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)