Monitoring Jobs
TensorFlow/Keras, PyTorch, and other deep learning packages are all complex software systems, providing high-level programming interfaces that coordinate the construction and operation of an impressive collection of computational machinery underneath those interfaces. While the user friendliness of the interfaces provides convenience, it can be a challenge to figure out what is going on underneath and to determine whether computational resources are being effectively utilized. This is especially true on powerful cluster nodes such as the GPU nodes on Frontera, which involve CPU nodes with multiple cores and 4 attached GPUs per node.
As noted, both TensorFlow/Keras and PyTorch are capable of running either solely on one or more CPUs, or on combined CPU/GPU platforms if one or more GPUs are available. Furthermore, given that compatible CUDA libraries are available, both systems are able to auto-detect the presence of GPUs and use them where needed to accelerate computations. On nodes with more than one GPU, such as on Frontera at TACC, both TensorFlow/Keras and PyTorch can be programmed to make use of multiple GPUs, but neither will do so automatically without changes to the user's code to coordinate that sort of distributed computation.
TensorFlow and PyTorch can both make use of more than one core on a multi-core CPU without alterations to a program, by detecting the presence of multiple cores and starting up multiple concurrent threads. The number of threads used can be controlled to some extent by setting environment variables (TF_NUM_INTEROP_THREADS
and TF_NUM_INTRAOP_THREADS
for TensorFlow, and OMP_NUM_THREADS
for PyTorch), although you are probably better off relying on each system's internal management to optimally use the available resources.
Both TensorFlow and PyTorch provide low-level, fine-grained control of the underlying computational processes and their mapping to hardware resources, but the high-level, user-friendly APIs hide much of that. In what follows, we describe a few tools that can help you to monitor some of the hardware resources that are being leveraged by your deep learning programs, so that you can efficiently use the resources you are requesting.
Performance assessment
Please consult our companion material on performance assessment within Python programs, including the use of profiling and timing tools.
time
time
is a standard Linux/Unix utility that reports the time used to run a command from the shell, broken down into real time ("wallclock time"), user time, and system time. If you are trying to speed up your code, either by using more CPU and/or GPU resources, or choosing different algorithms in your deep learning code, then you are motivated primarily in reducing the real time. If you run a job in parallel such that it uses multiple cores or CPU nodes, then that will be reflected in the user time. (In other words, the user time can be larger than the real time, if multiple CPU resources are used concurrently.)
top
top
is another standard Linux/Unix utility that reports running processes, their current CPU usage, and other attributes. Entries are ordered from the top down, in order of decreasing CPU usage, so top
is a useful tool to see what is running and using the CPU. On multi-core CPUs, such as those on the systems at TACC, each core can potentially run full-out at 100% utilization. Therefore, if a process is running that is using multiple cores, the utilization of that process can be listed as being greater than 100%. To get a rough sense of how many cores are in use for a process, you can divide the listing CPU usage in top
by 100 (percent). Of course, a process might not be getting all 100% of any single core, if there are other processes that are also running on that core, but this can be a useful first step to identifying how many cores are actually in use by a given process.
nvidia-smi
nvidia-smi
is a utility that is available on systems with NVIDIA GPUs, formally known as the NVIDIA System Management Interface. It provides detailed information about the GPUs attached to a particular CPU, such as GPU utilization, memory usage, and power consumption. As such, it is somewhat analogous to what top
provides for CPUs. If you are not sure about how many GPUs on a system your code is actually using, or how efficiently those GPUs are being used for a given application, you can run nvidia-smi
to get information such as is reproduced below from the Frontera system at TACC during the running of a TensorFlow/Keras application:
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.86 Driver Version: 470.86 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Quadro RTX 5000 On | 00000000:02:00.0 Off | Off | | 0% 39C P2 50W / 230W | 15676MiB / 16125MiB | 21% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Quadro RTX 5000 On | 00000000:03:00.0 Off | Off | | 0% 38C P2 46W / 230W | 15676MiB / 16125MiB | 20% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 Quadro RTX 5000 On | 00000000:82:00.0 Off | Off | | 0% 39C P2 56W / 230W | 15780MiB / 16125MiB | 20% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 Quadro RTX 5000 On | 00000000:83:00.0 Off | Off | | 0% 38C P2 44W / 230W | 15734MiB / 16125MiB | 20% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 10904 C python3 15673MiB | | 1 N/A N/A 10904 C python3 15673MiB | | 2 N/A N/A 10904 C python3 15777MiB | | 3 N/A N/A 10904 C python3 15731MiB | +-----------------------------------------------------------------------------+
From the nvidia-smi
output, we can see that all 4 GPUs are being used by the program that is running, although each of them is being utilized at only 20% of the full capacity. This is presumably because the size of the problem and associated data being analyzed are not big enough to fully feed all 4 of the GPU pipelines at the same time, although that does not necessarily mean that you are not getting some speedup by using multiple GPUs.
remora
Remora is a tool developed by TACC to provide users with information about resources being used by their running programs — the name stands for "Resource Monitoring for Remote Applications". Remora is managed on TACC machines through the Lmod module system, and is run in conjunction with a user code of interest. The Remora web page at TACC indicates a sample usage:
Once a job run via remora is finished, a html-formatted report is generated consisting of multiple pages with embedded figures, illustrating the time course of CPU and GPU usage, memory utilization, etc. throughout the execution of the job. To get a more detailed sense of resource usage, and for long-running jobs especially, generating an archived report through remora might be preferable to using the various command-line tools described above.
Package-specific profilers
In addition to the general-purpose tools described above, both TensorFlow and PyTorch provide their own profiling capabilities tailored those specific systems. For more information, see: