TensorFlow/Keras, PyTorch, and other deep learning packages are all complex software systems, providing high-level programming interfaces that coordinate the construction and operation of an impressive collection of computational machinery underneath those interfaces. While the user friendliness of the interfaces provides convenience, it can be a challenge to figure out what is going on underneath and to determine whether computational resources are being effectively utilized. This is especially true on powerful cluster nodes such as the GPU nodes on Frontera, which involve CPU nodes with multiple cores and 4 attached GPUs per node.

As noted, both TensorFlow/Keras and PyTorch are capable of running either solely on one or more CPUs, or on combined CPU/GPU platforms if one or more GPUs are available. Furthermore, given that compatible CUDA libraries are available, both systems are able to auto-detect the presence of GPUs and use them where needed to accelerate computations. On nodes with more than one GPU, such as on Frontera at TACC, both TensorFlow/Keras and PyTorch can be programmed to make use of multiple GPUs, but neither will do so automatically without changes to the user's code to coordinate that sort of distributed computation.

TensorFlow and PyTorch can both make use of more than one core on a multi-core CPU without alterations to a program, by detecting the presence of multiple cores and starting up multiple concurrent threads. The number of threads used can be controlled to some extent by setting environment variables (TF_NUM_INTEROP_THREADS and TF_NUM_INTRAOP_THREADS for TensorFlow, and OMP_NUM_THREADS for PyTorch), although you are probably better off relying on each system's internal management to optimally use the available resources.

Both TensorFlow and PyTorch provide low-level, fine-grained control of the underlying computational processes and their mapping to hardware resources, but the high-level, user-friendly APIs hide much of that. In what follows, we describe a few tools that can help you to monitor some of the hardware resources that are being leveraged by your deep learning programs, so that you can efficiently use the resources you are requesting.

Performance assessment

Please consult our companion material on performance assessment within Python programs, including the use of profiling and timing tools.

time

time is a standard Linux/Unix utility that reports the time used to run a command from the shell, broken down into real time ("wallclock time"), user time, and system time. If you are trying to speed up your code, either by using more CPU and/or GPU resources, or choosing different algorithms in your deep learning code, then you are motivated primarily in reducing the real time. If you run a job in parallel such that it uses multiple cores or CPU nodes, then that will be reflected in the user time. (In other words, the user time can be larger than the real time, if multiple CPU resources are used concurrently.)

top

top is another standard Linux/Unix utility that reports running processes, their current CPU usage, and other attributes. Entries are ordered from the top down, in order of decreasing CPU usage, so top is a useful tool to see what is running and using the CPU. On multi-core CPUs, such as those on the systems at TACC, each core can potentially run full-out at 100% utilization. Therefore, if a process is running that is using multiple cores, the utilization of that process can be listed as being greater than 100%. To get a rough sense of how many cores are in use for a process, you can divide the listing CPU usage in top by 100 (percent). Of course, a process might not be getting all 100% of any single core, if there are other processes that are also running on that core, but this can be a useful first step to identifying how many cores are actually in use by a given process.

nvidia-smi

nvidia-smi is a utility that is available on systems with NVIDIA GPUs, formally known as the NVIDIA System Management Interface. It provides detailed information about the GPUs attached to a particular CPU, such as GPU utilization, memory usage, and power consumption. As such, it is somewhat analogous to what top provides for CPUs. If you are not sure about how many GPUs on a system your code is actually using, or how efficiently those GPUs are being used for a given application, you can run nvidia-smi to get information such as is reproduced below from the Frontera system at TACC during the running of a TensorFlow/Keras application:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.86       Driver Version: 470.86       CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro RTX 5000     On   | 00000000:02:00.0 Off |                  Off |
|  0%   39C    P2    50W / 230W |  15676MiB / 16125MiB |     21%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Quadro RTX 5000     On   | 00000000:03:00.0 Off |                  Off |
|  0%   38C    P2    46W / 230W |  15676MiB / 16125MiB |     20%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Quadro RTX 5000     On   | 00000000:82:00.0 Off |                  Off |
|  0%   39C    P2    56W / 230W |  15780MiB / 16125MiB |     20%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Quadro RTX 5000     On   | 00000000:83:00.0 Off |                  Off |
|  0%   38C    P2    44W / 230W |  15734MiB / 16125MiB |     20%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     10904      C   python3                         15673MiB |
|    1   N/A  N/A     10904      C   python3                         15673MiB |
|    2   N/A  N/A     10904      C   python3                         15777MiB |
|    3   N/A  N/A     10904      C   python3                         15731MiB |
+-----------------------------------------------------------------------------+

From the nvidia-smi output, we can see that all 4 GPUs are being used by the program that is running, although each of them is being utilized at only 20% of the full capacity. This is presumably because the size of the problem and associated data being analyzed are not big enough to fully feed all 4 of the GPU pipelines at the same time, although that does not necessarily mean that you are not getting some speedup by using multiple GPUs.

remora

Remora is a tool developed by TACC to provide users with information about resources being used by their running programs — the name stands for "Resource Monitoring for Remote Applications". Remora is managed on TACC machines through the Lmod module system, and is run in conjunction with a user code of interest. The Remora web page at TACC indicates a sample usage:

Once a job run via remora is finished, a html-formatted report is generated consisting of multiple pages with embedded figures, illustrating the time course of CPU and GPU usage, memory utilization, etc. throughout the execution of the job. To get a more detailed sense of resource usage, and for long-running jobs especially, generating an archived report through remora might be preferable to using the various command-line tools described above.

Package-specific profilers

In addition to the general-purpose tools described above, both TensorFlow and PyTorch provide their own profiling capabilities tailored those specific systems. For more information, see:

 
©  |   Cornell University    |   Center for Advanced Computing    |   Copyright Statement    |   Inclusivity Statement