Exercise: Device Bandwidth
In this exercise, we will build and run one of the sample programs that NVIDIA provides with CUDA, in order to test the bandwidth between the host and an attached CUDA device. In the bonus exercise, if more than one device is present, we will test the bandwidth between a pair of the devices using a second sample code.
Again, no CUDA programming is required for these exercises. The goals are just to gain experience in submitting GPU jobs, and to learn the rates at which data are transferred to and from (even within and among!) GPUs. While the exercises are again geared for Frontera, with suitable modifications they will work on a wide range of systems that include NVIDIA GPUs.
After locating the source and header files and the Makefile, we will build the code, prepare the batch file, and submit the job.
1. Load the CUDA software using the module utility (as necessary).
2. Set CUDA_PATH
to the location where CUDA is installed. This can be deduced from the path to the nvcc
compiler.
The CUDA_PATH
environment variable will be needed by the Makefile, later.
3. Copy the desired CUDA sample files to a directory in $HOME
.
4. Set INCLUDES
as an environment variable for the Makefile you just copied, then build the executable with
make -e
.
At the end of the make process, you are likely to see a "cannot create directory" error, which can be safely ignored.
5. Prepare (or download) the batch file and save it as batch_test.sh (or pick any filename).
Remember to submit the job to one of the GPU queues, such as Frontera's rtx-dev
queue.
5. Submit your job using the sbatch command.
6. Retrieve the results. If your job ran successfully, your results should be stored in the file gpu_test.o[job ID]. Assuming you
submitted your job to Frontera's rtx-dev
queue, your output should look like the following:
The final bandwidth test above has a rather deceptive name. It is not a measurement of the bandwidth of a memory copy between two devices, but rather the bandwidth of a memory copy from the device to itself. In other words, it is a measurement of the memory bandwidth. The result appears to be a reasonably high fraction of the nominal peak rate of 448 GB/s.
Bonus exercise
As a follow-on to the above tests, you can try building and running a different sample code that is supplied with CUDA, in order to measure the speed of transfers that go directly from each of the attached GPU devices to each of the other attached devices. NVIDIA's term for this is peer-to-peer transfers. If no direct path is present, the transfer is routed through the host.
In practice, for the types of multi-GPU platforms described in this Virtual Workshop topic, the direct transfers should take place at least as fast as the host-to-device or device-to-host transfers. However, as mentioned in the discussion of interconnects for the RTX 5000, it turns out that direct routing may not be faster in practice.
To complete this bonus exercise, repeat the above steps, but replace step 3 with the following:
3. Copy the desired CUDA sample files to a second directory in $HOME
.
The only other change to the instructions is that batch script should run the executable p2pBandwidthLatencyTest
.
In the output, you should first see a matrix displaying exactly how the attached GPUs are coupled to one another, followed by a succession of matrices displaying the bandwidths and latencies between the devices. The latter are shown with P2P (peer-to-peer) disabled and enabled. For the Frontera nodes with 4 Quadro RTX 5000 GPUs, the peer-to-peer latency improvements will likely be more encouraging than the ones for bandwidth.