Submit a Job to Slurm
Running GPU code on a personal computer is very different from running it on a large cluster system like Frontera or Vista. On a personal computer, you install the required packages, then compile and execute the application. On an HPC system like Frontera or Vista, there are more steps: first you run the module
command to load CUDA-related utilities such as nvcc
; then compile your GPU code on one of the login nodes; then prepare a batch script and submit the job to one of the GPU queues, using Slurm.
This section provides step-by-step instructions for running a simple GPU job on Frontera or Vista. With some modification, the same steps should work on any cluster that features NVIDIA GPUs and a Slurm scheduler. Programs for the other lab exercises in this roadmap can be compiled and run in a similar manner.
Compilation
The first and most important step is to load the CUDA module, to gain access to the nvcc compiler. You will use the NVIDIA nvcc compiler to compile CUDA code on Frontera or Vista. The following command will load the CUDA module:
In this roadmap, all lab exercises can be compiled using the following command and compiler flags.
compute_75
and sm_75
instruct the compiler to compile source code for Turing-based NVIDIA GPUs, which is the family that the Quadro RTX 5000 GPUs on Frontera belong to. The -o
flag specifies the name of the output executable.
Batch File Preparation
Your GPU job batch script is a typical job script, but you can only submit your GPU jobs to a GPU queue:
- No GPU device is attached to the login nodes, so you cannot test your code there
- Not every Frontera and Vista compute node is equipped with a CUDA device—only the nodes in the GPU queues
- Note, you can request more than 1 GPU node
- In Frontera's GPU queues, each node has 4 Quadro RTX 5000 cards and 2 Intel Xeon E5-2620 v4 (“Broadwell”) CPUs
- In Vista's GPU queues, each node has 1 GH200 Grace-Hopper Superchip, consisting of 1 "Hopper" (H200) GPU and and 1 "Grace" CPU
Frontera and Vista schedule batch jobs using the Slurm resource manager. A sample Slurm batch script follows. The queue name used here is Frontera's GPU development queue, rtx-dev
. The rtx
queue also works for GPU jobs, but you may have to wait longer in the queue.
The following tables give the names and time limits of the Slurm partitions (queues) suitable for GPU jobs on Frontera and Vista, respectively.
Partition Name | Time Limit | Description |
---|---|---|
rtx | 48 hrs | GPU nodes |
rtx-dev | 2 hrs | GPU development nodes |
Partition Name | Time Limit | Description |
---|---|---|
gh | 48 hrs | GPU nodes |
gh-dev | 2 hrs | GPU development nodes |
Job Submission
Once your source code is compiled and your batch file is prepared, submit your job using the sbatch command:
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)