Advanced Slurm: Introduction
The Slurm Workload Manager (originally the Simple Linux Utility for Resource Management) is a collection of
utilities for managing workloads on compute clusters. Slurm is commonly used to manage all the jobs that execute on large-scale HPC resources
such as Stampede2 and Frontera at
TACC, among others. The basic knowledge required to use Slurm to submit, monitor, and control jobs on the
compute nodes of Stampede2 is provided in the Stampede2 Environment topic, as well as the
Stampede2 User Guide. Similar guidance
for Frontera is found in Getting Started on Frontera and the
Frontera User Guide.
This topic is for users who are already familiar with the process of submitting jobs via Slurm, but who have needs that go beyond submitting simple
batch files or interactive jobs. We will discuss some of the lesser-known but powerful features of Slurm that offer potential strategies for setting
up advanced workflows such as parameter sweeps. The goal is to
impart practical techniques and a broader understanding of Slurm from the user perspective, without taking the time to cover every possible aspect of
Slurm.
Steve Lantz (2021 author), Aaron Birkland (2014 author)
Cornell Center for Advanced Computing
With contributions from:
Texas Advanced Computing Center
Revisions: 8/2021, 4/2014 (original)