Exercise - Managing Jobs
In this exercise, we will try out some of the Slurm commands that let you track the progress of your jobs and manage them in the queues. Our batch script in this case doesn't really have to do anything useful, but the job does have to last long enough for you to take certain actions on it.
The following script will suffice. Create it in an editor (or with cat >
) and call it simpleJob.sh:
#!/bin/bash
#SBATCH -J simpleJob
#SBATCH -o simpleJob.%j.out
#SBATCH -e simpleJob.%j.err
#SBATCH -p development
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -t 00:10:00
# Uncomment next line and specify a project, if you have more than 1 project
##SBATCH -A <your account>
sleep 300
As you can see, the script's only function is to nap for 5 minutes! In spite of its laziness, the above script does illustrate how to give a name to a job, and how to give names to the files containing the output and error streams that are produced.
Submit the above script with sbatch
, taking note of the job number:
$ sbatch simpleJob.sh
Monitor the progress of your job with one or both of the following Slurm commands:
$ squeue -u $USER
$ showq -u
Now submit a second job that depends on the successful completion of the first.
$ sbatch -d afterok:<first_job_number> simpleJob.sh
We can ask, is Slurm able to estimate a start time for this new job?
$ squeue --start -j <second_job_number>
The answer is no; the dependency makes such an estimate impossible, because the scheduler cannot predict what will happen to the first job. For instance, we may choose to cancel the first job, so that the second job becomes free to run, subject to the prescribed dependencies:
$ scancel <first_job_number>
But remember that the second job is supposed to run only if the first one succeeds. So what now happens to the second job? Let's find out:
$ squeue -u $USER
This command may need to be repeated a few times. Eventually, you'll see that the job remains in a pending (PD) state, but the reason changes from "Dependency" to "DependencyNeverSatisfied". The job will stay that in that state indefinitely until you cancel it explicitly.
This means you have one final cleanup step to take care of:
$ scancel <second_job_number>
(Note that when a job is on indefinite hold like this, you can choose to release it back into the active queue instead of canceling it. This is accomplished with the scontrol release
command; see man scontrol
for details.)