Cornell Virtual Workshop > Advanced Slurm > Job Submission

Exercise

In this exercise, we will script the submission of a series of jobs followed by a dependent job that should run only if all the jobs in the series complete successfully. Each job in the series will have a parameter passed to it via a command-line argument to the batch script. To confirm that the parameter was received correctly, the batch script for these jobs simply prints out its parameter value and exits. The final, dependent job scans the output of the others, collects the lines that reported each job's input parameter, and shows just those lines.

Although the "calculation" in this case is trivial, we will see that the whole method of scripting the job submissions and defining the dependencies correctly for the final job is a bit tricky.

Create a batch script mainJobScript.sh that does nothing more than sleep for 10 seconds before printing out the value of its input argument. (If needed, supply your preferred account number.)
Create another batch script reportScript.sh that processes a series of files whose names incorporate a set of strings (i.e., job IDs) that are passed in as a colon-delimited list. As the script processes these files, it should print out only the lines that include the phrase "parameter value".
Now set up a series of jobs based on the first script, together with a dependent "postprocessing" job based on the second, using a submission script: series_dependency.sh. This script needs to pass the right input arguments to the series of jobs. It must also collect the job IDs from sbatch as it goes along, so it can prepare a properly formatted argument for the dependent job.
This script has a few noteworthy aspects:
- We use the recipe explained earlier to determine the job ID when submitting mainJobScript.sh.
- We build up a dependency string in the for loop by adding the job ID of each job in the series upon submission, then submitting the final, dependent job based on the full string of job IDs.
- We pass this same dependency string to the second batch script as well, so it can be parsed for finding the output files.

Run the submission script by source-ing it in the current shell:

login3.stampede2(1021)$ source series_dependency.sh
Running sbatch mainJobScript.sh 0
Running sbatch mainJobScript.sh 1
Running sbatch mainJobScript.sh 2
Running sbatch mainJobScript.sh 3
Running sbatch --dependency=afterok:8095643:8095644:8095645:8095646 reportScript.sh 8095643:8095644:8095645:8095646

-----------------------------------------------------------------
          Welcome to the Stampede2 Supercomputer                 
-----------------------------------------------------------------

No reservation for this job
--> Verifying valid submit host (login3)...OK
--> Verifying valid jobname...OK
--> Enforcing max jobs per user...OK
--> Verifying availability of your home dir (/home1/00933/tg459572)...OK
--> Verifying availability of your work2 dir (/work2/00933/tg459572/stampede2)...OK
--> Verifying availability of your scratch dir (/scratch/00933/tg459572)...OK
--> Verifying valid ssh keys...OK
--> Verifying access to desired queue (normal)...OK
--> Verifying job request is within current queue limits...OK
--> Checking available allocation (TG-CDA170005)...OK
--> Verifying that quota for filesystem /home1/00933/tg459572 is at  3.45% allocated...OK
--> Verifying that quota for filesystem /work2/00933/tg459572/stampede2 is at  0.00% allocated...OK
Submitted batch job 8095647

Now take a look at the jobs in the queue. All the jobs in the series should be visible, as well as the dependent job. As you can see, the final dependent job is blocked from execution ("PD" = pending) with a reason of "(Dependency)".
After all jobs finish, look at the output of the final dependent job. It summarizes each job's own unique Slurm job ID and input argument:
Try invoking the script with different values of SERIES_COUNT and/or intentionally canceling individual jobs in the series, to verify that everything behaves as you would expect.

Back