Monitoring and Canceling Jobs
squeue
When a job has been successfully submitted with sbatch
, it will return a job number:
$ sbatch -p development --tasks-per-node 1 -N 1 -t 00:02:00 myJob.sh
...
Submitted batch job 838514
Then you can run squeue
and look for the job number, to see if the job is still in the queue or if it's already running:
$ squeue -u tg459572
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
838514 development myJob.sh tg459572 PD 0:00 1 (Resources)
In this case, we see that the job is in the "pending" (PD) state while it awaits resources. Other possible states are "running" (R) and "completing" (CG); in those states, squeue
will list the nodes that were allocated to your job. Generally you'll want to run squeue -u
with your user ID as the argument, because without any arguments, squeue
retrieves information on all the scheduled jobs from every user on Frontera.
The squeue --start
command gives rough estimates of the start times for pending jobs that are sufficiently high in the queue. To learn the expected start time of the job above, you would type
$ squeue --start -j 838514
After a job has finished, stdout and stderr streams will be put in your $HOME
directory in files named with the job number. You may examine these output files to verify that the job ran successfully.
showq
TACC provides showq
as an alternate way of displaying job status within queues/partitions. Like squeue
, its default is to show job info from all users. Thus, the most useful form for individual users is:
$ showq -u
This displays all the jobs you own, grouped into four categories: active, waiting, blocked, and completing/errored. A blocked job is one that is not yet able to run due to circumstances such as a scheduled maintenance, or user hold, or job dependency.
scancel
The scancel
command is used to terminate jobs that are running or queued. Though it is capable of providing fine-grained control of exactly how a job is terminated, the most common usage is simply:
$ scancel <jobID>
where <jobID> is the number of a particular job. This causes the job to terminate immediately, regardless of its running or queued state.