Problems Running Jobs
There can be many reasons why jobs don't advance in the queue or fail after they start running. Here we cover a few common issues and suggest some things to try.
Job gets "stuck" in the queue
When your job doesn't seem to be making progress in the queue, it might not be your fault at all, but rather a matter of other jobs having priority over yours for a variety of reasons. Some of the factors that influence scheduling on Frontera are:
- Fairshare - Jobs submitted by users who haven't run many jobs recently will get some degree of priority over jobs submitted by users who have been using resources heavily.
- Reservations - Blocks of time may be reserved on all or part of Frontera for special situations such as workshops, full-scale runs, or maintenance windows. A job will not be scheduled if it would interfere with such a reservation.
- Backfill - Sometimes a small, quick job can run to completion while larger, queued jobs are waiting for nodes to free up. If so, the smaller job will be scheduled ahead of the larger ones.
- Queue limits - As outlined in Queues, all users are subject to queue-dependent limits on the number of jobs that they can run at one time. Jobs in excess of the limit will remain in the queue, even if resources are available.
No data from job
When there is no output, there is a strong chance that the job simply failed. Check the stderr and stdout files (usually the job number is contained in the filenames) to make sure nothing went wrong.
Another possibility is that the output files were actually written, but not to the place you expected. In particular, if your application writes any files to /tmp
, be sure that your batch script copies these files to $WORK
, $HOME
, or $SCRATCH
prior to exiting.
File or command not found
Generally a missing file is due to a path problem. For example, you will get an error if you try to launch an MPI executable with ibrun my_prog
, because your $PATH
does not automatically include the current working directory. A correct syntax is ibrun ./my_prog
(assuming the file is in the current working directory). Similarly, be sure to load any environment modules that are needed to set up the $PATH
and $LD_LIBRARY_PATH
for your job, either at the time you submit the job, or explicitly in your batch script.
Also, as a rule, full paths into your directory space at TACC should start with $WORK
, $HOME
, or $SCRATCH
.
Reason for failure is unclear
Perhaps your job failed because you were trying to do too much at once?
- Break down your batch script into smaller steps that run on fewer nodes, and test the steps separately or incrementally.
- Give yourself more output: echo the name of each step prior to taking that step in the script, list the contents of relevant directories, echo the values of environment variables, etc.
- Be sure you understand what your job is doing at each stage so you can diagnose the underlying issue.
Starting an interactive job with idev
and executing the job script line by line, while checking the output at each step, may be helpful in this regard.