Cornell Virtual Workshop > OpenMP > OpenMP Constructs

Loop Clauses

The available clauses for the for and do directives are:

private (list)
firstprivate (list)
lastprivate (list)
reduction (operator:list)
ordered
schedule (kind[, chunk_size])
collapse
nowait

private, firstprivate and reduction

The private, firstprivate, and reduction clauses behave just as they would in a parallel region, except that the extent of their operation is the OpenMP loop that they modify, rather than a parallel region (in cases where the parallel region has an extent different from the loop extent). In particular, if a parallel region contains several consecutive loop constructs, variables could be treated differently in the different loops.

lastprivate

The lastprivate clause is like the private clause in that it lists variables that are to have private instances in each thread that works on the loop. In addition, though, the value of the variable after the last iteration of the loop is available after the loop is complete. In this definition, the last iteration should be understood to be the iteration that would be performed last in a serial execution of the code. Depending on the load balance and execution environment of the various threads, this "last iteration" may be completed before some of the other iterations are complete, but a lastprivate variable's value doesn't become valid in the enclosing context until the entire OpenMP loop is complete.

ordered

The ordered clause on a loop construct is a flag that OpenMP ordered constructs will be found inside the loop. It does not mean that the entire loop will be performed in order. (If it were, there would be no benefit from using OpenMP!) See more on the OpenMP ordered construct here.

schedule

The schedule clause on a loop construct can have four kind values. The first three kinds can take an optional chunk-size argument.

static: static scheduling means that the iterations of the loop will be assigned to the threads in chunks of size chunk-size, if it is specified, in round-robin order according to the thread numbers. If the chunk size is not specified, the iterations will be divided up into chunks of approximately equal numbers of iterations and one will be assigned to each thread. Even in the default case, the assignment of work to threads is done in order according to the thread numbers.
dynamic: dynamic scheduling means that chunks (of size chunk-size, if specified, otherwise 1) are assigned to threads as the threads become free. This scheme is advantageous when the programmer knows that some iterations will take significantly longer than others. On the other hand, there is more overhead involved in dynamic scheduling, especially with the default chunk-size, so it isn't always better than static scheduling.
guided: guided scheduling is a bit more complicated. It is like dynamic scheduling in the sense that chunks of iterations are assigned to threads as they become free. The initial chunk size is proportional to the number of iterations remaining divided by the number of participating threads. The proportionality constant is implementation-dependent. If a chunk size is specified, that is used as a lower bound on the size that is assigned. The default chunk-size variable is 1. Thus, if the ratio of remaining iterations to threads multiplied by the proportionality constant is greater than the chunk size, the next thread that needs work will get that number of iterations. In the default case, the first thread will get more iterations than the second thread (because there are fewer iterations remaining after some have been assigned to the first thread) and so forth until at the end iterations are handed out one at a time. This should have less overhead than standard dynamic scheduling because relatively large chunks of iterations are handed out at first, reducing the overhead, but iterations are handed out in smaller chunks near the end when the main goal is to arrange for all of the threads to finish as close to the same time as possible.
runtime: runtime scheduling is controlled by the kind and chunk-size values that have been specified using the OMP_SCHEDULE environment variable. If that variable hasn't been set, the behavior is implementation-defined.

collapse(n)

A collapse(n) clause specifies how many loops (n) in a nested loop should be collapsed into one bigger loop and divided according to the schedule clause. The sequential execution of the iterations in all associated loops determines the order of the iterations in the collapsed loop. The collapse clause can be very helpful in achieving performance when the number of iterations in the outer loop is fewer than the number of available threads (an increasingly common scenario on many-core systems like Xeon Phi). The collapse clause is likely best used with static scheduling as a default since the number of iterations that need to be scheduled will grow. Iterations cannot depend on each other in any way.


#pragma omp parallel for private(jj) collapse(2)
for (ii = 0; ii < 10; ii++) {
    for (jj = 0; jj < 100; jj++) {
    ⋮
    }
}

An important caveat is that the loops being collapsed must not have any interdependencies in their iteration variables, or the behavior is undefined; typical "rectangular" loops (as in the example just shown) work well with collapse.

nowait

There is an implied barrier at the end of every loop construct. All threads will synchronize there before any more code is executed. If the nowait clause is specified on the for construct, that barrier is removed. For Fortran, the nowait clause doesn't appear on the OpenMP do construct, but on the END DO instead. Nowait is most useful if there are several independent loops in the same parallel region. Those threads that have completed work on one loop can continue on to the next loop without waiting for all of the threads to complete their work on the earlier loop. However, lastprivate or reduction variables from a loop construct do not become valid until a barrier has been encountered after their loop is complete.

Back