Pitfalls to Beware
Why does OpenMP need constructs like critical and atomic? The single biggest problem to avoid in shared-memory programming is the race condition, in which two or more threads attempt to update the same memory location at the same time. The "winner" of this race is the thread that comes in last, because its update will be the one that persists. The trouble with a race condition is that the result is unpredictable, as the outcome depends on the exact order in which the competing updates occur. The usual consequence is a nasty type of bug that is hard to track down because the code will tend to fail in an irreproducible way. Such a bug may even stay hidden until a particular scheduling of threads takes place to trigger it. The cure is to enforce mutual exclusion (mutex) in the relevant section of code. OpenMP provides the critical and atomic constructs, as well as the various lock functions (to be discussed later), for just this purpose.
A problem similar to the race condition is a stale value, which occurs when one thread needs the current value of a variable that has been updated by a second thread, but it gets a stale one instead. To avoid this problem, the updating thread must flush the new value to memory immediately after making the change, and the consumer must obtain its working copy after the flush. One solution is to make the update atomic, because the flush is implicit in an atomic operation. However, if the consumer needs not just a current copy of the variable, but also certainty that the update has actually taken place, then some sort of synchronization between threads using locks will be necessary. Sometimes the required synchronization is provided by other thread dependencies.
Finally, the performance of workshare constructs may suffer due to load imbalance if static scheduling is used and the time-per-iteration varies significantly. In this case, it is important to explore other approaches such as such as dynamic or guided scheduling. Even these adaptive scheduling techniques may fail to yield acceptable load balance if an iteration requiring a very long time is among the last to be scheduled. The programmer should take advantage of any heuristics that might help to make the lengthy iterations run early, in order to increase the chances of having better load balance.