Best Practices for Compilers
One overall recommendation
It's always good practice to try out various code-optimization options on an experimental basis to see if they actually produce faster (or even just functional!) code for your particular application.
Other compiling recommendations
-
The default is
-O2
, but compile with-O3
unless it slows or breaks your code (rarely happens) -
Optimize your source files as a group with
-ipo
(Intel) or-fwhole-program
(GCC) -
To see CPU features relevant to architecture-specific options, try
grep flags /proc/cpuinfo
-
Compile this way for debugging Fortran code (only):
ifort -O2 -g -CB test.f90 -o test
- Debug options like
-CB
should not be used in a production compilation!
After you compile, you might want to have a look at the assembly output corresponding to the hot spots in your code. This is not because you want to re-code these areas by hand, but because a visual verification of the assembly can help determine what the different compiler optimizations have (or have not) done. Intel Advisor is an excellent tool for this kind of work. You may notice, for instance, that the compiler has failed to pipeline a loop due to the false assumption of an aliasing problem; in other words, the compiler has conservatively assumed that two array arguments to a function might be parts of the same memory segment. A suitable flag (or a pragma, perhaps) might inform the compiler that aliasing is impossible in this section of the code and greatly speed it up.
Best Practices for Writing Fast Code
The following comments essentially discuss how to handle your compute-intensive loops.
The practice of moving the body of a function into the spot where it would normally be called is referred to as inlining. Why would you want to do this? Calls to functions or subroutines are an extra expense. They take more instructions to execute, and they potentially fill up stack memory (cache, too). But you shouldn't avoid functions altogether: doing so makes for messy, hard-to-read code, and inlining large functions may not help you all that much. The main place to worry about inlining is where small functions are being called in a loop, so you must frequently pay the overhead associated with the calls. Inlining can also help the compiler find vectorization opportunities.
In such places, you'll want to write functions that allow for inlining by the compiler.
With icc, any optimization level higher than -O0
will try to inline functions (by enabling
-finline-functions
). But it is not a sure thing, and it is done only for functions that are
defined in the same file in which they are called (unless you use Intel's -ipo
or
GCC's -flto
). You
can get diagnostics about which functions are being inlined by using the -Winline
flag in icc and gcc.
A different tactic is to use the inline
C function type modifier like this:
The modifier gives compilers a hint that they should inline the function if possible. If this
works, it should be better than relying on a macro function (e.g.,
#define CUBE(x)
((x)*(x)*(x))
), which is a much cruder way of doing inlining. Macro functions are
convenient, but they possess a number of weaknesses: they perform no type checking on
arguments; they have limited
portability
due to their lack of a definite scope for
variables;
they may suffer from expansion complications; and they may present difficulties for debugging.
The compiler has little chance of vectorizing a loop in which a sequence of items must be extracted from separate structs prior to computing. Looping through a simple array is always more efficient. The second of these two variable declarations will perform far better in loops where the distances between all pairs of particles must be computed:
This is a different way to avoid looping the subroutine call. It can be a viable alternative to inlining (above). The overhead of the call is paid only once, while new vectorization possibilities are created, just as they would be with inlining.
Note that extra directives may be needed inside the function to assure the compiler that the data being pointed to are properly boundary-aligned for vectorization.
Pointers can be convenient, and can sometimes improve efficiency by essentially allowing one to pass-by-reference rather than pass-by-value. But compilers are often forced to make conservative assumptions about the data that are being pointed to, which gives them difficulty in creating optimizations. (Are the data overlapping? Boundary-aligned? Contiguous?)
Especially avoid using subscript ranges like (2:5,3:7) in the argument list of a subroutine or function call. Reason: the compiler typically deals with this by copying the selected values into a single temporary array and putting the new array on the stack every time the routine is called.
You don't want to pay for costly operations or overhead over and over again. The reason for not doing I/O is fairly obvious; the OS must be contacted to mediate communication outside the program, which may incur large delays.
Avoid casts or type conversions, implicit or explicit. These are not free; conversions involve moving data between different execution units. (A "smart" compiler may figure out ways around it, though.)
Not only is this highly valuable for vectorization, it tends to help an application's speed in any execution environment.
Factor your arithmetic, especially if the processor features a fused multiply-add (FMA) operation in its instruction set (which is true of all recent Intel processors). In particular, polynomials can always be rewritten as nested binomials:
Tricks like this one (Horner's rule) only really matter in loops; otherwise, it's not worth bothering with. We've seen previously how we can reduce the number of arithmetic operations inside a loop by moving any invariant expressions outside the loop. Likewise, multiplication should be a favored over division, as it is a cheaper operation. Calculate a reciprocal outside the loop and multiply inside.
A good compiler can usually do a great job of optimizing code, but you do not have to treat it as a magical machine that you cannot influence. You can write "compiler-friendly" code that has the best chance of being converted into fast instructions.