Cornell Virtual Workshop > Code Optimization > Optimization via Compilers

Best Practices for Compilers

One overall recommendation

It's always good practice to try out various code-optimization options on an experimental basis to see if they actually produce faster (or even just functional!) code for your particular application.

Other compiling recommendations

The default is -O2, but compile with -O3 unless it slows or breaks your code (rarely happens)
Optimize your source files as a group with -ipo (Intel) or -fwhole-program (GCC)
To see CPU features relevant to architecture-specific options, try grep flags /proc/cpuinfo
Compile this way for debugging Fortran code (only): ifort -O2 -g -CB test.f90 -o test
Debug options like -CB should not be used in a production compilation!

After you compile, you might want to have a look at the assembly output corresponding to the hot spots in your code. This is not because you want to re-code these areas by hand, but because a visual verification of the assembly can help determine what the different compiler optimizations have (or have not) done. Intel Advisor is an excellent tool for this kind of work. You may notice, for instance, that the compiler has failed to pipeline a loop due to the false assumption of an aliasing problem; in other words, the compiler has conservatively assumed that two array arguments to a function might be parts of the same memory segment. A suitable flag (or a pragma, perhaps) might inform the compiler that aliasing is impossible in this section of the code and greatly speed it up.

Best Practices for Writing Fast Code

The following comments essentially discuss how to handle your compute-intensive loops.

Use inlining; avoid excessive program modularization.

The practice of moving the body of a function into the spot where it would normally be called is referred to as inlining. Why would you want to do this? Calls to functions or subroutines are an extra expense. They take more instructions to execute, and they potentially fill up stack memory (cache, too). But you shouldn't avoid functions altogether: doing so makes for messy, hard-to-read code, and inlining large functions may not help you all that much. The main place to worry about inlining is where small functions are being called in a loop, so you must frequently pay the overhead associated with the calls. Inlining can also help the compiler find vectorization opportunities.

In such places, you'll want to write functions that allow for inlining by the compiler. With icc, any optimization level higher than -O0 will try to inline functions (by enabling -finline-functions). But it is not a sure thing, and it is done only for functions that are defined in the same file in which they are called (unless you use Intel's -ipo or GCC's -flto). You can get diagnostics about which functions are being inlined by using the -Winline flag in icc and gcc.

A different tactic is to use the inline C function type modifier like this:

void inline my_inline_fun(...) {
  //...
}

The modifier gives compilers a hint that they should inline the function if possible. If this works, it should be better than relying on a macro function (e.g., #define CUBE(x) ((x)*(x)*(x))), which is a much cruder way of doing inlining. Macro functions are convenient, but they possess a number of weaknesses: they perform no type checking on arguments; they have limited portability due to their lack of a definite scope for variables; they may suffer from expansion complications; and they may present difficulties for debugging.

Define structs of arrays, not arrays of structs.

The compiler has little chance of vectorizing a loop in which a sequence of items must be extracted from separate structs prior to computing. Looping through a simple array is always more efficient. The second of these two variable declarations will perform far better in loops where the distances between all pairs of particles must be computed:

struct particle {
  double x, y, z;
};
struct group_of_particles {
  double x[N], y[N], z[N];
};
struct particle array_of_particles[N];
struct group_of_particles g;

Consider moving loops into the subroutine.

This is a different way to avoid looping the subroutine call. It can be a viable alternative to inlining (above). The overhead of the call is paid only once, while new vectorization possibilities are created, just as they would be with inlining.

struct group_of_particles g;  // refer to declaration above
double r[N*(N-1)/2];
int k = 0;
for (i = 0; i < N; i++) {     // this is how NOT to do it...
  for (j = 0; j < i; j++) {
    // awful overhead from repeated function calls! (unless inlined)
    r[k] = dist(g.x[i],g.y[i],g.z[i],g.x[j],g.y[j],g.z[j]);
    k++;
  }
}
// consider doing it this way instead...
all_dist(&amp;g, N, r);  // function runs the loops and calculations
}

Note that extra directives may be needed inside the function to assure the compiler that the data being pointed to are properly boundary-aligned for vectorization.

Minimize the use of pointers, except where necessary.

Pointers can be convenient, and can sometimes improve efficiency by essentially allowing one to pass-by-reference rather than pass-by-value. But compilers are often forced to make conservative assumptions about the data that are being pointed to, which gives them difficulty in creating optimizations. (Are the data overlapping? Boundary-aligned? Contiguous?)

Limit the use of Fortran 90 array sections.

Especially avoid using subscript ranges like (2:5,3:7) in the argument list of a subroutine or function call. Reason: the compiler typically deals with this by copying the selected values into a single temporary array and putting the new array on the stack every time the routine is called.

Avoid I/O, type casts, branches, and divisions inside inner loops.

You don't want to pay for costly operations or overhead over and over again. The reason for not doing I/O is fairly obvious; the OS must be contacted to mediate communication outside the program, which may incur large delays.

Avoid casts or type conversions, implicit or explicit. These are not free; conversions involve moving data between different execution units. (A "smart" compiler may figure out ways around it, though.)

for (i = 0; i < N; i++) {
  sum1 += pow(i,sqrt(i));
}
double fi = 0.0;      // the possibly quicker variant...
for (i = 0; i < N; i++) {
  sum2 += pow(fi,sqrt(fi));
  fi += 1.0;
}

Structure loops to eliminate conditionals.

Not only is this highly valuable for vectorization, it tends to help an application's speed in any execution environment.

Do smart arithmetic.

Factor your arithmetic, especially if the processor features a fused multiply-add (FMA) operation in its instruction set (which is true of all recent Intel processors). In particular, polynomials can always be rewritten as nested binomials:

(((5*x+4)*x)+3)*x+2  // faster form of 5*x*x*x + 4*x*x + 3*x + 2
</pre>

Tricks like this one (Horner's rule) only really matter in loops; otherwise, it's not worth bothering with. We've seen previously how we can reduce the number of arithmetic operations inside a loop by moving any invariant expressions outside the loop. Likewise, multiplication should be a favored over division, as it is a cheaper operation. Calculate a reciprocal outside the loop and multiply inside.

Bottom line...

A good compiler can usually do a great job of optimizing code, but you do not have to treat it as a magical machine that you cannot influence. You can write "compiler-friendly" code that has the best chance of being converted into fast instructions.

Back