Unit Stride
More important than alignment is the fact that memory access is fastest when values are addressed sequentially. Sequential access means that all data are fetched from memory as full cache lines. Even better, the cache lines are then loaded into registers as full vectors for vectorized operations. Furthermore, the hardware can more easily anticipate the data that will be needed, allowing it to do accurate prefetching. Random or strided access is much slower. The bottom line is: to keep the vector units busy, rather than waiting for data, it is best practice to use an access pattern that results in a stride of 1.
It is clear enough how to do this with one-dimensional arrays. What about multi-dimensional arrays? In most languages, these are stored in one continuous region of memory, and the indices are simply mapped onto one-dimensional memory coordinates. C and Fortran have different ways of approaching this mapping. The goal of the programmer is to iterate through the dimensions in a manner that results in sequential (stride 1) access. With that goal in mind, the following C and Fortran codes iterate through a two-dimensional array in sequential order in memory:
Again, these "rules" are not as mandatory as they used to be. Compilers are sometimes able to invert the order of loops to correct a bad-stride issue. Also, KNL, SKX, and ICX are equipped with gather/scatter engines that are able to collect vectors from disparate memory locations, regardless of the offsets from 64-byte boundaries. Nevertheless, it's safest to work with data structures in a vector-friendly way, so that the compiler can instantly recognize how to generate SIMD loads and stores, as well as SIMD flops.
Here's a very common example of an unfavorable data layout. Both C and Fortran allow the programmer to define derived data types, which in C are known as structs. Derived types or structs are composed of multiple members based on the elementary types of the language. Iterating through an "array of structs" almost always involves stride > 1, assuming that each struct contains more than one member. (To make matters worse, structs are often stored with padding between or after their members.) Therefore, if structs must be used, consider creating instead a "struct of arrays", which permits iterating with stride = 1 through one of the several large arrays contained within a single struct. This aids vectorization and is typically much more efficient.
Finally, in Fortran 90/95, using array syntax can also be a good idea. The compiler may be able to turn the high-level array expressions into low-level operations that have favorable stride and cache characteristics.