Skip to main content


How do you optimize the performance of a big, computationally intensive code? On an HPC cluster like Stampede at TACC, with 6400 nodes and nearly half a million cores of various kinds, the most beneficial optimizations are likely to be those that improve a code’s scalability. These allow it to run on more and more processors. On the opposite end of the spectrum, you can delve into optimization guides that encourage you to focus on chip-level tweaks. In reality, the whole range of scales matters when considering how to optimize performance on an HPC system:

  1. Thousands of nodes on Stampede
  2. Multiple processors on a node
  3. Several to many cores on a single processor
  4. Multiple execution units within a processor core

At every level in the hierarchy, parallelism is available and can be exploited. But parallelism is not the whole story of performance; data locality also matters. In memory, how close are the data to where they will be needed for the computation? A code may stress various I/O subsystems: MPI network transfers, database accesses, or Web services could be important, as could storage of data and logs. On a system such as Stampede, overall performance optimization goes well beyond figuring out how to utilize 100% of the cycles of a given processor. Any one of the above factors could turn out to be the “long pole in the tent” that causes a performance hangup.

This module tries to cover the range of possibilities. We start with grand design principles and strategies, then proceed to details of software interfaces, the network interconnect, and processor microarchitectures. This content should guide you towards the right level to concentrate your efforts, and give you some ideas about what to do to make your code more efficient.

Originally developed March 2012
Last updated October 2014

Steve Lantz and Andrew Dolgert (original authors)
Cornell Center for Advanced Computing