Cornell Virtual Workshop: Case Study: Profiling and Optimization on Advanced Cluster Architectures

Roadmap: Case Study: Profiling and Optimization on Advanced Cluster Architectures

Application ContextPerformance AnalysisVectorization & ParallelizationAppendix

This case study details how to use various profiling tools to assess computational performance and possible optimization strategies that can improve performance. This walk-through focuses on the Stampede2 KNL nodes at TACC, but the tools and approaches are generally applicable to a variety of advanced computational architectures.

The complex internal structure of modern processors makes such profiling and analysis tools especially important, as one is often not able to predict computational performance a priori in the face of so many contributing factors. Analyses discussed here include:

Identification of code hotspots
Scaling of computational runtimes with number of nodes
Identification of performance limitations due to inefficient communication with memory
Profiling of vectorization performance

Optimizations emphasized here primarily involve the use of various compiler options and small compiler directives that can be added to code, as opposed to large-scale restructuring of algorithms or data structures to improve performance.

The material is based on a webcast presentation given by Shiquan Su (NCAR) and Chad Burdyshaw (NICS) on October 17, 2017, as part of the ECSS Symposium Series. This tutorial augments that presentation by integrating video clips from the webcast with a textual summary of key points, along with further explanations and remarks. The latter part of the joint presentation by Burdyshaw, which leverages the use of the Intel performance analysis tools (especially VTune Amplifier XE or "VTune"), is emphasized here, as it should be pertinent to a wide range of application codes. The first part of the presentation by Su focuses more on the scientific and numerical background of the particular code being analyzed. The key take-home messages from Su's presentation are summarized in the Application Context topic (and expanded upon in the Appendix), which serves as a prelude to the more generally applicable topics, Performance Analysis and Vectorization & Parallelization.

Objectives

After you complete this workshop, you should be able to:

Use Intel analysis tools to profile code performance on Intel-based computational clusters.
Specify compiler options and code directives based on profiling analytics to improve computational performance.
Use analysis tools to probe different aspects of processor architecture and performance on a large cluster.

Prerequisites

There are no specific prerequisites for learning the information contained in this topic. If one is interested in applying these analyses to one's own application code, then access to Intel compilers and the Intel performance analysis tools (VTune, Advisor, Parallel Studio) is required.

Requirements

There are no requirements for this roadmap.