Cornell Virtual Workshop > Scalability > Scalability and Speedup

The Role of Benchmarks

When requesting a large allocation on HPC resources, it is important to know how well your code scales. You will need to measure or estimate how well your code performs using two times, four times, eight times as much of a resource. You may already know how well it runs on 64 processors, but how will it run on 128? Can it run on 32,768 cores, or would that be a waste of resources? This page and the next aim to help you to answer these questions.

The typical way to estimate an application's scalability is to run a benchmark for that application using several different numbers of tasks and plot the resulting speedup on a log-log plot to see if and when it tails off from the ideal line. For optimizing and tuning applications, benchmarks of this kind tell you when you have made progress or regress and—just as importantly—help you decide when you are done.

But what if your application is still under development, or if you don't have access to the target resource? Other types of benchmarks may help you understand what kind of performance to expect. There are three rough categories of benchmarks to consider: hardware, synthetic, and application.

Hardware benchmarks (also known as micro-benchmarks) measure various low-level things like processor floating point speed, point-to-point bandwidth, and write speed to disk. They are written to avoid any exigencies of the operating system and try to probe individual parts of the machine, one at a time. These benchmarks help to tell you the maximum capability of pieces that are relevant to modeling the computer system.
Synthetic benchmarks focus on the performance of individual algorithms. The application section of the NAS Parallel Benchmarks is a good example. They have separate tests of pentadiagonal solvers and block tridiagonal solvers. Synthetic benchmarks show the behavior of parts of the computer to complete a well-understood task. They can demonstrate the balance of an architecture under a particular load.
Application benchmarks are measures of how much useful work was done by the system. Think of an application as a set of synthetic algorithms with data movement in between. The final numbers of an application benchmark should mean something to the productivity of the end user. They often report in "wall clock" time instead of CPU time because the difference, i.e., the time the CPU spends waiting for I/O, is still quite important to the end user.

You can use the three kinds of benchmarks build up an argument about your application's expected behavior, in much the same way that an argument about the expected behavior of a complex system is built up from the effects at multiple scales. But the one that is most important to determining scalability is your application benchmark. You should define it so that it maps exactly to the real world results you care about. This could be the number of simulations per day, or it could be the simulation steps per minute, if they relate linearly. It needs to be something you can reproduce, respecting any requirements to run it on multiple platforms.

Back