Cornell Virtual Workshop > Introduction to CUDA > GPU Performance Topics

Bank Conflicts

Shared memory, as explained in the previous topics, is a high-speed memory type that achieves high memory bandwidth due to it being an on-chip memory. A programmer may choose to use it to store frequently used data or data shared by multiple threads. However, there is an important consideration in unlocking its high bandwidth: shared memory is divided into equally sized memory modules (or banks), and if each thread in a warp accesses a different bank, all the memory transfers for the warp will be done in parallel.

How does CUDA configure the banks? When shared memory is allocated to each block that is launched by a kernel, the addresses are mapped equally to banks, according to the following formula: \[\rm{bank\_index=(address/word\_size)\ mod\ (number\_of\_banks)}\] In NVIDIA devices, the number of banks is fixed at 32, matching the warp size. The word size is 4 bytes by default (i.e., the size of types float and int), though it can also be configured to be 8 bytes (the size of type double). The above formula is applied successively as the shared-memory variables are declared in code. For example, if an array of 16 floats (16 words, or 64 bytes) appears first in the list of shared-memory declarations:

The first word is mapped to bank 0, the next word to bank 1, etc., and the last word is mapped to bank 15.
The next declared variable, if there is one, would be mapped to bank 16, and so on.
If there is a dynamically allocated array, it would get mapped into shared memory after all the static declarations.

As mentioned, the advantage conferred by banks is that shared memory can be accessed in parallel if the addresses are stored to or loaded from different banks. In the simplest case, a warp can always access 32 consecutive words in parallel, because they are always in different banks. More generally, as long as all the banks that are accessed are unique (i.e., no two threads are accessing different words from the same bank), then the requests can be done in parallel. (For those familiar with CPU architecture: the "bank" concept is analogous to set-associativity in caches, and the "bank index" corresponds to a cache set.)

However, a bank conflict occurs when one or more bank threads request different words from the same bank. A bank cannot retrieve two different words in one single access; therefore, the accesses become serial.

Suppose you allocated 256 bytes (64 words) in shared memory. A bank conflict occurs if two threads access bytes 0-3 (word 0) and bytes 128-131 (word 32) at the same time, as both words are in bank 0. But if one of the threads accesses bytes 132-135 (word 33) instead, no bank conflict occurs, as that word is in bank 1. Interestingly, if multiple threads request access to the same address, the value at the address is broadcast instead, and no bank conflict occurs.

Knowing which access patterns can cause bank conflicts, you can readily see how to avoid them. For example, let's say a warp is accessing all the rows of a single column in the array __shared__ int a[32][32]. This causes a huge bank conflict. The conflict can be eliminated by increasing the number of columns of the array, so that each row starts at a different bank index. Thus, if you instead declare __shared__ int a[32][32 + 1], the warp no longer has conflicts when accessing all the rows of a single column, as each element of the column belongs to a different bank.

Bank Conflict Example

In this example, OFFSET determines how much bank conflict there will be. If OFFSET is set to 32, then id * OFFSET % 1024 evaluates to 0, 32, 64, ... , which means all the threads are accessing the same bank, which incurs the maximum number of bank conflicts possible. If it's changed to 1, then there is no bank conflict. You should run the program a couple of times, as the first run after compiling is inaccurate.

Back

© | Cornell University | Center for Advanced Computing | Copyright Statement | Access Statement
CVW material development is supported by NSF OAC awards 1854828, 2321040, 2323116 (UT Austin) and 2005506 (Indiana University)