When data is in the global memory on a GPU it’s usually in row-major or column-major order. Loading from global memory is quite slow though, so for performance we want to move the data to shared memory for the threads in a warp to work on.
To make that load from global memory performance we want memory reads to be coalesced, meaning we are reading contiguous chunk of memory at a time. Shared memory on the other hand is divided into banks, typically 32 banks which are 4 bytes wide. If multiple threads in the same warp try to write to different addresses in the same bank then the requests are processed sequentially, slowing things down while the threads wait on each other. Nsight and other profiling tools will helpfully point this out to you!
For example, let’s say we’re loading a row major and column major tensor, and will be doing a multiplication between them (this is naive, to demonstrate the issue):
__shared__ float Asub[TILE_DIM][TILE_DIM]; __shared__ float Bsub[TILE_DIM][TILE_DIM]; // (No padding in this naive version) int lane = threadIdx.x; // 0...31 (warp lane index) int tileRow = blockIdx.y * TILE_DIM; int tileCol = blockIdx.x * TILE_DIM; int globalRow = tileRow + lane; int globalCol = tileCol + lane; Asub[lane][0] = A[globalRow * N + tileCol + 0]; Bsub[lane][0] = B[(tileRow + lane) + (tileCol + 0) * N];
Now when we fill Bsub we will be writing everything to the same shared memory bank, significantly slowing things down. One easy fix is just to add padding:
__shared__ float Asub[TILE_DIM][TILE_DIM]; // A tile (row-major, no conflict in our case) __shared__ float Bsub[TILE_DIM][TILE_DIM + PAD]; // B tile (extra column to prevent conflicts)
With PAD as 1 (and TILE_DIM as 32) we have 32×33, or 132 bytes, offsetting the writes and ensuring that each thread gets its own bank.
The downside is that this wastes shared memory, a scarce resource, so an alternative approach is swizzling: changing the layout such that consecutive thread accesses aren’t causing bank conflicts. That’s what Bert implemented to get performance in his recent GEMM walkthrough, but it’s easy to get it wrong.
To make life easier than writing it in raw CUDA, Cutlass has a system called CuTE. Cute is a set of templates to express layout of data:
auto tileLayout = make_layout(make_shape(Int<32>{}, Int<32>{}), GenRowMajor{});
auto swizzledLayout = composition(Swizzle<5, 0, 5>{}, tileLayout);
Here you specify how the data is laid out in global memory with the shape and stride, then make_layout and the copy operation take care of translating from the row-major layout in global memory to the swizzled layout in shared memory.
From a Triton perspective, Lei Zhang has a great post on memory access, and how it works in Triton, specifically the LinearLayout class that allows the language to similarly handle swizzling and layouts for you:
Indeed the whole point of LLs is that they allow us to specify transposed and swizzled layouts as a “general case”. Instead of a layout class for registers in a thread, and another layout for registers in a thread but in MMAv2 order, and so on, all of these can be represented by different LLs. This gets rid of special cases and lets us write more general code.
There’s a great colfax report on building GEMMS that covers shared memory bank conflicts, and Lei Mao has a post with a nice illustration. Axel Feldman also has a post about benchmarking different approaches and identifying bank conflicts, and some more efficient loading techniques.