Analyzing Modern GPU Cores

[2503.20481] Analyzing Modern NVIDIA GPU cores

Filing this under interesting work I will probably never use. The authors try to construct a more accurate simulation of the Ampere (4090/A100 type GPUs) microarchitecture, backed by extensive testing on real hardware. It’s a good reminder that, in part because of how good some of the abstractions are, there is quite a lot about Nvidia GPUs that isn’t really known outside Nvidia. My main takeaway was that the compiler is very deeply coupled to the hardware performance: a Nvidia chip is not really a complete unit without taking in to account the software driving the performance, and recognizing that accounts for why Nvidia have done such a good job of building a solid stack with CUDA.

One of the things I found interesting was the use of a Stall counter: the compiler notes fixed latency instructions (which seem to be a preferred design choice) and adds a counter to the instructions control bits that specifies how many cycles the warp should wait before issuing the next instruction, and so other warps will be selected for execution. This means the hardware doesn’t have to dynamically check for data dependencies.

For example, an addition whose latency is four cycles and its
first consumer is the following instruction encodes a four in the Stall counter. Using the methodology explained in section 3, we have verified that if the Stall counter is not properly set, the result of the program is incorrect since the hardware does not check for RAW hazards, and simply relies on these compiler-set counters. In addition, this mechanism has benefits in terms of area and energy wiring. Keep in mind that wires from fixed-latency units to the dependence handling components are not needed, in contrast to a traditional scoreboard approach where they are required.

There are variable execution length instructions, like memory loads, and in that case they have a Dependence counter, which is decremented when data arrives.

In the vein of handing off to the compiler, the scheduler uses a Compiler Guided Greedy Then Youngest policy: it will keep issuing instructions from the same warp (greedy) with guidance from the Stall (and an explicit Yield bit) and otherwise will swithch to the youngest ready warp. Older GPUs (apparently!) used Greedy Then Oldest instead, which resulted in more often selecting a warp that was still stalled waiting for memory or similar, while the youngest more likely has useful work to do.

The scheduler starts issuing instructions from the youngest warp, which is W3, until it misses in the Icache.As a result of the miss, W3 does not have any valid instruction, so the scheduler switches to issue instructions from W2. W2 hits in the I-cache since it reuses the instructions brought by W3, and when it reaches the point where W3 missed, the miss has already been served, and all remaining instructions are found in the I-cache, so the scheduler greedily issues that warp until the end. Later, the scheduler proceeds to issue instruction from W3 (the youngest warp) until the end, since now all instructions are present in the I-cache.

Similarly, the paper points out that the instruction prefetch cache is a stream buffer (probably 16 instructions deep) rather than any kind of complex branch prediction logic, because we generally don’t do that kind of thing on GPUs!

a straightforward prefetcher, such as a stream buffer, behaves close to a perfect instruction cache in GPUs. This is because the different warps in each sub-core usually execute the same code region and the code of typical GPGPUs applications do not have a complex control flow, so prefetching 𝑁 subsequent ines usually performs well. Note that since GPUs do not predict branches, it is not worth implementing a Fetch Directed Instruction prefetcher [76] because it would require the addition of a branch predictor.

Discover more from Ian’s Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading