What do you need to do to get better performance and GPU efficiency out of your model? The GPU-oriented folks at Stanford recently published an early preview of the work they have been doing on the LLM generation of kernels: Surprisingly Fast AI-Generated Kernels We Didn’t Mean to Publish (Yet) – and they have a list:
- Memory Access Optimization: improving the efficiency of data movement between different memory hierarchies (global memory, shared memory, registers) and ensuring data is accessed in a way that maximizes bandwidth and minimizes conflicts.
- Asynchronous Operations & Latency Hiding: hide the latency of slow operations (like global memory access) by overlapping them with computation or other memory transfers
- Data Type & Precision Optimization: using lower-precision data types (like FP16 or BF16) where possible to reduce memory bandwidth requirements, increase cache effectiveness, and potentially leverage specialized hardware units.
- Compute & Instruction Optimization: making the arithmetic computations themselves more efficient, reducing instruction count, or leveraging specialized hardware instructions
- Parallelism & Occupancy Enhancement: maximize the number of active warps on the Streaming Multiprocessors (SMs) to better hide latencies and improve overall throughput
- Control Flow & Loop Optimization: reducing the overhead associated with loops, branches, and indexing calculations
That’s a good list! In this case though, it emerged not from (just) talking with kernel experts, but also from developing a model to generate kernels:
We have some very fast AI-generated kernels in pure CUDA-C without using libraries and DSLs such as CUTLASS and Triton. They are performing close to or in some cases even beating the standard expert-optimized production kernels shipped in PyTorch.
They developed a very straightforward but smart pattern on structuring test-time-compute. They reason about optimizations in natural language before generating code. Then, they branch out into a tree structure of refinements for each optimization idea, to avoid loops in investigation.
The kernels they generated were somewhere between fast, and very fast:
Conv2D: 179.9% performance of FP32 torch.nn.Conv2D; problem size: (100, 3, 224, 224) input tensor, conv(in_channels=3, out_channels=96, kernel_size=11, stride=4, padding=2)
The team aren’t claiming this is a general solution, but just an interesting proof of possibility, which is certainly is! The walk through of how they got to the final conv2D kernel is fascinating, both in terms of human intervention and the chain of optimizations.
The final code sample for the Conv2D kernel is included in the appendix. It uses advanced CUDA techniques that we find challenging to write ourselves! We also have more example kernels in this Github repo
The kernel is very fast for specific shapes, on the L40s, in FP32. Its also a kernel that, by the sounds of it, the team themselves struggled a bit with. It’s very, very specialized. It’s not that a human couldn’t have built it, its that (in most cases) they wouldn’t: it’s not a priority kernel, and all that clever CUDA comes with operational overhead, ties on specific hardware, shapes and so on.
That in itself isn’t new. If you PyTorch or XLA compile you’ll get a lot of kernels which you probably wouldn’t write, but this adds a new (and weird!) layer of non-determinism to everything. Elsewhere at Stanford, they have been looking at one of the other killers of GPU efficiency: kernel launch overhead. Most models are represented by hundreds of kernels, each of which have to be scheduled from the CPU. LLMs are generally memory-bound, small ones particularly so, and the gaps between kernel executions can end up dominating performance:
Look Ma, No Bubbles! Designing a Low-Latency Megakernel for Llama-1B:
In this post, we show how we can bypass this problem by merging the entire Llama-1B forward pass into a single “megakernel” that eliminates kernel boundaries altogether. Doing this achieves brr – on an H100, we use 78% of memory bandwidth and outperform existing systems by over 1.5x. (To our knowledge, this is the lowest-latency forward pass for Llama-1B in bfloat16!) In the rest of this post, we’ll walk through how and why one would do this.
The idea of megakernels that handle all of the operations is not new, but the complexity of fusing everything together is high. Persistent kernel were popularized at the tail end of the CUDA 11 series, due to the right residency and async copy support in Ampere. They allow leaving kernels resident on an SM and having them pull a series of tiles to do their work rather than scheduling repeated launches of the same kernel. The megakernel takes this idea ever further, with multiple operations within the kernel that pulls a stream of different problems. One issue with this approach (traditionally) is register spilling: you only have so many registers available, up to 255 per thread, though with a fairly high overall limit of 64k 32-bit registers (on Hopper). That means you need to keep some data in shared memory, and efficient use of shared memory ends up being the bottleneck. The team at Stanford developed paging shared memory, with a separate reserved block for managing allocations of shared memory to individual tasks.
This gets the CPU completely out the picture for the forward pass, but is incredibly specific to the model (in this case Llama 3.2 1B).
Another collaboration was clearly thinking in the same direction, as they recently posted about their Mirage Persistent Kernel: a compiler for megakernels.
our team from CMU, UW, Berkeley, NVIDIA, and Tsinghua developed Mirage Persistent Kernel (MPK) — a compiler and runtime system that automatically transforms multi-GPU LLM inference into a high-performance megakernel. MPK unlocks the benefits of end-to-end GPU fusion while requiring minimal manual effort from developers.
The system works by building a task-graph of what they call LAX1[1] fragments, which in practice is a very short list of 14 operators. This is actually too small to represent everything they need, meaning they have to manually decompose some common ops like ReLu, but this level of decomposition gives them an ability to do some pretty complex fusions.
The actual ops are generated thanks to Mirage’s Kernel Superoptimizer (a great name), which I think is a very intense autotuner:
Mirage automatically searches for possible GPU kernels for attention. The search space includes existing manually designed attention kernels (e.g., FlashAttention and FlashDecoding) as special cases. It also includes other implementations that outperform today’s handwritten ones by up to 3.5x for certain use cases. The GPU kernels generated by Mirage can directly operate on PyTorch tensors and be called in your PyTorch program.
The search is not cheap though:
In our evaluation, Mirage takes up to 4 hours to optimize a Lax program. This optimization is a one-time cost before deployment on the target hardware.
The aggressive decomposition allows them to have a clever verification scheme where they validate kernels on random inputs to get confidence in (approximate) numerical correctness.
They then build a worker kernel with all the relevant operations, and schedule the optimized graph via dedicated scheduler warps. Workers are scheduled on the SMs, and report back status. The scheduler warps then decide when tasks can be enqued for execution.
They’ve got a code example that walks through setting it up for Qwen. They recreate the model structure explicitly, generate a task graph from it, and kick off the search for optimal kernels and fusions. This avoids the need to solve the Dynamo-style problem of tracing the model!
The resulting kernel is again heavily tied to the specific hardware and model. One thing we have found useful for investigating production problems is that the ability to ablate different parts of the compile process, running models in (basically) PyTorch eager mode. This approach leaves the darkest of black boxes to work with, and I would imagine even more terrifying PTX than the complex CUDA that the LLM kernel generation team came up with.
Between these projects though, it feels like we are exploring the edges of what running a program on GPUs should actually look like: a combination of kernel generation and multi-level search seems almost guaranteed to yield optimizations that would be far outside the cost-benefit curve for manual implementation. What we don’t have yet is known ways to operationalize this kind of approach, but its an exciting area to watch!
Thanks to Matt for nudging me to actually read through these papers, they’d been on my todolist for a bit!
- I am not sure what this stands for, but the basic ops in jax are in jax.lax, so I presume its the same source! ↩︎