Blog

ARPA and predicting the future
Statecraft recently re-ran an interview from 2023 with Jason Matheny, formerly of IARPA: https://www.statecraft.pub/p/how-to-predict-the-future-278

While defense policy and research is a ways outside the scope for myself (or I imagine most folks reading), the problems of managing or working on uncertain, research-y projects in a volatile environment are pretty relatable:

Most of what we know from cognitive psychology and human judgment research over the last 50 years suggests that unstructured group deliberation might be one of the worst ways of making judgments, yet it’s the norm in most institutions.

Or this bit of career wisdom:

In general, people underestimate their own potential to make contributions to the most important problems. They overestimate how many people are already working on the most important problems. So many incredibly important problems are just really neglected. If you can’t figure out who’s working on something after a few days of homework, then it is probably a neglected problem. And it’s probably up to us to solve it.

Jason talks about looking for projects in the goldilocks zones of probability (less than 50%, more than 5%) that open up interesting opportunities. I worked with a manager who was a strong advocate of the Heilmeier Catechism to evaluate projects, and have seen the value of using it as guidance when presenting and evaluating ideas:
1. What are you trying to do? Articulate your objectives using absolutely no jargon.
2. How is it done today, and what are the limits of current practice?
3. What is new in your approach and why do you think it will be successful?
4. Who cares? If you are successful, what difference will it make?
5. What are the risks?
6. How much will it cost?
7. How long will it take?
8. What are the mid-term and final “exams” to check for success?
Jason adds some interesting updates:

For instance, the Heilmeier questions don’t have a question about counterfactual impact: “Would this work get done otherwise?” The office tends not to rigorously assess the other funding streams going to solve this particular problem, and their likelihoods of success.

We also tend not to think much about strategic move and countermove. […]. It probably is prudent to assign at least a 10% probability to some exquisite, classified technology being stolen.

One thing I found myself talking about this week with a couple of folks was how good people get “lucky” a lot. I think these kinds of questions help navigate towards those more positive-surprise-filled spaces.
July 8, 2025

links-and-recs

Cute-DSL

In May Nvidia shipped CuTe‑DSL, the Python library they teased at GTC earlier in the year that mirrors CUTLASS’s C++ tensor‑layout . Then, at the start of June, the ‑dev label disappeared (so presumably its production ready now). The pitch is simple: Write speed‑of‑light kernels from the comfort of Python.

Of course, nothing about CUDA is ever really simple. CuTe‑DSL gives the full Cutlass experience¹, wrapped in an ever so slightly more approachable interface.

Getting Cute: Transpose

Matrix transpose felt like a reasonable ‘hello world’ : (B[j,i] = A[i,j]). The PyTorch version is simple: torch.transpose(input, 0, 1).

To get a baseline, here is a simple transpose kernel in Triton. We tl.load, flip the coordinates and tl.store it back.

@triton.jit
def triton_transpose_kernel(input_ptr, output_ptr, M, N, BLOCK_SIZE: tl.constexpr):
    # 2D block coordinates
    pid_m = tl.program_id(0)
    pid_n = tl.program_id(1)
    
    # Calculate offsets
    offs_m = pid_m * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
    offs_n = pid_n * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
    
    # Load with masking
    mask = (offs_m[:, None] < M) & (offs_n[None, :] < N)
    a = tl.load(input_ptr + offs_m[:, None] * N + offs_n[None, :], mask=mask)
    
    # Store transposed (swap coordinates)
    tl.store(output_ptr + offs_n[:, None] * M + offs_m[None, :], a, mask=mask)

Here’s the same idea in CuTe‑DSL. CuTe leverages a decorator and Pythons ability to integrate with JITs. Anything decorated with @jit runs host-side (on the CPU), while @kernel is used for device side (on the GPU). There are both AST based and tracing based options dependening on the presence of dynamic shapes or control flow.

    @cute.kernel
    def transpose_kernel(self, mA: cute.Tensor, mB: cute.Tensor):
        tidx = cute.arch.thread_idx()[0]
        tidy = cute.arch.thread_idx()[1]
        bidx = cute.arch.block_idx()[0]
        bidy = cute.arch.block_idx()[1]
        # This might all be unnecessary
        # but I was fearful of the compiler
        tile_start_m = cutlass.Int32(0)
        tile_start_n = cutlass.Int32(0)
        global_m = cutlass.Int32(0)
        global_n = cutlass.Int32(0)
        M = cutlass.Int32(0)
        N = cutlass.Int32(0)
        val = cutlass.Float32(0.0)
        # Calculate tile starting positions
        tile_start_m = bidy * self._tile_size
        tile_start_n = bidx * self._tile_size
        # Calculate global coordinates for this thread
        global_m = tile_start_m + tidy
        global_n = tile_start_n + tidx
        # Get matrix dimensions at runtime
        M = mA.shape[0]
        N = mA.shape[1]
        # Bounds checking and transpose operation
        if global_m < M and global_n < N:
            val = mA[global_m, global_n]
            # Transpose: B[n, m] = A[m, n]
            mB[global_n, global_m] = val

What just happened?

Thread and block indices come straight from CUDA (thread_idx, block_idx), vs the Triton block abstraction
No explicit loads or stores: CuTe uses overloaded [] to generate them.

Launching isn’t a million miles away from Triton:

@cute.jit   # host side
def launch(self, A: cute.Tensor, B: cute.Tensor):
    M, N = A.shape
    grid = ((N + self.T - 1)//self.T,
            (M + self.T - 1)//self.T, 1)
    self.transpose_kernel(A, B).launch(
        grid=grid,
        block=[self.T, self.T, 1],
    )

Because CuTe‑DSL speaks DLPack, you can hand it a PyTorch tensor directly. If you wanted to cache the conversion, it looks like this:

A_cute = from_dlpack(A).mark_layout_dynamic()

The mark_layout_dynamic is used to trigger the dynamic shape support, and avoid shape specialization. The one place where this went a bit funky in my testing was dealing with singular leading dimensions: there you need to be more explicit about the shape to satisfy the compiler.

Layouts and Memory

This kernel isn’t really leveraging the fundamental value of CuTe though, which is composable tensor layouts and memory management. CuTe‑DSL exposes the full memory hierarchy: global, shared, register (and tmem for those with blackwells), and lets you tile, copy, and pipeline data between them. Common primitives:

make_layout / make_layout_tv: describe how a tensor is laid out.
cute.zipped_divide(tensor, tiler): tile a tensor.
cute.copy(src_layout, dst_layout, pred=mask): async copy.
cute.arch.sync_threads(): explicit barrier.

HGEMMony²

CuTe ships with some example kernels, so I grabbed one — an HGEMM (half-precision, FP16, batched GEMM) — and compared to an example implementation in Triton.

To express the same thing in PyTorch, we can unleash our inner Jeremy Howard and use einsum notation: torch.einsum("mkl,nkl->mnl", a, b). Take L batches of a MxK matrix, L batches of a NxK matrix, and return L batches of a MxN matrix.

Here is the Triton:

@triton.jit
def triton_batched_hgemm_kernel(
	a_ptr, b_ptr, c_ptr,
  M, N, K, L, 
  stride_am, stride_ak, stride_al, 
  stride_bn, stride_bk, stride_bl, 
  stride_cm, stride_cn, stride_cl, 
  BLOCK_SIZE_M: tl.constexpr, BLOCK_SIZE_N: tl.constexpr, BLOCK_SIZE_K: tl.constexpr,
  GROUP_SIZE_M: tl.constexpr,
):
    """Triton batched half-precision GEMM kernel: C[m,n,l] = sum_k A[m,k,l] * B[n,k,l]"""
    pid = tl.program_id(axis=0)
    pid_batch = tl.program_id(axis=1)  # Batch dimension
    
    # Calculate batch offsets
    batch_offset_a = pid_batch * stride_al
    batch_offset_b = pid_batch * stride_bl  
    batch_offset_c = pid_batch * stride_cl
    num_pid_m = tl.cdiv(M, BLOCK_SIZE_M)
    num_pid_n = tl.cdiv(N, BLOCK_SIZE_N)
    num_pid_in_group = GROUP_SIZE_M * num_pid_n
    group_id = pid // num_pid_in_group
    first_pid_m = group_id * GROUP_SIZE_M
    group_size_m = min(num_pid_m - first_pid_m, GROUP_SIZE_M)
    pid_m = first_pid_m + ((pid % num_pid_in_group) % group_size_m)
    pid_n = (pid % num_pid_in_group) // group_size_m
    offs_am = (pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M)) % M
    offs_bn = (pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N)) % N
    offs_k = tl.arange(0, BLOCK_SIZE_K)
    
    # Include batch offsets in pointer calculations
    a_ptrs = a_ptr + batch_offset_a + (offs_am[:, None] * stride_am + offs_k[None, :] * stride_ak)
    # transpose for GEMM (load B[K, N] pattern)
    b_ptrs = b_ptr + batch_offset_b + (offs_k[:, None] * stride_bk + offs_bn[None, :] * stride_bn)
    
    # We accumulate into fp32 for higher accuracy.
    accumulator = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32)
    
    for k in range(0, tl.cdiv(K, BLOCK_SIZE_K)):
        # Load the next block of A and B, mask in K
        a = tl.load(a_ptrs, mask=offs_k[None, :] < K - k * BLOCK_SIZE_K, other=0.0)
        b = tl.load(b_ptrs, mask=offs_k[:, None] < K - k * BLOCK_SIZE_K, other=0.0)
        accumulator = tl.dot(a, b, accumulator)
        a_ptrs += BLOCK_SIZE_K * stride_ak
        b_ptrs += BLOCK_SIZE_K * stride_bk
    # Convert back to FP16 for output
    c = accumulator.to(tl.float16)
    offs_cm = pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M)
    offs_cn = pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N)
    c_ptrs = c_ptr + batch_offset_c + stride_cm * offs_cm[:, None] + stride_cn * offs_cn[None, :]
    c_mask = (offs_cm[:, None] < M) & (offs_cn[None, :] < N)
    tl.store(c_ptrs, c, mask=c_mask)

The core loop here is:

Divide into block_size_k chunks of K
For each do a masked load (so when we hit the edges we don’t bring in garbage)
Do the dot product for the tile into an accumulator for the result matrix
Advance the A and B pointers

The CuTe kernel is, uhhh… a bit more involved. The full kernel is several hundred lines long. You can see the source in the Cutlass repo, which demonstrates some cool features like the ability to pass in an epilogue function for fusion.

For now, lets focus on the main loop. As before, we are looping over tiles of K³ .

The first thing we do is advance the pointers. The kernel is doing explicit pipelining of transfers from global memory, to shared memory, to registers, so we need to set wait groups do ensure all the loading has completed before we advance. We’re not actually doing loading in this section, just prepping the ground:

for k_tile in cutlass.range_dynamic(k_tile_count, unroll=1):
	for k_block in range(num_k_block):  
      if k_block == num_k_block - 1:
    	tCsA_p = tCsA_copy_view[None, None, None, smem_pipe_read]
      tCsB_p = tCsB_copy_view[None, None, None, smem_pipe_read]
      cute.arch.cp_async_wait_group(num_smem_stages - 2)
      cute.arch.sync_threads()

Next we kick off a copy from shared memory (smem) to registers (rmem) using cute.copy for the (future) A and B tiles.

k_block_next = (k_block + 1) % num_k_block  # static
cute.copy(
	tiled_copy_s2r_A,
  	tCsA_p[None, None, k_block_next],
    tCrA_copy_view[None, None, k_block_next],
   )
cute.copy(
	tiled_copy_s2r_B,
    tCsB_p[None, None, k_block_next],
     tCrB_copy_view[None, None, k_block_next],
  )

Finally, we interleave the transfers of the next A and B tiles from global to shared memory with the actual gemm operation (the equivalent of tl.dot). I will trust the folks at Nvidia that this is an optimal pattern. The pred= in there is the equivalent of masking in Triton.

   if k_block == 0:
 		if k_tile + num_smem_stages - 1 < k_tile_count:
    	    cute.copy(
				tiled_copy_B,
				tBgB[None, None, None, k_tile_index],
				tBsB[None, None, None, smem_pipe_write],
				pred=tBpB,
			)
        k_tile_index = k_tile_index + 1
        cute.arch.cp_async_commit_group()
		smem_pipe_write = smem_pipe_read
		smem_pipe_read = smem_pipe_read + 1
			
		if smem_pipe_read == num_smem_stages:
      	    smem_pipe_read = 0

The pipelining is explicit, which is nice for debuggability and optimization, but very manual.

Debugging Tips

export CUTE_DSL_LOG_TO_CONSOLE=1
export CUTE_DSL_LOG_LEVEL=10   # up to 100
export CUTE_DSL_PRINT_IR=1     # dump MLIR

cute.printf() gives you a GPU‑side printf.
Kernels are aggressively cached; rm ~/.cache/cutedsl if things look stale.
Multiple @cute.jit host functions in the same Python scope can confuse MLIR (mainly for launching kernels).
The control‑flow rules are strict: no return inside a kernel; initialize everything.

If you’re exploring GPU kernels for the first time, I strongly recommend starting with Triton. When you need to really get into the weeds, or want to reuse CUTLASS building blocks, its great to have CuTe‑DSL as an option in Python (provided you’re comfortable spelunking in GPU internals).

I spent a lot of time holding it wrong. Arguably, still holding it wrong. ↩︎
No one knows what it means, but it’s provocative ↩︎
Note the explicit unroll tag. When you really want #pragma but can’t. ↩︎

July 4, 2025

ML Infrastructure

Harnessing the Universal Geometry of Embeddings
What content should you include in an LLM prompt? Many interesting use cases (enterprise tools, coding assistants) have more content than they can handle at once, so you chunk it up, turn each chunk into a vector with some sentence‑encoder, and store those vectors in a database. Later you vector‑search, pull back the relevant chunks and feed them to the LLM — better known as the RAG pattern.

The working assumption has been: those vectors are fairly safe. A 768‑dimensional point doesn’t look the text “Ian ordered a burger at 12:07 ”, so storing raw embeddings seem privacy‑preserving.

But are they! In Cornell’s Harnessing the Universal Geometry of Embeddings paper the authors train vec2vec, a small GAN that learns to translate embeddings from encoder A’s space into encoder B’s space, without seeing the original sentences. Once you’re in encoder B‑land you can recover up to 80 % of the underlying text:

Inversion, i.e., reconstruction of text inputs, is more ambitious than attribute inference. vec2vec translations retain enough semantic information that off-the-shelf, zero-shot inversion methods […] extract information for as many as 80% of documents given only their translated embeddings, for some model pairs (Figure 5). These inversions are imperfect and we leave development of specialized inverters for translated embeddings to future work. Nevertheless, as exemplified in Figure 6, they still extract information sucdh as individual and company names, dates, promotions, financial information, outages, and even lunch orders. In Appendix E, we show the prompt we use to measure extraction.

The paper suggests that most sentence encoders trained with similar objectives on sufficiently diverse data come up with embeddings which resemble each other. Concepts, topics (and lunch) live on a shared manifold¹; the models just might position them differently in embedding space. Vec2vec is a learned a coordinate transform.

What this might be implying is that if you train a model with similar objectives on data samples from a similar generating function, you will arrive at a manifold in latent space that is geometrically similar to anyone else doing the same thing. If that is true operations in latent-space start to look less model specific, and approaches that navigate them (like JEPA, LDM editing) could learn to operate across different model with just an adapter layer.

To be clear, the paper is not saying this: the authors only align English, contrastive‑loss, transformer sentence encoders. No decoder models, hardly any dimensionality mismatch. The phrase “universal geometry²” may be a stretch: Their GAN training also requires quite a bit of run cherry-picking³, and when they tried cross-modality the results weren’t as strong, but it’s a very interesting idea worth further investigation.

In the short term, this is probably mildly alarming for coding agent customers that are worried about their source code leaking, but in the long term I hope we can see some more investigation into how true this is in more general modeling, and what kind of opportunities that might open up!
1. Shape in the embedding space. In practical experience when you have a large embedding space its mostly empty, and all the actual data lives on a plane in the space. This is why things like latent diffusion models work: they learn to navigate towards that plane from any random noisy point in the space. ↩︎
2. But it’s a great title. ↩︎
3. My understanding is unstable training is a very common problem for GANs. ↩︎
June 28, 2025

links-and-recs
How to build unmaintainable kernels
What do you need to do to get better performance and GPU efficiency out of your model? The GPU-oriented folks at Stanford recently published an early preview of the work they have been doing on the LLM generation of kernels: Surprisingly Fast AI-Generated Kernels We Didn’t Mean to Publish (Yet) – and they have a list:
- Memory Access Optimization: improving the efficiency of data movement between different memory hierarchies (global memory, shared memory, registers) and ensuring data is accessed in a way that maximizes bandwidth and minimizes conflicts.
- Asynchronous Operations & Latency Hiding: hide the latency of slow operations (like global memory access) by overlapping them with computation or other memory transfers
- Data Type & Precision Optimization: using lower-precision data types (like FP16 or BF16) where possible to reduce memory bandwidth requirements, increase cache effectiveness, and potentially leverage specialized hardware units.
- Compute & Instruction Optimization: making the arithmetic computations themselves more efficient, reducing instruction count, or leveraging specialized hardware instructions
- Parallelism & Occupancy Enhancement: maximize the number of active warps on the Streaming Multiprocessors (SMs) to better hide latencies and improve overall throughput
- Control Flow & Loop Optimization: reducing the overhead associated with loops, branches, and indexing calculations
That’s a good list! In this case though, it emerged not from (just) talking with kernel experts, but also from developing a model to generate kernels:

We have some very fast AI-generated kernels in pure CUDA-C without using libraries and DSLs such as CUTLASS and Triton. They are performing close to or in some cases even beating the standard expert-optimized production kernels shipped in PyTorch.

They developed a very straightforward but smart pattern on structuring test-time-compute. They reason about optimizations in natural language before generating code. Then, they branch out into a tree structure of refinements for each optimization idea, to avoid loops in investigation.

The kernels they generated were somewhere between fast, and very fast:

Conv2D: 179.9% performance of FP32 torch.nn.Conv2D; problem size: (100, 3, 224, 224) input tensor, conv(in_channels=3, out_channels=96, kernel_size=11, stride=4, padding=2)

The team aren’t claiming this is a general solution, but just an interesting proof of possibility, which is certainly is! The walk through of how they got to the final conv2D kernel is fascinating, both in terms of human intervention and the chain of optimizations.

The final code sample for the Conv2D kernel is included in the appendix. It uses advanced CUDA techniques that we find challenging to write ourselves! We also have more example kernels in this Github repo

The kernel is very fast for specific shapes, on the L40s, in FP32. Its also a kernel that, by the sounds of it, the team themselves struggled a bit with. It’s very, very specialized. It’s not that a human couldn’t have built it, its that (in most cases) they wouldn’t: it’s not a priority kernel, and all that clever CUDA comes with operational overhead, ties on specific hardware, shapes and so on.

That in itself isn’t new. If you PyTorch or XLA compile you’ll get a lot of kernels which you probably wouldn’t write, but this adds a new (and weird!) layer of non-determinism to everything. Elsewhere at Stanford, they have been looking at one of the other killers of GPU efficiency: kernel launch overhead. Most models are represented by hundreds of kernels, each of which have to be scheduled from the CPU. LLMs are generally memory-bound, small ones particularly so, and the gaps between kernel executions can end up dominating performance:

Look Ma, No Bubbles! Designing a Low-Latency Megakernel for Llama-1B:

In this post, we show how we can bypass this problem by merging the entire Llama-1B forward pass into a single “megakernel” that eliminates kernel boundaries altogether. Doing this achieves brr – on an H100, we use 78% of memory bandwidth and outperform existing systems by over 1.5x. (To our knowledge, this is the lowest-latency forward pass for Llama-1B in bfloat16!) In the rest of this post, we’ll walk through how and why one would do this.

The idea of megakernels that handle all of the operations is not new, but the complexity of fusing everything together is high. Persistent kernel were popularized at the tail end of the CUDA 11 series, due to the right residency and async copy support in Ampere. They allow leaving kernels resident on an SM and having them pull a series of tiles to do their work rather than scheduling repeated launches of the same kernel. The megakernel takes this idea ever further, with multiple operations within the kernel that pulls a stream of different problems. One issue with this approach (traditionally) is register spilling: you only have so many registers available, up to 255 per thread, though with a fairly high overall limit of 64k 32-bit registers (on Hopper). That means you need to keep some data in shared memory, and efficient use of shared memory ends up being the bottleneck. The team at Stanford developed paging shared memory, with a separate reserved block for managing allocations of shared memory to individual tasks.

This gets the CPU completely out the picture for the forward pass, but is incredibly specific to the model (in this case Llama 3.2 1B).

Another collaboration was clearly thinking in the same direction, as they recently posted about their Mirage Persistent Kernel: a compiler for megakernels.

our team from CMU, UW, Berkeley, NVIDIA, and Tsinghua developed Mirage Persistent Kernel (MPK) — a compiler and runtime system that automatically transforms multi-GPU LLM inference into a high-performance megakernel. MPK unlocks the benefits of end-to-end GPU fusion while requiring minimal manual effort from developers.

The system works by building a task-graph of what they call LAX¹^[1] fragments, which in practice is a very short list of 14 operators. This is actually too small to represent everything they need, meaning they have to manually decompose some common ops like ReLu, but this level of decomposition gives them an ability to do some pretty complex fusions.

The actual ops are generated thanks to Mirage’s Kernel Superoptimizer (a great name), which I think is a very intense autotuner:

Mirage automatically searches for possible GPU kernels for attention. The search space includes existing manually designed attention kernels (e.g., FlashAttention and FlashDecoding) as special cases. It also includes other implementations that outperform today’s handwritten ones by up to 3.5x for certain use cases. The GPU kernels generated by Mirage can directly operate on PyTorch tensors and be called in your PyTorch program.

The search is not cheap though:

In our evaluation, Mirage takes up to 4 hours to optimize a Lax program. This optimization is a one-time cost before deployment on the target hardware.

The aggressive decomposition allows them to have a clever verification scheme where they validate kernels on random inputs to get confidence in (approximate) numerical correctness.

They then build a worker kernel with all the relevant operations, and schedule the optimized graph via dedicated scheduler warps. Workers are scheduled on the SMs, and report back status. The scheduler warps then decide when tasks can be enqued for execution.

They’ve got a code example that walks through setting it up for Qwen. They recreate the model structure explicitly, generate a task graph from it, and kick off the search for optimal kernels and fusions. This avoids the need to solve the Dynamo-style problem of tracing the model!

The resulting kernel is again heavily tied to the specific hardware and model. One thing we have found useful for investigating production problems is that the ability to ablate different parts of the compile process, running models in (basically) PyTorch eager mode. This approach leaves the darkest of black boxes to work with, and I would imagine even more terrifying PTX than the complex CUDA that the LLM kernel generation team came up with.

Between these projects though, it feels like we are exploring the edges of what running a program on GPUs should actually look like: a combination of kernel generation and multi-level search seems almost guaranteed to yield optimizations that would be far outside the cost-benefit curve for manual implementation. What we don’t have yet is known ways to operationalize this kind of approach, but its an exciting area to watch!

Thanks to Matt for nudging me to actually read through these papers, they’d been on my todolist for a bit!
1. I am not sure what this stands for, but the basic ops in jax are in jax.lax, so I presume its the same source! ↩︎
June 27, 2025

ML Infrastructure
GPU Driven
Traditionally, GPU collective network operations were issued from the framework on a separate CUDA stream than the local computation kernel launches. This allowed overlapping comms and hiding most or all of the network latency. NCCL exposes collectives as fully implemented kernels, and there have been various derivitives such as AMD’s RCCL or Berkeley’s new UCCL project, which is aiming to be a drop-in replacement better suited for large-scale GPU workloads¹. Earlier versions of this sent the networking via the host, and later developed towards GPU-to-GPU peer to peer over connections like NVlink, and direct communication between GPU and the NIC for scale out. But the actual coordination was driven by launches from the CPU.

The increasing capacity of GPUs, particularly in the Hopper/MI300x era, the use of micro-batching, and the rise of Mixture-of-Expert models put a lot more pressure on this arrangement: now rather than a large chunk of comms at the end of each distributed-data-parallel pass you are doing thousands of small exchanges per step. Each step could require each CPU rank to launch thousands of tiny All-to-All: millions of collective calls across the cluster, while still running the rest of the training loop. Each one forces a host interrupt, collective calls and GPU triggers for the data transfer.

A paper from last year, The Landscape of GPU Driven Communication, gives a broad overview of this shift:

In the last decade, several advancements, broadly referred to as GPU-centric communication, have sought to challenge the CPU’s hegemony on multi-GPU execution. At a high level, these advancements reduce the CPU’s involvement in the critical path of execution, give the GPU more autonomy in initiating and synchronizing communication and attempt to address the semantic mismatch between multi-GPU communication and computation.

Getting the right kind of abstractions in here to make this more accessible and flexible is an active area of development. The main reference is the shmem abstraction. This allows writing and reading bytes from remote memory (originally for supercomputers) and adding barriers around usage. Nvidia’s nvshmem (and AMDs ROCshmem) library directly implemented this for GPU.

The next big step up in functionality was GPU Direct Async, a transport that allows access to NIC doorbells directly from the GPU, extending NVSHMEM to RoCE/InfiniBand (Remote Direct Memory Access – RDMA). In addition to this, NVLS (NVLink Sharp) was added to NVLink switches which allowed much more bandwidth efficient switch-managed multicast for broadcast and reduce cases, the fundamental operations used in all-gather and all-reduce collectives. This allows GPU-initiation of collectives which gives us the ingredients to get the CPU out of the networking path completely, and to fuse networking alongside compute operations.

This was one of the things DeepSeek did really well, covered in their DeepEP work: fusing MoE GEMM with GPU-initiated RDMA cut single-node latency by 2–3x. The tradeoff is kernels handle a lot of complexity: choosing between NVLink and GPUDirect for nodes, polling, flow management and so on is tuned to their specific needs and hardware.

ByteDance has also spent a lot of time looking at this problem. Their Flux paper looks at more general approach for doing tile-based fusion of the comms and compute:

Flux overdecomposes computation and communication into tiles. Here, since the computation operation is GEMM, and most high-performance GEMM kernels on GPUs are written with tiling, such as thread block tiling or warp tiling, our decomposition can naturally map into existing tiling in the kernels. Flux fuses dependent communication and/or wait logic into a GEMM kernel, and launches only one fused kernel, compared to the prior methods launching multiple split GEMM kernels.

PyTorch has an experimental feature and RFC for SymmetricMemory:

Then, innovative block-wise compute/computation overlapping techniques started to use copy-engine to drive the P2P traffic in order to minimize contention. Now, we are seeing techniques where NVLink communication is directly issued from “compute” kernels.

[…]

Just as Triton allows average users to modify matmuls for their needs (fusion, quantization, etc.), we hope that SymmetricMemory will enable average users to modify NVLink communication algorithms for their requirements, whether it’s implementing alternate collective algorithms (one-shot allreduce), using different quantization approaches (stochastic rounding), or fusing collectives with other kernels (all-gather interleaved with matmuls).

This project is still fairly manual, though it abstracts much of the plumbing required and makes it accessible at a high level.

In a similar vein, ByteDance recently released their implementation of Triton-Distributed: ByteDance-Seed/Triton-distributed: Distributed Compiler Based on Triton for Parallel Systems. This adds a small Triton DSL wrapping the shmem operations, supporting both Nvidia and AMD hardware. It exposes some of the comms knobs to autotuning, allowing tuning across compute and collectives, focusing on the tile-based approach they documented in their TileLink paper.

This is unlikely to make its way into upstream Triton, in part because the general approach of Triton is to hide much of the hardware information in the compiler passes, while this approach makes the topology fairly explicit.

The PyTorch team has been working on traceable collectives for the PyTorch compiler. The idea of a compiler being able to look at the overall task graph and decide where to fuse comms and which transport options to use is appealing, as it allows kernels to be more transparent to the specifics of the cluster.

The move to GPU-initiated, fine-grain comms that fuse into compute kernels is real and continuing; the tooling is still early, but the gap will continue to close.
1. This currently host-side controlled, but they have plans for GPU-driven comms as well ↩︎
June 26, 2025

note to self
AbsenceBench

https://arxiv.org/abs/2506.11440

Simon Willison has a good summary:

Long context models have been getting increasingly good at passing “Needle in a Haystack” tests recently, but what about a problem in the opposite direction?

The answers are surprisingly domain-specific; some models do great on numeric sequences but most are pretty bad at code!

The authors posit that attention is just a worse mechanism for seeing what’s missing vs what’s there. For me this rhymes with the experience of folks doing agentic coding assistant work: its beneficial to clear the context window more often than you think as the models strongly prefer to use what is already in there.

This feels like a learned or tuned behavior, a flavor of the model does the eval. Models will probably get better at this problem, as now it’s legible, but is there a tradeoff that has to be made?

Pretraining is somewhat saturating, but we have oodles of post-training (which includes context extension), the whole meta-RL process of researchers trying different data mixes and algorithm/architecture tweaks, and inference time search options.

If OpenAI had Anthropic’s data and evals would they have as good an agentic coding model? And vice versa would Opus be as good at deep research as O3? I honestly don’t know: in the end compute will always be finite and we have to allocate it with some end in mind. It feels very plausible there is no globally optimal scaling law for how you prioritize different model capabilities. But the models will probably do this eval.

June 21, 2025

links-and-recs
Free-Threaded Python gets ‘supported’ status

Huge congratulations to Thomas, Matt and Sam for sheparding through PEP 779 that moves the no-gil/free threaded python mode from experimental to supported:

https://discuss.python.org/t/pep-779-criteria-for-supported-status-for-free-threaded-python/84319/123

With these recommendations and the acceptance of this PEP, we as the Python developer community should broadly advertise that free-threading is a supported Python build option now and into the future, and that it will not be removed without following a proper deprecation schedule

Having confidence in the long term of this feature is great for anyone building on it. I’m very grateful to (and feel lucky to be working with!) the many folks who have been squashing bugs and improving performance, and to the people adding support for FT Python across the ecosystem!

The steering commitee have laid out some solid documentation and performance expectations for the ongoing work, and are setting an expectation for broad compatibility for future cpython work:

New experimental projects within CPython must be compatible with, and should be based on the free-threading build. The SC encourages this direction to reduce engineering complexity caused by supporting both GIL and free-threaded builds

I also appreciate the call for building out more high-level concurrency primitives: I think there are a lot of exciting projects to come as we move more of this into production!

June 21, 2025

ML Infrastructure
A patchwork quilt view of AI Alignment
https://arxiv.org/abs/2505.05197

Very interesting paper from folks at DeepMind that is focused on arguing that the idea of a convergence of a single, coherent value set doesn’t reflect society and is not the only way to think about AI morality and alignment.

Think of society as a patchwork quilt composed of diverse communities, each with its own internal norms and expectations. Within these communities—e.g. trade unions enforcing strike discipline or religious groups defining attendance practices—members understand the specific rules and the social consequences for deviating (Bicchieri, 2005; Ostrom, 1990; Ullmann-Margalit, 1977). These internal dynamics shape behavior within each distinct patch of the quilt, fostering cohesion through shared, localized understandings of appropriate conduct

[…]

A key insight we can draw then is that what holds humans together despite profound disagreements is not value convergence, but practical mechanisms for coexistence—which we see as social technologies

There is an idea that sometimes comes up that disagreements between good, reasonable people can be traced to misunderstandings or disagreements about the likelihood of different outcomes; if you can align on them, you’ll come to the same conclusions. This encourages some focus in AI alignment on finding the right, true principals, creating the best truth-seeking model possible, then assuming downstream that will result in strong alignment. The paper challenges this assumption.

They also call out collective action problems in implementing such a framework, particularly start up and free rider problems:

Even seemingly universal goods like “survival” are embedded in thick cultural contexts that shape their meaning and priority (in fact many cultures prioritize sacred values above survival e.g. Ginges et al. (2007)). In general, mobilizing global action and resources towards any specific AI safety strategy will inevitably confront deep-seated disagreements rooted in different values, priorities, and worldviews regarding the nature of AI risks, the efficacy or fairness of proposed initial strategies, and the equitable distribution of upfront costs and responsibilities.

Their approach calls for focusing on 4 areas:
1. Contextual grounding: broader understanding, not just the conversation but the environmental context they are operating in.
2. Community customization: Different norms for different communities.
3. Continual adaption: Updating understanding of appropriate behavior based on ongoing feedback. They suggest continuous training for this.
4. Polycentric governance: Distributed decision making wiht multiple overlapping centers of authority.
If you read this list into a general “helpful agent” context instead of alignment, I don’t think it would be controversial: these seem good ideas!

That said, I think a lot of this boils down to the last one. Getting governance structures right is hard, in any context, and I interpret a key part of the aspiration here as having “checks and balances” that can represent varied interests. Not an easy problem to solve!

Some might worry that our patchwork approach embraces a troubling relativism, but this misunderstands the quilt we’re describing. Just as a quilt’s structural integrity depends on solid stitching techniques regardless of pattern diversity, our appropriateness framework recognizes that while the
specific content of norms varies across contexts, the social technologies that facilitate coexistence—the mechanisms of learning, adaptation, and conflict resolution—can be studied with rigor and implemented with care.
June 15, 2025

links-and-recs
Linear Layouts in Triton
[2505.23819] Linear Layouts: Robust Code Generation of Efficient Tensor Computation

Paper from the Triton folks at OpenAI on their solution to the layouts/data movement problem. Data often needs to be laid out in a specific way to maximize performance on a GPU. This includes certain instructions, and also avoidance of bank conflicts in shared memory. You might have data stored nicely in global memory, need to permute it to load, then permute it again for execution.

Part of the appeal of CuTe is expressing these layouts and allowing a relatively simple algebra to transform it between these domains. This works, but the Triton approach is to try and hide this type of complexity, particularly hardware specific complexity, in the compiler.

While both CUTE and linear layouts aim to address the challenge of flexible task mapping on emerging architectures, they differ in several key aspects. First and foremost, CUTE is primarily designed for users to manually describe layouts, whereas linear layouts are integrated into a compiler. Second, the linear algebra framework of linear layouts enables compilers to generate efficient code for layout conversion and code lowering for many common operators, which is absent in CUTE. Third, swizzling is inherently defined within linear layouts, whereas in CUTE, it is treated as a separate step

The clever insight is that you can represent any of the layouts as a binary matrix over F₂, which means you can use XOR/AND for arithmetic. You can compose those binary matrices freely, and it’s also easy to replace the transform matrix with a new one for hardware that requires a different permutation.

To give a step-by-step example (as I’m not totally sure how well I grok this myself!) let’s say we are working on am MMA for a 16×8 tile:
- We start with our data, say in row major order (0,0), (0,1), …, (0,7), (1,0). Each value is stored in its own register
- We have 32 threads, each managing their own section of the block: in this case 4 registers
- So we have a hardware location for each value: the thread (0..31) and the register (0..3). You can imagine this as 7 bits of data, thread ID (5 bits), and register ID (2 bits)
- Equivalently we have imagine tracking the tensor location for each value: 4 bits for 0..15 rows, 3 bits for 0..7 columns
- We can have a map which translates between tensor location and hardware location: block location row 1 col 0 is in thread 2 register 0. This would be a 7 by 7 binary matrix
- We can define a matrix that transforms the hardware map to the one needed for our ldmatrix tensorcore call.
- For example, we might need thread 0 to manage tensor values (0,0), (4,0), (8,0), (12,0)
- If the mapping requires moving a value to a different register in the same thread we can use a prmt (permute) instruction
- If the mapping requires moving values between thread’s registers, we can use a warp shuffle like shfl.sync that allows swapping registers between threads without using shared memory¹
Triton has layouts for standard block level storage, and for MMAs and other operations. By multiplying through the required mappings it can automatically work out how best to optimize movement, versus the manual transforms you do in CuTe!

It also has versions of these mappings for different hardware, so for many operations only the layouts need to be swapped out when moving from Ampere to Hopper or Blackwell!
1. mostly. if there will be bank conflicts, it will spill to shared memory. ↩︎
June 11, 2025

links-and-recs
Monarch: PyTorch Single Controller
I’ve been excited for this to make it to OSS: The PyTorch team at Meta recently soft-launched Monarch on Github.

pytorch-labs/monarch: PyTorch Single Controller

Back in 2022, Google’s Pathways paper proposed (revisiting) a single-controller approach for managing machine learning runs. Typically, ML jobs use an SPMD (Single Program, Multiple Data) approach, distributing identical code across multiple hosts. Each host runs independently, synchronizing during collective operations. This works, as evidenced by the many large training runs in the world. It also introduces complexity, especially with pipeline parallelism where conditional logic for different ranks can clutter up your training code. Even without that, subtle issues can arise: for example, slight differences in torch.compile optimization have (in the past!) lead to deadlocks by placing collectives differently on separate nodes.

The single-controller model simplifies this by centralizing program execution on one main node and using generic workers on the hosts that execute assigned tasks. This provides a consistent, global view of the entire computation, making it easier to get to a correct implementation of parallelisms and other distributed work. This doesn’t come for free though: the main node must efficiently manage (potentially!) thousands of GPUs without becoming a bottleneck, and existing code must adapt to this new centralized model.

Monarch is the PyTorch team’s implementation of this single-controller concept. It provides a familiar PyTorch frontend, additional module wrappers, and a high-performance Rust-based actor system for distributing and managing work.

The fundamental abstraction in Monarch is the Actor. Each Actor executes on their own accelerator, maintains state and behavior. Communication with other Actors is via async message passing on methods decorated with @endpoint. The nice thing about the programming model is you can interact with all of the actors in your mesh in a consistent way.

Monarch is appealing even if you’re not GPU-rich. For instance, at home, I have a machine equipped with two (mismatched) 3090s, and Monarch allows me to run and debug jobs directly in notebooks without relying on external services.

Installation had minor hurdles because I built from source rather than using the available pip package. Although the README specifies Python 3.10, Python 3.13 worked fine. The dependencies reference dnf (reflecting Meta’s internal Linux distro choice), so adapting commands to other Linux distributions (Ubuntu in my case) was necessary. Additionally, I had to set BINDGEN_EXTRA_CLANG_ARGS="-I/usr/include/c++/11 -I/usr/include/x86_64-linux-gnu/c++/11" to resolve Rust compilation issues.

Once installed, running Monarch’s distributed data-parallel notebook was straightforward (see: monarch/examples/notebooks/spmd_ddp.ipynb). The notebook shows that minimal code changes to the standard DDP example are required, mainly subclassing Actor (e.g., class DDPActor(Actor)), while keeping the training loop virtually identical. Monarch handles the rest, including distributed execution and debugging across multiple GPUs.

Setting up the environment means providing the mesh configuration and launching the actors, which can be done from a cell:
```
# Spawn a process mesh
local_proc_mesh = await proc_mesh(
    gpus=WORLD_SIZE,
    env={
        "MASTER_ADDR": "localhost",
        "MASTER_PORT": "12455",
    },
)
# Spawn our actor mesh on top of the process mesh
ddp_actor = await local_proc_mesh.spawn("ddp_actor", DDPActor)
```
I didn’t have to manually start any other services; it all happened under the hood. Triggering the run is just:
```
await ddp_actor.demo_basic.call()
```
Which output:
```
self.rank=0 Running basic DDP example
self.rank=1 Running basic DDP example
self.rank=1 Finished running basic DDP example
self.rank=0 Finished running basic DDP example
```
What I find really appealing is how easy it is to execute across ranks. For example, to query for system info:
```
print("Gathering system info from all ranks...")
system_info = await ddp_actors.get_system_info.call()

print("\n SYSTEM INFORMATION ACROSS ALL RANKS:")
print("=" * 60)

for point, rank_info in system_info:
    print(f"Rank {rank_info['rank']}: PID={rank_info['process_id']}, "
          f"Device={rank_info['device_name']}, "
          f"GPU Memory={rank_info['gpu_memory_allocated']/1e6:.1f}MB")

print(f"\nFound {len(system_info)} ranks in the mesh")

all_rank_info = [value for point, value in system_info]
print(f"Total GPU memory across all ranks: {sum(info['gpu_memory_allocated'] for info in all_rank_info)/1e6:.1f}MB")
```
Outputting:
```
Gathering system info from all ranks...
[Rank 0] System Info: PID=10519, CPU=0.1%, RAM=5.2%, GPU_MEM=0.0MB
[Rank 1] System Info: PID=10520, CPU=0.1%, RAM=5.2%, GPU_MEM=0.0MB

 SYSTEM INFORMATION ACROSS ALL RANKS:
============================================================
Rank 0: PID=10519, Device=NVIDIA GeForce RTX 3090, GPU Memory=0.0MB
Rank 1: PID=10520, Device=NVIDIA GeForce RTX 3090, GPU Memory=0.0MB

Found 2 ranks in the mesh
Total GPU memory across all ranks: 0.1MB
```
I can also stop training and dump state if I need to , making it easy to check norms and debug:
```
print("Running training steps...")
for step in range(3):
    print(f"\n--- Step {step + 1} ---")
    
    step_results = await ddp_actors.train_step.call()
    
    all_results = [value for point, value in step_results]
    
    losses = [result['loss'] for result in all_results]
    grad_norms = [result['grad_norm'] for result in all_results]
    throughputs = [result['throughput'] for result in all_results]
    
    print(f"Losses across ranks: {[f'{l:.4f}' for l in losses]}")
    print(f"Gradient norms: {[f'{g:.4f}' for g in grad_norms]}")
    print(f"Avg throughput: {sum(throughputs)/len(throughputs):.1f} samples/sec")
```
```
--- Step 1 ---
[Rank 1] Step 1: Loss=1.1128, GradNorm=0.3198, Time=0.241s
[Rank 0] Step 1: Loss=1.0414, GradNorm=0.3198, Time=0.253s
Losses across ranks: ['1.0414', '1.1128']
Gradient norms: ['0.3198', '0.3198']
Avg throughput: 129.8 samples/sec

--- Step 2 ---
[Rank 0] Step 2: Loss=1.1526, GradNorm=0.3096, Time=0.003s
[Rank 1] Step 2: Loss=1.0546, GradNorm=0.3096, Time=0.003s
Losses across ranks: ['1.1526', '1.0546']
Gradient norms: ['0.3096', '0.3096']
Avg throughput: 9800.9 samples/sec

--- Step 3 ---
[Rank 1] Step 3: Loss=0.9116, GradNorm=0.2243, Time=0.002s
[Rank 0] Step 3: Loss=0.9662, GradNorm=0.2243, Time=0.002s
Losses across ranks: ['0.9662', '0.9116']
Gradient norms: ['0.2243', '0.2243']
Avg throughput: 19977.5 samples/sec
```
While the distributed stuff here is cool, it’s not wildly different than using a distributed framework like Ray and a little bit of setup (though I am pretty allergic to setup). What is most interesting is how this changes the programming model of PyTorch, and makes it really easy to build out distributed experiments.

For example, if I was building a param server the sync only requires an await’d read of the weights from another object, taking advantage of the RDMA support for an efficient cop¹y:
```
    @endpoint
    async def worker_sync_with_ps(self, param_server) -> bool:
        """Synchronize with parameter server and get RDMA handles"""
            
        self._log("Synchronizing with parameter server...")
        
        # Get RDMA buffer handles
        self.weight_buffers = await param_server.ps_get_weight_handles.call_one()
        self.gradient_buffers = await param_server.ps_get_gradient_handles.call_one()
        
        # Get metadata
        metadata = await param_server.ps_get_metadata.call_one()
        self.weight_metadata = metadata['weights']
        self.gradient_metadata = metadata['gradients']
        
        # Perform initial weight sync
        sync_time = await self._sync_weights_from_ps()
        
        self._log(f"Synchronized with parameter server (sync time: {sync_time:.3f}s)")
        return True
```
Getting those weight buffers is as simple as creating the right Monarch object:
```
def tensor_to_rdma_buffer(tensor: torch.Tensor) -> RDMABuffer:
    # RDMA requires 1D contiguous uint8 tensors
    byte_tensor = tensor.view(torch.uint8).flatten()
    return RDMABuffer(byte_tensor)
```
For an early preview of a library, Monarch is surprisingly complete, and definitely worth a look.
1. Not that this would do anything for my 3090s! ↩︎
June 10, 2025

ML Infrastructure
Toward a Theory of Tokenization in LLMs
[2404.08335] Toward a Theory of Tokenization in LLMs

Tokenization has always struck me as one of the odder aspects of natural language deep learning. Despite the extensive end-to-end learning processes we typically use, tokenization initially involves creating a dictionary of optimal sub-word segments from your dataset. One of the appealing concepts in the Byte Latent Transformers paper is the potential to learn tokenization dynamically, recognizing that tokenizers solve deeper problems than merely providing a fixed vocabulary.

This paper addresses tokenization from a theoretical perspective by modeling sequences using kth-order Markov processes, where the likelihood of each token depends on the preceding sequence, as in natural language. The parameter k corresponds to the model’s context window size. Key findings include:
1. Training without tokenization leads models to effectively behave as unigram predictors, significantly limiting performance.
2. Using a well-designed tokenizer (e.g., Byte Pair Encoding – BPE) enables models to achieve nearly optimal performance in capturing sequence dependencies.
3. Increasing the tokenizer’s dictionary size improves the model’s performance, moving it closer to the ideal probability distribution.
Tokenizers which do a good job at learning patterns in the data and assigning these frequent patterns as tokens in the dictionary are compatible with an i.i.d. model over tokens.

This insight suggests that despite the complexity of natural language’, a good tokenizer converts sequences into something approximating an independent and identically distributed (i.i.d.) format, which brings the modeling tasks for transformers closer to the one they can solve.

While the paper does not explicitly explore the Byte Latent approach, I wonder if its entropy-driven dynamic token allocation might similarly achieve this i.i.d. simplification. In BLT the entropy model, trained separately, could be dynamically transform inputs into a distribution that is more palatable for transformers.
June 10, 2025

links-and-recs
Analyzing Modern GPU Cores

[2503.20481] Analyzing Modern NVIDIA GPU cores

Filing this under interesting work I will probably never use. The authors try to construct a more accurate simulation of the Ampere (4090/A100 type GPUs) microarchitecture, backed by extensive testing on real hardware. It’s a good reminder that, in part because of how good some of the abstractions are, there is quite a lot about Nvidia GPUs that isn’t really known outside Nvidia. My main takeaway was that the compiler is very deeply coupled to the hardware performance: a Nvidia chip is not really a complete unit without taking in to account the software driving the performance, and recognizing that accounts for why Nvidia have done such a good job of building a solid stack with CUDA.

One of the things I found interesting was the use of a Stall counter: the compiler notes fixed latency instructions (which seem to be a preferred design choice) and adds a counter to the instructions control bits that specifies how many cycles the warp should wait before issuing the next instruction, and so other warps will be selected for execution. This means the hardware doesn’t have to dynamically check for data dependencies.

For example, an addition whose latency is four cycles and its
first consumer is the following instruction encodes a four in the Stall counter. Using the methodology explained in section 3, we have verified that if the Stall counter is not properly set, the result of the program is incorrect since the hardware does not check for RAW hazards, and simply relies on these compiler-set counters. In addition, this mechanism has benefits in terms of area and energy wiring. Keep in mind that wires from fixed-latency units to the dependence handling components are not needed, in contrast to a traditional scoreboard approach where they are required.

There are variable execution length instructions, like memory loads, and in that case they have a Dependence counter, which is decremented when data arrives.

In the vein of handing off to the compiler, the scheduler uses a Compiler Guided Greedy Then Youngest policy: it will keep issuing instructions from the same warp (greedy) with guidance from the Stall (and an explicit Yield bit) and otherwise will swithch to the youngest ready warp. Older GPUs (apparently!) used Greedy Then Oldest instead, which resulted in more often selecting a warp that was still stalled waiting for memory or similar, while the youngest more likely has useful work to do.

The scheduler starts issuing instructions from the youngest warp, which is W3, until it misses in the Icache.As a result of the miss, W3 does not have any valid instruction, so the scheduler switches to issue instructions from W2. W2 hits in the I-cache since it reuses the instructions brought by W3, and when it reaches the point where W3 missed, the miss has already been served, and all remaining instructions are found in the I-cache, so the scheduler greedily issues that warp until the end. Later, the scheduler proceeds to issue instruction from W3 (the youngest warp) until the end, since now all instructions are present in the I-cache.

Similarly, the paper points out that the instruction prefetch cache is a stream buffer (probably 16 instructions deep) rather than any kind of complex branch prediction logic, because we generally don’t do that kind of thing on GPUs!

a straightforward prefetcher, such as a stream buffer, behaves close to a perfect instruction cache in GPUs. This is because the different warps in each sub-core usually execute the same code region and the code of typical GPGPUs applications do not have a complex control flow, so prefetching 𝑁 subsequent ines usually performs well. Note that since GPUs do not predict branches, it is not worth implementing a Fetch Directed Instruction prefetcher [76] because it would require the addition of a branch predictor.

June 5, 2025

links-and-recs
Darwin Gödel Machines
https://open.substack.com/pub/gonzoml/p/darwin-godel-machine

Interesting paper breakdown on Gonzo ML of another evolutionary agent approach from the extended Sakanaverse.

It commences with an initial coding agent, constructed upon a frozen foundation model (FM) equipped with tool-use capabilities (e.g. running bash commands, editing files). In each cycle, “parent” agents are selected from the expanding archive. This selection process prioritizes agents based on a combination of their performance (assigning greater weight to higher scores, scaled by sigmoid) and a novelty bonus (inversely correlated with the number of offspring they have already produced, thereby encouraging exploration of less-frequented paths).

The actual foundation model is a frozen component, so much like alphaevolve this is a search set up on top of the model intelligence. The search is evolving the agent code itself to try and do better on benchmarks.

Qualitatively, the DGM learned to enhance its own tools and workflows. For instance, it developed more granular file editing capabilities (e.g., string replacement), improved long-context window management (e.g., auto-summarizing prior interactions), and refined its problem-solving strategies (e.g., making multiple attempts at a solution and using another FM to evaluate patches). These discovered improvements also demonstrated generalizability, transferring benefits across different underlying FMs and programming languages.

When it comes to coding agents I had been thinking there were three axes of performance, which gate the overall effectiveness, but the paper makes it clear there are at least 4:
1. The foundation model itself, with its base coding, tool use, reasoning abilities and context window size
2. The tools it has available – the more the tool is exposes underlying semantics the more the model can efficiently use it.
3. The UI, how the user interacts with the agent to direct it, provide clarity and review work.
4. The prompt, strategies for problem solving and how the context window is managed (eg when to summarize)
In this case the UI is held fixed (an outer eval loop), the model is fixed and the search explores tools and strategies. It seems at the very least a search across multiple different models as options might also work well!
June 1, 2025

links-and-recs
Keeping a GPU busy is a lot about tiling
File this under the “gross oversimplifications” category. The basic approach to keeping GPUs busy is dividing the work into tiles, smaller sub-problems that make up the larger result. For a GEMM you might break the matrix into 128×128 or 128×64 tiles and let each CUDA thread block (CTA) own one tile. The GPU has many streaming multiprocessors (an A100 has 108) and every SM picks up one CTA at a time. If you want to know how many SMs your own card has you can call:
```
props = torch.cuda.get_device_properties(0)
print(f"SMs: {props.multi_processor_count}")
```
Tiles are launched in waves. A full wave is the moment when every SM is busy with exactly one CTA. If the total number of tiles isn’t a multiple of the SM count, the final wave is only partly full and some SMs sit idle; Nvidia calls that wave quantization. There is a similar problem at the edge of the matrix: if the dimensions aren’t multiples of the tile size the right-most or bottom-most tiles are partly empty, wasting threads (tile quantization). Sometimes a smaller tile size (for example 64 × 64) gives higher overall throughput because it leaves less unused space at the edges.

The usual cure for poor wave utilization is a persistent kernel. Instead of launching one CTA per tile, you launch (roughly) one CTA per SM and have each CTA pull tiles from a global queue until the queue is empty. Because each CTA is pulls whenever ready, the SMs rarely go idle and the tail effect is reduced.

Inside an SM the main performance lever for GEMMs arethe Tensor Core, which execute matrix-multiply add (MMA) instructions efficiently. On Ampere you use WMMA instructions: one Warp (32 threads) computes a 16 × 16 fragment at a time. Hopper introduces WGMMA instructions where four warps acting in ia warp-group (128 threads) execute a larger matrix multiply (up to 64 × 64 for FP16/FP8). To issue WGMMA you must place the right-hand operand B in shared memory; A can sit in either registers or shared memory. The operation is asynchronous, so while a warp-group is processing one tile the same CTA can be pre-loading the next tile.

Blackwell pushes the idea further. A pair of CTAs on neighbouring SMs can cooperate in a pair unified MMA, letting two SMs’ tensor cores process an even larger tile.

To make that possible Hopper introduced thread-block clusters and Blackwell extends them. When you launch a kernel you can group CTAs into clusters such that the scheduler guarantees to place them on SMs inside the same GPC (GPU Processing Core), so they share a fast interconnect and can access shared memory across SMs. If the grid doesn’t divide cleanly into whole clusters you also lower efficiency on the tail (is this cluster quantization? stick with the trend Nvidia!) so Blackwell has a Cluster Launch Control that can shrink the last cluster to better fit the work.

Loading Data

All of this only works if data is present in shared memory. The first optimization is making sure (global) memory access is coalesced. A 32-thread warp can request 32-byte chunks , but the memory bandwidth for a single fetch from DRAM is wider. e.g. If four consecutive threads request address 1, 4, 8 and 12, the memory controller can coalesce these into a single 128-byte read. If the addresses are strided (e.g. hopping across rows) then only 32 bytes out of the 128 byte fetch capacity is loaded at a time, so the load takes longer. Getting this right is about ensuring the memory layout is set up for the kernel, and doing any transforms needed in shared memory before executing.

In older GPUs the warp had to wait on the copy operation. Ampere enabled cp.async plus non-blocking wait/arrive barriers so a warp can initiate a copy from global to shared memory and immediately continue with arithmetic. Hopper adds the Tensor Memory Accelerator: with TMA, a single thread in the CTA can describe a multidimensional block to copy and the TMA hardware streams it to shared memory while the threads do something else. Blackwell goes one step further and can multicast a single TMA load into every SM of a cluster, which is helpful when multiple CTAs are about to reuse the same B tile.

In practice you hide latency by organizing the main loop using so that it double buffers: while the tensor cores work on tile k the TMA or cp.async engine is fetching tile k + 1 into the other half of shared memory; then you swap buffers and repeat. As long as copy time and compute time overlap well, the tensor cores and the copy engines stay saturated.

Choosing the right tile size

Choosing the right tile size (often expressed in Triton as BLOCK_M × BLOCK_N) is a balance between each of these: enough threads to issue a warp-group MMA, small enough tiles that the matrix edges aren’t mostly padding, enough shared-memory space to double-buffer, and a grid size that fills whole waves or is run via a persistent kernel. Autotuning in Triton or CUTLASS can empirically test different options on the hardware, but it helps to have the right mental model about what sets of sizes they should consider. One good clue that you’re missing an option is when you see a sudden drop in achieved TFLOP/s for particular shape.

AMD

AMD’s MI300X hardware takes a somewhat different route. The GPU is divided into chiplets, where each chiplet has its own compute units and multiple schedulers that schedule wavefronts (AMD for warps, 64 threads rather than 32) independently, so the hardware load-balances multiple kernels by itself. Matrix instructions run at the wavefront level; there is no cross-CU equivalent to WGMMA. Latency hiding relies on launching a large grid of workgroups and letting the hardware interleave them, rather than on explicitly scheduling async copies. On AMD the guidance is to mostly focus on high occupancy and coalesced memory access, whereas on NVIDIA there is value in crafting (by hand or compiler) the copy–compute pipeline.
May 30, 2025

note to self
Metrics for Engineering Teams

Don’t blindly tie every piece of work to top-level metrics. Even if technically feasible, the cost is too high and the risk of spurious logic chains significant.

Start with Value Definition

Begin each project with a crisp definition of why it’s worth doing. What underlying problem are you solving, and why is that problem worth solving? Once you have these narrative assertions, it’s usually clear how extraordinary or controversial each claim is.

The more notable the claim, the more likely you need data to support it.

Value Metrics

1. Direct outcome metrics (strongest)
We will run an ongoing experiment measuring profit per user with the feature on vs baseline.

2. Strong correlative metrics
This will increase time on site, which we can measure and believe correlates with profit per user.

3. Anecdotes and feedback
N sales team members report they could sell into more accounts if we launch this feature.

4. Strategic assertions
We must do this because of upcoming regulation or we will be unable to continue this business line.

Progress Measurement

Once your value claim is clear and defensible, identify how you’ll measure progress. This may differ from your value metric. Ideal progress metrics tell you whether you’re succeeding, respond quickly to team actions, and have strong reference baselines.

1. Clear baseline, goal, and team-tied metric (Strongest)
Launching this compressor will reduce binary download size by an estimated 10% vs the best available industry baseline. We can measure relative progress continually against our production binary during development.

2. Responsive metric without clear reference point
We can improve compile times on this fixed codebase from today’s 90s baseline.

3. Non-responsive metric
We can measure weekly mobile app release binary size, comparing the new compressor to our old implementation.

4. Milestones
We will implement passes a, b, c, after which we can ship the new compiler optimizations to target customers.

Common Challenges

Stronger measures aren’t always a worthwhile tradeoff. If you have high confidence in the work’s value and applicability and mainly need to validate progress, milestones can be completely reasonable.

In general, approach projects with skepticism about whether this is the right thing to do and whether you’ll make good progress. Then identify ways to get concrete data rather than rely solely on leadership support.

Leadership pressure for top-level metrics
The clearer you are on why you’re doing something, the easier it becomes to communicate your measurement decisions. If leadership continues pushing back and you have a good relationship, use that as a lens to explore concerns you might have missed. Often requests for metric clarity stem from deeper worries about project value or plausibility.

Team dynamics and gaming
Metrics communicate value in performance reviews, creating incentives for engineers to identify unnecessary correlations or game metrics (intentionally or not). Counter this with “health” metrics that balance negative behaviors — if measuring deployment frequency, also measure production incidents with a flat target to prevent trading off reliability for speed.

For senior engineers concerned about optics, have them clearly articulate the value chain. Working through and demonstrating strong data usage in project steering is itself a highly rewarded skill — encourage them to take on that role.

When to revisit metrics
The world changes. What’s sensible in one environment isn’t in another. Constantly relitigating is a headache, but reevaluating logic at regular intervals (say, each half-year planning cycle) is appropriate. Otherwise, maintain awareness of company trends: new projects, initiatives, or teams gaining significant attention. If they were successful, would that change what you do? There’s no hard rule: it’s corporate decision-making taste that develops with experience.

May 28, 2025

Economics

Blog

Getting Cute: Transpose

What just happened?

Layouts and Memory

HGEMMony2

Debugging Tips

Loading Data

Choosing the right tile size

AMD

Start with Value Definition

Value Metrics

Progress Measurement

Common Challenges

HGEMMony²