Blog

  • Constraints & Orchestrators

    I recently read a few posts that helped connect the dots on why Python is a) so successful as the lingua franca of ML b) also seems likely to be successful in the future1.

    ML code reads like one program, but runs many: CUDA kernels, vectorized CPU loops, graph compilers and a bunch of glue code moving data around and tying things together. Python has continually improved at balancing two somewhat competing challenges: constraining the hot path so compilers can optimize it and structuring an orchestration path so humans can reason about it.

    Hot Path

    constrained languages are easier to optimize by Jynn Nelson touches on this:

    we should not be asking “what language can i use everywhere for every purpose”; we should build meta-languages that allow you to easily use the right tool for the job. this is already true for regular expressions and query languages; let’s go further. i want inline futhark; inline CSS selectors; inline datalog; ffi between python and C that’s trivially easy. the easier we make it to interop, the easier it becomes to pick the right tool for the job.

    Compilers are generally going to perform better if you have regular shapes, minimal side effects, predictable memory access and so on, but you want languages to be expressive and flexible, particularly when “research” is a big part of the work. In practice, that’s precisely what happens with ML : torch.compile lowers PyTorch graphs to an IR and (often) emits Triton kernels. Being able to hand off inner-loops to specialized languages allows compilers and runtimes to optimize and target the use cases they are best at.

    While this is (somewhat) clear for GPUs or other accelerators with distinctive programming models, I think it’s also largely true for getting the best out of modern CPUs. Daniel Lemire’s SEA 2025 talk covers nearly a decade of performance work and sums it up: modern CPUs do nearly as many instructions per cycle as you can feed them. To really maximize performance you need to batch work, reduce instruction counts and vectorize. We can do some of that in the general Python2 runtime but dynamic dispatch, aliasing and side effects all make the job a lot harder. We can add speculative guards, which can be hard to reason about, or give up and lose performance. By having DSLs3 that add additional constraints we can give ourselves the ability to get much, much higher performance without scrificing the overall flow of our program.

    Orchestration Path

    Python is unusually good as an orchestrator. From a readability perspective the language is baseline very readable and as long as libraries and DSLs stay Pythonic they tend to inherit that intelligibility. The challenge with orchestration is coordinating work in such a way that your most precious resources are well utilized. The investments in Free-Threaded Python make it a lot cheaper to do concurrency, but they don’t magically fix the challenge of coordination.

    asyncio: a library with too many sharp corners covers some of the many failure modes the community have encountered with asyncio, and makes a case for Trio or ANyIO style structured concurrency that allows for manageable failure modes.

    asyncio is not a good library. It is constantly full of sharp edges everywhere with implementation details leaking and poorly designed APIs forcing end users into odd code patterns to avoid fundamental flaws in the interfaces.

    This is very much a readability version of the constraints concern on the hot path. Threads are a bad app abstraction over shared mutable state, reasoning about races and cancellation is hard, and primitives are always leaky. But threads are a perfectly fine implementation detail behind a more constrained API, like task groups, or actors, or so on.

    One area that I do think needs sustained improvement is how we debug and trace across this kind of set up: it’s been challenging even in a controlled environment to really understand how all the pieces interact in a reasonably scaled ML workload, and I imagine that problem will only get worse. But I also expect that the flexibility and breadth of Python will end up a boon there as well.

    1. Beyond just sheer momentum, of course. ↩︎
    2. Or any language! Certainly for some optimizations having a JIT for Python would (and does) make life easier. ↩︎
    3. Whether that is an embedded JIT like Triton or a library+execution engine like Polars. ↩︎
  • Overthinking Everything

    Yann was taking laps on Threads a few weeks back over a recent paper, which was one of several recently that have explored aspects of how autoregressive models do as the amount of information they are dealing with gets longer. His general complaint is that each token they generate can either push them towards the right answer or further away from it, and that the models are inherently bad at recovering if they end up too far outside the correct trajectory.

    This “more might be worse” idea shows up anywhere folks are leveraging large context windows, and one of those1 is in agentic tasks. This post summarizes some research trying to measure the fall-off in chances of succeeding as task length2 increases.

    Is there a Half-Life for the Success Rates of AI Agents? — Toby Ord

    It provides indirect evidence that what really is going on under the hood is that tasks are made up of many sequential subtasks and the chance of succeeding at the whole requires succeeding at every individual component. Moreover, this suggests that the current AI agents are not very good at recovering from earlier mistakes.

    The framing they use is a constant hazard rate: each subtask is another roll of the dice, and if you roll a failure you don’t have much chance of recovering. So more (or longer) is pretty much always worse.

    One interesting aspect is that they also investigate the human failure rate, which increases over time, but much more slowly:

    This could indicate a different scaling behaviour of success rate with time horizon for humans compared to AI agents, which would be well worth investigating and may suggest important underlying mechanisms (e.g. that the humans were better at correcting earlier failed subtasks). If human performance scales differently with task length than AI agent performance, that would be an important result, suggesting that there is a notable inefficiency in the current AI paradigm.

    They’re testing with multiple runs, so these aren’t just models hitting problems they can’t do: its models hitting problems they can’t do given the specific tokens they have generated tried before.

    Agentic use cases aren’t the only situation where a model is generating responses that add to its context window. There were a lot of early observations after the release of O1 last year that thinking for longer on easy problems does not add value. This recent paper proposes not only that but suggests that there is an inverse scaling law: more time thinking makes the model worse.

    [2507.14417] Inverse Scaling in Test-Time Compute

    More specifically, they devised some stress tests: things like counting problems in the presence of distracting information, performing a regression where there is easy-to-understand but spurious data, and so on. The performance drops as the trace length increases. Different models are more susceptible to some failure modes than other, but performance consistently drops:

    Our experiments reveal distinct failure modes across model families—Claude models are particularly vulnerable to distraction from irrelevant information, while OpenAI o-series models show greater resistance but overfit to familiar problem framings. Extended reasoning amplifies different weaknesses: models overthink simple problems, shift attention to spurious correlations, and lose focus during Deduction tasks with constraint tracking.

    In contrast, Chroma’s recent Technical Report investigates how models do on single prompts, but of increasingly long contexts.

    Context Rot: How Increasing Input Tokens Impacts LLM Performance | Chroma Research

    Unlike in the agentic case, here all of the context is passed in at once, so the model isn’t poisoning its own context window through bad choices. It is still dealing with a large amount of content where it needs to choose which parts to attend to. Traditionally the test of long context has been needle-in-a-haystack evaluations: a relevant fact is hidden at different points in a long prompt and the test evaluates whether the model can effectively pull it out.

    The Chroma folks make the test a lot more nuanced — adding distractors3 and irrelevant content in both the broader context and the question. They find that performance consistently degrades as context increases.

    More broadly, our findings point to the importance of context engineering: the careful construction and management of a model’s context window. Where and how information is presented in a model’s context strongly influences task performance

    All of these papers rhyme with LeCun’s gripe about autoregressive transformers, which is (roughly!) that they (also) have a constant hazard rate on generating the “right” token.

    This is a very active area of research though. Process-based rewards in RL training make updates on each step vs only at the end. Multi-token prediction reduces the effective generation length or number of chances of misprediction. Summarizing context effectively compresses existing tokens, also reducing error rate.

    Similarly, if you have good verifiers4 you can use beam or tree searches to explore multiple reasoning paths during generation , which can reduce the error rate, at the cost of more compute.

    The closest (LLMish) techniques to LeCun’s vision are things like the recent Hierarchical Reasoning Model that has a layer of persisting hidden state, but it’s still pretty experimental!

    As agentic and reasoning traces get longer, I’m sure we’ll see more entries documenting failure modes, and proposals for techniques to scale around them.

    1. And the one being referenced in the post! ↩︎
    2. In time — they characterize tasks based on how long it takes humans to do them, which is a good control factor ↩︎
    3. As in additional content related to the question, but that doesn’t give the answer. ↩︎
    4. Similar to process-based rewards this is somewhat pushing the problem to the ability to judge how well you are doing during the generation ↩︎
  • The Tools Are Made Up

    The Tools Are Made Up

    It has been hard to keep up with the flurry of strong agentic open-source models coming out of Chinese labs recently, including Moonshot’s Kimi K2, Z.ai’s GLM 4.5, and Qwen3-Coder1.

    Each of them have the mix of clever pre-training recipes and verifiable-rewards post-training. Notably, Kimi and GLM both use the Muon optimizer, which seems to be gaining ground among the OSS labs at least. GLM’s description of the recipe is as follows:

    Our base model undergoes several training stages. During pre-training, the model is first trained on 15T tokens of a general pre-training corpus, followed by 7T tokens of a code & reasoning corpus. After pre-training, we introduce additional stages to further enhance the model’s performance on key downstream domains. Unlike the earlier pre-training stage on large-scale universal documents, these stages leverage medium-sized domain-specific datasets, including instruction data.

    The additional stages, which they refer to as mid-training, extend the context window and help grow capabilities in specific domains. They then move to post-training, with SFT over reasoning and agentic traces followed by RL with Verified Rewards2.

    The Kimi-K2 technical report goes into more details about how to actually train for tool use. Unlike the others, Kimi is not a reasoning model so doesn’t use much in the way of extended thinking. The fact that wasn’t required to get to strong levels of tool use/agentic capability feels pretty notable to me — most of the recent3 agentic models have been built on a reasoning foundation.

    What I really found interesting from the Kimi report was the level of synthetic data that the team used. This starts in pretraining: to extend high quality data sources they rewrite it with another LLM, giving the same facts with new phrasing, instead of looping over the same “good” data for multiple epochs.

    Their approach to tool training takes this kind of idea ever further:

    We construct a comprehensive tool repository through two complementary approaches. First, we directly fetch over 3,000 real MCP (Model Context Protocol) tools from GitHub repositories, leveraging existing high-quality tool specifications. Second, we systematically evolve 82 synthetic tools through a hierarchical domain generation process. We begin with key categories (e.g., financial trading, software applications, robot control), then evolve multiple specific application domains within each category. Specialized tools are then synthesized for each domain, with clear interfaces, descriptions, and operational semantics. This evolution process produces over 20,000 synthetic tools.

    They analyze a set of real tools, generate some novel (but derivative) ones, then domain-specialize them for a lot of use cases.

    Once they have this tool zoo, the actual training loop involves:

    1. Randomly sample a subset of tools and give it to a new agent with a fresh system prompt. Generate tool-appropriate tasks with explicit success rubrics.
    2. Run an LLM-driven user simulator to drive the agent, while running the tools in sandbox that keeps state.
    3. Filter trajectories using another LLM as judge to keep only successful ones for SFT

    They’re using models at every stage to generate data and evaluate options. When it comes to the actual RL training, they are baselining in verifiable rewards wherever possible for the RL stages: They, and the Qwen folks, talk about their simulator set up for code4: thousands of sandbox environments.

    For software engineering tasks, we collect a vast amount of pull requests and issues from GitHub to build software
    development environment that consists of user prompts/issues and executable unit tests. This environment was built on a robust sandbox infrastructure, powered by Kubernetes for scalability and security. It supports over 10,000 concurrent sandbox instances with stable performance, making it ideal for both competitive coding and software engineering tasks

    The combination of very sophisticated synthetic data and operationally intense sandboxes seem like table stakes for the current agentic game, and one which a lot of labs have figured out. Feels very promising for a growth in capabilities of these models over time, particularly as we work out how best to distill them down to smaller sizes for inference.

    1. Which seems a very solid model, but they haven’t released a lot of extra details about how they got there. One interesting component of the release though was that they forked Gemini CLI to make a qwen-code tool that works with any OpenAI compatible API, and I had some success locally plugging it into the smaller Qwen3 (non-coder) releases in case you were looking for some offline agentic capabilities! ↩︎
    2. Then GLM is distilled between the RL and base version of the model, which apparently helps generalize. This seems like a fun and relatively simple way of smoothing out the learning. ↩︎
    3. Though Claude 3.5 wasn’t, and that is really the trend-setter here I guess! ↩︎
    4. And other tasks that allow fully verifiable rewards. They use other models to score softer domains like creative writing. ↩︎

  • PyTorch Conference 2025

    The schedule is up for the 2025 edition of the PyTorch conference, which is now at the Moscone West in San Francisco.

    https://events.linuxfoundation.org/pytorch-conference/program/schedule/

    There are a lot of great sessions, but I’ll highlight some I personally find particularly interesting:

    Post-Training: Clearly a big theme this year, with some interesting talks from multiple groups:

    General Training

    Kernel development

    Compilers

    Inference

    I’m looking forward to October!

  • RL in the second half

    The Second Half – Shunyu Yao – 姚顺雨

    Extremely interesting post by Shunyu Yao of ReAct and Tree of Thought fame about where we got to with AI and where we are going. Read it for the spot-on description of the weirdnes of reasoning as an RL concept, but my main takeaway was the refinement to the idea that the most important thing is having models that “do the benchmarks”

     To recap the game of the first half:

    • We develop novel training methods or models that hillclimb benchmarks.
    • We create harder benchmarks and continue the loop.

    This game is being ruined because:

    Even if we create harder benchmarks, pretty soon (and increasingly soon) they get solved by the recipe. […]

    The recipe has essentially standardized and industried benchmark hillclimbing without requiring much more new ideas. As the recipe scales and generalizes well, your novel method for a particular task might improve it by 5%, while the next o-series model improve it by 30% without explicitly targeting it.

    The post makes the point that the gap is benchmarks that more closely map to real-world problems.

    when the intelligence is low, improving intelligence generally improves utility. But now, the general recipe is guaranteed to work under these assumptions. So the way to play the new game of the second half is

    • We develop novel evaluation setups or tasks for real-world utility.
    • We solve them with the recipe or augment the recipe with novel components. Continue the loop.

    Shunyu works on computer use at OpenAI, so this is well within his wheelhouse, and I think it’s a compelling claim. Many folks1 have talked about the capability overhang LLMs: there is a large baseline ability to do things in the models, but eliciting that ability can be challenging. I tend to think of that similarly to how that are many words which we can comfortably understand, but are very unlikely to use ourselves in conversation2. RL is our most powerful tool for eliciting capabilities, but it’s a somewhat blunt instrument. Having the right evals, eval environment and tasks helps train the agent to interact in a way which generalizes.

    I wonder if, as we progress through this second phase, that we will find signs of a similar “universal geometry” as has been suggested for pretraining in this elicitation: perhaps there is eventually a universal navigation3 towards where to generate in that space for different tasks. Maybe that’s what we’ll call AGI!

    1. Jack Clark particularly ↩︎
    2. Receptive vocabulary vs. productive vocabulary. ↩︎
    3. A universal geometry of a vector field? ↩︎
  • Quack CuteDSL Kernels

    Dao-AILab/quack: A Quirky Assortment of CuTe Kernels

    Tri Dao & co have a fun repo up called Quack: A Quirky Assortment of CuTe Kernels, all leveraging the CuTe-DSL. These are hopper and blackwell oriented kernels for a variety of common needs like softmax, layernorm and RMSNorm.

    On top of that, they wrote a post on how to get speed of light (memory bound) kernels in CuTe-DSL. It goes through how to implement a reduction op across multiple tiers of memory using TensorSSA for thread level reductions, warp reduction with shuffle_sync_bfly and block reduction with shared memory. Even if you’re not writing CuTe, this is about as good an introduction to architecting memory bound ops as I have seen!

    They also cover clustered reduction, leveraging multiple SMs:

    In cluster reduction, we first send the current warp’s reduced value to all the peer thread block’s reduction buffer in peer’s SMEM. Such sending is conducted via a dedicated SM-to-SM fabric (as DSMEM). Then each warp fetches all warp’s values from their local reduction buffer, and reduces these values.

    This does seem to help the kernels scale well to larger sizes:

    We believe our outstanding performance at >= 65k input is due to our successful utilization of cluster reduction in H100. When the size of inputs are ultra long and depleting the SM’s registers and shared memory, without cluster reduction, we would have to switch to an online algorithm (like online softmax) otherwise we may get a massive register spilling that leads to significant throughput degradation.

    I also really appreciate this note of reality in their conclusion:

    Hitting “speed-of-light” model memory throughput confirms that a carefully hand-crafted CuTe kernel can squeeze every byte across all memory hierarchies in the hardware. But that efficiency comes at the price of per-operator and even per input-shape tuning, which imposes a natural tradeoff between efficiency and development efforts

  • Reinforcement Learning Continues To Be The Frontier

    Back in 2021, OpenAI nixed its robotics team, leading to comments on Hacker News like “Reinforcement learning itself is a dead-end on a road to AI”. Now, in 2025 we are surrounded by RL post-trained reasoning models and Mary Meeker is using the word “unprecedented” a lot. This kind of skepticism/hype overlap is very common right now, as Helen Toner breaks down in her excellent recent post/talk on unresolved questions in AI:

    Last year, we had coverage from the Wall Street Journal—really good reporting—about real challenges inside OpenAI with scaling up their pre-trained models and how difficult that was and how they weren’t happy with the results, and then on the literal same day we had the release of o3, the next generation of their reasoning model, and François Chollet—who’s famously skeptical—saying that it was a significant breakthrough on his ARC-AGI benchmark. So these very contradictory takes, both of which had some truth to them.

    The framing used in that post is really useful: it’s less about “are we making progress?” and more “are we on the right branch of the tech tree?”

    A lot of people thought RL was the wrong branch: after notable successes from DeepMind and OpenAI, RL had become a bit of a backwater, with some resurgence (in a limited form) from Reinforcement Learning with Human Feedback (RLHF) for preference tuning LLMs.

    The reason people keep coming back to reinforcement learning is the ability to discover new things. Supervised learning is somewhat inherently bound by the dataset. A reinforcement process can continue to explore and find new strategies, like the famous examples of AlphaGo choosing moves humans wouldn’t have. Tim Lee has an excellent non-technical introduction to the evolution of RL that mentions this: Reinforcement Learning Explained

    In short, imitation learning can rapidly teach a model to mimic the behaviors in its training data, but the model will easily get confused in unfamiliar environments. A model trained with reinforcement learning has a better chance of learning general principles that will be relevant in new and unfamiliar situations

    In this direction, a recent paper, [2507.00432] Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning, suggests1 that reasoning generalizes better from RL-driven learning than supervised fine-tuning.

    RL-tuned models achieve significant gains on math reasoning while preserving positive transfer to other reasoning tasks and non-reasoning tasks, whereas SFT often incurs negative transfer on non-reasoning benchmarks. Second, PCA analysis of latent space confirms that RL induces minimal drift from backbone representations thus maintaining feature stability, while SFT produces larger latent shifts, especially in non-reasoning domains. Third, token-distribution analysis shows that RL selectively adjusts only a handful of task-relevant tokens, whereas SFT perturbs many irrelevant tokens, indicating RL’s more targeted optimization.

    RLHF is implemented by first training a reward model based on human preference feedback: you give people two versions of an answer, they tell you which one they prefer, you then train a model to predict those ratings. That reward model becomes the scoring function during post-training.

    Designing good reward functions has been somewhat of a dark art in RL. The agent optimizes what you ask for, which is not always what you really want2. This “reward hacking” phenomenon makes RL agents somewhat brittle, prone to exploiting loopholes in environments in ways no one anticipated.

    The recent reasoning models did so well because their rewards were verifiable: reward scores that are based on some ground truth validation and are often just yes/no: does code compile, does it pass a unit test, can a math proof be verified by a formal logic reasoner, or simply is the answer correct or not. Nathan Lambert did a breakdown on where RL goes next:

    The optimistic case for scaling current reinforcement learning with verifiable rewards (RLVR) techniques to next-generation language models, and maybe AGI or ASI depending on your religion, rests entirely on RL being able to learn on ever harder tasks. Where current methods are generating 10K-100K tokens per answer for math or code problems during training, the sort of problems people discuss applying next generation RL training to would be 1M-100M tokens per answer.

    Lambert makes the point that even the very long-range tasks we have now (coding agents, deep research) are based around learning to be better at tasks individually, then stringing those together:

    How to read this training method, which is likely similar for agents like Claude Code or Codex, is that current RL methods are helping the models get more robust at individual tasks that make up a longer trajectory rather than being trained on the end result of the trajectory itself. The final long-horizon behavior is put together with prompting and letting the model run longer, not sparse credit assignment. In the case of Deep Research the final measure of performance would actually look far closer to human preferences than verifiable rewards, and a large portion of that applies for Claude Code as well, where multiple solutions could solve a problem and it falls to human taste to say which is the best.

    This problem of having to learn to act over a long time-horizon is a recurring one in RL. The best algorithms we have for reinforcement learning are online: the model learns “live” while interacting with the environment. But sometimes it’s a lot easier to collect data than it is to run an experiment: for example, it’s much safer to get a large amount of sensor input from driving cars around than it is to have a model driving a real car around and making mistakes. This is off-policy or offline RL, and it offers the promise of learning from much larger data sets.

    Seohong Park recently wrote a great post breaking down how offline RL fails to scale up: Q-Learning Is Not Yet Scalable3. In the experiment there the team at Berkeley generate 1000x more data to try and scale offline RL, and still see the process breaking down:

    Q-learning struggles to scale because the prediction targets are biased, and these biases accumulate over the horizon. The presence of bias accumulation is a fundamental limitation that is unique to Q-learning (TD learning). For example, there are no biases in prediction targets in other scalable objectives (e.g., next-token prediction, denoising diffusion, contrastive learning, etc.) or at least these biases do not accumulate over the horizon (e.g., BYOL, DINO, etc.).

    Noted LLM-branch skeptic (and technically a very distant colleague) Yann LeCun has spoken a lot about a version of this kind of planning and world modelling problem, which he sees as inherent to the autoregressive nature of LLMs: the accumulation of errors over long time horizons.

    One of his architectural bets is JEPA, and the recently released V-JEPA 2 paper is beginning to show how this could work. V-JEPA 2 is a self-supervised video world model trained on a million hours of YouTube video. The model learns in a semi-supervised fashion by masking out parts of video frames and predicting them, in latent (embedding) space rather than pixel space. After the pre-training, they freeze the encoder, generate tokens with it for a video and prepend those to a query for a pretrained LLM4 .They fine-tune that LLM on video question answering data, and were able to get state of the art question answering with that set up, despite the JEPA part of it being totally task agnostic.

    Going a step further, they took the encoder and hooked it up to a small robot control model5. They trained it on some robot operator data for pick-and-place tasks. It learned to do a remarkably good job, without any reinforcement learning at all!

    This is interesting because robotics has traditionally been an area where we have seen a lot of exploration (with success and disappointment!) with long-range RL. Andrew Stokols’ excellent post on ChinaTalk makes a good case that while the west has focused on AI in a brain-in-a-jar type way, there has been a concerted push in Beijing for Embodied AI (with Chinese Characteristics). China has a very strong base in manufacturing. Robotics, drones, autonomous vehicles are all being developed and deployed in the country.

    One of the fundamental challenges robotics systems have to address is much more constrained latency bounds: the world operates in real time, and running a big model may result in a smart robot that simply cannot respond quickly enough to be useful. The space has trended towards hierarchical models, which chunk actions into higher level concepts that a controller model puts out (like “pick up at x”) and lower-level models that decode those chunked outputs into a series of motor commands. While sometimes transformers are used autoregressively (take sequences of state, action and predict next action), many now use diffusion-based techniques where they will generate a whole trajectory at once. Physical Intelligence recently put out a paper on Real Time Chunking where they show you can start with generating a chunk, then continue the denoising process a-la inpainting or fill-in-the-middle to generate the steps between the start and goal, allowing more real time responses.

    China putting a lot of eggs in the embodied AI basket is indirectly also betting that methods to make those systems learn and adapt will mature. Some of those same techniques will invariably apply to the (disembodied) agents that are currently the focus on big labs in the west.

    1. One of the ways they corroborate this finding is by seeing there is less KL divergence in the RL trained model than the SFT model, but that’s usually a training objective on RL, and I’d imagine you could apply KL regularization to SFT as well if you wanted. ↩︎
    2. A classic example from OpenAI: A reinforcement learning agent in a boat race game was given points for hitting targets, so it happily learned to drive in circles hitting the same targets forever, instead of actually finishing the race. Faulty reward functions in the wild | OpenAI ↩︎
    3. Q-Learning is the most common class of algorithms for offline RL. ↩︎
    4. They unsquash it into the hidden dimension size, and depending on how the numbers work out add some pooling. ↩︎
    5. Much like with the LLM, they combine the video embeddings with model-specific tokens, in this case a state tracking input and the current state of the robot arm. ↩︎
  • ARPA and predicting the future

    Statecraft recently re-ran an interview from 2023 with Jason Matheny, formerly of IARPA: https://www.statecraft.pub/p/how-to-predict-the-future-278

    While defense policy and research is a ways outside the scope for myself (or I imagine most folks reading), the problems of managing or working on uncertain, research-y projects in a volatile environment are pretty relatable:

    Most of what we know from cognitive psychology and human judgment research over the last 50 years suggests that unstructured group deliberation might be one of the worst ways of making judgments, yet it’s the norm in most institutions.

    Or this bit of career wisdom:

    In general, people underestimate their own potential to make contributions to the most important problems. They overestimate how many people are already working on the most important problems. So many incredibly important problems are just really neglected. If you can’t figure out who’s working on something after a few days of homework, then it is probably a neglected problem. And it’s probably up to us to solve it.

    Jason talks about looking for projects in the goldilocks zones of probability (less than 50%, more than 5%) that open up interesting opportunities. I worked with a manager who was a strong advocate of the Heilmeier Catechism  to evaluate projects, and have seen the value of using it as guidance when presenting and evaluating ideas:

    1. What are you trying to do? Articulate your objectives using absolutely no jargon.
    2. How is it done today, and what are the limits of current practice?
    3. What is new in your approach and why do you think it will be successful?
    4. Who cares? If you are successful, what difference will it make?
    5. What are the risks?
    6. How much will it cost?
    7. How long will it take?
    8. What are the mid-term and final “exams” to check for success? 

    Jason adds some interesting updates:

    For instance, the Heilmeier questions don’t have a question about counterfactual impact: “Would this work get done otherwise?” The office tends not to rigorously assess the other funding streams going to solve this particular problem, and their likelihoods of success.

    We also tend not to think much about strategic move and countermove. […]. It probably is prudent to assign at least a 10% probability to some exquisite, classified technology being stolen.

    One thing I found myself talking about this week with a couple of folks was how good people get “lucky” a lot. I think these kinds of questions help navigate towards those more positive-surprise-filled spaces.

  • Cute-DSL

    In May Nvidia shipped CuTe‑DSL, the Python library they teased at GTC earlier in the year that mirrors CUTLASS’s C++ tensor‑layout . Then, at the start of June, the ‑dev label disappeared (so presumably its production ready now). The pitch is simple: Write speed‑of‑light kernels from the comfort of Python.

    Of course, nothing about CUDA is ever really simple. CuTe‑DSL gives the full Cutlass experience1, wrapped in an ever so slightly more approachable interface.

    Getting Cute: Transpose

    Matrix transpose felt like a reasonable ‘hello world’ : (B[j,i] = A[i,j]). The PyTorch version is simple: torch.transpose(input, 0, 1).

    To get a baseline, here is a simple transpose kernel in Triton. We tl.load, flip the coordinates and tl.store it back.

    @triton.jit
    def triton_transpose_kernel(input_ptr, output_ptr, M, N, BLOCK_SIZE: tl.constexpr):
        # 2D block coordinates
        pid_m = tl.program_id(0)
        pid_n = tl.program_id(1)
        
        # Calculate offsets
        offs_m = pid_m * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
        offs_n = pid_n * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
        
        # Load with masking
        mask = (offs_m[:, None] < M) & (offs_n[None, :] < N)
        a = tl.load(input_ptr + offs_m[:, None] * N + offs_n[None, :], mask=mask)
        
        # Store transposed (swap coordinates)
        tl.store(output_ptr + offs_n[:, None] * M + offs_m[None, :], a, mask=mask)

    Here’s the same idea in CuTe‑DSL. CuTe leverages a decorator and Pythons ability to integrate with JITs. Anything decorated with @jit runs host-side (on the CPU), while @kernel is used for device side (on the GPU). There are both AST based and tracing based options dependening on the presence of dynamic shapes or control flow.

        @cute.kernel
        def transpose_kernel(self, mA: cute.Tensor, mB: cute.Tensor):
            tidx = cute.arch.thread_idx()[0]
            tidy = cute.arch.thread_idx()[1]
            bidx = cute.arch.block_idx()[0]
            bidy = cute.arch.block_idx()[1]
            # This might all be unnecessary
            # but I was fearful of the compiler
            tile_start_m = cutlass.Int32(0)
            tile_start_n = cutlass.Int32(0)
            global_m = cutlass.Int32(0)
            global_n = cutlass.Int32(0)
            M = cutlass.Int32(0)
            N = cutlass.Int32(0)
            val = cutlass.Float32(0.0)
            # Calculate tile starting positions
            tile_start_m = bidy * self._tile_size
            tile_start_n = bidx * self._tile_size
            # Calculate global coordinates for this thread
            global_m = tile_start_m + tidy
            global_n = tile_start_n + tidx
            # Get matrix dimensions at runtime
            M = mA.shape[0]
            N = mA.shape[1]
            # Bounds checking and transpose operation
            if global_m < M and global_n < N:
                val = mA[global_m, global_n]
                # Transpose: B[n, m] = A[m, n]
                mB[global_n, global_m] = val

    What just happened?

    • Thread and block indices come straight from CUDA (thread_idx, block_idx), vs the Triton block abstraction
    • No explicit loads or stores: CuTe uses overloaded [] to generate them.

    Launching isn’t a million miles away from Triton:

    @cute.jit   # host side
    def launch(self, A: cute.Tensor, B: cute.Tensor):
        M, N = A.shape
        grid = ((N + self.T - 1)//self.T,
                (M + self.T - 1)//self.T, 1)
        self.transpose_kernel(A, B).launch(
            grid=grid,
            block=[self.T, self.T, 1],
        )

    Because CuTe‑DSL speaks DLPack, you can hand it a PyTorch tensor directly. If you wanted to cache the conversion, it looks like this:

    A_cute = from_dlpack(A).mark_layout_dynamic()

    The mark_layout_dynamic is used to trigger the dynamic shape support, and avoid shape specialization. The one place where this went a bit funky in my testing was dealing with singular leading dimensions: there you need to be more explicit about the shape to satisfy the compiler.

    Layouts and Memory

    This kernel isn’t really leveraging the fundamental value of CuTe though, which is composable tensor layouts and memory management. CuTe‑DSL exposes the full memory hierarchy: global, shared, register (and tmem for those with blackwells), and lets you tile, copy, and pipeline data between them. Common primitives:

    • make_layout / make_layout_tv: describe how a tensor is laid out.
    • cute.zipped_divide(tensor, tiler): tile a tensor.
    • cute.copy(src_layout, dst_layout, pred=mask): async copy.
    • cute.arch.sync_threads(): explicit barrier.

    HGEMMony2

    CuTe ships with some example kernels, so I grabbed one — an HGEMM (half-precision, FP16, batched GEMM) — and compared to an example implementation in Triton.

    To express the same thing in  PyTorch, we can unleash our inner Jeremy Howard and use einsum notation: torch.einsum("mkl,nkl->mnl", a, b).  Take L batches of a MxK matrix, L batches of a NxK matrix, and return L batches of a MxN matrix.

    Here is the Triton:

    @triton.jit
    def triton_batched_hgemm_kernel(
    	a_ptr, b_ptr, c_ptr,
      M, N, K, L, 
      stride_am, stride_ak, stride_al, 
      stride_bn, stride_bk, stride_bl, 
      stride_cm, stride_cn, stride_cl, 
      BLOCK_SIZE_M: tl.constexpr, BLOCK_SIZE_N: tl.constexpr, BLOCK_SIZE_K: tl.constexpr,
      GROUP_SIZE_M: tl.constexpr,
    ):
        """Triton batched half-precision GEMM kernel: C[m,n,l] = sum_k A[m,k,l] * B[n,k,l]"""
        pid = tl.program_id(axis=0)
        pid_batch = tl.program_id(axis=1)  # Batch dimension
        
        # Calculate batch offsets
        batch_offset_a = pid_batch * stride_al
        batch_offset_b = pid_batch * stride_bl  
        batch_offset_c = pid_batch * stride_cl
        num_pid_m = tl.cdiv(M, BLOCK_SIZE_M)
        num_pid_n = tl.cdiv(N, BLOCK_SIZE_N)
        num_pid_in_group = GROUP_SIZE_M * num_pid_n
        group_id = pid // num_pid_in_group
        first_pid_m = group_id * GROUP_SIZE_M
        group_size_m = min(num_pid_m - first_pid_m, GROUP_SIZE_M)
        pid_m = first_pid_m + ((pid % num_pid_in_group) % group_size_m)
        pid_n = (pid % num_pid_in_group) // group_size_m
        offs_am = (pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M)) % M
        offs_bn = (pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N)) % N
        offs_k = tl.arange(0, BLOCK_SIZE_K)
        
        # Include batch offsets in pointer calculations
        a_ptrs = a_ptr + batch_offset_a + (offs_am[:, None] * stride_am + offs_k[None, :] * stride_ak)
        # transpose for GEMM (load B[K, N] pattern)
        b_ptrs = b_ptr + batch_offset_b + (offs_k[:, None] * stride_bk + offs_bn[None, :] * stride_bn)
        
        # We accumulate into fp32 for higher accuracy.
        accumulator = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32)
        
        for k in range(0, tl.cdiv(K, BLOCK_SIZE_K)):
            # Load the next block of A and B, mask in K
            a = tl.load(a_ptrs, mask=offs_k[None, :] < K - k * BLOCK_SIZE_K, other=0.0)
            b = tl.load(b_ptrs, mask=offs_k[:, None] < K - k * BLOCK_SIZE_K, other=0.0)
            accumulator = tl.dot(a, b, accumulator)
            a_ptrs += BLOCK_SIZE_K * stride_ak
            b_ptrs += BLOCK_SIZE_K * stride_bk
        # Convert back to FP16 for output
        c = accumulator.to(tl.float16)
        offs_cm = pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M)
        offs_cn = pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N)
        c_ptrs = c_ptr + batch_offset_c + stride_cm * offs_cm[:, None] + stride_cn * offs_cn[None, :]
        c_mask = (offs_cm[:, None] < M) & (offs_cn[None, :] < N)
        tl.store(c_ptrs, c, mask=c_mask)

    The core loop here is:

    • Divide into block_size_k chunks of K
    • For each do a masked load (so when we hit the edges we don’t bring in garbage)
    • Do the dot product for the tile into an accumulator for the result matrix
    • Advance the A and B pointers

    The CuTe kernel is, uhhh… a bit more involved. The full kernel is several hundred lines long. You can see the source in the Cutlass repo, which demonstrates some cool features like the ability to pass in an epilogue function for fusion.

    For now, lets focus on the main loop. As before, we are looping over tiles of K3 .

    The first thing we do is advance the pointers. The kernel is doing explicit pipelining of transfers from global memory, to shared memory, to registers, so we need to set wait groups do ensure all the loading has completed before we advance. We’re not actually doing loading in this section, just prepping the ground:

    for k_tile in cutlass.range_dynamic(k_tile_count, unroll=1):
    	for k_block in range(num_k_block):  
          if k_block == num_k_block - 1:
        	tCsA_p = tCsA_copy_view[None, None, None, smem_pipe_read]
          tCsB_p = tCsB_copy_view[None, None, None, smem_pipe_read]
          cute.arch.cp_async_wait_group(num_smem_stages - 2)
          cute.arch.sync_threads()

    Next we kick off a copy from shared memory (smem) to registers (rmem) using cute.copy for the (future) A and B tiles.

    k_block_next = (k_block + 1) % num_k_block  # static
    cute.copy(
    	tiled_copy_s2r_A,
      	tCsA_p[None, None, k_block_next],
        tCrA_copy_view[None, None, k_block_next],
       )
    cute.copy(
    	tiled_copy_s2r_B,
        tCsB_p[None, None, k_block_next],
         tCrB_copy_view[None, None, k_block_next],
      )

    Finally, we interleave the transfers of the next A and B tiles from global to shared memory with the actual gemm operation (the equivalent of tl.dot). I will trust the folks at Nvidia that this is an optimal pattern. The pred= in there is the equivalent of masking in Triton.

       if k_block == 0:
     		if k_tile + num_smem_stages - 1 < k_tile_count:
        	    cute.copy(
    				tiled_copy_B,
    				tBgB[None, None, None, k_tile_index],
    				tBsB[None, None, None, smem_pipe_write],
    				pred=tBpB,
    			)
            k_tile_index = k_tile_index + 1
            cute.arch.cp_async_commit_group()
    		smem_pipe_write = smem_pipe_read
    		smem_pipe_read = smem_pipe_read + 1
    			
    		if smem_pipe_read == num_smem_stages:
          	    smem_pipe_read = 0

    The pipelining is explicit, which is nice for debuggability and optimization, but very manual.

    Debugging Tips

    export CUTE_DSL_LOG_TO_CONSOLE=1
    export CUTE_DSL_LOG_LEVEL=10   # up to 100
    export CUTE_DSL_PRINT_IR=1     # dump MLIR
    • cute.printf() gives you a GPU‑side printf.
    • Kernels are aggressively cached; rm ~/.cache/cutedsl if things look stale.
    • Multiple @cute.jit host functions in the same Python scope can confuse MLIR (mainly for launching kernels).
    • The control‑flow rules are strict: no return inside a kernel; initialize everything.

    If you’re exploring GPU kernels for the first time, I strongly recommend starting with Triton. When you need to really get into the weeds, or want to reuse CUTLASS building blocks, its great to have CuTe‑DSL as an option in Python (provided you’re comfortable spelunking in GPU internals).

    1. I spent a lot of time holding it wrong. Arguably, still holding it wrong. ↩︎
    2. No one knows what it means, but it’s provocative ↩︎
    3. Note the explicit unroll tag. When you really want #pragma but can’t. ↩︎
  • Harnessing the Universal Geometry of Embeddings

    Harnessing the Universal Geometry of Embeddings

    What content should you include in an LLM prompt? Many interesting use cases (enterprise tools, coding assistants) have more content than they can handle at once, so you chunk it up, turn each chunk into a vector with some sentence‑encoder, and store those vectors in a database. Later you vector‑search, pull back the relevant chunks and feed them to the LLM — better known as the RAG pattern.

    The working assumption has been: those vectors are fairly safe. A 768‑dimensional point doesn’t look the text “Ian ordered a burger at 12:07 ”, so storing raw embeddings seem privacy‑preserving.

    But are they! In Cornell’s Harnessing the Universal Geometry of Embeddings paper the authors train vec2vec, a small GAN that learns to translate embeddings from encoder A’s space into encoder B’s space, without seeing the original sentences. Once you’re in encoder B‑land you can recover up to 80 % of the underlying text:

    Inversion, i.e., reconstruction of text inputs, is more ambitious than attribute inference. vec2vec translations retain enough semantic information that off-the-shelf, zero-shot inversion methods […] extract information for as many as 80% of documents given only their translated embeddings, for some model pairs (Figure 5). These inversions are imperfect and we leave development of specialized inverters for translated embeddings to future work. Nevertheless, as exemplified in Figure 6, they still extract information sucdh as individual and company names, dates, promotions, financial information, outages, and even lunch orders. In Appendix E, we show the prompt we use to measure extraction.

    The paper suggests that most sentence encoders trained with similar objectives on sufficiently diverse data come up with embeddings which resemble each other. Concepts, topics (and lunch) live on a shared manifold1; the models just might position them differently in embedding space. Vec2vec is a learned a coordinate transform.

    What this might be implying is that if you train a model with similar objectives on data samples from a similar generating function, you will arrive at a manifold in latent space that is geometrically similar to anyone else doing the same thing. If that is true operations in latent-space start to look less model specific, and approaches that navigate them (like JEPA, LDM editing) could learn to operate across different model with just an adapter layer.

    To be clear, the paper is not saying this: the authors only align English, contrastive‑loss, transformer sentence encoders. No decoder models, hardly any dimensionality mismatch. The phrase “universal geometry2” may be a stretch: Their GAN training also requires quite a bit of run cherry-picking3, and when they tried cross-modality the results weren’t as strong, but it’s a very interesting idea worth further investigation.

    In the short term, this is probably mildly alarming for coding agent customers that are worried about their source code leaking, but in the long term I hope we can see some more investigation into how true this is in more general modeling, and what kind of opportunities that might open up!

    1. Shape in the embedding space. In practical experience when you have a large embedding space its mostly empty, and all the actual data lives on a plane in the space. This is why things like latent diffusion models work: they learn to navigate towards that plane from any random noisy point in the space. ↩︎
    2. But it’s a great title. ↩︎
    3. My understanding is unstable training is a very common problem for GANs. ↩︎
  • How to build unmaintainable kernels

    What do you need to do to get better performance and GPU efficiency out of your model? The GPU-oriented folks at Stanford recently published an early preview of the work they have been doing on the LLM generation of kernels: Surprisingly Fast AI-Generated Kernels We Didn’t Mean to Publish (Yet) – and they have a list:

    • Memory Access Optimization: improving the efficiency of data movement between different memory hierarchies (global memory, shared memory, registers) and ensuring data is accessed in a way that maximizes bandwidth and minimizes conflicts.
    • Asynchronous Operations & Latency Hiding: hide the latency of slow operations (like global memory access) by overlapping them with computation or other memory transfers
    • Data Type & Precision Optimization: using lower-precision data types (like FP16 or BF16) where possible to reduce memory bandwidth requirements, increase cache effectiveness, and potentially leverage specialized hardware units.
    • Compute & Instruction Optimization: making the arithmetic computations themselves more efficient, reducing instruction count, or leveraging specialized hardware instructions
    • Parallelism & Occupancy Enhancement: maximize the number of active warps on the Streaming Multiprocessors (SMs) to better hide latencies and improve overall throughput
    • Control Flow & Loop Optimization: reducing the overhead associated with loops, branches, and indexing calculations

    That’s a good list! In this case though, it emerged not from (just) talking with kernel experts, but also from developing a model to generate kernels:

    We have some very fast AI-generated kernels in pure CUDA-C without using libraries and DSLs such as CUTLASS and Triton. They are performing close to or in some cases even beating the standard expert-optimized production kernels shipped in PyTorch.

    They developed a very straightforward but smart pattern on structuring test-time-compute. They reason about optimizations in natural language before generating code. Then, they branch out into a tree structure of refinements for each optimization idea, to avoid loops in investigation.

    The kernels they generated were somewhere between fast, and very fast:

    Conv2D: 179.9% performance of FP32 torch.nn.Conv2D; problem size: (100, 3, 224, 224) input tensor, conv(in_channels=3, out_channels=96, kernel_size=11, stride=4, padding=2)

    The team aren’t claiming this is a general solution, but just an interesting proof of possibility, which is certainly is! The walk through of how they got to the final conv2D kernel is fascinating, both in terms of human intervention and the chain of optimizations.

    The final code sample for the Conv2D kernel is included in the appendix. It uses advanced CUDA techniques that we find challenging to write ourselves! We also have more example kernels in this Github repo

    The kernel is very fast for specific shapes, on the L40s, in FP32. Its also a kernel that, by the sounds of it, the team themselves struggled a bit with. It’s very, very specialized. It’s not that a human couldn’t have built it, its that (in most cases) they wouldn’t: it’s not a priority kernel, and all that clever CUDA comes with operational overhead, ties on specific hardware, shapes and so on.

    That in itself isn’t new. If you PyTorch or XLA compile you’ll get a lot of kernels which you probably wouldn’t write, but this adds a new (and weird!) layer of non-determinism to everything. Elsewhere at Stanford, they have been looking at one of the other killers of GPU efficiency: kernel launch overhead. Most models are represented by hundreds of kernels, each of which have to be scheduled from the CPU. LLMs are generally memory-bound, small ones particularly so, and the gaps between kernel executions can end up dominating performance:

    Look Ma, No Bubbles! Designing a Low-Latency Megakernel for Llama-1B:

    In this post, we show how we can bypass this problem by merging the entire Llama-1B forward pass into a single “megakernel” that eliminates kernel boundaries altogether. Doing this achieves brr – on an H100, we use 78% of memory bandwidth and outperform existing systems by over 1.5x. (To our knowledge, this is the lowest-latency forward pass for Llama-1B in bfloat16!) In the rest of this post, we’ll walk through how and why one would do this.

    The idea of megakernels that handle all of the operations is not new, but the complexity of fusing everything together is high. Persistent kernel were popularized at the tail end of the CUDA 11 series, due to the right residency and async copy support in Ampere. They allow leaving kernels resident on an SM and having them pull a series of tiles to do their work rather than scheduling repeated launches of the same kernel. The megakernel takes this idea ever further, with multiple operations within the kernel that pulls a stream of different problems. One issue with this approach (traditionally) is register spilling: you only have so many registers available, up to 255 per thread, though with a fairly high overall limit of 64k 32-bit registers (on Hopper). That means you need to keep some data in shared memory, and efficient use of shared memory ends up being the bottleneck. The team at Stanford developed paging shared memory, with a separate reserved block for managing allocations of shared memory to individual tasks.

    This gets the CPU completely out the picture for the forward pass, but is incredibly specific to the model (in this case Llama 3.2 1B).

    Another collaboration was clearly thinking in the same direction, as they recently posted about their Mirage Persistent Kernel: a compiler for megakernels.

    our team from CMU, UW, Berkeley, NVIDIA, and Tsinghua developed Mirage Persistent Kernel (MPK) — a compiler and runtime system that automatically transforms multi-GPU LLM inference into a high-performance megakernel. MPK unlocks the benefits of end-to-end GPU fusion while requiring minimal manual effort from developers.

    The system works by building a task-graph of what they call LAX1[1] fragments, which in practice is a very short list of 14 operators. This is actually too small to represent everything they need, meaning they have to manually decompose some common ops like ReLu, but this level of decomposition gives them an ability to do some pretty complex fusions.

    The actual ops are generated thanks to Mirage’s Kernel Superoptimizer (a great name), which I think is a very intense autotuner:

    Mirage automatically searches for possible GPU kernels for attention. The search space includes existing manually designed attention kernels (e.g., FlashAttention and FlashDecoding) as special cases. It also includes other implementations that outperform today’s handwritten ones by up to 3.5x for certain use cases. The GPU kernels generated by Mirage can directly operate on PyTorch tensors and be called in your PyTorch program.

    The search is not cheap though:

    In our evaluation, Mirage takes up to 4 hours to optimize a Lax program. This optimization is a one-time cost before deployment on the target hardware.

    The aggressive decomposition allows them to have a clever verification scheme where they validate kernels on random inputs to get confidence in (approximate) numerical correctness.

    They then build a worker kernel with all the relevant operations, and schedule the optimized graph via dedicated scheduler warps. Workers are scheduled on the SMs, and report back status. The scheduler warps then decide when tasks can be enqued for execution.

    They’ve got a code example that walks through setting it up for Qwen. They recreate the model structure explicitly, generate a task graph from it, and kick off the search for optimal kernels and fusions. This avoids the need to solve the Dynamo-style problem of tracing the model!

    The resulting kernel is again heavily tied to the specific hardware and model. One thing we have found useful for investigating production problems is that the ability to ablate different parts of the compile process, running models in (basically) PyTorch eager mode. This approach leaves the darkest of black boxes to work with, and I would imagine even more terrifying PTX than the complex CUDA that the LLM kernel generation team came up with.

    Between these projects though, it feels like we are exploring the edges of what running a program on GPUs should actually look like: a combination of kernel generation and multi-level search seems almost guaranteed to yield optimizations that would be far outside the cost-benefit curve for manual implementation. What we don’t have yet is known ways to operationalize this kind of approach, but its an exciting area to watch!

    Thanks to Matt for nudging me to actually read through these papers, they’d been on my todolist for a bit!


    1. I am not sure what this stands for, but the basic ops in jax are in jax.lax, so I presume its the same source! ↩︎

  • GPU Driven

    Traditionally, GPU collective network operations were issued from the framework on a separate CUDA stream than the local computation kernel launches. This allowed overlapping comms and hiding most or all of the network latency. NCCL exposes collectives as fully implemented kernels, and there have been various derivitives such as AMD’s RCCL or Berkeley’s new UCCL project, which is aiming to be a drop-in replacement better suited for large-scale GPU workloads1. Earlier versions of this sent the networking via the host, and later developed towards GPU-to-GPU peer to peer over connections like NVlink, and direct communication between GPU and the NIC for scale out. But the actual coordination was driven by launches from the CPU.

    The increasing capacity of GPUs, particularly in the Hopper/MI300x era, the use of micro-batching, and the rise of Mixture-of-Expert models put a lot more pressure on this arrangement: now rather than a large chunk of comms at the end of each distributed-data-parallel pass you are doing thousands of small exchanges per step. Each step could require each CPU rank to launch thousands of tiny All-to-All: millions of collective calls across the cluster, while still running the rest of the training loop. Each one forces a host interrupt, collective calls and GPU triggers for the data transfer.

    A paper from last year, The Landscape of GPU Driven Communication, gives a broad overview of this shift:

    In the last decade, several advancements, broadly referred to as GPU-centric communication, have sought to challenge the CPU’s hegemony on multi-GPU execution. At a high level, these advancements reduce the CPU’s involvement in the critical path of execution, give the GPU more autonomy in initiating and synchronizing communication and attempt to address the semantic mismatch between multi-GPU communication and computation.

    Getting the right kind of abstractions in here to make this more accessible and flexible is an active area of development. The main reference is the shmem abstraction. This allows writing and reading bytes from remote memory (originally for supercomputers) and adding barriers around usage. Nvidia’s nvshmem (and AMDs ROCshmem) library directly implemented this for GPU.

    The next big step up in functionality was GPU Direct Async, a transport that allows access to NIC doorbells directly from the GPU, extending NVSHMEM to RoCE/InfiniBand (Remote Direct Memory Access – RDMA). In addition to this, NVLS (NVLink Sharp) was added to NVLink switches which allowed much more bandwidth efficient switch-managed multicast for broadcast and reduce cases, the fundamental operations used in all-gather and all-reduce collectives. This allows GPU-initiation of collectives which gives us the ingredients to get the CPU out of the networking path completely, and to fuse networking alongside compute operations.

    This was one of the things DeepSeek did really well, covered in their DeepEP work: fusing MoE GEMM with GPU-initiated RDMA cut single-node latency by 2–3x. The tradeoff is kernels handle a lot of complexity: choosing between NVLink and GPUDirect for nodes, polling, flow management and so on is tuned to their specific needs and hardware.

    ByteDance has also spent a lot of time looking at this problem. Their Flux paper looks at more general approach for doing tile-based fusion of the comms and compute:

    Flux overdecomposes computation and communication into tiles. Here, since the computation operation is GEMM, and most high-performance GEMM kernels on GPUs are written with tiling, such as thread block tiling or warp tiling, our decomposition can naturally map into existing tiling in the kernels. Flux fuses dependent communication and/or wait logic into a GEMM kernel, and launches only one fused kernel, compared to the prior methods launching multiple split GEMM kernels.

    PyTorch has an experimental feature and RFC for SymmetricMemory:

    Then, innovative block-wise compute/computation overlapping techniques started to use copy-engine to drive the P2P traffic in order to minimize contention. Now, we are seeing techniques where NVLink communication is directly issued from “compute” kernels.

    […]

     Just as Triton allows average users to modify matmuls for their needs (fusion, quantization, etc.), we hope that SymmetricMemory will enable average users to modify NVLink communication algorithms for their requirements, whether it’s implementing alternate collective algorithms (one-shot allreduce), using different quantization approaches (stochastic rounding), or fusing collectives with other kernels (all-gather interleaved with matmuls).

    This project is still fairly manual, though it abstracts much of the plumbing required and makes it accessible at a high level.

    In a similar vein, ByteDance recently released their implementation of Triton-Distributed: ByteDance-Seed/Triton-distributed: Distributed Compiler Based on Triton for Parallel Systems. This adds a small Triton DSL wrapping the shmem operations, supporting both Nvidia and AMD hardware. It exposes some of the comms knobs to autotuning, allowing tuning across compute and collectives, focusing on the tile-based approach they documented in their TileLink paper.

    This is unlikely to make its way into upstream Triton, in part because the general approach of Triton is to hide much of the hardware information in the compiler passes, while this approach makes the topology fairly explicit.

    The PyTorch team has been working on traceable collectives for the PyTorch compiler. The idea of a compiler being able to look at the overall task graph and decide where to fuse comms and which transport options to use is appealing, as it allows kernels to be more transparent to the specifics of the cluster.

    The move to GPU-initiated, fine-grain comms that fuse into compute kernels is real and continuing; the tooling is still early, but the gap will continue to close.

    1. This currently host-side controlled, but they have plans for GPU-driven comms as well ↩︎
  • AbsenceBench

    https://arxiv.org/abs/2506.11440

    Simon Willison has a good summary:

    Long context models have been getting increasingly good at passing “Needle in a Haystack” tests recently, but what about a problem in the opposite direction?

    The answers are surprisingly domain-specific; some models do great on numeric sequences but most are pretty bad at code!

    The authors posit that attention is just a worse mechanism for seeing what’s missing vs what’s there. For me this rhymes with the experience of folks doing agentic coding assistant work: its beneficial to clear the context window more often than you think as the models strongly prefer to use what is already in there.

    This feels like a learned or tuned behavior, a flavor of the model does the eval. Models will probably get better at this problem, as now it’s legible, but is there a tradeoff that has to be made?

    Pretraining is somewhat saturating, but we have oodles of post-training (which includes context extension), the whole meta-RL process of researchers trying different data mixes and algorithm/architecture tweaks, and inference time search options.

    If OpenAI had Anthropic’s data and evals would they have as good an agentic coding model? And vice versa would Opus be as good at deep research as O3? I honestly don’t know: in the end compute will always be finite and we have to allocate it with some end in mind. It feels very plausible there is no globally optimal scaling law for how you prioritize different model capabilities. But the models will probably do this eval.

  • Free-Threaded Python gets ‘supported’ status

    Huge congratulations to Thomas, Matt and Sam for sheparding through PEP 779 that moves the no-gil/free threaded python mode from experimental to supported:

    https://discuss.python.org/t/pep-779-criteria-for-supported-status-for-free-threaded-python/84319/123

    With these recommendations and the acceptance of this PEP, we as the Python developer community should broadly advertise that free-threading is a supported Python build option now and into the future, and that it will not be removed without following a proper deprecation schedule

    Having confidence in the long term of this feature is great for anyone building on it. I’m very grateful to (and feel lucky to be working with!) the many folks who have been squashing bugs and improving performance, and to the people adding support for FT Python across the ecosystem!

    The steering commitee have laid out some solid documentation and performance expectations for the ongoing work, and are setting an expectation for broad compatibility for future cpython work:

    New experimental projects within CPython must be compatible with, and should be based on the free-threading build. The SC encourages this direction to reduce engineering complexity caused by supporting both GIL and free-threaded builds

    I also appreciate the call for building out more high-level concurrency primitives: I think there are a lot of exciting projects to come as we move more of this into production!

  • A patchwork quilt view of AI Alignment

    https://arxiv.org/abs/2505.05197

    Very interesting paper from folks at DeepMind that is focused on arguing that the idea of a convergence of a single, coherent value set doesn’t reflect society and is not the only way to think about AI morality and alignment.

    Think of society as a patchwork quilt composed of diverse communities, each with its own internal norms and expectations. Within these communities—e.g. trade unions enforcing strike discipline or religious groups defining attendance practices—members understand the specific rules and the social consequences for deviating (Bicchieri, 2005; Ostrom, 1990; Ullmann-Margalit, 1977). These internal dynamics shape behavior within each distinct patch of the quilt, fostering cohesion through shared, localized understandings of appropriate conduct

    […]

    A key insight we can draw then is that what holds humans together despite profound disagreements is not value convergence, but practical mechanisms for coexistence—which we see as social technologies

    There is an idea that sometimes comes up that disagreements between good, reasonable people can be traced to misunderstandings or disagreements about the likelihood of different outcomes; if you can align on them, you’ll come to the same conclusions. This encourages some focus in AI alignment on finding the right, true principals, creating the best truth-seeking model possible, then assuming downstream that will result in strong alignment. The paper challenges this assumption.

    They also call out collective action problems in implementing such a framework, particularly start up and free rider problems:

    Even seemingly universal goods like “survival” are embedded in thick cultural contexts that shape their meaning and priority (in fact many cultures prioritize sacred values above survival e.g. Ginges et al. (2007)). In general, mobilizing global action and resources towards any specific AI safety strategy will inevitably confront deep-seated disagreements rooted in different values, priorities, and worldviews regarding the nature of AI risks, the efficacy or fairness of proposed initial strategies, and the equitable distribution of upfront costs and responsibilities.

    Their approach calls for focusing on 4 areas:

    1. Contextual grounding: broader understanding, not just the conversation but the environmental context they are operating in.
    2. Community customization: Different norms for different communities.
    3. Continual adaption: Updating understanding of appropriate behavior based on ongoing feedback. They suggest continuous training for this.
    4. Polycentric governance: Distributed decision making wiht multiple overlapping centers of authority.

    If you read this list into a general “helpful agent” context instead of alignment, I don’t think it would be controversial: these seem good ideas!

    That said, I think a lot of this boils down to the last one. Getting governance structures right is hard, in any context, and I interpret a key part of the aspiration here as having “checks and balances” that can represent varied interests. Not an easy problem to solve!

    Some might worry that our patchwork approach embraces a troubling relativism, but this misunderstands the quilt we’re describing. Just as a quilt’s structural integrity depends on solid stitching techniques regardless of pattern diversity, our appropriateness framework recognizes that while the
    specific content of norms varies across contexts, the social technologies that facilitate coexistence—the mechanisms of learning, adaptation, and conflict resolution—can be studied with rigor and implemented with care.