Category: posts

  • Automation & Managerial Control

    There’s a chart making the rounds that caused Tim Lee over at Understanding AI to rewrite his recent (excellent!) article about the impact of AI on jobs. MIT’s Erik Brynjolfsson and colleagues found1 that young workers in AI-exposed jobs2 have seen their employment drop by 13% since ChatGPT arrived. Meanwhile, their older colleagues in the same fields are doing just fine.

    […] the youngest workers saw dramatic job losses—but only if they worked in occupations (like accountants or computer programmers) that were highly exposed to AI. Young workers in less exposed occupations (like nurses or construction workers) saw normal employment growth over the same period.

    From a tech industry focus, it’s a little hard to disentangle the impact of reduced hiring after layoffs 3 from the growth of AI, but likely both had an impact. AI coding agents are making it easier to complete the kind of introductory tasks that might have been left for junior engineers.

    New grads don’t just do simple tasks though, they grow and develop tacit knowledge of their company industry, begging the question is whether this is permanent disruption or temporary dislocation as the skills need shifts. As Tim calls out: 

    It’s important not to read too much into this research. Workers between the ages of 22 and 25 are a small slice of the job market, and their employment has always been more volatile than for older workers. When I graduated with a computer science degree in 2002, the economy was just emerging from the recession that followed the dot-com bubble. It was a hard time for a young adult to get their first programming job, but most of my peers eventually found work in the field.

    To give an analogy, there was a time when becoming a junior programmer meant learning how to write fast code as cycles were too important to waste. Now, writing particularly efficient code is largely the preserve of specialist, more senior people: some folks opt in to that route early because of their personal interests, but in general raw performance of code is not the blocking factor to building something valuable.

    My sense is we are seeing the same thing in terms of general “program composition”: senior folks with experience on large, collaborative projects can benefit from LLM automation as they understand how to put in the right project guardrails and how to translate needs into technical direction. Junior people are still mostly trained how to write working code, and that need has become less pressing as LLMs have proved moderately competent at it.

    Rodney Brooks, the robotics legend, made a point back in 2018 that stuck with me: it’s not automation that disrupts workers—it’s digitalization. In his article, Brooks wrote

    Digitalization is replacing old methods of sharing information or the flow of control within a processes, with computer code, perhaps thousands of different programs running on hundreds or thousands of computers, that make that flow of information or control process amenable to new variations and rapid redefinition by loading new versions of code into the network of computers.

    An example that Brooks uses is bridge toll takers. This directly happened on the Bay Bridge between San Francisco and Oakland, which used to employ toll takers in booths. Then FastTrak was added, allowing passing through without interacting with anyone, while still offering cash tolls for those without. Now, between that and direct mail to people via cameras watching license plates, the tollbooths are empty.

    LLMs also digitalize. Task descriptions and project documentation, for example, have been stored in human language: digital, but not particularly accessible to automation. Much of the work of managing a large bug tracking system has been in adding metadata that is accessible to automation. LLMs digitalize language, imperfectly to be sure, but enough to expose new swathes of work to automation.  

    High Road/Low Road

    How will companies respond? Thomas Kochan at MIT has been mapping this kind of choice for years, and describes the separation between what he called the high road and low road. 

    The language that was used to differentiate these two approaches quickly evolved to a comparison of “high road” and “low-road” business strategies and “high-performance work systems,” which viewed labor as an asset, versus “command and control” systems, which viewed labor as a cost like any other factor of production. A comparison of the business strategies of two household names, Walmart and Costco, illustrates the differences between low-road and high-road business strategies. Walmart has been extremely successful (when judged solely on the grounds of finances and shareholder value) by pursuing a business strategy best captured by its marketing tag line: “Every day low prices.” To achieve this strategy, it places top priority on minimizing and tightly controlling labor costs, discouraging long-term tenure of its “associates,” investing little in training and development, and avoiding unions at all costs. Costco’s business strategy places a higher value on product quality and customer service, and to achieve these objectives it pays higher wages, invests more in training its work force to understand and serve customer needs, and has longer tenure patterns (and thus lower turnover costs). As a result, Costco’s employees are more productive, stay with the firm longer, and have more discretion to use their time and knowledge to solve customer problems.

    Tech companies have, in the most part, been high-road employers. Employees have been an asset, and in some cases the key asset of the company. The low road though is not simply driven by cost cutting, it’s about control. Having a more fungible, replaceable workforce gives executives more options. Having more specialized, skilled workers offers the options of more flexibility in how work is done, but shifts control to the workers and away from management.

    We can see this play out in some of the post-pandemic cultural changes. There is a concept in work called deskilling, where work is atomized to improve efficiency: take something which was a skill and divide it up until it until the individual components becomes unskilled. Classic examples are in factory work, where a skilled person is replaced with an operator of a machine, or more often a series of operators of a series of machines4. This trades a higher up-front cost in terms of capital and procedure development for a lower labor cost, transferring both money but also power from workers to managers. 

    A recent article extended this concept to virtues, with the idea of “moral deskilling”. A virtue is a positive behavior, such as building responsibility or with high quality. Virtues tend to be individual qualities, things we recognize and reward in others: much of culture in a company is about inoculating virtues. That is inherently messy and the idea of systematizing virtue is appealing: move from a fuzzy, personal conception to a verifiable checklist or a rule that can be followed. This worked in a lot of cases, but it also enabled a form of deskilling: 

    Systematising virtue handed control to managers. Who, endlessly mistrusting these expert folk who were always trying to do things the expensive way, converted that mistrust into endless, endless paper work.

    It was endless because it broke every little aspect of what had been virtue into tiny components. Fearful of losing control of any scrap of virtue, managers needed to relentless check on every little task.

    If we want to see this play out in real-time we can look at the return-to-office mess in tech.  A vibrant, collaborative office culture is a good thing, and it requires a compact. Employees would deal with the misery of a commute5 (particularly in the SF bay area), but in exchange they would participate in an environment where they could learn and teach, build camaraderie and so on. 

    When the idea of a return to office happened post-pandemic, people had found pleasure and benefit in not doing the commute. When they returned, they found the offices less vibrant, the workforce more distributed, and cost-driven reductions in space making the experience harder through shortages of meeting rooms or desks.

    Compounded by a series of layoffs and a change in the prior relationship between company and employee, the in-office deal felt worse. Frustrated with the lack of the old compact, management exerted control through systems. They set required days and logged attendance through badge ins. Workers responded by treating the atomized requirements as mere requirements, not aspects of a culture: even a small percentage of folks coffee badging or trying to work from more convenient offices were visible in the empty desks, exacerbating tensions for workers “doing the right thing”. 

    Rather than analyze the problem and step back, management in many cases doubled down on systematizing: validating time at desk, logging badge out times or adding similar extra controls. This continued to take what had been a morally complex set of trade-offs and reduce it to a checklist. For many newer staff, that was the in-office experience. 

    This is the essence of the low road: prioritizing the systematized and legible over the messy, and complex, but more interesting, world of dealing with real people; prioritizing power and control over exploring new outcomes.

    One way to view what’s happening is through the lens of debt, which is one of the angles in a recent position paper that frames the future of work as an AI Safety risk. Every time a company chooses to replace junior workers with LLMs rather than training them, they’re borrowing against the future. Matt Garman of AWS was pretty clear on his position: 

    “I was at a group, a leadership group and people were telling me they’re like we think that with AI we can replace all of our junior people in our company. I was like that’s the like one the dumbest thing I’ve ever heard […] They’re probably the least expensive employees you have. They’re the most leaned into your AI tools and like how’s that going to work when you go like 10 years in the future and you have no one that has built up or learned anything.”

    But understanding something and acting on it are different things. Both the low road and high road can lead to a lot of success in business, but I do hope we can navigate this transition towards a place where the craft can be retained in software development. The question is whether enough companies will choose the messy, complex work of developing people over the appealing simplicity of trying to replace them.

    1. Canaries in the Coal Mine? Six Facts about the Recent Employment Effects of Artificial Intelligence — Stanford Digital Economy Lab ↩︎
    2. Like programming and accountancy, knowledge work fields that have a large amount of machine interaction ↩︎
    3. As well as pandemic-driven overhiring and the end of zero interest rates ↩︎
    4. Or now robots in entirely lights out factories for sufficiently high scale productions ↩︎
    5. Particularly in the SF bay area! ↩︎
  • Rubrics

    Rubrics

    Pre-training is about making AI correct, post-training is about making AI helpful1. That helpfulness is (primarily) shaped by reinforcement learning. RL for LLMs really took off with RLHF (RL from Human Feedback), which trained based on the score from a reward model.

    The reward model was designed to score responses based on how well they met certain preferences, and the preferences were inferred from a set of human ratings: the graders were told what to look for in pairs of responses, and the reward model was trained to predict what they would pick. This worked, but was gated on how much signal you could get into the reward model and hence how many humans you had to generate preference data.

    RLAIF (RL from AI Feedback) naturally extended this to using an LLM to make the preference picks rather than humans2. Folks also started to use LLMs in an LLM-as-Judge pattern for evaluation after training: give the model a list of criteria, and ask it to rate how well the responses meet them. 

    The next notable step was RLVR (RL with Verifiable Rewards), which uses ground-truth data to provide rewards scores instead of a model. For example, a math problem might have a defined numeric answer, or a generated proof could be verified by a dedicated theorem prover program. This turned out to work very well for code and math and lead to the O-series of OpenAI models3 and many open reasoners, particularly Deepseek R1. 

    It’s a pretty natural idea to take a verifiable reward pipeline plug in AI scoring directly: rather than a model generate preference pairs and train a separate reward model, give the model criteria and ask it how well the response satisfies them. This means instead of letting a model work out what “good code” looks like from pairs of different (but similar!) solutions to a problem, you have a model working through a checklist, asking things like “Does it have types? Does it have comments? Would your coworkers hate you if you landed this?”

    These checklists are referred to as rubrics and Snorkel have started an interesting looking blog series introducing rubrics, which offers a definition: 

    A rubric is a structured guide that spells out what “good” looks like for each response from an AI system. 

    A rubric consists of:

    • A list of criteria: Does the code compile? Does it have comments?
    • How the model performed on each criterion: “Compiles” could be yes/no. It could also be more nuanced: yes/yes with warnings/no.
    • Scoring rules that turn performance into numbers: Clean = 0. Warnings = 1. No = 2.

    In Nathan Lambert’s recent interview with Ross Taylor, Taylor calls rubrics out as an underappreciated research opportunity, particularly for agentic training:

    Rubrics are underhyped on social media – they were driving force behind projects like DeepResearch – and GenRMs are interesting but perhaps slightly overhyped.

    This caught my eye, as Moonshot leveraged rubric based rewards heavily in Kimi K2, notably using the model they were training as the judge of itself: 

    The framework operates using a Self-Critique Rubric Reward mechanism, where the model evaluates its own outputs to generate preference signals. To bootstrap K2 as a competent judge, we curated a mixture of open-source and in-house preference datasets and initialize its critic capability in the SFT stage.

    One of the core values of rubrics is that they work for both LLMs and humans. You can iterate on rubrics with people, scale them with LLMs, and spot-check LLM results with human raters to ensure reliability. 

    The paper [2507.17746] Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains formalizes them as a full peer to Verifiable Rewards. The paper sets up rubrics so each criteria is a simple pass/fail and each has a predefined importance weight. They normalize everything so the system can’t get gamed by just adding more criteria4, and then plug in the resulting score in to the RL loop5.

    Of course, you actually have to write the rubrics, which leads to a specificity versus generality tradeoff: take more time to write more rubrics or rely on fewer, more general ones. The RaR paper makes it clear that more is better:

    predefined generic rubrics substantially underperform compared to prompt-specific ones, underscoring the importance of contextualization. Rubrics that include a broader range of criteria—both positive and negative—consistently outperform those limited to essential checks, suggesting that richer evaluation signals lead to better learning.

    As you might have guessed, the solution was more LLM: use a model to generate prompt-specific rubrics:  

    For each domain, the prompt (included in Appendix H) instructs the LLM to generate 7–20 rubric items based on the complexity of the input question. Each item is assigned a categorical weight (e.g., Essential Criteria, Important Criteria) to determine its importance to a correct answer. The rubrics are designed to be fully self-contained which means that non-expert readers should be able to evaluate response quality using only the rubric. 

    This particularly benefited from having a reference answer attached to the prompt. The models do a much better job of coming up with a good rubric if provided with a (human generated) “good” answer to judge against rather than just the question/prompt. This really opens the door to 1:1 rubrics: given questions and reference answers, you can generate a scoring checklist for each one and mix it with verifiable rewards during post-training. 

    The field continues to be turtles all the way down: using LLMs to write rubrics to have LLM judges evaluate LLM training outputs. At some point, someone’s going to suggest we use rubrics to evaluate how good our rubrics are, and honestly, I’m surprised that paper doesn’t already exist6.

    1. Correct in predicting the next token, and helpful, honest and harmless, specifically. ↩︎
    2. With humans still looped in to validate that the ratings were reasonable. The human graders went from generating ratings to rating the raters. ↩︎
    3. This is the part where everyone pretends they know exactly how O1 works, but actually we’re all just pattern-matching from breadcrumbs ↩︎
    4. Else we’d risk giving more focus to problems with more rubrics, and end up with something unthinkable like a coding model that liberally sprinkles emojis everywhere ↩︎
    5. In practice, they also tried a single LLM judge that took in all criteria and weights and generated a scalar reward, which seemed to work fine. ↩︎
    6. It probably does, I’m just scared to look ↩︎
  • Constraints & Orchestrators

    I recently read a few posts that helped connect the dots on why Python is a) so successful as the lingua franca of ML b) also seems likely to be successful in the future1.

    ML code reads like one program, but runs many: CUDA kernels, vectorized CPU loops, graph compilers and a bunch of glue code moving data around and tying things together. Python has continually improved at balancing two somewhat competing challenges: constraining the hot path so compilers can optimize it and structuring an orchestration path so humans can reason about it.

    Hot Path

    constrained languages are easier to optimize by Jynn Nelson touches on this:

    we should not be asking “what language can i use everywhere for every purpose”; we should build meta-languages that allow you to easily use the right tool for the job. this is already true for regular expressions and query languages; let’s go further. i want inline futhark; inline CSS selectors; inline datalog; ffi between python and C that’s trivially easy. the easier we make it to interop, the easier it becomes to pick the right tool for the job.

    Compilers are generally going to perform better if you have regular shapes, minimal side effects, predictable memory access and so on, but you want languages to be expressive and flexible, particularly when “research” is a big part of the work. In practice, that’s precisely what happens with ML : torch.compile lowers PyTorch graphs to an IR and (often) emits Triton kernels. Being able to hand off inner-loops to specialized languages allows compilers and runtimes to optimize and target the use cases they are best at.

    While this is (somewhat) clear for GPUs or other accelerators with distinctive programming models, I think it’s also largely true for getting the best out of modern CPUs. Daniel Lemire’s SEA 2025 talk covers nearly a decade of performance work and sums it up: modern CPUs do nearly as many instructions per cycle as you can feed them. To really maximize performance you need to batch work, reduce instruction counts and vectorize. We can do some of that in the general Python2 runtime but dynamic dispatch, aliasing and side effects all make the job a lot harder. We can add speculative guards, which can be hard to reason about, or give up and lose performance. By having DSLs3 that add additional constraints we can give ourselves the ability to get much, much higher performance without scrificing the overall flow of our program.

    Orchestration Path

    Python is unusually good as an orchestrator. From a readability perspective the language is baseline very readable and as long as libraries and DSLs stay Pythonic they tend to inherit that intelligibility. The challenge with orchestration is coordinating work in such a way that your most precious resources are well utilized. The investments in Free-Threaded Python make it a lot cheaper to do concurrency, but they don’t magically fix the challenge of coordination.

    asyncio: a library with too many sharp corners covers some of the many failure modes the community have encountered with asyncio, and makes a case for Trio or ANyIO style structured concurrency that allows for manageable failure modes.

    asyncio is not a good library. It is constantly full of sharp edges everywhere with implementation details leaking and poorly designed APIs forcing end users into odd code patterns to avoid fundamental flaws in the interfaces.

    This is very much a readability version of the constraints concern on the hot path. Threads are a bad app abstraction over shared mutable state, reasoning about races and cancellation is hard, and primitives are always leaky. But threads are a perfectly fine implementation detail behind a more constrained API, like task groups, or actors, or so on.

    One area that I do think needs sustained improvement is how we debug and trace across this kind of set up: it’s been challenging even in a controlled environment to really understand how all the pieces interact in a reasonably scaled ML workload, and I imagine that problem will only get worse. But I also expect that the flexibility and breadth of Python will end up a boon there as well.

    1. Beyond just sheer momentum, of course. ↩︎
    2. Or any language! Certainly for some optimizations having a JIT for Python would (and does) make life easier. ↩︎
    3. Whether that is an embedded JIT like Triton or a library+execution engine like Polars. ↩︎
  • Overthinking Everything

    Yann was taking laps on Threads a few weeks back over a recent paper, which was one of several recently that have explored aspects of how autoregressive models do as the amount of information they are dealing with gets longer. His general complaint is that each token they generate can either push them towards the right answer or further away from it, and that the models are inherently bad at recovering if they end up too far outside the correct trajectory.

    This “more might be worse” idea shows up anywhere folks are leveraging large context windows, and one of those1 is in agentic tasks. This post summarizes some research trying to measure the fall-off in chances of succeeding as task length2 increases.

    Is there a Half-Life for the Success Rates of AI Agents? — Toby Ord

    It provides indirect evidence that what really is going on under the hood is that tasks are made up of many sequential subtasks and the chance of succeeding at the whole requires succeeding at every individual component. Moreover, this suggests that the current AI agents are not very good at recovering from earlier mistakes.

    The framing they use is a constant hazard rate: each subtask is another roll of the dice, and if you roll a failure you don’t have much chance of recovering. So more (or longer) is pretty much always worse.

    One interesting aspect is that they also investigate the human failure rate, which increases over time, but much more slowly:

    This could indicate a different scaling behaviour of success rate with time horizon for humans compared to AI agents, which would be well worth investigating and may suggest important underlying mechanisms (e.g. that the humans were better at correcting earlier failed subtasks). If human performance scales differently with task length than AI agent performance, that would be an important result, suggesting that there is a notable inefficiency in the current AI paradigm.

    They’re testing with multiple runs, so these aren’t just models hitting problems they can’t do: its models hitting problems they can’t do given the specific tokens they have generated tried before.

    Agentic use cases aren’t the only situation where a model is generating responses that add to its context window. There were a lot of early observations after the release of O1 last year that thinking for longer on easy problems does not add value. This recent paper proposes not only that but suggests that there is an inverse scaling law: more time thinking makes the model worse.

    [2507.14417] Inverse Scaling in Test-Time Compute

    More specifically, they devised some stress tests: things like counting problems in the presence of distracting information, performing a regression where there is easy-to-understand but spurious data, and so on. The performance drops as the trace length increases. Different models are more susceptible to some failure modes than other, but performance consistently drops:

    Our experiments reveal distinct failure modes across model families—Claude models are particularly vulnerable to distraction from irrelevant information, while OpenAI o-series models show greater resistance but overfit to familiar problem framings. Extended reasoning amplifies different weaknesses: models overthink simple problems, shift attention to spurious correlations, and lose focus during Deduction tasks with constraint tracking.

    In contrast, Chroma’s recent Technical Report investigates how models do on single prompts, but of increasingly long contexts.

    Context Rot: How Increasing Input Tokens Impacts LLM Performance | Chroma Research

    Unlike in the agentic case, here all of the context is passed in at once, so the model isn’t poisoning its own context window through bad choices. It is still dealing with a large amount of content where it needs to choose which parts to attend to. Traditionally the test of long context has been needle-in-a-haystack evaluations: a relevant fact is hidden at different points in a long prompt and the test evaluates whether the model can effectively pull it out.

    The Chroma folks make the test a lot more nuanced — adding distractors3 and irrelevant content in both the broader context and the question. They find that performance consistently degrades as context increases.

    More broadly, our findings point to the importance of context engineering: the careful construction and management of a model’s context window. Where and how information is presented in a model’s context strongly influences task performance

    All of these papers rhyme with LeCun’s gripe about autoregressive transformers, which is (roughly!) that they (also) have a constant hazard rate on generating the “right” token.

    This is a very active area of research though. Process-based rewards in RL training make updates on each step vs only at the end. Multi-token prediction reduces the effective generation length or number of chances of misprediction. Summarizing context effectively compresses existing tokens, also reducing error rate.

    Similarly, if you have good verifiers4 you can use beam or tree searches to explore multiple reasoning paths during generation , which can reduce the error rate, at the cost of more compute.

    The closest (LLMish) techniques to LeCun’s vision are things like the recent Hierarchical Reasoning Model that has a layer of persisting hidden state, but it’s still pretty experimental!

    As agentic and reasoning traces get longer, I’m sure we’ll see more entries documenting failure modes, and proposals for techniques to scale around them.

    1. And the one being referenced in the post! ↩︎
    2. In time — they characterize tasks based on how long it takes humans to do them, which is a good control factor ↩︎
    3. As in additional content related to the question, but that doesn’t give the answer. ↩︎
    4. Similar to process-based rewards this is somewhat pushing the problem to the ability to judge how well you are doing during the generation ↩︎
  • The Tools Are Made Up

    The Tools Are Made Up

    It has been hard to keep up with the flurry of strong agentic open-source models coming out of Chinese labs recently, including Moonshot’s Kimi K2, Z.ai’s GLM 4.5, and Qwen3-Coder1.

    Each of them have the mix of clever pre-training recipes and verifiable-rewards post-training. Notably, Kimi and GLM both use the Muon optimizer, which seems to be gaining ground among the OSS labs at least. GLM’s description of the recipe is as follows:

    Our base model undergoes several training stages. During pre-training, the model is first trained on 15T tokens of a general pre-training corpus, followed by 7T tokens of a code & reasoning corpus. After pre-training, we introduce additional stages to further enhance the model’s performance on key downstream domains. Unlike the earlier pre-training stage on large-scale universal documents, these stages leverage medium-sized domain-specific datasets, including instruction data.

    The additional stages, which they refer to as mid-training, extend the context window and help grow capabilities in specific domains. They then move to post-training, with SFT over reasoning and agentic traces followed by RL with Verified Rewards2.

    The Kimi-K2 technical report goes into more details about how to actually train for tool use. Unlike the others, Kimi is not a reasoning model so doesn’t use much in the way of extended thinking. The fact that wasn’t required to get to strong levels of tool use/agentic capability feels pretty notable to me — most of the recent3 agentic models have been built on a reasoning foundation.

    What I really found interesting from the Kimi report was the level of synthetic data that the team used. This starts in pretraining: to extend high quality data sources they rewrite it with another LLM, giving the same facts with new phrasing, instead of looping over the same “good” data for multiple epochs.

    Their approach to tool training takes this kind of idea ever further:

    We construct a comprehensive tool repository through two complementary approaches. First, we directly fetch over 3,000 real MCP (Model Context Protocol) tools from GitHub repositories, leveraging existing high-quality tool specifications. Second, we systematically evolve 82 synthetic tools through a hierarchical domain generation process. We begin with key categories (e.g., financial trading, software applications, robot control), then evolve multiple specific application domains within each category. Specialized tools are then synthesized for each domain, with clear interfaces, descriptions, and operational semantics. This evolution process produces over 20,000 synthetic tools.

    They analyze a set of real tools, generate some novel (but derivative) ones, then domain-specialize them for a lot of use cases.

    Once they have this tool zoo, the actual training loop involves:

    1. Randomly sample a subset of tools and give it to a new agent with a fresh system prompt. Generate tool-appropriate tasks with explicit success rubrics.
    2. Run an LLM-driven user simulator to drive the agent, while running the tools in sandbox that keeps state.
    3. Filter trajectories using another LLM as judge to keep only successful ones for SFT

    They’re using models at every stage to generate data and evaluate options. When it comes to the actual RL training, they are baselining in verifiable rewards wherever possible for the RL stages: They, and the Qwen folks, talk about their simulator set up for code4: thousands of sandbox environments.

    For software engineering tasks, we collect a vast amount of pull requests and issues from GitHub to build software
    development environment that consists of user prompts/issues and executable unit tests. This environment was built on a robust sandbox infrastructure, powered by Kubernetes for scalability and security. It supports over 10,000 concurrent sandbox instances with stable performance, making it ideal for both competitive coding and software engineering tasks

    The combination of very sophisticated synthetic data and operationally intense sandboxes seem like table stakes for the current agentic game, and one which a lot of labs have figured out. Feels very promising for a growth in capabilities of these models over time, particularly as we work out how best to distill them down to smaller sizes for inference.

    1. Which seems a very solid model, but they haven’t released a lot of extra details about how they got there. One interesting component of the release though was that they forked Gemini CLI to make a qwen-code tool that works with any OpenAI compatible API, and I had some success locally plugging it into the smaller Qwen3 (non-coder) releases in case you were looking for some offline agentic capabilities! ↩︎
    2. Then GLM is distilled between the RL and base version of the model, which apparently helps generalize. This seems like a fun and relatively simple way of smoothing out the learning. ↩︎
    3. Though Claude 3.5 wasn’t, and that is really the trend-setter here I guess! ↩︎
    4. And other tasks that allow fully verifiable rewards. They use other models to score softer domains like creative writing. ↩︎

  • PyTorch Conference 2025

    The schedule is up for the 2025 edition of the PyTorch conference, which is now at the Moscone West in San Francisco.

    https://events.linuxfoundation.org/pytorch-conference/program/schedule/

    There are a lot of great sessions, but I’ll highlight some I personally find particularly interesting:

    Post-Training: Clearly a big theme this year, with some interesting talks from multiple groups:

    General Training

    Kernel development

    Compilers

    Inference

    I’m looking forward to October!

  • Cute-DSL

    In May Nvidia shipped CuTe‑DSL, the Python library they teased at GTC earlier in the year that mirrors CUTLASS’s C++ tensor‑layout . Then, at the start of June, the ‑dev label disappeared (so presumably its production ready now). The pitch is simple: Write speed‑of‑light kernels from the comfort of Python.

    Of course, nothing about CUDA is ever really simple. CuTe‑DSL gives the full Cutlass experience1, wrapped in an ever so slightly more approachable interface.

    Getting Cute: Transpose

    Matrix transpose felt like a reasonable ‘hello world’ : (B[j,i] = A[i,j]). The PyTorch version is simple: torch.transpose(input, 0, 1).

    To get a baseline, here is a simple transpose kernel in Triton. We tl.load, flip the coordinates and tl.store it back.

    @triton.jit
    def triton_transpose_kernel(input_ptr, output_ptr, M, N, BLOCK_SIZE: tl.constexpr):
        # 2D block coordinates
        pid_m = tl.program_id(0)
        pid_n = tl.program_id(1)
        
        # Calculate offsets
        offs_m = pid_m * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
        offs_n = pid_n * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
        
        # Load with masking
        mask = (offs_m[:, None] < M) & (offs_n[None, :] < N)
        a = tl.load(input_ptr + offs_m[:, None] * N + offs_n[None, :], mask=mask)
        
        # Store transposed (swap coordinates)
        tl.store(output_ptr + offs_n[:, None] * M + offs_m[None, :], a, mask=mask)

    Here’s the same idea in CuTe‑DSL. CuTe leverages a decorator and Pythons ability to integrate with JITs. Anything decorated with @jit runs host-side (on the CPU), while @kernel is used for device side (on the GPU). There are both AST based and tracing based options dependening on the presence of dynamic shapes or control flow.

        @cute.kernel
        def transpose_kernel(self, mA: cute.Tensor, mB: cute.Tensor):
            tidx = cute.arch.thread_idx()[0]
            tidy = cute.arch.thread_idx()[1]
            bidx = cute.arch.block_idx()[0]
            bidy = cute.arch.block_idx()[1]
            # This might all be unnecessary
            # but I was fearful of the compiler
            tile_start_m = cutlass.Int32(0)
            tile_start_n = cutlass.Int32(0)
            global_m = cutlass.Int32(0)
            global_n = cutlass.Int32(0)
            M = cutlass.Int32(0)
            N = cutlass.Int32(0)
            val = cutlass.Float32(0.0)
            # Calculate tile starting positions
            tile_start_m = bidy * self._tile_size
            tile_start_n = bidx * self._tile_size
            # Calculate global coordinates for this thread
            global_m = tile_start_m + tidy
            global_n = tile_start_n + tidx
            # Get matrix dimensions at runtime
            M = mA.shape[0]
            N = mA.shape[1]
            # Bounds checking and transpose operation
            if global_m < M and global_n < N:
                val = mA[global_m, global_n]
                # Transpose: B[n, m] = A[m, n]
                mB[global_n, global_m] = val

    What just happened?

    • Thread and block indices come straight from CUDA (thread_idx, block_idx), vs the Triton block abstraction
    • No explicit loads or stores: CuTe uses overloaded [] to generate them.

    Launching isn’t a million miles away from Triton:

    @cute.jit   # host side
    def launch(self, A: cute.Tensor, B: cute.Tensor):
        M, N = A.shape
        grid = ((N + self.T - 1)//self.T,
                (M + self.T - 1)//self.T, 1)
        self.transpose_kernel(A, B).launch(
            grid=grid,
            block=[self.T, self.T, 1],
        )

    Because CuTe‑DSL speaks DLPack, you can hand it a PyTorch tensor directly. If you wanted to cache the conversion, it looks like this:

    A_cute = from_dlpack(A).mark_layout_dynamic()

    The mark_layout_dynamic is used to trigger the dynamic shape support, and avoid shape specialization. The one place where this went a bit funky in my testing was dealing with singular leading dimensions: there you need to be more explicit about the shape to satisfy the compiler.

    Layouts and Memory

    This kernel isn’t really leveraging the fundamental value of CuTe though, which is composable tensor layouts and memory management. CuTe‑DSL exposes the full memory hierarchy: global, shared, register (and tmem for those with blackwells), and lets you tile, copy, and pipeline data between them. Common primitives:

    • make_layout / make_layout_tv: describe how a tensor is laid out.
    • cute.zipped_divide(tensor, tiler): tile a tensor.
    • cute.copy(src_layout, dst_layout, pred=mask): async copy.
    • cute.arch.sync_threads(): explicit barrier.

    HGEMMony2

    CuTe ships with some example kernels, so I grabbed one — an HGEMM (half-precision, FP16, batched GEMM) — and compared to an example implementation in Triton.

    To express the same thing in  PyTorch, we can unleash our inner Jeremy Howard and use einsum notation: torch.einsum("mkl,nkl->mnl", a, b).  Take L batches of a MxK matrix, L batches of a NxK matrix, and return L batches of a MxN matrix.

    Here is the Triton:

    @triton.jit
    def triton_batched_hgemm_kernel(
    	a_ptr, b_ptr, c_ptr,
      M, N, K, L, 
      stride_am, stride_ak, stride_al, 
      stride_bn, stride_bk, stride_bl, 
      stride_cm, stride_cn, stride_cl, 
      BLOCK_SIZE_M: tl.constexpr, BLOCK_SIZE_N: tl.constexpr, BLOCK_SIZE_K: tl.constexpr,
      GROUP_SIZE_M: tl.constexpr,
    ):
        """Triton batched half-precision GEMM kernel: C[m,n,l] = sum_k A[m,k,l] * B[n,k,l]"""
        pid = tl.program_id(axis=0)
        pid_batch = tl.program_id(axis=1)  # Batch dimension
        
        # Calculate batch offsets
        batch_offset_a = pid_batch * stride_al
        batch_offset_b = pid_batch * stride_bl  
        batch_offset_c = pid_batch * stride_cl
        num_pid_m = tl.cdiv(M, BLOCK_SIZE_M)
        num_pid_n = tl.cdiv(N, BLOCK_SIZE_N)
        num_pid_in_group = GROUP_SIZE_M * num_pid_n
        group_id = pid // num_pid_in_group
        first_pid_m = group_id * GROUP_SIZE_M
        group_size_m = min(num_pid_m - first_pid_m, GROUP_SIZE_M)
        pid_m = first_pid_m + ((pid % num_pid_in_group) % group_size_m)
        pid_n = (pid % num_pid_in_group) // group_size_m
        offs_am = (pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M)) % M
        offs_bn = (pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N)) % N
        offs_k = tl.arange(0, BLOCK_SIZE_K)
        
        # Include batch offsets in pointer calculations
        a_ptrs = a_ptr + batch_offset_a + (offs_am[:, None] * stride_am + offs_k[None, :] * stride_ak)
        # transpose for GEMM (load B[K, N] pattern)
        b_ptrs = b_ptr + batch_offset_b + (offs_k[:, None] * stride_bk + offs_bn[None, :] * stride_bn)
        
        # We accumulate into fp32 for higher accuracy.
        accumulator = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32)
        
        for k in range(0, tl.cdiv(K, BLOCK_SIZE_K)):
            # Load the next block of A and B, mask in K
            a = tl.load(a_ptrs, mask=offs_k[None, :] < K - k * BLOCK_SIZE_K, other=0.0)
            b = tl.load(b_ptrs, mask=offs_k[:, None] < K - k * BLOCK_SIZE_K, other=0.0)
            accumulator = tl.dot(a, b, accumulator)
            a_ptrs += BLOCK_SIZE_K * stride_ak
            b_ptrs += BLOCK_SIZE_K * stride_bk
        # Convert back to FP16 for output
        c = accumulator.to(tl.float16)
        offs_cm = pid_m * BLOCK_SIZE_M + tl.arange(0, BLOCK_SIZE_M)
        offs_cn = pid_n * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N)
        c_ptrs = c_ptr + batch_offset_c + stride_cm * offs_cm[:, None] + stride_cn * offs_cn[None, :]
        c_mask = (offs_cm[:, None] < M) & (offs_cn[None, :] < N)
        tl.store(c_ptrs, c, mask=c_mask)

    The core loop here is:

    • Divide into block_size_k chunks of K
    • For each do a masked load (so when we hit the edges we don’t bring in garbage)
    • Do the dot product for the tile into an accumulator for the result matrix
    • Advance the A and B pointers

    The CuTe kernel is, uhhh… a bit more involved. The full kernel is several hundred lines long. You can see the source in the Cutlass repo, which demonstrates some cool features like the ability to pass in an epilogue function for fusion.

    For now, lets focus on the main loop. As before, we are looping over tiles of K3 .

    The first thing we do is advance the pointers. The kernel is doing explicit pipelining of transfers from global memory, to shared memory, to registers, so we need to set wait groups do ensure all the loading has completed before we advance. We’re not actually doing loading in this section, just prepping the ground:

    for k_tile in cutlass.range_dynamic(k_tile_count, unroll=1):
    	for k_block in range(num_k_block):  
          if k_block == num_k_block - 1:
        	tCsA_p = tCsA_copy_view[None, None, None, smem_pipe_read]
          tCsB_p = tCsB_copy_view[None, None, None, smem_pipe_read]
          cute.arch.cp_async_wait_group(num_smem_stages - 2)
          cute.arch.sync_threads()

    Next we kick off a copy from shared memory (smem) to registers (rmem) using cute.copy for the (future) A and B tiles.

    k_block_next = (k_block + 1) % num_k_block  # static
    cute.copy(
    	tiled_copy_s2r_A,
      	tCsA_p[None, None, k_block_next],
        tCrA_copy_view[None, None, k_block_next],
       )
    cute.copy(
    	tiled_copy_s2r_B,
        tCsB_p[None, None, k_block_next],
         tCrB_copy_view[None, None, k_block_next],
      )

    Finally, we interleave the transfers of the next A and B tiles from global to shared memory with the actual gemm operation (the equivalent of tl.dot). I will trust the folks at Nvidia that this is an optimal pattern. The pred= in there is the equivalent of masking in Triton.

       if k_block == 0:
     		if k_tile + num_smem_stages - 1 < k_tile_count:
        	    cute.copy(
    				tiled_copy_B,
    				tBgB[None, None, None, k_tile_index],
    				tBsB[None, None, None, smem_pipe_write],
    				pred=tBpB,
    			)
            k_tile_index = k_tile_index + 1
            cute.arch.cp_async_commit_group()
    		smem_pipe_write = smem_pipe_read
    		smem_pipe_read = smem_pipe_read + 1
    			
    		if smem_pipe_read == num_smem_stages:
          	    smem_pipe_read = 0

    The pipelining is explicit, which is nice for debuggability and optimization, but very manual.

    Debugging Tips

    export CUTE_DSL_LOG_TO_CONSOLE=1
    export CUTE_DSL_LOG_LEVEL=10   # up to 100
    export CUTE_DSL_PRINT_IR=1     # dump MLIR
    • cute.printf() gives you a GPU‑side printf.
    • Kernels are aggressively cached; rm ~/.cache/cutedsl if things look stale.
    • Multiple @cute.jit host functions in the same Python scope can confuse MLIR (mainly for launching kernels).
    • The control‑flow rules are strict: no return inside a kernel; initialize everything.

    If you’re exploring GPU kernels for the first time, I strongly recommend starting with Triton. When you need to really get into the weeds, or want to reuse CUTLASS building blocks, its great to have CuTe‑DSL as an option in Python (provided you’re comfortable spelunking in GPU internals).

    1. I spent a lot of time holding it wrong. Arguably, still holding it wrong. ↩︎
    2. No one knows what it means, but it’s provocative ↩︎
    3. Note the explicit unroll tag. When you really want #pragma but can’t. ↩︎
  • How to build unmaintainable kernels

    What do you need to do to get better performance and GPU efficiency out of your model? The GPU-oriented folks at Stanford recently published an early preview of the work they have been doing on the LLM generation of kernels: Surprisingly Fast AI-Generated Kernels We Didn’t Mean to Publish (Yet) – and they have a list:

    • Memory Access Optimization: improving the efficiency of data movement between different memory hierarchies (global memory, shared memory, registers) and ensuring data is accessed in a way that maximizes bandwidth and minimizes conflicts.
    • Asynchronous Operations & Latency Hiding: hide the latency of slow operations (like global memory access) by overlapping them with computation or other memory transfers
    • Data Type & Precision Optimization: using lower-precision data types (like FP16 or BF16) where possible to reduce memory bandwidth requirements, increase cache effectiveness, and potentially leverage specialized hardware units.
    • Compute & Instruction Optimization: making the arithmetic computations themselves more efficient, reducing instruction count, or leveraging specialized hardware instructions
    • Parallelism & Occupancy Enhancement: maximize the number of active warps on the Streaming Multiprocessors (SMs) to better hide latencies and improve overall throughput
    • Control Flow & Loop Optimization: reducing the overhead associated with loops, branches, and indexing calculations

    That’s a good list! In this case though, it emerged not from (just) talking with kernel experts, but also from developing a model to generate kernels:

    We have some very fast AI-generated kernels in pure CUDA-C without using libraries and DSLs such as CUTLASS and Triton. They are performing close to or in some cases even beating the standard expert-optimized production kernels shipped in PyTorch.

    They developed a very straightforward but smart pattern on structuring test-time-compute. They reason about optimizations in natural language before generating code. Then, they branch out into a tree structure of refinements for each optimization idea, to avoid loops in investigation.

    The kernels they generated were somewhere between fast, and very fast:

    Conv2D: 179.9% performance of FP32 torch.nn.Conv2D; problem size: (100, 3, 224, 224) input tensor, conv(in_channels=3, out_channels=96, kernel_size=11, stride=4, padding=2)

    The team aren’t claiming this is a general solution, but just an interesting proof of possibility, which is certainly is! The walk through of how they got to the final conv2D kernel is fascinating, both in terms of human intervention and the chain of optimizations.

    The final code sample for the Conv2D kernel is included in the appendix. It uses advanced CUDA techniques that we find challenging to write ourselves! We also have more example kernels in this Github repo

    The kernel is very fast for specific shapes, on the L40s, in FP32. Its also a kernel that, by the sounds of it, the team themselves struggled a bit with. It’s very, very specialized. It’s not that a human couldn’t have built it, its that (in most cases) they wouldn’t: it’s not a priority kernel, and all that clever CUDA comes with operational overhead, ties on specific hardware, shapes and so on.

    That in itself isn’t new. If you PyTorch or XLA compile you’ll get a lot of kernels which you probably wouldn’t write, but this adds a new (and weird!) layer of non-determinism to everything. Elsewhere at Stanford, they have been looking at one of the other killers of GPU efficiency: kernel launch overhead. Most models are represented by hundreds of kernels, each of which have to be scheduled from the CPU. LLMs are generally memory-bound, small ones particularly so, and the gaps between kernel executions can end up dominating performance:

    Look Ma, No Bubbles! Designing a Low-Latency Megakernel for Llama-1B:

    In this post, we show how we can bypass this problem by merging the entire Llama-1B forward pass into a single “megakernel” that eliminates kernel boundaries altogether. Doing this achieves brr – on an H100, we use 78% of memory bandwidth and outperform existing systems by over 1.5x. (To our knowledge, this is the lowest-latency forward pass for Llama-1B in bfloat16!) In the rest of this post, we’ll walk through how and why one would do this.

    The idea of megakernels that handle all of the operations is not new, but the complexity of fusing everything together is high. Persistent kernel were popularized at the tail end of the CUDA 11 series, due to the right residency and async copy support in Ampere. They allow leaving kernels resident on an SM and having them pull a series of tiles to do their work rather than scheduling repeated launches of the same kernel. The megakernel takes this idea ever further, with multiple operations within the kernel that pulls a stream of different problems. One issue with this approach (traditionally) is register spilling: you only have so many registers available, up to 255 per thread, though with a fairly high overall limit of 64k 32-bit registers (on Hopper). That means you need to keep some data in shared memory, and efficient use of shared memory ends up being the bottleneck. The team at Stanford developed paging shared memory, with a separate reserved block for managing allocations of shared memory to individual tasks.

    This gets the CPU completely out the picture for the forward pass, but is incredibly specific to the model (in this case Llama 3.2 1B).

    Another collaboration was clearly thinking in the same direction, as they recently posted about their Mirage Persistent Kernel: a compiler for megakernels.

    our team from CMU, UW, Berkeley, NVIDIA, and Tsinghua developed Mirage Persistent Kernel (MPK) — a compiler and runtime system that automatically transforms multi-GPU LLM inference into a high-performance megakernel. MPK unlocks the benefits of end-to-end GPU fusion while requiring minimal manual effort from developers.

    The system works by building a task-graph of what they call LAX1[1] fragments, which in practice is a very short list of 14 operators. This is actually too small to represent everything they need, meaning they have to manually decompose some common ops like ReLu, but this level of decomposition gives them an ability to do some pretty complex fusions.

    The actual ops are generated thanks to Mirage’s Kernel Superoptimizer (a great name), which I think is a very intense autotuner:

    Mirage automatically searches for possible GPU kernels for attention. The search space includes existing manually designed attention kernels (e.g., FlashAttention and FlashDecoding) as special cases. It also includes other implementations that outperform today’s handwritten ones by up to 3.5x for certain use cases. The GPU kernels generated by Mirage can directly operate on PyTorch tensors and be called in your PyTorch program.

    The search is not cheap though:

    In our evaluation, Mirage takes up to 4 hours to optimize a Lax program. This optimization is a one-time cost before deployment on the target hardware.

    The aggressive decomposition allows them to have a clever verification scheme where they validate kernels on random inputs to get confidence in (approximate) numerical correctness.

    They then build a worker kernel with all the relevant operations, and schedule the optimized graph via dedicated scheduler warps. Workers are scheduled on the SMs, and report back status. The scheduler warps then decide when tasks can be enqued for execution.

    They’ve got a code example that walks through setting it up for Qwen. They recreate the model structure explicitly, generate a task graph from it, and kick off the search for optimal kernels and fusions. This avoids the need to solve the Dynamo-style problem of tracing the model!

    The resulting kernel is again heavily tied to the specific hardware and model. One thing we have found useful for investigating production problems is that the ability to ablate different parts of the compile process, running models in (basically) PyTorch eager mode. This approach leaves the darkest of black boxes to work with, and I would imagine even more terrifying PTX than the complex CUDA that the LLM kernel generation team came up with.

    Between these projects though, it feels like we are exploring the edges of what running a program on GPUs should actually look like: a combination of kernel generation and multi-level search seems almost guaranteed to yield optimizations that would be far outside the cost-benefit curve for manual implementation. What we don’t have yet is known ways to operationalize this kind of approach, but its an exciting area to watch!

    Thanks to Matt for nudging me to actually read through these papers, they’d been on my todolist for a bit!


    1. I am not sure what this stands for, but the basic ops in jax are in jax.lax, so I presume its the same source! ↩︎

  • Free-Threaded Python gets ‘supported’ status

    Huge congratulations to Thomas, Matt and Sam for sheparding through PEP 779 that moves the no-gil/free threaded python mode from experimental to supported:

    https://discuss.python.org/t/pep-779-criteria-for-supported-status-for-free-threaded-python/84319/123

    With these recommendations and the acceptance of this PEP, we as the Python developer community should broadly advertise that free-threading is a supported Python build option now and into the future, and that it will not be removed without following a proper deprecation schedule

    Having confidence in the long term of this feature is great for anyone building on it. I’m very grateful to (and feel lucky to be working with!) the many folks who have been squashing bugs and improving performance, and to the people adding support for FT Python across the ecosystem!

    The steering commitee have laid out some solid documentation and performance expectations for the ongoing work, and are setting an expectation for broad compatibility for future cpython work:

    New experimental projects within CPython must be compatible with, and should be based on the free-threading build. The SC encourages this direction to reduce engineering complexity caused by supporting both GIL and free-threaded builds

    I also appreciate the call for building out more high-level concurrency primitives: I think there are a lot of exciting projects to come as we move more of this into production!

  • Monarch: PyTorch Single Controller

    I’ve been excited for this to make it to OSS: The PyTorch team at Meta recently soft-launched Monarch on Github.

    pytorch-labs/monarch: PyTorch Single Controller

    Back in 2022, Google’s Pathways paper proposed (revisiting) a single-controller approach for managing machine learning runs. Typically, ML jobs use an SPMD (Single Program, Multiple Data) approach, distributing identical code across multiple hosts. Each host runs independently, synchronizing during collective operations. This works, as evidenced by the many large training runs in the world. It also introduces complexity, especially with pipeline parallelism where conditional logic for different ranks can clutter up your training code. Even without that, subtle issues can arise: for example, slight differences in torch.compile optimization have (in the past!) lead to deadlocks by placing collectives differently on separate nodes.

    The single-controller model simplifies this by centralizing program execution on one main node and using generic workers on the hosts that execute assigned tasks. This provides a consistent, global view of the entire computation, making it easier to get to a correct implementation of parallelisms and other distributed work. This doesn’t come for free though: the main node must efficiently manage (potentially!) thousands of GPUs without becoming a bottleneck, and existing code must adapt to this new centralized model.

    Monarch is the PyTorch team’s implementation of this single-controller concept. It provides a familiar PyTorch frontend, additional module wrappers, and a high-performance Rust-based actor system for distributing and managing work.

    The fundamental abstraction in Monarch is the Actor. Each Actor executes on their own accelerator, maintains state and behavior. Communication with other Actors is via async message passing on methods decorated with @endpoint. The nice thing about the programming model is you can interact with all of the actors in your mesh in a consistent way.

    Monarch is appealing even if you’re not GPU-rich. For instance, at home, I have a machine equipped with two (mismatched) 3090s, and Monarch allows me to run and debug jobs directly in notebooks without relying on external services.

    Installation had minor hurdles because I built from source rather than using the available pip package. Although the README specifies Python 3.10, Python 3.13 worked fine. The dependencies reference dnf (reflecting Meta’s internal Linux distro choice), so adapting commands to other Linux distributions (Ubuntu in my case) was necessary. Additionally, I had to set BINDGEN_EXTRA_CLANG_ARGS="-I/usr/include/c++/11 -I/usr/include/x86_64-linux-gnu/c++/11" to resolve Rust compilation issues.

    Once installed, running Monarch’s distributed data-parallel notebook was straightforward (see: monarch/examples/notebooks/spmd_ddp.ipynb). The notebook shows that minimal code changes to the standard DDP example are required, mainly subclassing Actor (e.g., class DDPActor(Actor)), while keeping the training loop virtually identical. Monarch handles the rest, including distributed execution and debugging across multiple GPUs.

    Setting up the environment means providing the mesh configuration and launching the actors, which can be done from a cell:

    # Spawn a process mesh
    local_proc_mesh = await proc_mesh(
        gpus=WORLD_SIZE,
        env={
            "MASTER_ADDR": "localhost",
            "MASTER_PORT": "12455",
        },
    )
    # Spawn our actor mesh on top of the process mesh
    ddp_actor = await local_proc_mesh.spawn("ddp_actor", DDPActor)

    I didn’t have to manually start any other services; it all happened under the hood. Triggering the run is just:

    await ddp_actor.demo_basic.call()

    Which output:

    self.rank=0 Running basic DDP example
    self.rank=1 Running basic DDP example
    self.rank=1 Finished running basic DDP example
    self.rank=0 Finished running basic DDP example

    What I find really appealing is how easy it is to execute across ranks. For example, to query for system info:

    print("Gathering system info from all ranks...")
    system_info = await ddp_actors.get_system_info.call()
    
    print("\n SYSTEM INFORMATION ACROSS ALL RANKS:")
    print("=" * 60)
    
    for point, rank_info in system_info:
        print(f"Rank {rank_info['rank']}: PID={rank_info['process_id']}, "
              f"Device={rank_info['device_name']}, "
              f"GPU Memory={rank_info['gpu_memory_allocated']/1e6:.1f}MB")
    
    print(f"\nFound {len(system_info)} ranks in the mesh")
    
    all_rank_info = [value for point, value in system_info]
    print(f"Total GPU memory across all ranks: {sum(info['gpu_memory_allocated'] for info in all_rank_info)/1e6:.1f}MB")

    Outputting:

    Gathering system info from all ranks...
    [Rank 0] System Info: PID=10519, CPU=0.1%, RAM=5.2%, GPU_MEM=0.0MB
    [Rank 1] System Info: PID=10520, CPU=0.1%, RAM=5.2%, GPU_MEM=0.0MB
    
     SYSTEM INFORMATION ACROSS ALL RANKS:
    ============================================================
    Rank 0: PID=10519, Device=NVIDIA GeForce RTX 3090, GPU Memory=0.0MB
    Rank 1: PID=10520, Device=NVIDIA GeForce RTX 3090, GPU Memory=0.0MB
    
    Found 2 ranks in the mesh
    Total GPU memory across all ranks: 0.1MB

    I can also stop training and dump state if I need to , making it easy to check norms and debug:

    print("Running training steps...")
    for step in range(3):
        print(f"\n--- Step {step + 1} ---")
        
        step_results = await ddp_actors.train_step.call()
        
        all_results = [value for point, value in step_results]
        
        losses = [result['loss'] for result in all_results]
        grad_norms = [result['grad_norm'] for result in all_results]
        throughputs = [result['throughput'] for result in all_results]
        
        print(f"Losses across ranks: {[f'{l:.4f}' for l in losses]}")
        print(f"Gradient norms: {[f'{g:.4f}' for g in grad_norms]}")
        print(f"Avg throughput: {sum(throughputs)/len(throughputs):.1f} samples/sec")
    --- Step 1 ---
    [Rank 1] Step 1: Loss=1.1128, GradNorm=0.3198, Time=0.241s
    [Rank 0] Step 1: Loss=1.0414, GradNorm=0.3198, Time=0.253s
    Losses across ranks: ['1.0414', '1.1128']
    Gradient norms: ['0.3198', '0.3198']
    Avg throughput: 129.8 samples/sec
    
    --- Step 2 ---
    [Rank 0] Step 2: Loss=1.1526, GradNorm=0.3096, Time=0.003s
    [Rank 1] Step 2: Loss=1.0546, GradNorm=0.3096, Time=0.003s
    Losses across ranks: ['1.1526', '1.0546']
    Gradient norms: ['0.3096', '0.3096']
    Avg throughput: 9800.9 samples/sec
    
    --- Step 3 ---
    [Rank 1] Step 3: Loss=0.9116, GradNorm=0.2243, Time=0.002s
    [Rank 0] Step 3: Loss=0.9662, GradNorm=0.2243, Time=0.002s
    Losses across ranks: ['0.9662', '0.9116']
    Gradient norms: ['0.2243', '0.2243']
    Avg throughput: 19977.5 samples/sec

    While the distributed stuff here is cool, it’s not wildly different than using a distributed framework like Ray and a little bit of setup (though I am pretty allergic to setup). What is most interesting is how this changes the programming model of PyTorch, and makes it really easy to build out distributed experiments.

    For example, if I was building a param server the sync only requires an await’d read of the weights from another object, taking advantage of the RDMA support for an efficient cop1y:

        @endpoint
        async def worker_sync_with_ps(self, param_server) -> bool:
            """Synchronize with parameter server and get RDMA handles"""
                
            self._log("Synchronizing with parameter server...")
            
            # Get RDMA buffer handles
            self.weight_buffers = await param_server.ps_get_weight_handles.call_one()
            self.gradient_buffers = await param_server.ps_get_gradient_handles.call_one()
            
            # Get metadata
            metadata = await param_server.ps_get_metadata.call_one()
            self.weight_metadata = metadata['weights']
            self.gradient_metadata = metadata['gradients']
            
            # Perform initial weight sync
            sync_time = await self._sync_weights_from_ps()
            
            self._log(f"Synchronized with parameter server (sync time: {sync_time:.3f}s)")
            return True

    Getting those weight buffers is as simple as creating the right Monarch object:

    def tensor_to_rdma_buffer(tensor: torch.Tensor) -> RDMABuffer:
        # RDMA requires 1D contiguous uint8 tensors
        byte_tensor = tensor.view(torch.uint8).flatten()
        return RDMABuffer(byte_tensor)

    For an early preview of a library, Monarch is surprisingly complete, and definitely worth a look.

    1. Not that this would do anything for my 3090s! ↩︎
  • Metrics for Engineering Teams

    Don’t blindly tie every piece of work to top-level metrics. Even if technically feasible, the cost is too high and the risk of spurious logic chains significant.

    Start with Value Definition

    Begin each project with a crisp definition of why it’s worth doing. What underlying problem are you solving, and why is that problem worth solving? Once you have these narrative assertions, it’s usually clear how extraordinary or controversial each claim is.

    The more notable the claim, the more likely you need data to support it.

    Value Metrics

    1. Direct outcome metrics (strongest)
    We will run an ongoing experiment measuring profit per user with the feature on vs baseline.

    2. Strong correlative metrics
    This will increase time on site, which we can measure and believe correlates with profit per user.

    3. Anecdotes and feedback
    N sales team members report they could sell into more accounts if we launch this feature.

    4. Strategic assertions
    We must do this because of upcoming regulation or we will be unable to continue this business line.

    Progress Measurement

    Once your value claim is clear and defensible, identify how you’ll measure progress. This may differ from your value metric. Ideal progress metrics tell you whether you’re succeeding, respond quickly to team actions, and have strong reference baselines.

    1. Clear baseline, goal, and team-tied metric (Strongest)
    Launching this compressor will reduce binary download size by an estimated 10% vs the best available industry baseline. We can measure relative progress continually against our production binary during development.

    2. Responsive metric without clear reference point
    We can improve compile times on this fixed codebase from today’s 90s baseline.

    3. Non-responsive metric
    We can measure weekly mobile app release binary size, comparing the new compressor to our old implementation.

    4. Milestones
    We will implement passes a, b, c, after which we can ship the new compiler optimizations to target customers.

    Common Challenges

    Stronger measures aren’t always a worthwhile tradeoff. If you have high confidence in the work’s value and applicability and mainly need to validate progress, milestones can be completely reasonable.

    In general, approach projects with skepticism about whether this is the right thing to do and whether you’ll make good progress. Then identify ways to get concrete data rather than rely solely on leadership support.

    Leadership pressure for top-level metrics
    The clearer you are on why you’re doing something, the easier it becomes to communicate your measurement decisions. If leadership continues pushing back and you have a good relationship, use that as a lens to explore concerns you might have missed. Often requests for metric clarity stem from deeper worries about project value or plausibility.

    Team dynamics and gaming
    Metrics communicate value in performance reviews, creating incentives for engineers to identify unnecessary correlations or game metrics (intentionally or not). Counter this with “health” metrics that balance negative behaviors — if measuring deployment frequency, also measure production incidents with a flat target to prevent trading off reliability for speed.

    For senior engineers concerned about optics, have them clearly articulate the value chain. Working through and demonstrating strong data usage in project steering is itself a highly rewarded skill — encourage them to take on that role.

    When to revisit metrics
    The world changes. What’s sensible in one environment isn’t in another. Constantly relitigating is a headache, but reevaluating logic at regular intervals (say, each half-year planning cycle) is appropriate. Otherwise, maintain awareness of company trends: new projects, initiatives, or teams gaining significant attention. If they were successful, would that change what you do? There’s no hard rule: it’s corporate decision-making taste that develops with experience.

  • Fused Linear Cross-Entropy

    Fused Linear Cross-Entropy is a popular optimization that combines the final linear projection and cross-entropy loss into a single operation. This fusion is very valuable for training large language models efficiently, as it can reduce memory usage significant, particularly for larger vocabularies.

    If you look at a LLM training loop, you generally see something like:

    logits = model(input_ids)
    loss = cross_entropy(logits, targets)

    And if you look at the end of the model, you’ll see something like the below, where h is the hidden state so far and output is output = nn.Linear(embed_dim, vocab_size, bias=False)

    # shape: [b, seq_len, out_dim]
    output = self.output(h)

    That final logics value can be pretty big: sequence length is long and the vocabulary size is large (128k for Llama 3, 202k for llama 4), so logits can be GB of memory: with a 16k context window, a 128k vocab, and 4k embedding dimensions even at a batch size of 1, you get 8bn entries. At BF16, that’s 4GB. You’ll also need to capture the gradient, which will give you another 4GB in the backwards.

    That set of logits has a range of values that are a bit all over the place, one for each of the possible targets.

    Cross-entropy is a loss between two probability distributions. Jay Mody has an excellent blog post breaking down softmax and CE loss

    Roughly speaking, cross entropy measures the similarity of two probability distributions. In the context of neural networks, it’s common to use cross entropy as a loss function for classification problems where:

    • q is our predicted probabilities vector (i.e. the softmax of our raw network outputs, also called logits, denoted as y^), that is q=softmax(y^)
    • p is a one-hot encoded vector of our label, that is a probability vector that assigns 100% probability to the position y (our label for the correct class): pi={1i=y 0i≠y

    This means that cross-entropy simplifies to F.nll_loss(F.log_softmax(x, 1), target)

    Softmax makes our previously messy logits into a nice probability distribution where all the values are positive and sum to one. log softmax is usually used in LLMs, for numerical stability and efficiency.

    When we implement softmax, the naive implementations looks something like:

    out = torch.log(torch.exp(x) / torch.sum(torch.exp(x)))

    This isn’t numerically stable, so you want to subtract the max to avoid overflows and underflows in the exp. This is the common log-sum-exp implementation:

    x_max = torch.max(x)
    shifted_x = x - x_max
    exp_shifted = torch.exp(shifted_x)
    out = shifted_x - torch.log(torch.sum(exp_shifted)

    In general the memory and compute cost of this grows with the size, which gets painful for our hefty logits. We can instead keep a running log-sum-exp so we don’t have to deal with the whole input at once.

    lse = xs[0]
    for x in xs[1:]:
        m = torch.max(torch.stack([lse, x]))
        lse = m + torch.log(torch.exp(lse - m) + torch.exp(x - m))
    out = lse

    This is the online log-sum-exp approach, and makes our life easier! We can now compute incrementally, but we are still generating the big logits before hand.

    Fused Linear Cross-Entropy replaces the output projection, softmax and loss calculation with a single kernel that a tiles across all of it.

    This is the core of the idea: instead of computing all logits at once (which creates a massive tensor), we can:

    1. Compute logits for small chunks of the vocabulary
    2. Compute the softmax incrementally
    3. Only store the logits we need for the loss calculation

    Quoting https://github.com/mgmalek/efficient_cross_entropy

    This repo contains an implementation of a linear projection + cross-entropy loss PyTorch module that has substantially lower memory consumption compared to a standard implementation, with almost no additional compute cost. The memory savings come from two optimizations: 1) overwriting the logits with their gradients in-place and 2) not materializing the entire logits tensor.

    Roughly, the loop looks like:

    For each of the token i in the sequence, with output layer weights h

    • Compute a partial dot product si = hi dot W_tile
    • Reduce for a running max and exp-sum
    • Return only the si[targeti] needed for the loss.

    This gives you quite a lot of memory wins, which also reduce peak memory bandwidth needs. But this also introduces some potential pain!

    1. You’re fusing the final layer op into the loss, which might be defined in different places in your model code
    2. You’re accumulating, so you have to use fp32 or risk introducing numeric errors
    3. You have to write you own backwards op as well, which will generally do some extra computation, so you are paying some extra FLOPS
    4. Debugging can be harder, so you want a good recipe prior to swapping in the op
    5. May require some futzing for best implementations on different hardware.

    Actually implementing is pretty straightforward.

    @staticmethod
    def forward(ctx, h, W, target):
        B, D = h.shape
        V, _ = W.shape
        
        chunk_size = min(1024, V)
        
       # Initialize online softmax accumulators
       max_logits = torch.full((B,), -float('inf'), device=h.device, dtype=torch.float32)
       sum_exp = torch.zeros(B, device=h.device, dtype=torch.float32)
       target_logits = torch.zeros(B, device=h.device, dtype=torch.float32)
            
        # Process vocabulary in chunks
        for chunk_start in range(0, V, chunk_size):
            chunk_end = min(chunk_start + chunk_size, V)
                
            # Compute logits for this chunk only
            W_chunk = W[chunk_start:chunk_end, :]
            logits_chunk = h @ W_chunk.T  # [B, chunk_size]
                
            # Update running max
            chunk_max = logits_chunk.max(dim=1).values
            new_max = torch.maximum(max_logits, chunk_max)
                
            # Adjust previous sum_exp by exp(old_max - new_max)
            sum_exp *= torch.exp(max_logits - new_max)
            
            # Add this chunk's contribution to sum_exp
            sum_exp += torch.exp(logits_chunk - new_max.unsqueeze(1)).sum(dim=1)
            
            # Update max
            max_logits = new_max
                
            # Extract target logits if target is in this chunk
            chunk_indices = torch.arange(chunk_start, chunk_end, device=h.device)
            is_target_in_chunk = (target.unsqueeze(1) == chunk_indices.unsqueeze(0))
            target_logits += (logits_chunk * is_target_in_chunk).sum(dim=1)
        
        # Compute loss: -log(p_target) = -(target_logit - log_sum_exp)
        log_sum_exp = max_logits + torch.log(sum_exp)
        loss = log_sum_exp - target_logits
        
        # Save for backward
        ctx.save_for_backward(h, W, target, max_logits, sum_exp)
        ctx.chunk_size = chunk_size
            
        return loss.mean()

    Here we chunk the vocabulary, calculate the partial transform for the chunk h @ W_chunk.T, do online softmax and accumulate the target logits.

    The backward calculates the gradients:

    @staticmethod
    def backward(ctx, grad_output):
        h, W, target, max_logits, sum_exp = ctx.saved_tensors
        chunk_size = ctx.chunk_size
            
        B, D = h.shape
        V, _ = W.shape
            
        # Scale gradient by batch size (since we use mean reduction)
        grad_scale = grad_output / B
            
        # Initialize gradient accumulators
        grad_h = torch.zeros_like(h)
        grad_W = torch.zeros_like(W)
            
        # Process vocabulary in chunks (same as forward)
        for chunk_start in range(0, V, chunk_size):
            chunk_end = min(chunk_start + chunk_size, V)
            chunk_indices = torch.arange(chunk_start, chunk_end, device=h.device)
                
            # Recompute logits for this chunk
            W_chunk = W[chunk_start:chunk_end, :]
            logits_chunk = h @ W_chunk.T  # [B, chunk_size]
                
            # Compute softmax probabilities for this chunk
            # p_i = exp(logit_i - max) / sum_exp
            probs_chunk = torch.exp(logits_chunk - max_logits.unsqueeze(1)) / sum_exp.unsqueeze(1)
                
            # Gradient w.r.t. logits: grad_logits = p - 1_{y=i}
            grad_logits_chunk = probs_chunk * grad_scale
                
            # Subtract 1 from target positions
            is_target = (target.unsqueeze(1) == chunk_indices.unsqueeze(0))
            grad_logits_chunk -= is_target.float() * grad_scale
                
            # Accumulate gradients
            grad_h += grad_logits_chunk @ W_chunk
                
            grad_W[chunk_start:chunk_end, :] = grad_logits_chunk.T @ h
            
        return grad_h, grad_W, None

    In the backwards we recompute the logits for the chunks, and calculate the logits.

    This is a very simplified implementation that trades off a bunch of kernel launches, so gives up a lot of performance, but you can see the memory savings:

    Regular:
    Time: 285.18 ms
    Memory (total): 3072.0 MB
    Loss: 11.142737
    Chunked online softmax:
    Time: 470.27 ms
    Memory (total): 356.0 MB
    Loss: 11.142738

    For a more sophisticated implementation, you can look at the repo mentioned before or Liger has a good quality kernel with further optimizations. These calculate the gradients in the forward pass, then can just scale them in the backwards. This trades off a bit more memory for less of a compute hit. In general there are a few options for choosing the right point

  • Values in AI

    Daniel Schmachtenberger has made the argument:

    1. All technologies embody value systems
    2. Some technologies are obligate in a competitive environment

    The example of his that stuck with me was the plough: many cultures were animistic (a belief in the spirit of the animal), but after the scaling up of agriculture enabled by the plough, most weren’t. The plough’s enablement of large-scale agriculture likely shifted societies toward sedentism (vs nomadism) and surplus, altering spiritual relationships with animals as they became tools for labor. The perspective shift — the value it encodes — is embedded in the technology.

    The plough is also obligate. If one group uses it and other doesn’t, the group that does will be able to farm more per person. That surplus enables for more specialization, which yields an advantage either in terms of trading or conflict. If the second group doesn’t adopt the plough they will be taken-over, outgrown, or both, by the first.

    AI may well be an obligate technology, which forces us to make deliberate ethical choices about its deployment and values. We are in the early stages of seeing that with software development. That’s going to change the nature of certain careers: changing what the day-to-day work looks like and impacting demand for software engineers. That isn’t necessarily negative: it will depend on the opportunities that replace the current ones. It also isn’t neutral: our approach to AI, how we deploy it, how it is used are all a series of choices that embed values.

    Some of those values are encoded into the models by the training data and loss functions, some are encoded in the systems engineering, the choices of which tasks to apply it to, which interactions to explore and so on, and some are explicitly engineered in through fine tuning and reinforcement learning.

    One way of looking at those values is through the study of ethics, how to live in a just way. This is a core topic for philosophers. One example is Kant’s Categorical Imperative, which requires actions to follow maxims that could be universally applied without contradiction, ensuring rational consistency.

    It’s somewhat akin to asking the question: Would I still support this if I knew everyone else would act this way? Further, would I support this action if knew I would be born again randomly into the world, maybe in a much different situation than my one now?

    The proliferation of useful AI agents adds a somewhat realistic flavor to the question: if, in the future, you are dependent on systems constrained by these specific guidelines or rules , are you happy about that?

    Kantian (or deontological) thinking is far from the only ethical system. A lot of thinking about AI ethics has been consequentialist. Consequentialism is practical: the “goodness” of an action is whether it results in a good outcome! Inherently we judge AI training (at least for RL and supervised learning) by the achievement of the outcome encoded in a loss function, reward function or similar. Stuart Russel (of & Norvig fame from university courses of my youth) has written about “provably beneficial” AI where the AI maximizes a human-involved reward signal (a little like the Assistant Games pattern we discussed before).

    The downside of all this is well documented — Nick Bostrom’s famous paperclip maximizer thought experiment is an AI that achieves the objective, but in a way that was undesirable. A more benign but annoying example might be a cleaning robot that pushes everything outside the house in order to make it tidier. Because outcome-based rules just judge the what, and now the how, they can also encourage power-seeking (as called out by Bostrom) in order to better achieve objectives.

    standard forms of consequentialism recommend taking unsafe actions when such acts maximize expected utility. Adding features like risk-aversion and future discounting may mitigate some of these safety issues, but it’s not clear they solve them entirely.

    Deontology and safe artificial intelligence – William D’Allesandro

    Anthropic’s constitutional AI approach can be seen as a blend of approaches; the constitution is a set of principles that can be used by another AI to criticize and improve output in response to requests:

    As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as ‘Constitutional AI’. 

    The training still ultimately uses a form of reinforcement learning (which is inherently consequentialist), but the reward is given according to how well the outputs adhere to the constitutional principles.

    A more recent philosopher, Derek Parfitt, argued that all moral systems were hill climbing towards a shared perspective, and you can evaluate an action on multiple in order to gain confidence. For example, when considering an option, you could ask:

    a) Would it maximize overall good? (consequentialist)
    b) Could everyone rationally will it? (Kantian)
    c) Could anyone reasonably reject it? (contractualist1)

    “Rationally” here is doing a bit of work: it means “with reasoning”, as in there is a chain of thought that can support and justify the decision.

    Part of the challenge with rationalism is that part of the reward signal here is coming from human raters. We have seen this play out with LMSys where models which are “friendlier” score better, and in a more extreme version in the ChatGPT 4O misalignment where the model became excessively sycophantic in a way that resulted in better rewards in short doses, and didn’t impact any of the quantitative evaluations, despite being an overall negative to the experience.

    As we move into more agentic systems we often have fewer tools to evaluate or make visible the values we are encoding, but we are still doing it!

    For example. Google’s recent AlphaEvolve project uses Gemini underneath, which is an LLM that can be evaluated and aligned. But on top of that it uses an evolutionary search scheme (another reminder of Rich Sutton’s bitter lesson) to generate different prompts and evaluations and iterate towards a new, externally defined goal: in that case generating better algorithms and code. We are searching for superior outcomes, but that search itself is -somewhat unconstrained by other values: it’s a more consequentialist approach.

    The current crop of agentic coding tools often recommends encoding preference data into a project specific file. For example, Claude Code recommends a CLAUDE.md file

    • Include frequently used commands (build, test, lint) to avoid repeated searches
    • Document code style preferences and naming conventions
    • Add important architectural patterns specific to your project
    • CLAUDE.md memories can be used for both instructions shared with your team and for your individual preferences.

    While it presents them as memory, the idea here is to guide the choices of the model in a way that aligns with the principles by which the project being modified is managed.

    OpenAI have published work in allowing hierarchies of instruction: The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions | OpenAI

    we argue that one of the primary vulnerabilities underlying these attacks is that LLMs often consider system prompts (e.g., text from an application developer) to be the same priority as text from untrusted users and third parties. To address this, we propose an instruction hierarchy that explicitly defines how models should behave when instructions of different priorities conflict. 

    As well as using a single model that can incorporate different safeguards, we can use models themself to verify actions and outputs. Verification is generally an easier problem than generation, so a model that is unable to consistently follow a set of principles may still be able to validate whether a given example does or does not follow them.

    LlamaGuard is a good example of this kind of system, built and released by Meta’s GenAI team alongside Llama. One example of seeing this process in the wild is OpenAI’s safety systems on 4O image generation. Inherently agentic, 4O can generate image ideas, then the image itself. Despite the model having constraints on it, it will happily generate things which violate OpenAI’s content policy, necessitating a monitoring model that whisks them away before a use can access a violating image.

    If AI becomes an obligate technology, we will benefit from encoding values intentionally, balancing outcomes, universal principles, and fairness. The challenge is ensuring these choices reflect the world we want, not just the one we’re competing in.

    1. Another theory of ethics that weights mutuality heavily: it’s frames ethical considerations as something derived between people rather than just based on outcomes or on abstract principles. Its featured particularly in Scanlon’s What We Owe to Each Other for those, like me, who get all of their ethical understanding from watching The Good Place ↩︎

  • Pyrefly

    https://pyrefly.org

    I’m at PyCon (mildly awkward photo thanks to Simon Willison!) and earlier had to steal some extra chairs for the Typing Summit as it was full up! There is a lot of energy and interest around type checking, thanks to Astral’s Ty and Meta’s Pyrefly projects coming in to the space recently.

    While the playground is great to try it, I wanted to see what it was like on a larger codebase I was familiar with. I decided to try TorchTune, which makes use of types, but doesn’t configure a typechecker explicitly for CI (as far as I know!), relying on the LSP to show squiggles as the main type hinting feedback (which is reasonable!)

    I tried running mypy over it with a very basic config, and time mypy

    [mypy]
    python_version = 3.13
    ignore_missing_imports = True
    warn_unused_ignores = True
    strict_optional = True
    files = .

    Unsurprisingly, there are quite a few errors!

    Found 1361 errors in 211 files (checked 485 source files)
    real 10m45.596s
    user 0m12.906s
    sys 0m0.929s

    I installed pyrefly and init’d it:

    pip install pyreflypyrefly init

    This created a pyrefly.toml containing a very minimal config:

    project_includes = ["."]
    python_version = "3.13.0"

    pyrefly check then gave me

    INFO 2,966 errors shown, 7 errors ignored, 485 modules, 7,364 transitive dependencies, 3,522,743 lines, took 47.94s (checking 34.88s; reporting 12.98s), peak memory physical 863.6 MiB

    It’s impressively fast: 10 minutes for mypy vs under 50 seconds for pyrefly. There are also a lot more errors, and it’s tricky to tell whether they’re false positive from pyrefly, skipped errors from mypy, or something else. I scoped it down to the TorchTunes KV cache module in torchtune/modules/kv_cache.py to get a better look.

    There pyrefly returns 9 errors, and mypy 17, but that’s caused by slightly different ways of capturing some of the same errors from what I can see. For example:

    k_out[:, :, self.cache_pos[:seq_len]] = k_val

    This code is doing a bit of Tensor slicing:

    • cache_pos is a max_seq_len long tensor holding absolute write positions
    • k_out is the key cache, with shape batch_size x num_heads x max_seq_len x head_dim
    • Here we’re getting a view for the part of the cache we want to update, and appending the latest values

    MyPy gives these errors:

    torchtune/modules/kv_cache.py:104: error: "Tensor" not callable [operator]
    torchtune/modules/kv_cache.py:104: error: Value of type "Tensor | Module" is not indexable [index]

    While pyrefly gives:

    torchtune/modules/kv_cache.py:104:9-46: Item assignment is not supported on Module | Tensor   Expected __setitem__ to be a callable, got BoundMethod...

    In this module cache_pos and k_cache are created by calling PyTorch’s register_buffer, which stores params for use in the state_dict but doesn’t use them for training. Buffers don’t have to be Tensors, so I am guessing the type doesn’t propagate well there. Adding explicit type declarations in the class body fixes the errors in both mypy and pyrefly.

    cache_pos: torch.Tensor
    k_cache: torch.Tensor

  • Free-Threaded Python community

    As Thomas posted on the Python Discuss board there is a discord for discussing Free-Threaded/NoGil Python: https://discord.gg/rqgHCDqdRr

    Other than the helpful docs the Quansight folks maintain (py-free-threading) it’s been interesting to see some projects and tools pop up on there I hadn’t see. One being Zsolt’s py-free-threading which checks whether your project and deps have FT wheels for any non-pure python deps, which can be run as a 1-liner thanks to uv:

    uvx python-ft-deps