Category: links-and-recs

Interesting links and recommendations

  • Perplexed

    The normal loss when pre-training a language model is Cross-Entropy, which sounds more complicated than it is. As it generates a token, the model doesn’t just predict a token, it predicts a probability distribution across all possible tokens. Cross Entropy loss is -log(probability of the correct token) from that distribution.

    • If p(correct) = 0.99 → CE ≈ 0.01
    • If p(correct) = 0.5 (unsure between two tokens) → CE ≈ 0.693
    • If p(correct) = 1/100_000 (e.g. guessing uniformly) → CE ≈ 11.5

    If you average the CE over a whole bunch of tokens (say in your validation set) and take e^(ave CE), you get the perplexity, or PPL.

    The number gives you an idea of how many choices the model was considering. Perplexity of 1 means the model was always 100% sure and 100% right (a feat only Elon can achieve). PPL 2 means the model was flipping a coin between two tokens most of the time. PPL 50 means the model was uncertain between 50 plausible next tokens. Because you’re already calculating the loss, PPL is very cheap to compute, so it gets used a lot.

    Prior to pre-training you’ll typically run a sweep of experiments of different architecture tweaks, and see which lower perplexity. During pre-training you’ll want to check whether the model is successfully learning, whether you should nuke a run rather than continuing: improvements in perplexity are a good guide to that. You can also score perplexity on fresh data using a well-trained model: data with a surprisingly high perplexity might be garbage, or a counting subreddit.

    Still, you can have too much of a good thing. A new paper from Veličković et al “Perplexity cannot always tell right from wrong”, makes the argument that, much like with humans, its very easy to select for confidently wrong rather than uncertainly right.

    We prove that, for a wide class of decoder-only Transformer-based language models, should the model be highly confident and correct on a sufficiently long input sequence, this must imply existence of another input where the model’s prediction is wrong, yet the log-perplexity of that prediction approaches *zero*

    The basic idea is that when the model is confident, you can construct a different sequence that the model would be equally confident on but also… wrong.

    This particularly shows up when contexts get longer, because all tokens are not equal. To give a trivial example:

    In the word "strawberry," there are 8 Rs.

    This is correct for every single token, except ‘8’. A highly confident model may have a lower perplexity for that sequence, as a whole, than a more correct but less confident one.

  • What is In-Distribution

    What is In-Distribution

    One of the persistent questions in model development is whether reasoning actually involves… reasoning. As in: are we seeing actual logical conclusions, or just better recall of knowledge and patterns from the training set? LLMs are trained on, roughly, the web, which makes answering that question tricky: almost everything shows up in some form. A model that appears to “reason” through a physics problem could just be pattern-matching an irritated Reddit reply it saw during training.

    On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models takes a look at this question methodically.

    To this end, we build a fully controlled framework that isolates the contributions of each training stage. Our design is based on three principles: (i) fully controllable synthetic reasoning tasks with explicit atomic operations and DAG-defined dependency structure; (ii) observable, parseable reasoning processes enabling process-level evaluation and reducing reward or evaluation hacking; and (iii) systematic manipulation of pre-/mid-/post-training distributions to attribute causal effects to each stage.

    The authors break the problem of reasoning and training data down along two dimensions.

    1) Breadth-wise: can the model generalize from one type of problem to another (structurally similar) one in a different domain?
    2) Depth-wise: can the model reason correctly for longer, and hence solve harder problems?

    Rather than train on the internet, they build synthetic Math-puzzle reasoning tasks using a dependency-graph framework inspired by GSM-Infinite. By varying the depth of the reasoning chains required, and by generating structurally equivalent tasks across different domains, they try to tease apart those two aspects and investigate them separately.

    For the breadth side the model needs to generalize, to transfer learning across domains. The paper finds that the target domain has to be “in-distribution: the model has to have some examples in the pretraining set. They test this by using pass@128: if you give the pre-trained model 128 attempts, does it get the answer right even once? If so, you can use reinforcement learning or SFT to help the model get reliably better.

    It’s a bit like having studied Spanish at some point and forgotten albóndigas, the word for meatballs. If, for dietary preference reasons, you came to use that word regularly it would likely lodge itself in your brain more easily and you’d go from a lowish chance of getting it right to a much higher one.

    The paper is saying you must have this baseline in their to amplify with RL. Daniel Han of Unsloth describes this by saying with RL “luck is all you need”. If the model never gets the answer right, there is nothing much to reinforce (and you are stuck with paella).

    Depth on the other hand does seem to something we can kinda make up in post-training. Even if a model has only been pre-trained on problems up to a certain complexity, post-training on harder problems consistently enables it to solve them. The model is able to compose more complex patterns based on the simpler ones in its training set1. To continued our tortured analogy, this is more like being reminded of several Spanish words and, over time, learning to stick them together into actual sentences.

    Practically this means your pre-training data is a bet on what the model will ever be able to reason about, and post-training refines how well and how hard it can think within those domains.

    That approach also gives a useful tool for identifying whether something is in-distribution. If you want to know whether a model can learn a new capability through post-training, check pass@128 first. If it never gets the answer right in 128 attempts, you probably have a pre-training gap, not an RL problem.

    1. The paper also spends a while justifying curriculum training, giving the model problems just on the edge of its capabilities before introducing harder ones. Recent work from the FAIR Paris folks and others show you can somewhat automate this by generating problems from the same model you are training! ↩︎
  • Do MoEs Think Different?

    When I was writing recently about MoEs I was focused mostly on the architectural reasons that we use them. One thing I hadn’t considered is that they might actually be better at learning as well.

    Meanwhile, Deconstructing Pre-training: Knowledge Attribution Analysis in MoE and Dense Models

    Our findings reveal that MoE architectures form a low entropy backbone of consistently reinforced neurons, which leads to an early consolidation of their importance profiles and, in turn, underpins their functional robustness. This resilience, stemming from more distributed knowledge storage, contrasts with the greater brittleness and knowledge concentration in the dense model. These phenomena collectively demonstrate that architectural sparsity is not merely a computational shortcut but also acts as a useful inductive bias that fosters stable and robust learning

    To land that somewhere between academic prose and GPT-speak1 the results of the paper are suggesting that MoEs learn more effectively, and store their core knowledge more robustly.

    They measure this with Log-Probability Increase (LPI), which lets you estimate how much each column in the output projection for a layer in the model contributes to the final score. It gives you a sense of how much smarter the model gets from that specific chunk of the weights2. They track this “neuron importance” measure over multiple checkpoints using the (very!) open models from AI2, OLMo-7B and OLMoE-1B-7B.

    In the MoE the set of important weights is both more stable and stabilizes earlier in training: the model develops a core of understanding and builds on that. This might mean MoE training is genuinely more effective than dense. The dense model is regularly thrashing its core understanding as updates come in, while the MoE protects it and lets the model focus more on nuance.

    Or! It might be entirely an artifact of model differences. As the authors note the two models are quite different: different training data sets, different lengths of training, and different depths (16 vs 32 layers), as well as, you know, being an MoE or not. Finally, the actual LPI version they use3, Gated-LPI, bakes in the MoE routing. It’s not totally clear whether we are seeing “neurons that matter”, or mostly seeing “routing patterns that matter”.

    I do think4 this is likely showing something interesting, even with some skepticism. The “smearing” of knowledge across weights is how I described what we are trying to avoid with MoEs, and it may be useful to have a more mechanistic understanding of how that actually happens. The authors observe that the stability curve rises, drops and consolidates. Even if this is just an artifact of routing, it’s quite possible there is a critical phase in the training where that routing locks-in.

    If that idea is right, we might already be shaping that phase. The load-balancing tricks that made MoEs practical could be doing double duty as scaffolds for learning.

    1. Sparsity is not just a shortcut — it’s crucial to learning ↩︎
    2. For a given prompt. They actually use some fairly advanced evals for this, rather than the general basic benchmarks ↩︎
    3. And created, to make it plausible to do this work! ↩︎
    4. Do not draw any research conclusions based on this website ↩︎
  • Anyone got any Veras?

    In the heady world of AI progress, context lengths have seen somewhat more languid growth. After rapid progress up to the 100-300k token range, they’ve largely stayed there for frontier models. We now have a couple of 1m token models that appear economically viable1, with Gemini and Sonnet, but Opus 4.5 (for example) stuck with the 200k window of its predecessor.

    The fundamental challenge with long contexts is the interaction between tokens, particularly in the prefill (prompt processing) phase where you have to do this for a whole lot of tokens at once before you can generate anything.

    For each token attention calculates:

    1) The key: when to use this token
    2) The value: what information this token contributes
    3) A query: what each token is looking for

    Each2 token’s query is compared against prior tokens’ keys to get weighted scores; the resulting weights mix those tokens’ values.

    Then, in decoding you make this calculation repeatedly. The 500th token has a new Key, Value and Query, but the 1st token has the same.

    It turns out you can save yourself a lot of work by just keeping around the Keys and Values from the previous generation and loading it in for the prior tokens. Then you just have to update for the newly added token. This happens for every layer in the model, so it’s a significant amount of computation saved.

    Of course, you have to stick that cached copy somewhere. Because it’s used in each round of generation it needs to be rapidly available, to avoid adding a bunch of latency. In practicality that means it has to be in the high bandwidth memory, which is a scarce resource. So the longer the context the more memory you need to hold it, and the more memory you need for a bigger cache.

    Larger context windows have been unlocked in large part by more memory on the card and in a somewhat smaller part by more rapid scale-out interconnect like NVlink3.

    Meanwhile, here’s Uncle Jensen at CES, via Stratchery‘s excellent analysis of the announcements:

    this context memory, which started out fitting inside an HBM, is no longer large enough. Last year we created Grace Blackwell’s very fast memory, we call fast context memory. That’s the reason why we connected Grace directly to Hopper. That’s why we connected Grace directly to Blackwell, so that we can expand the context memory. But even that is not enough, and so the next solution, of course, is to go off onto the network, the north-south network, off to the storage of the company. But if you have a whole lot of AIs running at the same time, that network is no longer going to be fast enough. So the answer is very clearly to do it different, and so we created Bluefield-4 so that we could essentially have a very fast KV cache context memory store right in the rack.

    It’s quite possible this kind of in-rack memory will unlock significantly larger context windows. I do wonder what this will mean for actually using long-context models. Dealing with multiple-million tokens of context is still going to take a bit of time to process. For the kind of interactive use cases that have worked best with LLMs (Claude Code, Computer Use, Cowork etc.) I suspect latency will be a bit of a pain point.

    What is kind of interesting is that all the providers at this point have some form of prompt caching option. Most of the time with a KV cache you build it up as you go, but in some cases you are going to actually generate the exact same cache in multiple different sessions. A good example would be a long system prompt: you can generate the KV cache for that, stick it on slower memory4 and then load it in to HBM for a new session. This can save a bunch of compute, and is very practical for a lot of use cases.

    One interesting thing this might do is enable “whole codebase” type queries: the vast majority of assets (e.g. code) in a given work session won’t change, so you could cache the KV for everything, and have it in context for later use

    I’m hopeful that as Blackwell, TPUv7 and MI450 come online we will see context lengths unstick and move up, and perhaps with Vera Rubin we will really get rid of “compacting” for some practical set of cases.

    1. So many asterisks should go here after this flagrant assertion ↩︎
    2. Technically in most cases this is between each token’s Query and the Keys of the tokens before it, thanks to causal masking ↩︎
    3. You have to do some work to distribute things of course, but if your model is multi-card anyway, then you can distribute the KV cache fairly easily. TPUs have chonky scale-out bandwidth, probably one of the reasons Google was able to offer 1M first. ↩︎
    4. For clarity, this might not actually be how its implemented at Throppy/Google/OAI, they might just keep it in HBM anyway. But it feels like you could do that? ↩︎
  • A Primer on Post-Training

    A Primer on LLM Post-Training – PyTorch

    Very excited to see this publicly available. David moved to the PyTorch team at the start of the year, having worked on Llama, and wrote up this excellent guide for post-training internally. This is a cleaned up version of the same doc, and provides a fantastic introduction to the world of post-training for modern LLMs.

    It also includes one of my favorite perverse incentive examples:

    Note: this happens with humans too! We just call these Perverse Incentives, but they are literally the same thing. The British government, concerned about the number of venomous cobras in Delhi, offered a bounty for every dead cobra. Initially, this was a successful strategy; large numbers of snakes were killed for the reward. Eventually, however, people began to breed cobras for income.

    The real kicker in that one came when the government realized what was happening and canceled the bounty. The folks who had been breeding cobras didn’t want to look after them any more, so just released them, leading to a lot more cobras than there had been before!

  • Layouts

    You could have invented CuTe hierarchical layout (but maybe not the rest of it?) : ezyang’s blog

    Ed posted the best intro to CuTe layouts I have seen, by showing how to extrapolate them from PyTorch striding1.

    Well, it turns out, this is exactly how CuTe layouts work! In CuTe, sizes/strides are hierarchical: a size is actually a tree of ints, where the hierarchy denotes internal structure of a dimension that you can address linearly (in fact, everything by default can be addressed in a 1-D linear way, even if its an N-D object.)

    Relatedly, Simon Veitner put together a quite visual understanding of layouts. https://veitner.bearblog.dev/intuition-behind-hierarchical-layouts/ – the graphics are helpful once you have the baseline intuition from Ed’s post!

    1. If you’re not familiar with striding, Ed’s PyTorch Internals talk/post remains the best intro! ↩︎
  • The TPU book, on GPUs

    How to Think About GPUs | How To Scale Your Model

    The Jax “How To Scale Your Model” book is one of my favorite references for folks trying to get their head round pretraining1. It breaks down the performance characteristics of model training (often using Llama 3 as an example) in an incredibly clear way. The only slight limitation is that it is primarily focused on scaling LLMs on TPUs: interesting, but probably not your main platform target (unless you work at Deepmind). They just released a new chapter covering GPUs, and it’s also a great summary2.

    There are also plenty of mildly snarky comments about design choices to leaven the reading too:

    Takeaway: in theory, NVIDIA SHARP (available on most NVIDIA switches) should reduce the cost of an AllReduce on B bytes from about 2 * B / W to B / W. However, in practice we only see a roughly 30% improvement in bandwidth. Since pure AllReduces are fairly rare in LLMs, this is not especially useful.

    1. Though they include a chapter on inference too! ↩︎
    2. Though if you haven’t read the rest of the book it moves pretty fast – definitely best to read through the whole thing and treat this as the appendix it is intended to be! ↩︎
  • Extending Arcee’s FM context length

    Extending AFM-4.5B to 64k Context Length

    Via Nathan Lambert, an extremely fun write up of the journey to an 64k context length for Arcee’s 4.5B foundation model. There are a lot of good takeaways, but this one particularly resonated with me:

    Experimentation is Key: As in everything I write, I am unable to stress enough the importance of trying dumb things. If you try enough dumb things, eventually one of them will turn into a smart thing. Embrace the chaos.

  • RL in the second half

    The Second Half – Shunyu Yao – 姚顺雨

    Extremely interesting post by Shunyu Yao of ReAct and Tree of Thought fame about where we got to with AI and where we are going. Read it for the spot-on description of the weirdnes of reasoning as an RL concept, but my main takeaway was the refinement to the idea that the most important thing is having models that “do the benchmarks”

     To recap the game of the first half:

    • We develop novel training methods or models that hillclimb benchmarks.
    • We create harder benchmarks and continue the loop.

    This game is being ruined because:

    Even if we create harder benchmarks, pretty soon (and increasingly soon) they get solved by the recipe. […]

    The recipe has essentially standardized and industried benchmark hillclimbing without requiring much more new ideas. As the recipe scales and generalizes well, your novel method for a particular task might improve it by 5%, while the next o-series model improve it by 30% without explicitly targeting it.

    The post makes the point that the gap is benchmarks that more closely map to real-world problems.

    when the intelligence is low, improving intelligence generally improves utility. But now, the general recipe is guaranteed to work under these assumptions. So the way to play the new game of the second half is

    • We develop novel evaluation setups or tasks for real-world utility.
    • We solve them with the recipe or augment the recipe with novel components. Continue the loop.

    Shunyu works on computer use at OpenAI, so this is well within his wheelhouse, and I think it’s a compelling claim. Many folks1 have talked about the capability overhang LLMs: there is a large baseline ability to do things in the models, but eliciting that ability can be challenging. I tend to think of that similarly to how that are many words which we can comfortably understand, but are very unlikely to use ourselves in conversation2. RL is our most powerful tool for eliciting capabilities, but it’s a somewhat blunt instrument. Having the right evals, eval environment and tasks helps train the agent to interact in a way which generalizes.

    I wonder if, as we progress through this second phase, that we will find signs of a similar “universal geometry” as has been suggested for pretraining in this elicitation: perhaps there is eventually a universal navigation3 towards where to generate in that space for different tasks. Maybe that’s what we’ll call AGI!

    1. Jack Clark particularly ↩︎
    2. Receptive vocabulary vs. productive vocabulary. ↩︎
    3. A universal geometry of a vector field? ↩︎
  • Quack CuteDSL Kernels

    Dao-AILab/quack: A Quirky Assortment of CuTe Kernels

    Tri Dao & co have a fun repo up called Quack: A Quirky Assortment of CuTe Kernels, all leveraging the CuTe-DSL. These are hopper and blackwell oriented kernels for a variety of common needs like softmax, layernorm and RMSNorm.

    On top of that, they wrote a post on how to get speed of light (memory bound) kernels in CuTe-DSL. It goes through how to implement a reduction op across multiple tiers of memory using TensorSSA for thread level reductions, warp reduction with shuffle_sync_bfly and block reduction with shared memory. Even if you’re not writing CuTe, this is about as good an introduction to architecting memory bound ops as I have seen!

    They also cover clustered reduction, leveraging multiple SMs:

    In cluster reduction, we first send the current warp’s reduced value to all the peer thread block’s reduction buffer in peer’s SMEM. Such sending is conducted via a dedicated SM-to-SM fabric (as DSMEM). Then each warp fetches all warp’s values from their local reduction buffer, and reduces these values.

    This does seem to help the kernels scale well to larger sizes:

    We believe our outstanding performance at >= 65k input is due to our successful utilization of cluster reduction in H100. When the size of inputs are ultra long and depleting the SM’s registers and shared memory, without cluster reduction, we would have to switch to an online algorithm (like online softmax) otherwise we may get a massive register spilling that leads to significant throughput degradation.

    I also really appreciate this note of reality in their conclusion:

    Hitting “speed-of-light” model memory throughput confirms that a carefully hand-crafted CuTe kernel can squeeze every byte across all memory hierarchies in the hardware. But that efficiency comes at the price of per-operator and even per input-shape tuning, which imposes a natural tradeoff between efficiency and development efforts

  • ARPA and predicting the future

    Statecraft recently re-ran an interview from 2023 with Jason Matheny, formerly of IARPA: https://www.statecraft.pub/p/how-to-predict-the-future-278

    While defense policy and research is a ways outside the scope for myself (or I imagine most folks reading), the problems of managing or working on uncertain, research-y projects in a volatile environment are pretty relatable:

    Most of what we know from cognitive psychology and human judgment research over the last 50 years suggests that unstructured group deliberation might be one of the worst ways of making judgments, yet it’s the norm in most institutions.

    Or this bit of career wisdom:

    In general, people underestimate their own potential to make contributions to the most important problems. They overestimate how many people are already working on the most important problems. So many incredibly important problems are just really neglected. If you can’t figure out who’s working on something after a few days of homework, then it is probably a neglected problem. And it’s probably up to us to solve it.

    Jason talks about looking for projects in the goldilocks zones of probability (less than 50%, more than 5%) that open up interesting opportunities. I worked with a manager who was a strong advocate of the Heilmeier Catechism  to evaluate projects, and have seen the value of using it as guidance when presenting and evaluating ideas:

    1. What are you trying to do? Articulate your objectives using absolutely no jargon.
    2. How is it done today, and what are the limits of current practice?
    3. What is new in your approach and why do you think it will be successful?
    4. Who cares? If you are successful, what difference will it make?
    5. What are the risks?
    6. How much will it cost?
    7. How long will it take?
    8. What are the mid-term and final “exams” to check for success? 

    Jason adds some interesting updates:

    For instance, the Heilmeier questions don’t have a question about counterfactual impact: “Would this work get done otherwise?” The office tends not to rigorously assess the other funding streams going to solve this particular problem, and their likelihoods of success.

    We also tend not to think much about strategic move and countermove. […]. It probably is prudent to assign at least a 10% probability to some exquisite, classified technology being stolen.

    One thing I found myself talking about this week with a couple of folks was how good people get “lucky” a lot. I think these kinds of questions help navigate towards those more positive-surprise-filled spaces.

  • Harnessing the Universal Geometry of Embeddings

    Harnessing the Universal Geometry of Embeddings

    What content should you include in an LLM prompt? Many interesting use cases (enterprise tools, coding assistants) have more content than they can handle at once, so you chunk it up, turn each chunk into a vector with some sentence‑encoder, and store those vectors in a database. Later you vector‑search, pull back the relevant chunks and feed them to the LLM — better known as the RAG pattern.

    The working assumption has been: those vectors are fairly safe. A 768‑dimensional point doesn’t look the text “Ian ordered a burger at 12:07 ”, so storing raw embeddings seem privacy‑preserving.

    But are they! In Cornell’s Harnessing the Universal Geometry of Embeddings paper the authors train vec2vec, a small GAN that learns to translate embeddings from encoder A’s space into encoder B’s space, without seeing the original sentences. Once you’re in encoder B‑land you can recover up to 80 % of the underlying text:

    Inversion, i.e., reconstruction of text inputs, is more ambitious than attribute inference. vec2vec translations retain enough semantic information that off-the-shelf, zero-shot inversion methods […] extract information for as many as 80% of documents given only their translated embeddings, for some model pairs (Figure 5). These inversions are imperfect and we leave development of specialized inverters for translated embeddings to future work. Nevertheless, as exemplified in Figure 6, they still extract information sucdh as individual and company names, dates, promotions, financial information, outages, and even lunch orders. In Appendix E, we show the prompt we use to measure extraction.

    The paper suggests that most sentence encoders trained with similar objectives on sufficiently diverse data come up with embeddings which resemble each other. Concepts, topics (and lunch) live on a shared manifold1; the models just might position them differently in embedding space. Vec2vec is a learned a coordinate transform.

    What this might be implying is that if you train a model with similar objectives on data samples from a similar generating function, you will arrive at a manifold in latent space that is geometrically similar to anyone else doing the same thing. If that is true operations in latent-space start to look less model specific, and approaches that navigate them (like JEPA, LDM editing) could learn to operate across different model with just an adapter layer.

    To be clear, the paper is not saying this: the authors only align English, contrastive‑loss, transformer sentence encoders. No decoder models, hardly any dimensionality mismatch. The phrase “universal geometry2” may be a stretch: Their GAN training also requires quite a bit of run cherry-picking3, and when they tried cross-modality the results weren’t as strong, but it’s a very interesting idea worth further investigation.

    In the short term, this is probably mildly alarming for coding agent customers that are worried about their source code leaking, but in the long term I hope we can see some more investigation into how true this is in more general modeling, and what kind of opportunities that might open up!

    1. Shape in the embedding space. In practical experience when you have a large embedding space its mostly empty, and all the actual data lives on a plane in the space. This is why things like latent diffusion models work: they learn to navigate towards that plane from any random noisy point in the space. ↩︎
    2. But it’s a great title. ↩︎
    3. My understanding is unstable training is a very common problem for GANs. ↩︎
  • AbsenceBench

    https://arxiv.org/abs/2506.11440

    Simon Willison has a good summary:

    Long context models have been getting increasingly good at passing “Needle in a Haystack” tests recently, but what about a problem in the opposite direction?

    The answers are surprisingly domain-specific; some models do great on numeric sequences but most are pretty bad at code!

    The authors posit that attention is just a worse mechanism for seeing what’s missing vs what’s there. For me this rhymes with the experience of folks doing agentic coding assistant work: its beneficial to clear the context window more often than you think as the models strongly prefer to use what is already in there.

    This feels like a learned or tuned behavior, a flavor of the model does the eval. Models will probably get better at this problem, as now it’s legible, but is there a tradeoff that has to be made?

    Pretraining is somewhat saturating, but we have oodles of post-training (which includes context extension), the whole meta-RL process of researchers trying different data mixes and algorithm/architecture tweaks, and inference time search options.

    If OpenAI had Anthropic’s data and evals would they have as good an agentic coding model? And vice versa would Opus be as good at deep research as O3? I honestly don’t know: in the end compute will always be finite and we have to allocate it with some end in mind. It feels very plausible there is no globally optimal scaling law for how you prioritize different model capabilities. But the models will probably do this eval.

  • A patchwork quilt view of AI Alignment

    https://arxiv.org/abs/2505.05197

    Very interesting paper from folks at DeepMind that is focused on arguing that the idea of a convergence of a single, coherent value set doesn’t reflect society and is not the only way to think about AI morality and alignment.

    Think of society as a patchwork quilt composed of diverse communities, each with its own internal norms and expectations. Within these communities—e.g. trade unions enforcing strike discipline or religious groups defining attendance practices—members understand the specific rules and the social consequences for deviating (Bicchieri, 2005; Ostrom, 1990; Ullmann-Margalit, 1977). These internal dynamics shape behavior within each distinct patch of the quilt, fostering cohesion through shared, localized understandings of appropriate conduct

    […]

    A key insight we can draw then is that what holds humans together despite profound disagreements is not value convergence, but practical mechanisms for coexistence—which we see as social technologies

    There is an idea that sometimes comes up that disagreements between good, reasonable people can be traced to misunderstandings or disagreements about the likelihood of different outcomes; if you can align on them, you’ll come to the same conclusions. This encourages some focus in AI alignment on finding the right, true principals, creating the best truth-seeking model possible, then assuming downstream that will result in strong alignment. The paper challenges this assumption.

    They also call out collective action problems in implementing such a framework, particularly start up and free rider problems:

    Even seemingly universal goods like “survival” are embedded in thick cultural contexts that shape their meaning and priority (in fact many cultures prioritize sacred values above survival e.g. Ginges et al. (2007)). In general, mobilizing global action and resources towards any specific AI safety strategy will inevitably confront deep-seated disagreements rooted in different values, priorities, and worldviews regarding the nature of AI risks, the efficacy or fairness of proposed initial strategies, and the equitable distribution of upfront costs and responsibilities.

    Their approach calls for focusing on 4 areas:

    1. Contextual grounding: broader understanding, not just the conversation but the environmental context they are operating in.
    2. Community customization: Different norms for different communities.
    3. Continual adaption: Updating understanding of appropriate behavior based on ongoing feedback. They suggest continuous training for this.
    4. Polycentric governance: Distributed decision making wiht multiple overlapping centers of authority.

    If you read this list into a general “helpful agent” context instead of alignment, I don’t think it would be controversial: these seem good ideas!

    That said, I think a lot of this boils down to the last one. Getting governance structures right is hard, in any context, and I interpret a key part of the aspiration here as having “checks and balances” that can represent varied interests. Not an easy problem to solve!

    Some might worry that our patchwork approach embraces a troubling relativism, but this misunderstands the quilt we’re describing. Just as a quilt’s structural integrity depends on solid stitching techniques regardless of pattern diversity, our appropriateness framework recognizes that while the
    specific content of norms varies across contexts, the social technologies that facilitate coexistence—the mechanisms of learning, adaptation, and conflict resolution—can be studied with rigor and implemented with care.

  • Linear Layouts in Triton

    [2505.23819] Linear Layouts: Robust Code Generation of Efficient Tensor Computation

    Paper from the Triton folks at OpenAI on their solution to the layouts/data movement problem. Data often needs to be laid out in a specific way to maximize performance on a GPU. This includes certain instructions, and also avoidance of bank conflicts in shared memory. You might have data stored nicely in global memory, need to permute it to load, then permute it again for execution.

    Part of the appeal of CuTe is expressing these layouts and allowing a relatively simple algebra to transform it between these domains. This works, but the Triton approach is to try and hide this type of complexity, particularly hardware specific complexity, in the compiler.

    While both CUTE and linear layouts aim to address the challenge of flexible task mapping on emerging architectures, they differ in several key aspects. First and foremost, CUTE is primarily designed for users to manually describe layouts, whereas linear layouts are integrated into a compiler. Second, the linear algebra framework of linear layouts enables compilers to generate efficient code for layout conversion and code lowering for many common operators, which is absent in CUTE. Third, swizzling is inherently defined within linear layouts, whereas in CUTE, it is treated as a separate step

    The clever insight is that you can represent any of the layouts as a binary matrix over F₂, which means you can use XOR/AND for arithmetic. You can compose those binary matrices freely, and it’s also easy to replace the transform matrix with a new one for hardware that requires a different permutation.

    To give a step-by-step example (as I’m not totally sure how well I grok this myself!) let’s say we are working on am MMA for a 16×8 tile:

    • We start with our data, say in row major order (0,0), (0,1), …, (0,7), (1,0). Each value is stored in its own register
    • We have 32 threads, each managing their own section of the block: in this case 4 registers
    • So we have a hardware location for each value: the thread (0..31) and the register (0..3). You can imagine this as 7 bits of data, thread ID (5 bits), and register ID (2 bits)
    • Equivalently we have imagine tracking the tensor location for each value: 4 bits for 0..15 rows, 3 bits for 0..7 columns
    • We can have a map which translates between tensor location and hardware location: block location row 1 col 0 is in thread 2 register 0. This would be a 7 by 7 binary matrix
    • We can define a matrix that transforms the hardware map to the one needed for our ldmatrix tensorcore call.
    • For example, we might need thread 0 to manage tensor values (0,0), (4,0), (8,0), (12,0)
    • If the mapping requires moving a value to a different register in the same thread we can use a prmt (permute) instruction
    • If the mapping requires moving values between thread’s registers, we can use a warp shuffle like shfl.sync that allows swapping registers between threads without using shared memory1

    Triton has layouts for standard block level storage, and for MMAs and other operations. By multiplying through the required mappings it can automatically work out how best to optimize movement, versus the manual transforms you do in CuTe!

    It also has versions of these mappings for different hardware, so for many operations only the layouts need to be swapped out when moving from Ampere to Hopper or Blackwell!

    1. mostly. if there will be bank conflicts, it will spill to shared memory. ↩︎