Category: links-and-recs

Interesting links and recommendations

  • A Primer on Post-Training

    A Primer on LLM Post-Training – PyTorch

    Very excited to see this publicly available. David moved to the PyTorch team at the start of the year, having worked on Llama, and wrote up this excellent guide for post-training internally. This is a cleaned up version of the same doc, and provides a fantastic introduction to the world of post-training for modern LLMs.

    It also includes one of my favorite perverse incentive examples:

    Note: this happens with humans too! We just call these Perverse Incentives, but they are literally the same thing. The British government, concerned about the number of venomous cobras in Delhi, offered a bounty for every dead cobra. Initially, this was a successful strategy; large numbers of snakes were killed for the reward. Eventually, however, people began to breed cobras for income.

    The real kicker in that one came when the government realized what was happening and canceled the bounty. The folks who had been breeding cobras didn’t want to look after them any more, so just released them, leading to a lot more cobras than there had been before!

  • Layouts

    You could have invented CuTe hierarchical layout (but maybe not the rest of it?) : ezyang’s blog

    Ed posted the best intro to CuTe layouts I have seen, by showing how to extrapolate them from PyTorch striding1.

    Well, it turns out, this is exactly how CuTe layouts work! In CuTe, sizes/strides are hierarchical: a size is actually a tree of ints, where the hierarchy denotes internal structure of a dimension that you can address linearly (in fact, everything by default can be addressed in a 1-D linear way, even if its an N-D object.)

    Relatedly, Simon Veitner put together a quite visual understanding of layouts. https://veitner.bearblog.dev/intuition-behind-hierarchical-layouts/ – the graphics are helpful once you have the baseline intuition from Ed’s post!

    1. If you’re not familiar with striding, Ed’s PyTorch Internals talk/post remains the best intro! ↩︎
  • The TPU book, on GPUs

    How to Think About GPUs | How To Scale Your Model

    The Jax “How To Scale Your Model” book is one of my favorite references for folks trying to get their head round pretraining1. It breaks down the performance characteristics of model training (often using Llama 3 as an example) in an incredibly clear way. The only slight limitation is that it is primarily focused on scaling LLMs on TPUs: interesting, but probably not your main platform target (unless you work at Deepmind). They just released a new chapter covering GPUs, and it’s also a great summary2.

    There are also plenty of mildly snarky comments about design choices to leaven the reading too:

    Takeaway: in theory, NVIDIA SHARP (available on most NVIDIA switches) should reduce the cost of an AllReduce on B bytes from about 2 * B / W to B / W. However, in practice we only see a roughly 30% improvement in bandwidth. Since pure AllReduces are fairly rare in LLMs, this is not especially useful.

    1. Though they include a chapter on inference too! ↩︎
    2. Though if you haven’t read the rest of the book it moves pretty fast – definitely best to read through the whole thing and treat this as the appendix it is intended to be! ↩︎
  • Extending Arcee’s FM context length

    Extending AFM-4.5B to 64k Context Length

    Via Nathan Lambert, an extremely fun write up of the journey to an 64k context length for Arcee’s 4.5B foundation model. There are a lot of good takeaways, but this one particularly resonated with me:

    Experimentation is Key: As in everything I write, I am unable to stress enough the importance of trying dumb things. If you try enough dumb things, eventually one of them will turn into a smart thing. Embrace the chaos.

  • RL in the second half

    The Second Half – Shunyu Yao – 姚顺雨

    Extremely interesting post by Shunyu Yao of ReAct and Tree of Thought fame about where we got to with AI and where we are going. Read it for the spot-on description of the weirdnes of reasoning as an RL concept, but my main takeaway was the refinement to the idea that the most important thing is having models that “do the benchmarks”

     To recap the game of the first half:

    • We develop novel training methods or models that hillclimb benchmarks.
    • We create harder benchmarks and continue the loop.

    This game is being ruined because:

    Even if we create harder benchmarks, pretty soon (and increasingly soon) they get solved by the recipe. […]

    The recipe has essentially standardized and industried benchmark hillclimbing without requiring much more new ideas. As the recipe scales and generalizes well, your novel method for a particular task might improve it by 5%, while the next o-series model improve it by 30% without explicitly targeting it.

    The post makes the point that the gap is benchmarks that more closely map to real-world problems.

    when the intelligence is low, improving intelligence generally improves utility. But now, the general recipe is guaranteed to work under these assumptions. So the way to play the new game of the second half is

    • We develop novel evaluation setups or tasks for real-world utility.
    • We solve them with the recipe or augment the recipe with novel components. Continue the loop.

    Shunyu works on computer use at OpenAI, so this is well within his wheelhouse, and I think it’s a compelling claim. Many folks1 have talked about the capability overhang LLMs: there is a large baseline ability to do things in the models, but eliciting that ability can be challenging. I tend to think of that similarly to how that are many words which we can comfortably understand, but are very unlikely to use ourselves in conversation2. RL is our most powerful tool for eliciting capabilities, but it’s a somewhat blunt instrument. Having the right evals, eval environment and tasks helps train the agent to interact in a way which generalizes.

    I wonder if, as we progress through this second phase, that we will find signs of a similar “universal geometry” as has been suggested for pretraining in this elicitation: perhaps there is eventually a universal navigation3 towards where to generate in that space for different tasks. Maybe that’s what we’ll call AGI!

    1. Jack Clark particularly ↩︎
    2. Receptive vocabulary vs. productive vocabulary. ↩︎
    3. A universal geometry of a vector field? ↩︎
  • Quack CuteDSL Kernels

    Dao-AILab/quack: A Quirky Assortment of CuTe Kernels

    Tri Dao & co have a fun repo up called Quack: A Quirky Assortment of CuTe Kernels, all leveraging the CuTe-DSL. These are hopper and blackwell oriented kernels for a variety of common needs like softmax, layernorm and RMSNorm.

    On top of that, they wrote a post on how to get speed of light (memory bound) kernels in CuTe-DSL. It goes through how to implement a reduction op across multiple tiers of memory using TensorSSA for thread level reductions, warp reduction with shuffle_sync_bfly and block reduction with shared memory. Even if you’re not writing CuTe, this is about as good an introduction to architecting memory bound ops as I have seen!

    They also cover clustered reduction, leveraging multiple SMs:

    In cluster reduction, we first send the current warp’s reduced value to all the peer thread block’s reduction buffer in peer’s SMEM. Such sending is conducted via a dedicated SM-to-SM fabric (as DSMEM). Then each warp fetches all warp’s values from their local reduction buffer, and reduces these values.

    This does seem to help the kernels scale well to larger sizes:

    We believe our outstanding performance at >= 65k input is due to our successful utilization of cluster reduction in H100. When the size of inputs are ultra long and depleting the SM’s registers and shared memory, without cluster reduction, we would have to switch to an online algorithm (like online softmax) otherwise we may get a massive register spilling that leads to significant throughput degradation.

    I also really appreciate this note of reality in their conclusion:

    Hitting “speed-of-light” model memory throughput confirms that a carefully hand-crafted CuTe kernel can squeeze every byte across all memory hierarchies in the hardware. But that efficiency comes at the price of per-operator and even per input-shape tuning, which imposes a natural tradeoff between efficiency and development efforts

  • ARPA and predicting the future

    Statecraft recently re-ran an interview from 2023 with Jason Matheny, formerly of IARPA: https://www.statecraft.pub/p/how-to-predict-the-future-278

    While defense policy and research is a ways outside the scope for myself (or I imagine most folks reading), the problems of managing or working on uncertain, research-y projects in a volatile environment are pretty relatable:

    Most of what we know from cognitive psychology and human judgment research over the last 50 years suggests that unstructured group deliberation might be one of the worst ways of making judgments, yet it’s the norm in most institutions.

    Or this bit of career wisdom:

    In general, people underestimate their own potential to make contributions to the most important problems. They overestimate how many people are already working on the most important problems. So many incredibly important problems are just really neglected. If you can’t figure out who’s working on something after a few days of homework, then it is probably a neglected problem. And it’s probably up to us to solve it.

    Jason talks about looking for projects in the goldilocks zones of probability (less than 50%, more than 5%) that open up interesting opportunities. I worked with a manager who was a strong advocate of the Heilmeier Catechism  to evaluate projects, and have seen the value of using it as guidance when presenting and evaluating ideas:

    1. What are you trying to do? Articulate your objectives using absolutely no jargon.
    2. How is it done today, and what are the limits of current practice?
    3. What is new in your approach and why do you think it will be successful?
    4. Who cares? If you are successful, what difference will it make?
    5. What are the risks?
    6. How much will it cost?
    7. How long will it take?
    8. What are the mid-term and final “exams” to check for success? 

    Jason adds some interesting updates:

    For instance, the Heilmeier questions don’t have a question about counterfactual impact: “Would this work get done otherwise?” The office tends not to rigorously assess the other funding streams going to solve this particular problem, and their likelihoods of success.

    We also tend not to think much about strategic move and countermove. […]. It probably is prudent to assign at least a 10% probability to some exquisite, classified technology being stolen.

    One thing I found myself talking about this week with a couple of folks was how good people get “lucky” a lot. I think these kinds of questions help navigate towards those more positive-surprise-filled spaces.

  • Harnessing the Universal Geometry of Embeddings

    Harnessing the Universal Geometry of Embeddings

    What content should you include in an LLM prompt? Many interesting use cases (enterprise tools, coding assistants) have more content than they can handle at once, so you chunk it up, turn each chunk into a vector with some sentence‑encoder, and store those vectors in a database. Later you vector‑search, pull back the relevant chunks and feed them to the LLM — better known as the RAG pattern.

    The working assumption has been: those vectors are fairly safe. A 768‑dimensional point doesn’t look the text “Ian ordered a burger at 12:07 ”, so storing raw embeddings seem privacy‑preserving.

    But are they! In Cornell’s Harnessing the Universal Geometry of Embeddings paper the authors train vec2vec, a small GAN that learns to translate embeddings from encoder A’s space into encoder B’s space, without seeing the original sentences. Once you’re in encoder B‑land you can recover up to 80 % of the underlying text:

    Inversion, i.e., reconstruction of text inputs, is more ambitious than attribute inference. vec2vec translations retain enough semantic information that off-the-shelf, zero-shot inversion methods […] extract information for as many as 80% of documents given only their translated embeddings, for some model pairs (Figure 5). These inversions are imperfect and we leave development of specialized inverters for translated embeddings to future work. Nevertheless, as exemplified in Figure 6, they still extract information sucdh as individual and company names, dates, promotions, financial information, outages, and even lunch orders. In Appendix E, we show the prompt we use to measure extraction.

    The paper suggests that most sentence encoders trained with similar objectives on sufficiently diverse data come up with embeddings which resemble each other. Concepts, topics (and lunch) live on a shared manifold1; the models just might position them differently in embedding space. Vec2vec is a learned a coordinate transform.

    What this might be implying is that if you train a model with similar objectives on data samples from a similar generating function, you will arrive at a manifold in latent space that is geometrically similar to anyone else doing the same thing. If that is true operations in latent-space start to look less model specific, and approaches that navigate them (like JEPA, LDM editing) could learn to operate across different model with just an adapter layer.

    To be clear, the paper is not saying this: the authors only align English, contrastive‑loss, transformer sentence encoders. No decoder models, hardly any dimensionality mismatch. The phrase “universal geometry2” may be a stretch: Their GAN training also requires quite a bit of run cherry-picking3, and when they tried cross-modality the results weren’t as strong, but it’s a very interesting idea worth further investigation.

    In the short term, this is probably mildly alarming for coding agent customers that are worried about their source code leaking, but in the long term I hope we can see some more investigation into how true this is in more general modeling, and what kind of opportunities that might open up!

    1. Shape in the embedding space. In practical experience when you have a large embedding space its mostly empty, and all the actual data lives on a plane in the space. This is why things like latent diffusion models work: they learn to navigate towards that plane from any random noisy point in the space. ↩︎
    2. But it’s a great title. ↩︎
    3. My understanding is unstable training is a very common problem for GANs. ↩︎
  • AbsenceBench

    https://arxiv.org/abs/2506.11440

    Simon Willison has a good summary:

    Long context models have been getting increasingly good at passing “Needle in a Haystack” tests recently, but what about a problem in the opposite direction?

    The answers are surprisingly domain-specific; some models do great on numeric sequences but most are pretty bad at code!

    The authors posit that attention is just a worse mechanism for seeing what’s missing vs what’s there. For me this rhymes with the experience of folks doing agentic coding assistant work: its beneficial to clear the context window more often than you think as the models strongly prefer to use what is already in there.

    This feels like a learned or tuned behavior, a flavor of the model does the eval. Models will probably get better at this problem, as now it’s legible, but is there a tradeoff that has to be made?

    Pretraining is somewhat saturating, but we have oodles of post-training (which includes context extension), the whole meta-RL process of researchers trying different data mixes and algorithm/architecture tweaks, and inference time search options.

    If OpenAI had Anthropic’s data and evals would they have as good an agentic coding model? And vice versa would Opus be as good at deep research as O3? I honestly don’t know: in the end compute will always be finite and we have to allocate it with some end in mind. It feels very plausible there is no globally optimal scaling law for how you prioritize different model capabilities. But the models will probably do this eval.

  • A patchwork quilt view of AI Alignment

    https://arxiv.org/abs/2505.05197

    Very interesting paper from folks at DeepMind that is focused on arguing that the idea of a convergence of a single, coherent value set doesn’t reflect society and is not the only way to think about AI morality and alignment.

    Think of society as a patchwork quilt composed of diverse communities, each with its own internal norms and expectations. Within these communities—e.g. trade unions enforcing strike discipline or religious groups defining attendance practices—members understand the specific rules and the social consequences for deviating (Bicchieri, 2005; Ostrom, 1990; Ullmann-Margalit, 1977). These internal dynamics shape behavior within each distinct patch of the quilt, fostering cohesion through shared, localized understandings of appropriate conduct

    […]

    A key insight we can draw then is that what holds humans together despite profound disagreements is not value convergence, but practical mechanisms for coexistence—which we see as social technologies

    There is an idea that sometimes comes up that disagreements between good, reasonable people can be traced to misunderstandings or disagreements about the likelihood of different outcomes; if you can align on them, you’ll come to the same conclusions. This encourages some focus in AI alignment on finding the right, true principals, creating the best truth-seeking model possible, then assuming downstream that will result in strong alignment. The paper challenges this assumption.

    They also call out collective action problems in implementing such a framework, particularly start up and free rider problems:

    Even seemingly universal goods like “survival” are embedded in thick cultural contexts that shape their meaning and priority (in fact many cultures prioritize sacred values above survival e.g. Ginges et al. (2007)). In general, mobilizing global action and resources towards any specific AI safety strategy will inevitably confront deep-seated disagreements rooted in different values, priorities, and worldviews regarding the nature of AI risks, the efficacy or fairness of proposed initial strategies, and the equitable distribution of upfront costs and responsibilities.

    Their approach calls for focusing on 4 areas:

    1. Contextual grounding: broader understanding, not just the conversation but the environmental context they are operating in.
    2. Community customization: Different norms for different communities.
    3. Continual adaption: Updating understanding of appropriate behavior based on ongoing feedback. They suggest continuous training for this.
    4. Polycentric governance: Distributed decision making wiht multiple overlapping centers of authority.

    If you read this list into a general “helpful agent” context instead of alignment, I don’t think it would be controversial: these seem good ideas!

    That said, I think a lot of this boils down to the last one. Getting governance structures right is hard, in any context, and I interpret a key part of the aspiration here as having “checks and balances” that can represent varied interests. Not an easy problem to solve!

    Some might worry that our patchwork approach embraces a troubling relativism, but this misunderstands the quilt we’re describing. Just as a quilt’s structural integrity depends on solid stitching techniques regardless of pattern diversity, our appropriateness framework recognizes that while the
    specific content of norms varies across contexts, the social technologies that facilitate coexistence—the mechanisms of learning, adaptation, and conflict resolution—can be studied with rigor and implemented with care.

  • Linear Layouts in Triton

    [2505.23819] Linear Layouts: Robust Code Generation of Efficient Tensor Computation

    Paper from the Triton folks at OpenAI on their solution to the layouts/data movement problem. Data often needs to be laid out in a specific way to maximize performance on a GPU. This includes certain instructions, and also avoidance of bank conflicts in shared memory. You might have data stored nicely in global memory, need to permute it to load, then permute it again for execution.

    Part of the appeal of CuTe is expressing these layouts and allowing a relatively simple algebra to transform it between these domains. This works, but the Triton approach is to try and hide this type of complexity, particularly hardware specific complexity, in the compiler.

    While both CUTE and linear layouts aim to address the challenge of flexible task mapping on emerging architectures, they differ in several key aspects. First and foremost, CUTE is primarily designed for users to manually describe layouts, whereas linear layouts are integrated into a compiler. Second, the linear algebra framework of linear layouts enables compilers to generate efficient code for layout conversion and code lowering for many common operators, which is absent in CUTE. Third, swizzling is inherently defined within linear layouts, whereas in CUTE, it is treated as a separate step

    The clever insight is that you can represent any of the layouts as a binary matrix over F₂, which means you can use XOR/AND for arithmetic. You can compose those binary matrices freely, and it’s also easy to replace the transform matrix with a new one for hardware that requires a different permutation.

    To give a step-by-step example (as I’m not totally sure how well I grok this myself!) let’s say we are working on am MMA for a 16×8 tile:

    • We start with our data, say in row major order (0,0), (0,1), …, (0,7), (1,0). Each value is stored in its own register
    • We have 32 threads, each managing their own section of the block: in this case 4 registers
    • So we have a hardware location for each value: the thread (0..31) and the register (0..3). You can imagine this as 7 bits of data, thread ID (5 bits), and register ID (2 bits)
    • Equivalently we have imagine tracking the tensor location for each value: 4 bits for 0..15 rows, 3 bits for 0..7 columns
    • We can have a map which translates between tensor location and hardware location: block location row 1 col 0 is in thread 2 register 0. This would be a 7 by 7 binary matrix
    • We can define a matrix that transforms the hardware map to the one needed for our ldmatrix tensorcore call.
    • For example, we might need thread 0 to manage tensor values (0,0), (4,0), (8,0), (12,0)
    • If the mapping requires moving a value to a different register in the same thread we can use a prmt (permute) instruction
    • If the mapping requires moving values between thread’s registers, we can use a warp shuffle like shfl.sync that allows swapping registers between threads without using shared memory1

    Triton has layouts for standard block level storage, and for MMAs and other operations. By multiplying through the required mappings it can automatically work out how best to optimize movement, versus the manual transforms you do in CuTe!

    It also has versions of these mappings for different hardware, so for many operations only the layouts need to be swapped out when moving from Ampere to Hopper or Blackwell!

    1. mostly. if there will be bank conflicts, it will spill to shared memory. ↩︎
  • Toward a Theory of Tokenization in LLMs

    [2404.08335] Toward a Theory of Tokenization in LLMs

    Tokenization has always struck me as one of the odder aspects of natural language deep learning. Despite the extensive end-to-end learning processes we typically use, tokenization initially involves creating a dictionary of optimal sub-word segments from your dataset. One of the appealing concepts in the Byte Latent Transformers paper is the potential to learn tokenization dynamically, recognizing that tokenizers solve deeper problems than merely providing a fixed vocabulary.

    This paper addresses tokenization from a theoretical perspective by modeling sequences using kth-order Markov processes, where the likelihood of each token depends on the preceding sequence, as in natural language. The parameter k corresponds to the model’s context window size. Key findings include:

    1. Training without tokenization leads models to effectively behave as unigram predictors, significantly limiting performance.
    2. Using a well-designed tokenizer (e.g., Byte Pair Encoding – BPE) enables models to achieve nearly optimal performance in capturing sequence dependencies.
    3. Increasing the tokenizer’s dictionary size improves the model’s performance, moving it closer to the ideal probability distribution.

    Tokenizers which do a good job at learning patterns in the data and assigning these frequent patterns as tokens in the dictionary are compatible with an i.i.d. model over tokens.

    This insight suggests that despite the complexity of natural language’, a good tokenizer converts sequences into something approximating an independent and identically distributed (i.i.d.) format, which brings the modeling tasks for transformers closer to the one they can solve.

    While the paper does not explicitly explore the Byte Latent approach, I wonder if its entropy-driven dynamic token allocation might similarly achieve this i.i.d. simplification. In BLT the entropy model, trained separately, could be dynamically transform inputs into a distribution that is more palatable for transformers.

  • Analyzing Modern GPU Cores

    [2503.20481] Analyzing Modern NVIDIA GPU cores

    Filing this under interesting work I will probably never use. The authors try to construct a more accurate simulation of the Ampere (4090/A100 type GPUs) microarchitecture, backed by extensive testing on real hardware. It’s a good reminder that, in part because of how good some of the abstractions are, there is quite a lot about Nvidia GPUs that isn’t really known outside Nvidia. My main takeaway was that the compiler is very deeply coupled to the hardware performance: a Nvidia chip is not really a complete unit without taking in to account the software driving the performance, and recognizing that accounts for why Nvidia have done such a good job of building a solid stack with CUDA.

    One of the things I found interesting was the use of a Stall counter: the compiler notes fixed latency instructions (which seem to be a preferred design choice) and adds a counter to the instructions control bits that specifies how many cycles the warp should wait before issuing the next instruction, and so other warps will be selected for execution. This means the hardware doesn’t have to dynamically check for data dependencies.

    For example, an addition whose latency is four cycles and its
    first consumer is the following instruction encodes a four in the Stall counter. Using the methodology explained in section 3, we have verified that if the Stall counter is not properly set, the result of the program is incorrect since the hardware does not check for RAW hazards, and simply relies on these compiler-set counters. In addition, this mechanism has benefits in terms of area and energy wiring. Keep in mind that wires from fixed-latency units to the dependence handling components are not needed, in contrast to a traditional scoreboard approach where they are required.

    There are variable execution length instructions, like memory loads, and in that case they have a Dependence counter, which is decremented when data arrives.

    In the vein of handing off to the compiler, the scheduler uses a Compiler Guided Greedy Then Youngest policy: it will keep issuing instructions from the same warp (greedy) with guidance from the Stall (and an explicit Yield bit) and otherwise will swithch to the youngest ready warp. Older GPUs (apparently!) used Greedy Then Oldest instead, which resulted in more often selecting a warp that was still stalled waiting for memory or similar, while the youngest more likely has useful work to do.

    The scheduler starts issuing instructions from the youngest warp, which is W3, until it misses in the Icache.As a result of the miss, W3 does not have any valid instruction, so the scheduler switches to issue instructions from W2. W2 hits in the I-cache since it reuses the instructions brought by W3, and when it reaches the point where W3 missed, the miss has already been served, and all remaining instructions are found in the I-cache, so the scheduler greedily issues that warp until the end. Later, the scheduler proceeds to issue instruction from W3 (the youngest warp) until the end, since now all instructions are present in the I-cache.

    Similarly, the paper points out that the instruction prefetch cache is a stream buffer (probably 16 instructions deep) rather than any kind of complex branch prediction logic, because we generally don’t do that kind of thing on GPUs!

    a straightforward prefetcher, such as a stream buffer, behaves close to a perfect instruction cache in GPUs. This is because the different warps in each sub-core usually execute the same code region and the code of typical GPGPUs applications do not have a complex control flow, so prefetching 𝑁 subsequent ines usually performs well. Note that since GPUs do not predict branches, it is not worth implementing a Fetch Directed Instruction prefetcher [76] because it would require the addition of a branch predictor.

  • Darwin Gödel Machines

    https://open.substack.com/pub/gonzoml/p/darwin-godel-machine

    Interesting paper breakdown on Gonzo ML of another evolutionary agent approach from the extended Sakanaverse.

    It commences with an initial coding agent, constructed upon a frozen foundation model (FM) equipped with tool-use capabilities (e.g. running bash commands, editing files). In each cycle, “parent” agents are selected from the expanding archive. This selection process prioritizes agents based on a combination of their performance (assigning greater weight to higher scores, scaled by sigmoid) and a novelty bonus (inversely correlated with the number of offspring they have already produced, thereby encouraging exploration of less-frequented paths).

    The actual foundation model is a frozen component, so much like alphaevolve this is a search set up on top of the model intelligence. The search is evolving the agent code itself to try and do better on benchmarks.

    Qualitatively, the DGM learned to enhance its own tools and workflows. For instance, it developed more granular file editing capabilities (e.g., string replacement), improved long-context window management (e.g., auto-summarizing prior interactions), and refined its problem-solving strategies (e.g., making multiple attempts at a solution and using another FM to evaluate patches). These discovered improvements also demonstrated generalizability, transferring benefits across different underlying FMs and programming languages.

    When it comes to coding agents I had been thinking there were three axes of performance, which gate the overall effectiveness, but the paper makes it clear there are at least 4:

    1. The foundation model itself, with its base coding, tool use, reasoning abilities and context window size
    2. The tools it has available – the more the tool is exposes underlying semantics the more the model can efficiently use it.
    3. The UI, how the user interacts with the agent to direct it, provide clarity and review work.
    4. The prompt, strategies for problem solving and how the context window is managed (eg when to summarize)

    In this case the UI is held fixed (an outer eval loop), the model is fixed and the search explores tools and strategies. It seems at the very least a search across multiple different models as options might also work well!

  • Scaling RL Compute

    https://gr.inc/blog/scaling-rl-compute/?

    Great post by the folks at General Reasoning on the combination of factors that led to O1-type breakthroughs in inference time compute.

    But here is the key point: no-one suddenly discovered that reinforcement learning was useful for reasoning. It was always useful, but getting some of the details right was the difference between a good post-training recipe and a paradigm shift in the way we use language models.

    ML research is prone to these lollapalooza effects where several positive facts coincide to produce a much larger than expected result. You can go look at the launch of ChatGPT for another example: ChatGPT wasn’t a surprise for folks who had spent time with large language models, and had seen attempts like Galactica before. But for many people it was a remarkable, new experience, and the engagement and interaction ChatGPT saw was new to the researcher community. That itself contributed to further breakthroughs and improvements.