Tag: ml

  • Optimizers and Hessians

    https://arxiv.org/abs/2505.02809

    Optimizers are consistently one of the great areas of ML for discovering whether you remember any linear algebra or not (I land on not). Given the pace of change, it’s somewhat surprising that Adam(W) has hung around for as long as it has. Adam updates each parameter by a moving average of the gradient, and a moving average of the squared-gradient (the 2nd moment). Each weight is updated (and the moments/running averages are tracked) separately.

    One area where we have some alternative approaches are the Shampoo family of optimizers. They take a whole block of parameters (usually the weights of a layer ), stores the moving averages of second moments of all the gradients in a block, then transforms them each step. This transform gives an approximation for the of the inverse Hessian of the block. The Hessian is the square matrix representing the second derivative of the loss: telling you how the gradient slope changes, like a curvature map. This is expensive to calculate for the network as a whole, so Shampoo estimates it one layer at a time (ish – it’ll split up big layers), and rotates/transforms each parameters gradient update based on this curvature estimate.

    Empirically either of these work because if you do look at the Hessian of loss in a network, it its basically block diagonal: all of the curvature is within certain blocks, and very little of it is between distant parameters.

    The paper, Towards Quantifying the Hessian Structure of Neural Networks looks into why that is. The paper is dense, but they mainly conclude that the block diagonal pattern will occur as the number of output “classes” increase, with blocks representing classes. Given that a class is a token in LLMs, then modern LLMs are strongly likely to exhibit this structure.

    (and layers end up somewhat aligning with class-wise blocks in most nets).

    In the subsequent analysis, we will show that the number of classes C, instead of the CE loss, is one key factor. Specifically, the near-block-diagonal Hessian structure arises as C→∞ for both the MSE and the CE loss.

    […]

    We emphasize that we do not claim “large C” as the only cause for the near-block-diagonal Hessian structure, but just that it is a sufficient condition. 

    Because output layers generally have a tight class association (e.g. one column per class) this propagates through the model during training. Ideally a block would be “all weights that eventually feed the class”, and the paper shows that training process (on a simple network) pushes different parts of tensors to have stronger associations with specific classes, so you get a kind of local version of the same block diagonal structure.

    This kind of understanding is helpful because it helps explain why doing single-parameter optimizing still works (curvature is very localized), and also points a direction for improving optimizer memory usage:

     II. Understanding Hessian structure can help design new training methods for NNs. For instance, Adam-mini (Zhang et al., 2024b), a recently proposed optimizer, utilizes the near-block-diagonal Hessian structure to cut down 50% memory consumption in Adam. We believe the special Hessian structure can inspire more new optimizers.

    Skimming this paper did make me wonder whether maybe this would also apply to Muon. A new paper, Muon optimizer Accelerates Grokking studies Muon in practice. Grigory Sapnuov wrote up a great summary of the paper, which shows that Muon gets to understanding of the underlying distributions earlier (as shown by increases is eval-set validation).

    The authors speculate about what exactly in Muon helps grokking. Spectral norm constraints and second-order cues steer the model away from simple memorization and help discover the true pattern.

    Muon operates on 2D tensors only (and is usually mixed with Adam), and uses a transform called Newton-Schulz which takes the directions from the gradient upgrade but makes the singular values of the update equal in each direction, meaning we update the same effective distance based on the local curvature. It also operates one step at a time, rather than storing the moving average. This means it operates a bit like a simplified shampoo, but is even more efficient — so again benefits from the fact that you can largely ignore the geometry outside the layer it’s looking at!

  • Generative modelling in the latent space

    Generative modelling in latent space – Sander Dieleman

    Fantastic deep dive into the concept of latents and the tradeoffs around them by Sander Dieleman of DeepMind. It’s a long article, but there’s a conclusions section that pulls out some of the most interesting points, and each section is an expansion on those points.

    • Latents add complexity, but the computational efficiency benefits are large enough for us to tolerate this complexity – at least for now.
    • Three main aspects to consider when designing latent spaces are capacity (how many bits of information are encoded in the latents), curation (which bits from the input signals are retained) and shape (how this information is presented).
    • Preserving structure (i.e. topology, statistics) in the latent space is important to make it easy to model, even if this is sometimes worse from an efficiency perspective.
  • Scalably Solving Assistant Games

    Scalably Solving Assistance Games | OpenReview

    Assistant games are an RL approach where the assistant and human cooperate on achieving a goal, and receive a reward signal for the joint effort. This paper proposes them as a better mechanism for aligning models in post-training than RLHF.

    Normally, RLHF is focused on single responses, or a single “turn” or interaction:

    Assistance games avoid the aforementioned drawbacks of RLHF by explicitly accounting for both the interactive nature of assistance and uncertainty about the user’s goal. In particular, an assistance game is a two player game in which an assistant and a user take actions in a shared environment. The two agents share a reward function, but crucially the assistant is initially uncertain about it.

    Assistance games remove incentives for deception since the assistant’s performance depends on the true latent reward function, rather than human feedback. They also incentivize the assistant to interact with the user to resolve its uncertainty about the reward function.

    The paper uses building structures in Minecraft as the learning environment and get some very positive results. They mention the possible applications for chatbot alignment as a post-script.

    Practically this requires, given chat history h, predicting:

    • the next assistant message (or tool call)
    • the next human message in response to that
    • how satisfied the human is with the response

    The algorithm does a tree search, trying various different replies and responses and picks the assistant action which showed up the best. They generally sampled ~100 actions in the paper.

    In the Minecraft example, they can see whether a placed or removed block moves the shape towards the human target, so they can give a reward score each step; doing the same thing with conversations might need some clever goal crafting or propagating back from only a final reward signal.

  • JEPA

    JEPA is an example of a predictive coding architecture. Predictive coding is the idea that the brain works by predicting the next sensory input and learns from the error between the prediction and the actual input. Hierarchies of this kind of prediction allow higher level elements to predict the outputs of lower-level elements, building up deeper and more complex semantics.

    The core idea in JEPA is to take two related things (say consecutive frames of a video) x and y, encode each of them into a latent space (an embedding), and then predict the embedding for y s(y) based on s(x). The encoders can be Transformer-based models — in practice models like I-JEPA have trained the x encoder and updated the weights via a moving average for the y (target) encoder.

    The learning is not based on how well the end-result predicts the target e.g. how close the pixels of the next frame are predicted. Instead, it’s based on how well the latent representation of the next frame is predicted.

    The advantage of working in the latent space for the prediction is the model can choose what level of detail it wants to capture, discarding some aspects and focusing on more foundational concepts. This helps build a more robust world-model, with the hope being that training in this way will then allow easier generalization to more tasks, with less data required

    Similarities

    This is somewhat similar to autoencoders. Autoencoders take an input, compress it in a latent space, then reconstruct the original from the latent space and propagate back the error. JEPA does a similar process across two different items with separate encoders, and only cares about error within the latent space.

    Contrastive models embed two different items into the same space and try to increase similarity between the embeddings for things known to be similar and make them dissimilar to other items. This is used in CLIP and other multimodal text-image encoders, where the text and the image embed to the same space so that a text caption and a matching image are close in embedding space. This requires a lot of pairwise comparisons, while JEPA is a more straightforward s(x)->s(y) prediction in training.

    Challenges

    Because JEPA models leave you with a latent they need to be paired with a generator for getting an observable/human viewable output, which is a per-domain challenge. This makes it harder to evaluate how well the model is learning, beyond measuring loss.

    Training stability can also be tricky — it is possible for the model to collapse and learn trivial representations to minimize prediction error. Even without complete collapse it can require some experimentation to ensure the model is learning a deep enough conceptual level. For example, I-JEPA, which worked in image space, found that using large enough masked patches was important to ensure the model captured sufficient detail.

  • Pydantic Evals

    https://ai.pydantic.dev/evals/

    Ed Yang was recently recommending keeping your own benchmark of LLM evals, so you can test newer models on problems that they have struggled with in the past. I have recommended similar things to people, but there is some barrier to entry into knowing how to start. Ed references (and forks) Nicolas Carlini’s personal benchmark repo, but its nice to have some light(ish) weight options too.

    Pydantic Evals is a powerful evaluation framework designed to help you systematically test and evaluate the performance and accuracy of the systems you build, especially when working with LLMs.

    You can install the library with uv or pip:

    uv add pydantic-evals
    

    I tried it out with a strawberry test, calling openrouter with different models. I needed a custom eval as the default Contains is a bit rigid, but the approach seems nice!

    import os
    import asyncio
    from dataclasses import dataclass
    from pydantic_evals import Case, Dataset
    from pydantic_evals.evaluators import Evaluator, EvaluationReason, EvaluatorContext
    from pydantic_evals.evaluators.common import _truncated_repr
    from openai import OpenAI
    from typing import Any, Optional, cast
    @dataclass
    class FlexibleContains(Evaluator[object, object, object]):
        """
        Check if the output contains any one of the expected options.
        """
        value: Any
        case_sensitive: bool = False
        def evaluate(
            self, ctx: EvaluatorContext[object, object, object]
        ) -> EvaluationReason:
            failure_reason: Optional[str] = None
            # Normalize value into a list of options if it isn't already a list or tuple.
            options = self.value if isinstance(self.value, (list, tuple)) else [self.value]
            output_str = str(ctx.output)
            if not self.case_sensitive:
                output_str = output_str.lower()
            match_found = False
            for opt in options:
                opt_str = str(opt)
                if not self.case_sensitive:
                    opt_str = opt_str.lower()
                if opt_str in output_str:
                    match_found = True
                    break
            if not match_found:
                failure_reason = (
                    f"Output string {_truncated_repr(output_str, max_length=100)} does not contain "
                    f"any of expected strings: {[str(opt) for opt in options]}"
                )
            return EvaluationReason(value=match_found, reason=failure_reason)
    strawberry = Case(
        name="strawberry",
        inputs="How many rs are in strawberry?",
        evaluators=[FlexibleContains(value=["3", "three"])],
        metadata={"difficulty": "easy"},
    )
    dataset = Dataset(cases=[strawberry])
    MODELS = [
        "anthropic/claude-3.5-sonnet",
        "openai/gpt-4o",
        "meta-llama/llama-4-maverick:free",
        "meta-llama/llama-4-scout:free",
        "openrouter/optimus-alpha",  # secret model!
    ]
    def generate_completion(inputs: str, model: str) -> str:
        """Generate a completion using OpenRouter with specified model"""
        client = OpenAI(
            base_url="https://openrouter.ai/api/v1",
            api_key=os.getenv("OPENROUTER_API_KEY"),
        )
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": "You are a helpful AI assistant."},
                {"role": "user", "content": inputs},
            ],
            max_tokens=50,
            temperature=0.7,
        )
        return response.choices[0].message.content.strip()
    def evaluate_models():
        """Run evaluations across multiple models"""
        for model in MODELS:
            print(f"\nResults for model: {model}")
            print("=" * 50)
            # Wrap the synchronous generate_completion in an async function:
            async def model_specific_generate(inputs: str) -> str:
                loop = asyncio.get_running_loop()
                return await loop.run_in_executor(None, generate_completion, inputs, model)
            # Run evaluation for this model
            report = dataset.evaluate_sync(model_specific_generate)
            # Print results for this model
            report.print(include_input=True, include_output=True, include_durations=False)
    def main():
        evaluate_models()
    if __name__ == "__main__":
        main()
    

    To give a trimmed output:

    Results for model: openrouter/optimus-alpha
    ==================================================
                                          Evaluation Summary: model_specific_generate
    ┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
    ┃ Case ID    ┃ Inputs                         ┃ Outputs                                                   ┃ Assertions ┃
    ┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
    │ strawberry │ How many rs are in strawberry? │ The word **"strawberry"** contains **three** letter "r"s. │ ✔          │
    ├────────────┼────────────────────────────────┼───────────────────────────────────────────────────────────┼────────────┤
    │ Averages   │                                │                                                           │ 100.0% ✔   │
    └────────────┴────────────────────────────────┴───────────────────────────────────────────────────────────┴────────────┘
  • Understanding R1-Zero-Like training

    https://arxiv.org/abs/2503.20783

    Further support for elicitation and a very good example of why its worth starting with the evals.

    They started with evals for mathematical reasoning, and then tested base Deepseek, Qwen and Llama models with different templates to see how they did prior to any RL. In doing so they discovered that Qwen did best with no template, and that Deepseek v3 already would create “aha” moments (reasoning self reflection) without any further tuning. Some of the examples are amusing, particuarly the “awkward silence”:

    In Pascal’s Triangle, every row starts
    and ends with 1, …

    This can be calculated as: awkward silence Wait, I’m overthinking. Let’s try again. The number of elements in the first n rows of Pascal’s Triangle…

    The second interesting takeaway for me was that the GRPO implementation (along with most PPO implementations) have a length bias: wrong but long answers are preferred over wrong but short ones. Their corrected, unbiased approach resulted in the same performance, but better token efficiency:

    We also revisit the GRPO’s optimization bias with the Llama base model. The right plot of Fig. 8 compares the model performance and response length trained with GRPO and Dr. GRPO [Ed: the unbiased version]. We can clearly see that GRPO can produce the “double-increase” phenomenon, potentially leading to a misperception that long-CoT can also emerge on Llama models after math pretraining. Unfortunately, the increase of length might be due to the optimization bias

  • Muon optimizer

    Last year Keller Jordan at OpenAI beat some of the existing NanoGPT speedrun records thanks to some optimizer improvements. Towards the end of the year the work was formalized as the Muon optimizer, and it’s making waves in a bunch of areas now

    Friendship ended with Adam, now Muon is my best friend.
    From Elie Bakouch’s great pretraining presentation

    Jeremy Berenstein has written up a great post on how Muon is derived:

    To handle individual layers, our idea is to normalize the weight updates in a clever way so that, given the structure of the inputs, the weight updates automatically induce a desirable effect on the outputs. As a community, we have invested so much effort into thinking about how to normalize the activations: think batch normlayer normRMS norm, etc. Why not also consider how the weights and weight updates influence the activations?

    Keller also wrote a detailed blog post when introducing the optimizer, calling out some open questions (like does it scale to very large training).

    As the posts cover, the optimizer isn’t totally generally – it was designed for linear layers (and flattened convs), so you need to pair it up with Adam for most usage.

    You can install the library from Github: pip install git+https://github.com/KellerJordan/Muon

    from muon import Muon
    
    muon_params = [p for p in list(model.parameters()) if p.ndim > 2]
    muon_param_ids = {id(p) for p in muon_params}
    adamw_params = [p for p in model.parameters() if id(p) not in muon_param_ids]
    # Create the optimizer
    optimizers = [Muon(muon_params, lr=0.001),
    torch.optim.AdamW(adamw_params, lr=0.001)]

    And step both optimizers in the training loop:

    for opt in optimizers:
        opt.step()

    It’s great to have innovation in this area, particularly with this kind of fundamental reasoning around why it works!

  • Byte-Latent Transformers

    Who needs a tokenizer anyway!

    [2412.09871] Byte Latent Transformer: Patches Scale Better Than Tokens

    This paper, from back in December last year, presents an interesting approach to handling raw byte sequences in LLMs without relying on tokenization.

    Vocab sizes for tokenizers have gone up over the last couple of years with attendant gains in usefulness, but this remains a particularly hand-tuned number in the training process. BLT proposes a method that processes raw UTF-8 byte sequences directly, leveraging a dynamic patching mechanism to group bytes into variable-length patches based on entropy.

    Higher-entropy regions receive more attention and shorter patches, while lower-entropy regions can be processed more efficiently.

    There are conceptually three levels of processing:

    • Local Encoder: A small transformer stack encodes raw byte sequences into higher level representations, which are then structured into patches.
    • Latent Global Transformer: A standard large transformer model operating on patch-level representations
    • Local Decoder: The encoded patches are decoded back into byte sequences, using a cross-attention mechanism to reconstruct text.

    In the paper they show they can achieve parity in pretraining with a traditional tokenized approach in llama for similar parameter count, while being more robust and offering some inference time performance gains. The patching approach allows for allocating compute where needed most.

    Retrofitting existing models

    One of the ideas I found most interesting is starting with a traditionally pretrained model. The paper discusses using the main transformer layers from Llama and training the byte latent approach successfully.

    I gave the approach a go with a simplified local encoder, entropy and patching approach, and took the transformer layers from Qwen 2.5 3B, a strong model that could still be trained locally (no corporate resources were harmed, etc).

    The basic approach was replacing the tokenizer, adding a small transformer and patch pooling based on a local entropy measure to generate patches, then cross-attending in some of the Qwen layers. Its training a new encoder while leveraging Qwen for the backbone of the global transformer and adding new cross-attention params to make it also the decoder, with the embedding layers at each end chopped off – so a significant domain shift. For inference I leverage the same patch generation process to try and generate effective tokens.

    You can find my Torchtune recipe on GitHub, running through the Alpaca dataset. Thus far I’m still training so while loss is improving, I have no idea whether it will turn into something useful. The fact that there is something trainable is fun though, and I have hopes that this kind of technique will lead to some breakthroughs in tokenizer-free models in the future!

  • Ping-Pong Kernels on Hopper

    Deep Dive on CUTLASS Ping-Pong GEMM Kernel | PyTorch

    A useful deep dive on this performance technique. The TL;DR is, using warp specialization, set up a producer groupthat loads data (using TMA), and two consumer groups executing MatMuls on the Tensor core. When a consumer group finishes it executes the epilogue (e.g. copying the results elsewhere, but you could imagine doing something else on a Cuda core) while the other consumer group takes over the Tensor core. Hence, I presume, the name as Tensor core usage ping-pongs between the two consumers!

    The producer warp group focuses on producing data movement to fill the shared memory buffers (via TMA). Two other warp groups are dedicated consumers that process the math (MMA) portion with tensor cores, and then do any follow up work and write their results back to global memory (epilogue).

  • Grouped GEMMs and MoE

    One of the challenges discussed in the Deepseek v3 paper is the availability of grouped GEMM kernels, which are used to hide the performance impact of many small kernel launches on GPUs. Deepseek uses many small experts (256!) rather than a few larger ones, which exacerbates this problem.

    Mixture of Experts models introduce multiple experts in the feed-forward portion of each transformer layer. Rather than having a single shared set of experts, each layer has its own. Each batch of tokens first passes through the standard attention block, followed by a lightweight linear layer with a softmax function1. This determines, for each token, which experts it should be sent to. Tokens designated for each expert are gathered and sent to the appropriate device via an all-to-all operation, as experts are typically distributed across different devices.

    Once the tokens are on the device with the right expert(s) we need to execute the matrix multiplies for each expert for its set of tokens. The obvious solution is just to loop through and launch each GEMM, but because these are small (small number of tokens, and smaller expert matrices) the kernel launch ends up being a lot of the performance. A grouped GEMM allows you to do this process on-device, taking in a list of tokens and experts and executing all the GEMMS with a single kernel launch.

    This varies from batch GEMMs as the inputs can vary – different experts might receive different numbers of tokens.

    There are example implementations available, including a tutorial on TritonLang that walks through a simple grouped GEMM kernel, as well as an example in Cutlass .

    1. In switch MoEs at least, but there are similar gating networks elsewhere. ↩︎
  • Flow Matching

    https://drscotthawley.github.io/blog/posts/FlowModels.html

    Flow Matching models have largely replaced diffusion-based ones in a number of areas, and this is a really clear walk through of how they work and what they predict.

    It also references a post from folks at Deepmind that makes a case the two frameworks are equivalent. Flow matching approaches are apparently easier to train.

  • Vita 1.5 – how to train an end-to-end speech model

    GitHub – VITA-MLLM/VITA: ✨✨VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

    Vita 1.5 is a decent MLLM with speech and vision capabilities and comes with a good paper that gives a clear recipe for how to take a solid LLM (in this case Qwen 2.5) and add audio and vision to enable it to do end-to-end speech interaction (you can see a video of it in practice!)

    It can handle images and speech audio as input, and can handle video by sampling frames, treating them as sequential images, which allows it to manage video inputs without a dedicated video processing module. The interesting part of the paper is the training recipe they provide. They used generally available datasets, and some ones they sourced- 11k hours of speech-transcription data and 3k hours of text-speech data. The recipe goes in multiple stages of fine-tuning:

    First, train the image understanding and question answering capability:

    1. Take 20% of image-text caption data, pass images through a pretrained visual encoder, a vision adaptor that connects the encoder to the LLM, and send the text to the LLM directly. Freeze the encoder and the LLM, and just train the adapter.
    2. Take all the image-text caption data and train the LLM, adapter and vision encoder.
    3. Mix in 20% of the caption data and all of the image question/answer data and train all three components again

    Next train the audio input:

    1. Take speech-transcription data and train a speech encoder, a task guidance token, and freeze the LLM
    2. Then they take a bit of the caption and QA data and train the speech encoder, speech adaptor, vision encoder, vision adapter, and the state head (that helps the model keep track of which modality its working in) and LLM at the same time. Half the time they replace the text in the caption or QA with the speech equivalent generated from a TTS system to exercise the speech system.

    Finally, they train the audio output

    1. Use the text speech data to train the codec encoder to speech tokens then from the decoder back to speech.
    2. Encode the text with the LLM embeddings, run it through two speech decoders.
  • Deepseek V3

    https://github.com/deepseek-ai/DeepSeek-V3/tree/main

    The DeepSeek V3 paper is very clearly written and goes into the infrastructure they used to get a very strong model on a relatively small amount of compute.

    One example that stood out was the work they did to get efficient utilization on their expert routing.

    In detail, we employ the warp specialization technique (Bauer et al., 2014) and partition 20 SMs into 10 communication channels. During the dispatching process, (1) IB sending, (2) IB-to-NVLink forwarding, and (3) NVLink receiving are handled by respective warps. The number of warps allocated to each communication task is dynamically adjusted according to the actual workload across all SMs. Similarly, during the combining process, (1) NVLink sending, (2) NVLink-to-IB forwarding and accumulation, and (3) IB receiving and accumulation are also handled by dynamically adjusted warps.

    The name of the game in this kind of efficiency work is overlapping comms, and they are doing it at multiple levels: overlapping Infiniband (backend network) calls with NVLink (much faster) calls, using warp specialization to enable concurrency and overlap within a GPU etc. They have some feedback for hardware manufacturers on this:

    We aspire to see future vendors developing hardware that offloads these communication tasks from the valuable computation unit SM, serving as a GPU co-processor or a network
    co-processor like NVIDIA SHARP Graham et al. (2016). Furthermore, to reduce application programming complexity, we aim for this hardware to unify the IB (scale-out) and NVLink
    (scale-up) networks from the perspective of the computation units.

    Reminds me a little of the specialized comms cores in the Tenstorrent hardware!

    They also train in FP8 – getting stable training with FP8 is non-trivial, so having a list of what components they included/excluded is very helpful for other folks that might be interested in similar training:

    For this reason, after careful investigations, we maintain the original precision (e.g., BF16 or FP32) for the following components: the embedding module, the output head, MoE gating modules, normalization operators, and attention operators.

    Though, there is some work to do to get the same performance:

    One key modification in our method is the introduction of per-group scaling factors along the inner dimension of GEMM operations. This functionality is not directly supported in the standard FP8 GEMM.

  • HuatuoGPT-o1, domain specific reasoning

    https://arxiv.org/abs/2412.18925

    A good paper on creating domain specific (medical in this case) reasoning LLM. I had been somewhat vague on how creating a reasoning model actually worked, and this paper felt very clear (as to one recipe at least). As I read it:

    Start with a base model you are fine tuning for reasoning (in their case qwen 2.5) and a strong general model for verification and data generation (in this case GPT-4o)

    1. Gather a bunch of problems with ground-truth answers. In this case they had medical problems. Take about half of them for fine tuning data.
    2. Prompt the general model to think step-by-step for each problem to get some initial reasoning.
    3. Use the general model to evaluate whether each answer reached is correct vs ground truth.
    4. If the reasoning gets to the wrong answer (which is likely!) try a search strategy randomly – e.g. backtracking (start from an earlier step), critique-and-correct the existing line of reasoning, explore a new path distinct from the one given, or verify the current reasoning. Again, this is done by prompting the general model.
    5. Each problem gets three tries to get to a correct answer before giving up. At the end of this process we have a series of reasoning traces that get to the correct answer. Pair these with the problems and used for the next steps, but first rewrite them into a chain of thought incorporating “hmms” and other smooth transitions between thoughts, via prompting the general model.
    6. Fine tune the base model on the problem traces.
    7. Do RL (PPO) on the fine tuned model with the other half of the problems, rewarding when the model is correct with reasoning (verified by the general model), giving a small reward for a incorrect answer but with reasoning, and giving 0 for no reasoning provided (regardless of answer). Also constrain with KL divergence from the fine tuned model, as is standard in RLHF etc.

    This seems like a pretty reproducible recipe and the results seem strong. They include the prompts they use in the appendix, helpfully, and some good ablations/practical notes.

    1. Effectiveness of Complex CoTs: We further examined the impact of different types of Chain-ofThought (CoT) reasoning. The results show that direct learning of response (yˆ) performs the worst,
      while simple CoT (y0, e0) offers only little benefit. In contrast, Complex CoT (yˆ0, eˆ) significantly
      improves performance by an average of 4.3 points. This demonstrates the importance of teaching
      models to refine their answers with reflection

    One other interesting note on sourcing data was that they used GPT-4o to filter a number of multiple choice questions to generate the set of problems and ground truth. They used it to evaluate whether questions were complex enough to require reasoning, and whether they had a single clear and unambiguous answer. I am guessing it is a lot easier to get multiple choice question banks than other kinds, so this is a clever approach.

  • Thunderkittens /GPUs go brr

    Just got round to reading the intro post to the (now improved) thunderkittens kernel DSL.

    https://hazyresearch.stanford.edu/blog/2024-05-12-tk

    Many good nuggets on kernel writing in general and the hopper in particular.

    But to us a “register” is a 16×16 tile of data. We think AI wants this — after all this time, it’s still just matrix multiplies, reductions, and reshapes. And we think the hardware wants this, too — small matrix multiplies are just begging for hardware support beyond just the systolic mma.

    In fact, more broadly we believe we should really reorient our ideas of AI around what maps well onto the hardware. How big should a recurrent state be? As big can fit onto an SM. How dense should the compute be? No less so than what the hardware demands.