The model can probably write the code

The current vibes in software engineering are a mix of crushing despair at years of accumulated personal skills being displaced by the CEO prompting some stuff, and crushing despair at years of corporate investment in an existing codebase that isn’t vibe-y enough. People worry whether the models will be effective in their programming language of choice, not on some general benchmarks.

One angle to approach that is to ask how well the language is covered by the distribution of the training data1. An interesting paper the other day gave a pretty clear idea of how to check: 1-shot some prompts against the base model and see if they ever get it right. Getting access to base models is not always possible, but you can certainly call the post-trained models with roughly the same idea: no tools, no iterations, just generate this program.

To try this, I2 wrote up 20 project-euler like3 puzzles of varying difficulties and had a few different models YOLO solutions in several languages. These ranged from common ones like Python to fairly rare ones like Zig and Hack.

After validating all the solutions, we can calculate some stats using pass@k: in k trials, how often did the model solve the problem. Here’s some stats for pass@1: what % of the time can you expect the model to one-shot the solution:

LangGPT-4.1 MiniGemini 3 FlashOLMo 3.1Kimi K2.5GLM-5
Python.93.99.72.97.98
Type Script.941.00.43.95.95
Go.95.91.46.86.86
Rust.89.94.43.95.95
Kotlin.90.99.29.91.93
OCaml.76.86.08.94.90
Zig.14.55.00.79.88
Hack.43.76.05.47.68

And here is the same thing for pass@128: what is the chance it is right at least once in 128 samples:

LangGPT-4.1 MiniGemini 3 FlashOLMo 3.1Kimi K2.5GLM-5
Python1.001.00.951.001.00
Type Script1.001.00.901.001.00
Go1.001.00.851.001.00
Rust.951.00.881.001.00
Kotlin1.001.00.591.001.00
OCaml.981.00.381.001.00
Zig.491.00.051.001.00
Hack.991.00.461.001.00

To make that a bit more visual, here is a per-language chart for GPT-4.1-mini:

Line graph showing pass@k curves for various programming languages with k (number of attempts) on the x-axis and pass rate averaged across problems on the y-axis. Languages include Python, TypeScript, Go, Rust, Kotlin, OCaml, Zig, and Hack.

Given enough chances GPT 4.1-mini solves all the problems, in almost all the languages. Of course, we don’t actually know what GPT 4 was trained on, but we do know what OlMo 3.1 was trained on, thanks to the wonderful folks at AI2. That means we can see how much code-specific data for each language there was4:

LanguageCode Corpus (GB)Est. Tokens (B)Category
Python60.4017.3High-resource
TypeScript26.527.6High-resource
Go23.786.8High-resource
Rust9.112.6Medium-resource
Kotlin5.681.6Medium-resource
OCaml1.030.29Low-resource
Zig0.180.05Low-resource
Hack0.000.00Very-low-resource

There is a pretty decent correlation between the presence of training data and the pass@k rates. But, importantly, its not 1: despite Hack having no StarCoder data and Zig negligible, the model clearly does know at least something about them. Given enough chances it has a decent chance at coming up with the correct answer for Hack, and a non-zero one for Zig:

Line graph depicting the relationship between training data volume and average pass@k scores for various programming languages, including Python, Rust, Go, and Zig, with different markers representing pass@1, pass@10, and pass@128 metrics.

We have seen for human language that models learn a language substrate, enabling them to perform strongly even on tasks they haven’t seen such as translating between unseen language pairs. I suspect something similar happens with code: despite the language differences there is a logical programming substrate, and the model doesn’t need much exposure to the language in order to generalize to it.

Once you start giving the model multiple attempts, it gets into the right region quickly for the high-resource languages: with GPT-4.1 mini, Python, TypeScript, Go and Kotlin saturate at k=10. The less-common languages continue to rise: the model can write valid OCaml or Zig or Hack but need more attempts to stumble into the right region.

Thinking models flatten the curve substantially. Kimi K2.5 and GLM 5 both use high effort by default5, and that appears to give them multiple bites at the apple from internally exploring and self-correcting. By k=10 the models saturate all problems on all languages, though at the cost of a remarkable number of tokens6!

It’s also instructive to see the ways in the which the models get it wrong. There were four patterns that showed up:

  1. Ecosystem: One problem involved a sum of very large digits. GPT-4.1 Mini regularly used num::BigUint. This is a crate, not a standard language feature, and in an agentic loop would probably be a very valid choice but doesn’t strictly work. In contrast, GLM-5, a thinking model, implements digit-by-digit multiplication from scratch with Vec<u32>.
  2. API confusion: The model knows roughly what the code should look like, but chooses the wrong API. For example, OlMo generated while ... do ... in mixing OCaml’s while...do...done loop with Haskell’s do notation and OCaml’s let...in binding.
  3. Surface-form invention: The model has a sense of how things stylistically look in the language, but doesn’t know the real API. GLM occasionally writes Zig with invented functions: std.mem.Allocator.alloc(usize, limit) (Allocator is a type, not a callable) or @intCast(usize, limit), which actually was valid syntax in earlier versions.
  4. Systematic convention gaps: Models would regularly put in <?hh for the hack samples, which broke in modern Hack.

My takeaway from this is that models learn to code, not just to reproduce syntax. That means you can almost certainly post-train or prompt your way out of most programming language problems with any frontier model: while some models were still pretty poor at Zig even with a lot of tries, Gemini most certainly was not. I doubt the folks at GDM spent a whole lot of time on Zig evals7.

A well pre-trained model has broad capabilities in programming, and it’s mostly a case of eliciting them rather than having to teach them.

  1. I’m going to take as a given that models are good at generalizing within the distribution of their training data, and poor at generalizing outside it. This is not settled! Reasonable people can disagree! But, it’s a decent starting point. ↩︎
  2. Claude. ↩︎
  3. Not actually project Euler. I confirmed that the models never respond with an actual Euler puzzle answer in the incorrect ones, so I’m fairly (this is not good science) sure it wasn’t memorization. ↩︎
  4. OLMo’s full training corpus (Dolma v1.7) includes a massive web crawl in addition to code-specific data from StarCoder, so the 0.00 GB for Hack means “absent from code specific training ” not “absent from all training data”. Hack documentation and other content are almost certainly present in the web crawl portion. ↩︎
  5. Gemini also reasons, but the 2.5 Flash model was doing minimal reasoning when answering.
    ↩︎
  6. Somehow averaging over 3k per sample for GLM, I say while ruefully staring at my OpenRouter bill. ↩︎
  7. By posting this on the internet I am guaranteed to be corrected, at length, by a Googler ↩︎

Discover more from Ian’s Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading