If you want to see what a very painful couple of months looks like for an ML research team, FAIR’s logbook of the OPT-175 pretraining from 2021 should top your list. The first few runs are basically:
Loss exploded.
Doesn’t learn.
Loss exploded.
Etc.
At each point the team tweaks some of the hyperparameters: learning rates, weight decay, clipping and so on, as well as adjusting how the model is distributed with various parallelisms. They swapped parts of the architecture (GELU to RELU for example), dealt with hardware failures, avoided bad data, and tried to debug what was going on.
That was 2021 though; by now in 2026 we commonly train on tens-of-thousands of much more powerful GPUs. We have a broadly available body of knowledge on how to train massive models. The big labs are now full of serious people debating the moral meaning of perplexity with their in-house philosophers.
But the big labs don’t share those details. Of the the frontier-ish labs DeepSeek continue to be unusually open about their work. Their new one, DeepSeek v4, is pretty great! 1M tokens of context and close to frontier performance across multiple domains. Training, however, was… not smooth:
“Training trillion-parameter MoE models presents significant stability challenges, and DeepSeek-V4 series are no exception. We encountered notable instability challenges during training. While simple rollbacks could temporarily restore the training state, they proved inadequate as a long-term solution because they do not prevent the recurrence of loss spikes.”
One helpful dynamic of model training is that you can often validate what will happen in a big model by training in a small model. But not always. It mostly works for architectural tweaks and allows much more rapid experimentation and testing. But when it doesn’t work you can be in trouble: things that appear to smooth out problems at small scale can mask others at large scale where models have the capacity to learn weirder things. Fixing them late in training, when you are already into the gigawatts, hurts.
The techniques DeepSeek used (expert routing based on stale params, and clipping) did seem to get them through training a massive model on 30-odd trillion tokens, which is an incredible accomplishment. But they are arguably bandaids, as the team readily call out:
“Although a comprehensive theoretical understanding of their underlying mechanisms remains an open question for now, we are sharing them openly to foster further exploration by the community”
Which has echoes of Noam Shazeer’s similar observation for SwiGLU:
“We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence.”
Recently I had a conversation with an infrastructure team supporting an ML modeling group. The two orgs used to collaborate to ship experiments: the modeling team would come up with ideas, the infra team would augment their frameworks and build out tooling to make those ideas scalable. Together, they would ship an experiment every couple of weeks. Now the modeling team is largely making the framework changes and performance improvements themselves, thanks to coding agents, and are shipping a few experiments every single week. The infra team are still busy, but they are firefighting and debugging when the agents get stuck. The modeling team are much more productive, undeniably, and all the humans are busy, but the work for the infra team has ended up somewhat worse.
If you are a tech CEO who has recently returned to coding, you could look at the team doing the lower-scale firefighting and think “do I need these people?” If you keep taking that question to its conclusion you eventually ask… do I need anyone to do anything at all?
This question, helpfully, predates the term AI 1: Back in the ’30s, Coase wrote his theory of the firm on why companies do some things in-house, and buy others from the market. For a brief period in the early 00s it looked like software jobs would go to the market, thanks to outsourcing. This largely didn’t happen, because, as Coase predicted, specifying a project is tough. Creating software is an iterative process; you don’t know exactly what you’re building until you start, so you need people with technical taste to be making decisions in a consistent way.
There are a lot of Steve Jobs stories with this flavor. For one, Jobs wasn’t happy with the jiggling when holding down icons on the iPhone to remove them. The team built a UI with sliders so he could adjust the jiggle rate until satisfied. Once perfected, copying it was easy, but assembling a group that cares about those kinds of details is hard.
One way to find those people is to train them. Gary Becker wrote about human capital back in the 60s, and in his framing some training imparts skills which are marketable; some which are firm-specific. Companies will pay for firm-specific training but are less keen on paying for marketable skills because rivals can free-ride on it by poaching employees once they are trained.
“If Company A invests time and money to turn a raw college graduate into an expert, Company B can hire that person after five years of experience for a higher salary, collecting the benefits of skills Company A paid to build. In the past, firms tolerated this risk because juniors were producing valuable work along the way. Without that value, the economic foundation of apprenticeship collapses entirely.”
This shows up is in the L3-L5 progression in big tech companies. I’ve seen many hiring managers be hesitant to hire an “industry four” as they don’t yet have the rounded, marketable skills the manager wants. But within the companies they have (effectively) apprenticed at, L4s contribute a huge amount of value. Is AI blowing up that trade-off?
The author of the AI Becker note, Luis Garicano, recently put out another paper on AI disruption asking when AI actually displaces jobs. In Garicano’s framing jobs are bundles of tasks and responsibilities; AI’s impact depends on how tightly these tasks are tied together.
“In a strong bundle, breaking the job destroys enough value that the job survives as a whole: AI assists but the human still sells the full service and retains a large share of revenue. In a weak bundle, the cost of splitting is small: AI replaces some tasks, the human role narrows, and the labor share falls.”
Software engineering involves writing code, operating services, decomposing problems, and aligning with others (both project-wise and culturally). Current AI coding agents attack part of this bundle, but humans comparatively excel at social dynamics and maintaining the larger world view necessary to know which problems to focus on.
At the senior levels the ties seem strong: you can take the coding and task breakdown out of it, but that wasn’t the main thrust of your L7-9 engineers anyway. At less vaunted levels, companies will need many fewer software engineers to churn out code than they have doing it now. But as the ML infra example earlier showed, that doesn’t necessarily mean you don’t need some of the other things they can do.
This opens a risk for the business: if you need senior folks but don’t have enough valuable work to justifying training them yourself, you are stuck paying market-rate for increasingly rare talent. Right now if you happen to have, say, scaled LLM post-training experience you can command a very significant salary. Or just start your own company.
Hiring is hard even for deep-pocketed executives when key skills are firm-specific rather than marketable. Apple can’t go out and hire the kind of people with the taste it develops internally (generally). But how much are firms willing to roll the dice on developing the next Jeff Dean, and how much are they willing to risk someone else hiring them away?
For a similar dynamic, look at investing. Over the past decades, much of the junior analyst work that undergirded investment firms has been replaced by automation. The structure that emerged was the pod shop, or more formally a multi-strategy hedge fund. They operate more like a platform that hosts “pods”, each led by a portfolio manager who is supported by analysts, data scientists and traders. Each pod has its own domain of speciality, and its own profit and loss. The firm centrally manages risk and allocates capital to pods. Successful portfolio managers earn a healthy percentage of the profits they generate, while unsuccessful pods are taken out behind the woodshed and shot. This both gives a talent development pipeline and a rigorous performance standard, albeit not a very collaborative one.
This works, in part, because there is a very clear score card, measured in dollars. We might be able to copy the structure in engineering teams, but actually evaluating how well things are going is hard!
Firms that have the highest dependence on people you can’t easily hire are exactly the ones who are at risk of struggling in this transition: they have the most need to grow their own people, and the least economic reason to do so. Apple can’t buy another Apple, and neither can anyone else.
Back when people still used the term Cybernetics. AI researcher drama is literally as old as AI. ↩︎
When you have enough AI, what do programmers… do? When it was smart autocomplete (e.g. Copilot), that was pretty clear: everything! The AI handles some typing. When it was interactive IDEs (e.g. Cursor) it was still a lot: pair programming, designing, writing the hardest parts. Now it’s an independent agent (e.g. Claude Code) it’s guiding, reviewing code, setting guardrails.
But, you know, we want to move faster than that! That means either we have the agent running in a loop without needing us, or we have lots of agents doing things at the same time1. Or both.
Throwing agents at a problem doesn’t automatically solve it2. Which leads back to the question: “what do we do?” The answer seems to be not so much being in the loop but designing the loop itself.
The most viral agent loop right now is Karpathy’s Autoresearch, which finds verifiable training improvements to his nanochat project. Running Autoresearch is straightforward: A human writes program.md with workflow guidelines, the agent runs in a loop trying ideas. Karpathy’s workflow allocates a fixed compute budget and constrains edits to a single training file to ensure the experiments are valid3. The agent generates ideas, verifies them, then refines: keeping the new baseline and discarding failed ideas.
While Karpathy’s agent-in-a-loop is responsible for both generating and implementing ideas, PyTorch’s KernelAgent4 goes multi-agent, giving each specialized roles and toolsets for improving GPU kernel performance. A profiling worker identifies opportunities, an analyzer agent suggest potential fixes, and so on. The actual execution is best-of-N sampling as an agent loop: it spawns N workers, lets them race, then plans a strategy for the next round.
“Optimization agents reflect on what succeeded and failed in each round, summarizing insights into a shared memory that guides subsequent iterations and prevents repeated dead ends.”
The pattern that seems to work is to set up agents in a generate-verify-refine loop, following a pre-defined work approach, with guardrails. If you need more parallelism, add multiple agents, but keep state central to avoid communication overhead.
An example of the latter is OpenAI’s Symphony. This moves state into a task tracker then spawns5 individual codex agents with a fixed budget of iterations. Individual agents write back to the tracker to save state. This type of agent usage is also known as a “Ralph loop”: agents that start fresh for each iteration of the loop, with necessary context injected each time rather than accumulating organically in the context window.
Much like with Karpathy’s program.md you “program” the WORKFLOW.md with how you want the loop to run, then it executes autonomously.
Designing the workflow feels like a genuinely different skill. It’s not writing the code, the agent does that. It’s not specifying the solution either; in many cases the agent does that too! It’s about designing an approach: how can the agent make progress with each turn of the crank? How can the environment give clean validation signal to the agent about its approach? Not easy, and not quite what we used to do either.
AKA agent teams, swarms, or whichever Mad Max movie Yegge is on today. Google’s new “Towards a Science of Scaling Agent Systems” paper is not keen on multi-agent systems though: “on tasks requiring strict sequential reasoning […] every multi-agent variant we tested degraded performance by 39-70%”. ↩︎
Condolences if your executives are currently pushing that as company strategy A. ↩︎
Mostly: it did engage in a bit of seed hacking, so has achieved postgrad status successfully. ↩︎
The current vibes in software engineering are a mix of crushing despair at years of accumulated personal skills being displaced by the CEO prompting some stuff, and crushing despair at years of corporate investment in an existing codebase that isn’t vibe-y enough. People worry whether the models will be effective in their programming language of choice, not on some general benchmarks.
One angle to approach that is to ask how well the language is covered by the distribution of the training data1. An interesting paper the other day gave a pretty clear idea of how to check: 1-shot some prompts against the base model and see if they ever get it right. Getting access to base models is not always possible, but you can certainly call the post-trained models with roughly the same idea: no tools, no iterations, just generate this program.
To try this, I2 wrote up 20 project-euler like3 puzzles of varying difficulties and had a few different models YOLO solutions in several languages. These ranged from common ones like Python to fairly rare ones like Zig and Hack.
After validating all the solutions, we can calculate some stats using pass@k: in k trials, how often did the model solve the problem. Here’s some stats for pass@1: what % of the time can you expect the model to one-shot the solution:
Lang
GPT-4.1 Mini
Gemini 3 Flash
OLMo 3.1
Kimi K2.5
GLM-5
Python
.93
.99
.72
.97
.98
Type Script
.94
1.00
.43
.95
.95
Go
.95
.91
.46
.86
.86
Rust
.89
.94
.43
.95
.95
Kotlin
.90
.99
.29
.91
.93
OCaml
.76
.86
.08
.94
.90
Zig
.14
.55
.00
.79
.88
Hack
.43
.76
.05
.47
.68
And here is the same thing for pass@128: what is the chance it is right at least once in 128 samples:
Lang
GPT-4.1 Mini
Gemini 3 Flash
OLMo 3.1
Kimi K2.5
GLM-5
Python
1.00
1.00
.95
1.00
1.00
Type Script
1.00
1.00
.90
1.00
1.00
Go
1.00
1.00
.85
1.00
1.00
Rust
.95
1.00
.88
1.00
1.00
Kotlin
1.00
1.00
.59
1.00
1.00
OCaml
.98
1.00
.38
1.00
1.00
Zig
.49
1.00
.05
1.00
1.00
Hack
.99
1.00
.46
1.00
1.00
To make that a bit more visual, here is a per-language chart for GPT-4.1-mini:
Given enough chances GPT 4.1-mini solves all the problems, in almost all the languages. Of course, we don’t actually know what GPT 4 was trained on, but we do know what OlMo 3.1 was trained on, thanks to the wonderful folks at AI2. That means we can see how much code-specific data for each language there was4:
Language
Code Corpus (GB)
Est. Tokens (B)
Category
Python
60.40
17.3
High-resource
TypeScript
26.52
7.6
High-resource
Go
23.78
6.8
High-resource
Rust
9.11
2.6
Medium-resource
Kotlin
5.68
1.6
Medium-resource
OCaml
1.03
0.29
Low-resource
Zig
0.18
0.05
Low-resource
Hack
0.00
0.00
Very-low-resource
There is a pretty decent correlation between the presence of training data and the pass@k rates. But, importantly, its not 1: despite Hack having no StarCoder data and Zig negligible, the model clearly does know at least something about them. Given enough chances it has a decent chance at coming up with the correct answer for Hack, and a non-zero one for Zig:
We have seen for human language that models learn a language substrate, enabling them to perform strongly even on tasks they haven’t seen such as translating between unseen language pairs. I suspect something similar happens with code: despite the language differences there is a logical programming substrate, and the model doesn’t need much exposure to the language in order to generalize to it.
Once you start giving the model multiple attempts, it gets into the right region quickly for the high-resource languages: with GPT-4.1 mini, Python, TypeScript, Go and Kotlin saturate at k=10. The less-common languages continue to rise: the model can write valid OCaml or Zig or Hack but need more attempts to stumble into the right region.
Thinking models flatten the curve substantially. Kimi K2.5 and GLM 5 both use high effort by default5, and that appears to give them multiple bites at the apple from internally exploring and self-correcting. By k=10 the models saturate all problems on all languages, though at the cost of a remarkable number of tokens6!
It’s also instructive to see the ways in the which the models get it wrong. There were four patterns that showed up:
Ecosystem: One problem involved a sum of very large digits. GPT-4.1 Mini regularly used num::BigUint. This is a crate, not a standard language feature, and in an agentic loop would probably be a very valid choice but doesn’t strictly work. In contrast, GLM-5, a thinking model, implements digit-by-digit multiplication from scratch with Vec<u32>.
API confusion: The model knows roughly what the code should look like, but chooses the wrong API. For example, OlMo generated while ... do ... in mixing OCaml’s while...do...done loop with Haskell’s do notation and OCaml’s let...in binding.
Surface-form invention: The model has a sense of how things stylistically look in the language, but doesn’t know the real API. GLM occasionally writes Zig with invented functions: std.mem.Allocator.alloc(usize, limit) (Allocator is a type, not a callable) or @intCast(usize, limit), which actually was valid syntax in earlier versions.
Systematic convention gaps: Models would regularly put in <?hh for the hack samples, which broke in modern Hack.
My takeaway from this is that models learn to code, not just to reproduce syntax. That means you can almost certainly post-train or prompt your way out of most programming language problems with any frontier model: while some models were still pretty poor at Zig even with a lot of tries, Gemini most certainly was not. I doubt the folks at GDM spent a whole lot of time on Zig evals7.
A well pre-trained model has broad capabilities in programming, and it’s mostly a case of eliciting them rather than having to teach them.
I’m going to take as a given that models are good at generalizing within the distribution of their training data, and poor at generalizing outside it. This is not settled! Reasonable people can disagree! But, it’s a decent starting point. ↩︎
Not actually project Euler. I confirmed that the models never respond with an actual Euler puzzle answer in the incorrect ones, so I’m fairly (this is not good science) sure it wasn’t memorization. ↩︎
OLMo’s full training corpus (Dolma v1.7) includes a massive web crawl in addition to code-specific data from StarCoder, so the 0.00 GB for Hack means “absent from code specific training ” not “absent from all training data”. Hack documentation and other content are almost certainly present in the web crawl portion. ↩︎
Gemini also reasons, but the 2.5 Flash model was doing minimal reasoning when answering. ↩︎
Somehow averaging over 3k per sample for GLM, I say while ruefully staring at my OpenRouter bill. ↩︎
By posting this on the internet I am guaranteed to be corrected, at length, by a Googler ↩︎
Every big software engineering team right now is racing to out-do themselves on their adoption of agentic coding practices, and ship faster. There is something more insidious going on with many of the software engineers I talk to1 though. A lot of pressure to build “more! faster!” comes from themselves.
This shows up all over: the “you only have 2 years to escape the permanent underclass” meme2, or the various breathless LinkedIn or Twitter posts of 996’ing startups, labs, or particularly obsessive interns.
Things that used to require teams can now be done by a sufficiently keen solo engineer with a gang of Claudes, or Codexes, or a K2 agentic-swarms. That is thrilling, and it opens up the door to projects that you wouldn’t normally have bothered building. But it also open the door to thinking you need to build those things, and that’s not quite the same.
One of the observations of most people that take an extended leave from a large corporation is that much of the work they were doing wasn’t all that important. Either no one did it while they were out, or how they left it was… fine. Yet, much of that work somehow regains urgency as they come back to the role.
It’s very hard to tease apart how much of your output actually matters. Coordinating a large group of people inevitably takes overhead, and so many annoying aspects of work are genuinely important. But, much like Wanamaker’s famous quote about advertising, half of the work you do doesn’t matter, the trouble is you don’t know which half.
Adding a helpful and harmless model to the mix can certainly accelerate the rate of output, but it doesn’t do much about determining which bucket the work goes into. In fact, I’d say that the problems you take on when given a Max subscription are mildly more likely to to be things that haven’t been done because they are not worth doing. The combination of increased capacity and a pervasive sense of urgency is not a great recipe for quality decision making, or for a healthy relationship with your work.
It can be helpful to take the outsider perspective, at work or with personal projects. Would ask you someone else to do whatever you are considering, even with the expectation they would leverage agents to help them?
It’s often easier to see the value in something, or lack thereof, if you have to convince someone else of it. That can save you from some rabbit-holes filled with a sense of obligation to “extract value” from the time you already sunk into a misguided project.
This doesn’t mean you should ignore all of the ideas you have: you really can just do things, and you sometimes should! Just be clear about whether you want to spend your time3 that way, regardless of what the agent is doing.
I think the most important AI question is, at some level, how do you deploy it so that it is a genuinely positive force across a wide spectrum of people.
I like to tell a story to describe Why Are Things This Way, for some wide hand wave of the world right now and it goes like this: The post-Cold War era marked a renaissance in global trade, what some people call the Pax Americana. This period of globalization rested on two American pillars and one Chinese: the U.S. dollar’s status as the world’s reserve currency, the U.S. Navy’s command of maritime shipping lanes and the rapid development of highly scaled manufacturing.
Underpinning this system was a constellation of technologies: containerization1, ERP systems, advanced telecommunications, the financialization of assets, and cheap energy. China’s accession to the WTO in 2001 was the culmination of its Reform and Opening policy, with leaders like Hu Jintao embodying a sense of forward momentum. We were at the apex of Francis Fukuyama’s “end of history”: the belief that liberal democratic capitalism represented the final stage of human political evolution.
This felt like a rising tide that might, for once, actually lift all boats. Growth did materialize. We witnessed a substantial economic expansion that lifted millions out of poverty, most dramatically in China but also across dozens of countries where GDP and living standards surged.
It was easy to look at this trend line and extrapolate upwards. The most common objection to that extrapolation was that it relied on non-renewable, extractive energy and materials 2. But I think this argument was a mistake on both sides: globalization represented a step change: a one-time shift enabled by a unique convergence of technologies that amplified the principles of specialization and trade to an unprecedented scale.
These technologies were big leaps, but their diffusion unfolded gradually, over decades. This extended rollout created an illusion of continuous growth. As Jeffrey Ding argues in his excellent 2024 book Technology and the Rise of Great Powers, the critical factor is not which nation invents technology first, but which spreads it through their economy faster. The same principle applies globally: diffusion creates the feeling of growth, but its just the future being unevenly distributed. We thought the end of the Cold War ended history, but in reality it just gave us a really good logistics stack.
The Great Financial Crisis of 2008 was the first major crack, exposing a disconnect between elite consensus and public experience. Contagion from the U.S. subprime mortgage market rippled worldwide, shattering faith in both institutions and experts. Austerity measures inflicted deep pain on the median voter, while ZIRP boosted GDP figures and asset valuations, widening the gap between elite enrichment and broad-based prosperity. COVID-19 extinguished any lingering illusion of elite competence. Chains collapsed across critical sectors from masks to electronics to, oddly, toilet paper.
Today, in the post-pandemic, post-austerity landscape, we’ve seen a decisive shift toward realpolitik and narrow, short-term domestic political calculation: people are less trusting of “the system” and more receptive to those actively disrupting it.
If you ask a random person in Hayes Valley they’ll say that AI is a similar step change, maybe even larger: it could lead to flourishing prosperity or possibly doom everyone to being consumed by a rogue instance of Claude obsessed with the Golden Gate bridge. Unlike the internet or mobile phones AI is emerging in a volatile, multilateral world with a broken trust environment.
AI needs vast resources: data, compute, electricity, technical skills, integration and political support. The US and China have adopted somewhat divergent approaches to how to manage that.
The U.S. model emphasizes corporate AI accountability and regulation that favors large incumbents, restriction over compute resources through export controls, and voluntary safety frameworks developed largely by industry. In essence, they are asking the public to trust corporate institutions to manage AI safely, and to deliver the long-term societal benefits to consumers. It’s downstream of the way the tech giants like Microsoft, Google and Apple have navigated government before: hands off during rapid growth, then clear regulations to offer a stable business environment.
China’s Internet companies are under no question of who is in charge, particularly after the crackdowns on gaming and social media a few years back. The Chinese Communist Party is caught in a bind: AI aligns very well with the kind of hard science, foundational technology they want to prioritize, but is dependent on foreign technology and needs the kinds of data and skills that exist within the big social conglomerates they just tried to reign in. China is running a playbook of rapid diffusion and explosive competition, with the government putting heavy hands on the scale: who can buy which GPUs, what kind of content controls must exist, and a national level AI plan. It says to the public: trust in the party, and we will ensure AI delivers social benefit, for our definition of “social benefit”
One major question, going into 2026, is which party will speak for the Americans who abhor the incursions of A.I. into their lives and want to see its reach restricted. Another is whether widespread public hostility to this technology even matters, given all the money behind it. We’ll soon start to find out not just how much A.I. is going to remake our democracy but also to what degree we still have one.
Goldberg is asking who will promise to assert state control for AI: who will bring the Chinese model to America. The fundamental problem for me is I’m not sure that the public trust the government much more than they do corporations, or the media.
If AI is a global-scale step change it requires global coordination, which is expensive: it’s something folks can engage with when everything is going well. When times are tougher, economic interdependence morphs into leverage for coercion and control. Blackwells and rare earths and SWIFT messages become chips on the bargaining table.
Billions of people use large language models, which gives the creators of those models influence in how people act. But thus far the influence on the models from their users is very indirect: aggregate usage patterns or occasional thumbs up/thumbs down feedback. Open Weight and Open Source models offered folks more control, but despite the slow death of scaling the ability to train and operate a true frontier model remains a very large hurdle.
What would it take for people to believe that this power is being used in a way that includes them, instead of being done to them? In a high-trust world you can do that with credentials and commitments. In this world you can’t. People don’t need safety and impact reports from labs or promises of benevolence from the state. They need leverage.
Trust can’t scale, but verification maybe can. That requires independent auditing, liability, and transparency around capabilities and how those capabilities are being deployed. Open weights help, competition helps, and national strategies help, but none solve the whole problem. We’re building machines that can reason. We also need to build systems where the people who own the machines can’t silently rewrite the terms of everyone else’s lives.
Back in 1817 David Ricardo published a very influential theory on an interesting question: Why trade, and particularly why trade when you are better at producing something than other countries?
He gave an example of England and Portugal, in a world where there were just two goods, wine and cloth. In England it took 100 people-hours to make one unit of cloth, and 120 to make one unit of wine. The Portuguese, on the other hand, took 90 hours to make a unit of cloth and 80 to make a unit of wine. England is worse at making both wine, and cloth, so why trade? Why doesn’t Portugal just make everything for itself?
Well, it turns out that while England lacked the famed Portuguese efficiency, it was way worse at wine than it was at cloth. England could trade one unit of English cloth for one unit of Portuguese wine, which meant the wine cost them (effectively) 100 person-hours vs 120 they would have needed to make it themselves: a clear win! But Portugal won too: by focusing on wine rather than cloth they could trade 80 hours of work (for the wine) for some cloth that would have cost them 90 hours to make.
Ricardo described this as a comparative advantage: by leaning into their relative specialties, countries could benefit from trade, even if they are generally more efficient than their competitors. This was a clever insight, globalization happened, and we eventually ended up with Temu.
Of course, things are never quite as simple as economists’ models (annoyingly to economists the world over), and within his own life there were some interesting wrinkles. Sticking with the textiles theme one of them happened to weavers: people who took thread and turned it into fabric. There was a period, shortly before Ricardo published his theory, that some call the Golden Age of the handloom weaver. Spinning, turning material into threads, had been mechanized thanks to the Spinning Jenny, which made yarn cheaply available. Weavers became the bottleneck to turn that yarn into saleable cloth. Weavers worked from home, controlled their schedule, and made excellent money while doing so.
What changed next was the power loom1. Using the hand loom required dexterity and practice to master the shuttle and weave, but the power loom just needed someone to mind it and occasionally unjam things. Weaver’s earnings collapsed from around 20 shillings a week in 1800 to 8 shillings by 1820. The power loom enabled turning yarn into cloth efficiently and cheaply, without the need of years of deep skill and practice.
Ricardo was, at the end of his life, right there to observe the start of this transition, and in the third edition of his book Principles of Political Economy he added a chapter titled “On Machinery”. Comparative advantage says that if a machine comes out that is better at some job humans should move to a place where they are comparatively better (like fixing the machine). Ricardo realized that machinery could increase the profit for the factory owner while decreasing the gross income to workers: it shifted returns from labor to capital. The power loom took the primary asset of the weavers, their dexterity and practice, and made it economically irrelevant.
This feels worth discussing because in many ways software engineering has been going through a Golden Age of the handloom coder, particularly in the post-pandemic expansion from 2020-2022, where it was a very, very valuable skill indeed.
While SWE wages have yet to collapse to shillings, there has been a definite cooling through rounds of layoffs and shifts to capital expenditure, accelerated by the adoption of strong coding models. Generating syntactically correct code has become way cheaper, and the bottleneck that was shipping code to production is shifting from writing code to proving it is correct. There is still a huge amount that hasn’t changed: identifying requirements, making choices on implementation paths, and thinking about the overall system, but slinging code is becoming a different job, quickly. The primary beneficiaries so far are those selling the pythonic power looms: the big labs and key tooling and hardware providers.
In my own direct experience coding assistance went from being a somewhat niche interest, that required regular selling to VPs to keep them investing in it, to a top level company mandate with accompanying metrics. The question I have found myself discussing recently with many smart engineers recently is: are we the weavers, or, you know, is everyone a weaver? Is this another industrial revolution like steam or electricity, or something perhaps even larger?
Steve Newman of the Golden Gate Institute of AI2 (and one of the creators of Google Docs), wrote up one of the best “maybe it’s different this time” posts I’ve read in a bit, and not just because it involves robots mining Ceres3.
“I spend a lot of time in this blog arguing that AI’s near-term impact is overestimated, to the point where some people think of me as an AI skeptic. I think that predictions of massive change in the next few years are unrealistic. But as the saying goes, we tend to overestimate the effect of a technology in the short run, and underestimate it in the long run. Today, I’m going to address the flip side of the coin, and present a case that the long-term effect of AI could be very large indeed.”
The core of Newman’s argument is that AI is the first technology we have developed that could, potentially, be more adaptive than we are. As a way of illustrating, let’s stick with what everyone comes to this blog for: 19th century weavers.
Despite all of the above automation, weavers still had a role in more complex or limited run designs where the expense and effort of setting up a power loom didn’t make sense. Then, the Jacquard loom made the design flexible: you specified the design by punching holes in a card4 and the loom wove the design. The comparative advantage shifted away from weaving entirely, into designing and encoding. Pattern designers became some of the first programmers of mechanical systems as card punchers. The unique human advantage was adaptability: we added a level of flexibility, and the humans then adapted to work above this level
Newman argues that the AI is a cognitive loom: the power loom replaced dexterity and practice, the Jacquard loom made it flexible and adaptable, but someone still needed to punch the cards. Humans adapted, and learned new skills. Newman argues that AI might be able to learn those new skills faster.
“My point is simply that once AI crosses some threshold of adaptability and independence, there will be paths around the traditional barriers to change. And then things will really start to get weird.”
This doesn’t inherently invalidate the idea of competitive advantage, but it might make it practically irrelevant if the market value of the human advantage drops below the cost of subsistence. If a future AGIs opportunity cost is tiny, maybe there just isn’t enough left for humans when it comes to matters of substance.
Comparative advantage is, fundamentally, about tradeoffs. Technology is our great lever of progress to remove some of those tradeoffs, but we have historically always run into more. Even if we were out mining asteroids with robots and building giant data centers autonomously there is still not infinite compute, and there is still not infinite time. There will always be some set of tradeoffs that have to be made, some range of competing options to choose between.
What is valuable or notable in that environment can look markedly different. To look at the Victorians again, the art world was significantly impacted by the advent of photography, as (within certain bounds) it effectively solved realism. Artists responded by developing impressionism: the comparative advantage they retained was subjectivity and emotional context. Even the most opium-enhanced Victorian futurist would have to be lucky to predict Cubism from reading about William Henry Fox Talbot.
Humans do seem to me to have a comparative advantage in some areas, particularly:
Reality
Desires
We are grounded as creatures in the world, not in textual or video inputs. We evolved in the world, and are richly adapted to it, in ways that are not always obvious, even to ourselves.
We also tend to view intelligence as being coupled to wanting things, because things notably less intelligent things than us seem to want things, and we certainly have any number of desires. It might be true that an AGI wants things, but it’s not clear that it must be true. I feel even more confident that on the way to AGI we will build some pretty powerful systems that don’t really “want things” in the same way we do: they may be agentic, but they are not truly agents with goals absent human input.
Since we are already living in part of that future, I asked Gemini what it thought might be the human comparative advantage. As I hoped, it told me I was absolutely right:
“Since we (AIs) are designed to serve human intent, the scarcest resource for us is accurate data on human preference. If you can predict what humanity will value in 10 years (e.g., “Will we value privacy or convenience more?”), that information would be incredibly valuable to a superintelligence trying to optimize its resources.”
In a world of tradeoffs there will still have to be choices, and many of those choices are not easily, observably optimizable. Our ability to be in the world and have preferences might be the most valuable aspect of us after all. Maybe the role of the software engineer of the future, or perhaps of people of the future, isn’t so much doing work or even managing work, it’s instead curating the work.
One example of that kind of activity is a DJ: they create a vibe by arranging songs based on their taste and the response of the audience. Folks choose to go to certain DJs not because they are objectively better, but because they are who they are.
This might sound a bit silly, but in practice much of modern work is not so much about doing the thing as it is about doing the thing a certain way. Still, is the future of humanity collectively making sure the vibes are right? From a certain point of view, what we have always done, collectively, is build a culture. And what is culture other than the right vibes? Perhaps our future is just a continuation of our history, with new technologies, and new tradeoffs.
As an aside this influenced various other uses of punch cards for data storage, leading to IBM and from thence to the fact your terminal defaults to 80 character widths ↩︎
I don’t think even the most perceptive forecaster would have identified a 90s LucasArts video format being a flashpoint for a discussion of the state of the security. We live in an age of generative AI agents rampaging through OSS though, and that seems to be what has happened.
Open source is one of the great triumphs in loose, global coordination. In most meaningful ways, proprietary software… lost. The scale and effectiveness of open source projects consistently outstripped closed source components, across the stack, leaving proprietary software mainly existing at the application level.
This also had the effect of shifting open source from being in contrast to corporate, top-down development of proprietary software to being deeply intertwined with it. The expectations and requirements intermingled volunteer-ish communities and profit-seeking businesses, leading to tension in several areas, including security.
Luckily, the loving grace of the megacorps invested in things like Google’s Project Zero to provide the type of security investments that need corporate-scale backing.
The flow for things like Project Zero look like:
Investigate popular projects and find real security risks before the bad guys do
Share a report with the project, and give them time to fix it before disclosing it
If the project doesn’t fix it in a certain time, disclose it so that folks can work around the issue rather than being vulnerable to it.
That’s their mission: “make the discovery and exploitation of security vulnerabilities more difficult, and to significantly improve the safety and security of the Internet for everyone. “
Inherently, that’s a pretty good idea as the incentive for various bad actors is:
Investigate popular projects and find a real security risk
Tell no one
Use it (or sell it to the national intelligence agency of choice)
“Not long ago, maintaining an open source project meant uploading a tarball from your local machine to a website. Today, expectations are very different”
Today’s expectations include complex distribution infra, signed packages, deterministic builds, CI coverage across many types of hardware, and resilience against security concerns. These expectations aren’t unfounded: the PyPitfalls paper: [2507.18075] PyPitfall: Dependency Chaos and Software Supply Chain Vulnerabilities in Python, released earlier this year, took an extensive look into one particular community:
“By analyzing the dependency metadata of 378,573 PyPI packages, we quantified the extent to which packages rely on versions with known vulnerabilities. Our study reveals that 4,655 packages have guaranteed dependencies on known vulnerabilities, and 141,044 packages allow for the use of vulnerable versions. Our findings underscore the need for enhanced security awareness in the Python software supply chain.”
As the world centralized around open source, some aspects of the infrastructure have scaled up, but the support and investment model really didn’t.
It’s very easy for the corporations building on OSS to treat it like an infinitely available good, especially when they don’t have to deal with the impact of their usage. Again, from the letter.
“Automated CI systems, large-scale dependency scanners, and ephemeral container builds, which are often operated by companies, place enormous strain on infrastructure. These commercial-scale workloads often run without caching, throttling, or even awareness of the strain they impose. The rise of Generative and Agentic AI is driving a further explosion of machine-driven, often wasteful automated usage, compounding the existing challenges.”
Because this code ends up in production for some very large products, maintainers end up as unpaid on-call. Folks with good intentions want to keep a library in healthy shape and feels the pressure of knowing that perhaps millions of people are (indirectly) depending on it. Then we mixed in AI.
The Big Sleep
The FFMPeg project is at the center of a storm right now about the demands from security research teams:
One of the big challenges is the security "research" community treats volunteer projects like vendors and gives them deadlines for release and sends no patches. https://t.co/PT0DZn8Nbz
Google have spent billions of dollars training Gemini, and a hefty chunk moreon a project called BigSleep: an agent to do the security research work at scale. That tool is exactly what the FFMPEG developers are reacting to, with issues like this use-after-free write in SANM process_ftch [440183164]
The vulnerability is in a codec for the LucasArts SMUSH format, which was used in games like Grim Fandango: a security risk targeting a very narrow group of people in their 40s. In a world of human researchers, I suspect that neither attacker or researcher would have spent much time on that codec.
For an AI agent, it’s feasible to scale up the search if you have the compute and model resources, which Google do. So now that (very real!) vulnerability is documented1. That also scales up the demands on maintainers, who don’t have the equivalent billions to do research into generative AI security patch systems.
Security has always been asymmetric, in that it’s easier to break than to build. Scaling up discovery tips that scale off the table. The bulls are in the bazaar, finding vulnerabilities in code for rendering 1995 Rebel Assault 2 cutscenes, and the maintainers just want someone to help clean up after them. Global-scale coordination on global-scale problems remains hard.
There’s a chart making the rounds that caused Tim Lee over at Understanding AI to rewrite his recent (excellent!) article about the impact of AI on jobs. MIT’s Erik Brynjolfsson and colleagues found1 that young workers in AI-exposed jobs2 have seen their employment drop by 13% since ChatGPT arrived. Meanwhile, their older colleagues in the same fields are doing just fine.
[…] the youngest workers saw dramatic job losses—but only if they worked in occupations (like accountants or computer programmers) that were highly exposed to AI. Young workers in less exposed occupations (like nurses or construction workers) saw normal employment growth over the same period.
From a tech industry focus, it’s a little hard to disentangle the impact of reduced hiring after layoffs 3 from the growth of AI, but likely both had an impact. AI coding agents are making it easier to complete the kind of introductory tasks that might have been left for junior engineers.
New grads don’t just do simple tasks though, they grow and develop tacit knowledge of their company industry, begging the question is whether this is permanent disruption or temporary dislocation as the skills need shifts. As Tim calls out:
It’s important not to read too much into this research. Workers between the ages of 22 and 25 are a small slice of the job market, and their employment has always been more volatile than for older workers. When I graduated with a computer science degree in 2002, the economy was just emerging from the recession that followed the dot-com bubble. It was a hard time for a young adult to get their first programming job, but most of my peers eventually found work in the field.
To give an analogy, there was a time when becoming a junior programmer meant learning how to write fast code as cycles were too important to waste. Now, writing particularly efficient code is largely the preserve of specialist, more senior people: some folks opt in to that route early because of their personal interests, but in general raw performance of code is not the blocking factor to building something valuable.
My sense is we are seeing the same thing in terms of general “program composition”: senior folks with experience on large, collaborative projects can benefit from LLM automation as they understand how to put in the right project guardrails and how to translate needs into technical direction. Junior people are still mostly trained how to write working code, and that need has become less pressing as LLMs have proved moderately competent at it.
Rodney Brooks, the robotics legend, made a point back in 2018 that stuck with me: it’s not automation that disrupts workers—it’s digitalization. In his article, Brooks wrote
Digitalization is replacing old methods of sharing information or the flow of control within a processes, with computer code, perhaps thousands of different programs running on hundreds or thousands of computers, that make that flow of information or control process amenable to new variations and rapid redefinition by loading new versions of code into the network of computers.
An example that Brooks uses is bridge toll takers. This directly happened on the Bay Bridge between San Francisco and Oakland, which used to employ toll takers in booths. Then FastTrak was added, allowing passing through without interacting with anyone, while still offering cash tolls for those without. Now, between that and direct mail to people via cameras watching license plates, the tollbooths are empty.
LLMs also digitalize. Task descriptions and project documentation, for example, have been stored in human language: digital, but not particularly accessible to automation. Much of the work of managing a large bug tracking system has been in adding metadata that is accessible to automation. LLMs digitalize language, imperfectly to be sure, but enough to expose new swathes of work to automation.
High Road/Low Road
How will companies respond? Thomas Kochan at MIT has been mapping this kind of choice for years, and describes the separation between what he called the high road and low road.
The language that was used to differentiate these two approaches quickly evolved to a comparison of “high road” and “low-road” business strategies and “high-performance work systems,” which viewed labor as an asset, versus “command and control” systems, which viewed labor as a cost like any other factor of production. A comparison of the business strategies of two household names, Walmart and Costco, illustrates the differences between low-road and high-road business strategies. Walmart has been extremely successful (when judged solely on the grounds of finances and shareholder value) by pursuing a business strategy best captured by its marketing tag line: “Every day low prices.” To achieve this strategy, it places top priority on minimizing and tightly controlling labor costs, discouraging long-term tenure of its “associates,” investing little in training and development, and avoiding unions at all costs. Costco’s business strategy places a higher value on product quality and customer service, and to achieve these objectives it pays higher wages, invests more in training its work force to understand and serve customer needs, and has longer tenure patterns (and thus lower turnover costs). As a result, Costco’s employees are more productive, stay with the firm longer, and have more discretion to use their time and knowledge to solve customer problems.
Tech companies have, in the most part, been high-road employers. Employees have been an asset, and in some cases the key asset of the company. The low road though is not simply driven by cost cutting, it’s about control. Having a more fungible, replaceable workforce gives executives more options. Having more specialized, skilled workers offers the options of more flexibility in how work is done, but shifts control to the workers and away from management.
We can see this play out in some of the post-pandemic cultural changes. There is a concept in work called deskilling, where work is atomized to improve efficiency: take something which was a skill and divide it up until it until the individual components becomes unskilled. Classic examples are in factory work, where a skilled person is replaced with an operator of a machine, or more often a series of operators of a series of machines4. This trades a higher up-front cost in terms of capital and procedure development for a lower labor cost, transferring both money but also power from workers to managers.
A recent article extended this concept to virtues, with the idea of “moral deskilling”. A virtue is a positive behavior, such as building responsibility or with high quality. Virtues tend to be individual qualities, things we recognize and reward in others: much of culture in a company is about inoculating virtues. That is inherently messy and the idea of systematizing virtue is appealing: move from a fuzzy, personal conception to a verifiable checklist or a rule that can be followed. This worked in a lot of cases, but it also enabled a form of deskilling:
Systematising virtue handed control to managers. Who, endlessly mistrusting these expert folk who were always trying to do things the expensive way, converted that mistrust into endless, endless paper work.
It was endless because it broke every little aspect of what had been virtue into tiny components. Fearful of losing control of any scrap of virtue, managers needed to relentless check on every little task.
If we want to see this play out in real-time we can look at the return-to-office mess in tech. A vibrant, collaborative office culture is a good thing, and it requires a compact. Employees would deal with the misery of a commute5 (particularly in the SF bay area), but in exchange they would participate in an environment where they could learn and teach, build camaraderie and so on.
When the idea of a return to office happened post-pandemic, people had found pleasure and benefit in not doing the commute. When they returned, they found the offices less vibrant, the workforce more distributed, and cost-driven reductions in space making the experience harder through shortages of meeting rooms or desks.
Compounded by a series of layoffs and a change in the prior relationship between company and employee, the in-office deal felt worse. Frustrated with the lack of the old compact, management exerted control through systems. They set required days and logged attendance through badge ins. Workers responded by treating the atomized requirements as mere requirements, not aspects of a culture: even a small percentage of folks coffee badging or trying to work from more convenient offices were visible in the empty desks, exacerbating tensions for workers “doing the right thing”.
Rather than analyze the problem and step back, management in many cases doubled down on systematizing: validating time at desk, logging badge out times or adding similar extra controls. This continued to take what had been a morally complex set of trade-offs and reduce it to a checklist. For many newer staff, that was the in-office experience.
This is the essence of the low road: prioritizing the systematized and legible over the messy, and complex, but more interesting, world of dealing with real people; prioritizing power and control over exploring new outcomes.
One way to view what’s happening is through the lens of debt, which is one of the angles in a recent position paper that frames the future of work as an AI Safety risk. Every time a company chooses to replace junior workers with LLMs rather than training them, they’re borrowing against the future. Matt Garman of AWS was pretty clear on his position:
“I was at a group, a leadership group and people were telling me they’re like we think that with AI we can replace all of our junior people in our company. I was like that’s the like one the dumbest thing I’ve ever heard […] They’re probably the least expensive employees you have. They’re the most leaned into your AI tools and like how’s that going to work when you go like 10 years in the future and you have no one that has built up or learned anything.”
But understanding something and acting on it are different things. Both the low road and high road can lead to a lot of success in business, but I do hope we can navigate this transition towards a place where the craft can be retained in software development. The question is whether enough companies will choose the messy, complex work of developing people over the appealing simplicity of trying to replace them.
While defense policy and research is a ways outside the scope for myself (or I imagine most folks reading), the problems of managing or working on uncertain, research-y projects in a volatile environment are pretty relatable:
Most of what we know from cognitive psychology and human judgment research over the last 50 years suggests that unstructured group deliberation might be one of the worst ways of making judgments, yet it’s the norm in most institutions.
Or this bit of career wisdom:
In general, people underestimate their own potential to make contributions to the most important problems. They overestimate how many people are already working on the most important problems. So many incredibly important problems are just really neglected. If you can’t figure out who’s working on something after a few days of homework, then it is probably a neglected problem. And it’s probably up to us to solve it.
Jason talks about looking for projects in the goldilocks zones of probability (less than 50%, more than 5%) that open up interesting opportunities. I worked with a manager who was a strong advocate of the Heilmeier Catechism to evaluate projects, and have seen the value of using it as guidance when presenting and evaluating ideas:
What are you trying to do? Articulate your objectives using absolutely no jargon.
How is it done today, and what are the limits of current practice?
What is new in your approach and why do you think it will be successful?
Who cares? If you are successful, what difference will it make?
What are the risks?
How much will it cost?
How long will it take?
What are the mid-term and final “exams” to check for success?
Jason adds some interesting updates:
For instance, the Heilmeier questions don’t have a question about counterfactual impact: “Would this work get done otherwise?” The office tends not to rigorously assess the other funding streams going to solve this particular problem, and their likelihoods of success.
We also tend not to think much about strategic move and countermove. […]. It probably is prudent to assign at least a 10% probability to some exquisite, classified technology being stolen.
One thing I found myself talking about this week with a couple of folks was how good people get “lucky” a lot. I think these kinds of questions help navigate towards those more positive-surprise-filled spaces.
Don’t blindly tie every piece of work to top-level metrics. Even if technically feasible, the cost is too high and the risk of spurious logic chains significant.
Start with Value Definition
Begin each project with a crisp definition of why it’s worth doing. What underlying problem are you solving, and why is that problem worth solving? Once you have these narrative assertions, it’s usually clear how extraordinary or controversial each claim is.
The more notable the claim, the more likely you need data to support it.
Value Metrics
1. Direct outcome metrics (strongest) We will run an ongoing experiment measuring profit per user with the feature on vs baseline.
2. Strong correlative metrics This will increase time on site, which we can measure and believe correlates with profit per user.
3. Anecdotes and feedback N sales team members report they could sell into more accounts if we launch this feature.
4. Strategic assertions We must do this because of upcoming regulation or we will be unable to continue this business line.
Progress Measurement
Once your value claim is clear and defensible, identify how you’ll measure progress. This may differ from your value metric. Ideal progress metrics tell you whether you’re succeeding, respond quickly to team actions, and have strong reference baselines.
1. Clear baseline, goal, and team-tied metric (Strongest) Launching this compressor will reduce binary download size by an estimated 10% vs the best available industry baseline. We can measure relative progress continually against our production binary during development.
2. Responsive metric without clear reference point We can improve compile times on this fixed codebase from today’s 90s baseline.
3. Non-responsive metric We can measure weekly mobile app release binary size, comparing the new compressor to our old implementation.
4. Milestones We will implement passes a, b, c, after which we can ship the new compiler optimizations to target customers.
Common Challenges
Stronger measures aren’t always a worthwhile tradeoff. If you have high confidence in the work’s value and applicability and mainly need to validate progress, milestones can be completely reasonable.
In general, approach projects with skepticism about whether this is the right thing to do and whether you’ll make good progress. Then identify ways to get concrete data rather than rely solely on leadership support.
Leadership pressure for top-level metrics The clearer you are on why you’re doing something, the easier it becomes to communicate your measurement decisions. If leadership continues pushing back and you have a good relationship, use that as a lens to explore concerns you might have missed. Often requests for metric clarity stem from deeper worries about project value or plausibility.
Team dynamics and gaming Metrics communicate value in performance reviews, creating incentives for engineers to identify unnecessary correlations or game metrics (intentionally or not). Counter this with “health” metrics that balance negative behaviors — if measuring deployment frequency, also measure production incidents with a flat target to prevent trading off reliability for speed.
For senior engineers concerned about optics, have them clearly articulate the value chain. Working through and demonstrating strong data usage in project steering is itself a highly rewarded skill — encourage them to take on that role.
When to revisit metrics The world changes. What’s sensible in one environment isn’t in another. Constantly relitigating is a headache, but reevaluating logic at regular intervals (say, each half-year planning cycle) is appropriate. Otherwise, maintain awareness of company trends: new projects, initiatives, or teams gaining significant attention. If they were successful, would that change what you do? There’s no hard rule: it’s corporate decision-making taste that develops with experience.
Some technologies are obligate in a competitive environment
The example of his that stuck with me was the plough: many cultures were animistic (a belief in the spirit of the animal), but after the scaling up of agriculture enabled by the plough, most weren’t. The plough’s enablement of large-scale agriculture likely shifted societies toward sedentism (vs nomadism) and surplus, altering spiritual relationships with animals as they became tools for labor. The perspective shift — the value it encodes — is embedded in the technology.
The plough is also obligate. If one group uses it and other doesn’t, the group that does will be able to farm more per person. That surplus enables for more specialization, which yields an advantage either in terms of trading or conflict. If the second group doesn’t adopt the plough they will be taken-over, outgrown, or both, by the first.
AI may well be an obligate technology, which forces us to make deliberate ethical choices about its deployment and values. We are in the early stages of seeing that with software development. That’s going to change the nature of certain careers: changing what the day-to-day work looks like and impacting demand for software engineers. That isn’t necessarily negative: it will depend on the opportunities that replace the current ones. It also isn’t neutral: our approach to AI, how we deploy it, how it is used are all a series of choices that embed values.
Some of those values are encoded into the models by the training data and loss functions, some are encoded in the systems engineering, the choices of which tasks to apply it to, which interactions to explore and so on, and some are explicitly engineered in through fine tuning and reinforcement learning.
One way of looking at those values is through the study of ethics, how to live in a just way. This is a core topic for philosophers. One example is Kant’s Categorical Imperative, which requires actions to follow maxims that could be universally applied without contradiction, ensuring rational consistency.
It’s somewhat akin to asking the question: Would I still support this if I knew everyone else would act this way? Further, would I support this action if knew I would be born again randomly into the world, maybe in a much different situation than my one now?
The proliferation of useful AI agents adds a somewhat realistic flavor to the question: if, in the future, you are dependent on systems constrained by these specific guidelines or rules , are you happy about that?
Kantian (or deontological) thinking is far from the only ethical system. A lot of thinking about AI ethics has been consequentialist. Consequentialism is practical: the “goodness” of an action is whether it results in a good outcome! Inherently we judge AI training (at least for RL and supervised learning) by the achievement of the outcome encoded in a loss function, reward function or similar. Stuart Russel (of & Norvig fame from university courses of my youth) has written about “provably beneficial” AI where the AI maximizes a human-involved reward signal (a little like the Assistant Games pattern we discussed before).
The downside of all this is well documented — Nick Bostrom’s famous paperclip maximizer thought experiment is an AI that achieves the objective, but in a way that was undesirable. A more benign but annoying example might be a cleaning robot that pushes everything outside the house in order to make it tidier. Because outcome-based rules just judge the what, and now the how, they can also encourage power-seeking (as called out by Bostrom) in order to better achieve objectives.
standard forms of consequentialism recommend taking unsafe actions when such acts maximize expected utility. Adding features like risk-aversion and future discounting may mitigate some of these safety issues, but it’s not clear they solve them entirely.
Anthropic’s constitutional AI approach can be seen as a blend of approaches; the constitution is a set of principles that can be used by another AI to criticize and improve output in response to requests:
As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as ‘Constitutional AI’.
The training still ultimately uses a form of reinforcement learning (which is inherently consequentialist), but the reward is given according to how well the outputs adhere to the constitutional principles.
A more recent philosopher, Derek Parfitt, argued that all moral systems were hill climbing towards a shared perspective, and you can evaluate an action on multiple in order to gain confidence. For example, when considering an option, you could ask:
a) Would it maximize overall good? (consequentialist) b) Could everyone rationally will it? (Kantian) c) Could anyone reasonably reject it? (contractualist1)
“Rationally” here is doing a bit of work: it means “with reasoning”, as in there is a chain of thought that can support and justify the decision.
Part of the challenge with rationalism is that part of the reward signal here is coming from human raters. We have seen this play out with LMSys where models which are “friendlier” score better, and in a more extreme version in the ChatGPT 4O misalignment where the model became excessively sycophantic in a way that resulted in better rewards in short doses, and didn’t impact any of the quantitative evaluations, despite being an overall negative to the experience.
As we move into more agentic systems we often have fewer tools to evaluate or make visible the values we are encoding, but we are still doing it!
For example. Google’s recent AlphaEvolve project uses Gemini underneath, which is an LLM that can be evaluated and aligned. But on top of that it uses an evolutionary search scheme (another reminder of Rich Sutton’s bitter lesson) to generate different prompts and evaluations and iterate towards a new, externally defined goal: in that case generating better algorithms and code. We are searching for superior outcomes, but that search itself is -somewhat unconstrained by other values: it’s a more consequentialist approach.
The current crop of agentic coding tools often recommends encoding preference data into a project specific file. For example, Claude Code recommends a CLAUDE.md file
Include frequently used commands (build, test, lint) to avoid repeated searches
Document code style preferences and naming conventions
Add important architectural patterns specific to your project
CLAUDE.md memories can be used for both instructions shared with your team and for your individual preferences.
While it presents them as memory, the idea here is to guide the choices of the model in a way that aligns with the principles by which the project being modified is managed.
we argue that one of the primary vulnerabilities underlying these attacks is that LLMs often consider system prompts (e.g., text from an application developer) to be the same priority as text from untrusted users and third parties. To address this, we propose an instruction hierarchy that explicitly defines how models should behave when instructions of different priorities conflict.
As well as using a single model that can incorporate different safeguards, we can use models themself to verify actions and outputs. Verification is generally an easier problem than generation, so a model that is unable to consistently follow a set of principles may still be able to validate whether a given example does or does not follow them.
LlamaGuard is a good example of this kind of system, built and released by Meta’s GenAI team alongside Llama. One example of seeing this process in the wild is OpenAI’s safety systems on 4O image generation. Inherently agentic, 4O can generate image ideas, then the image itself. Despite the model having constraints on it, it will happily generate things which violate OpenAI’s content policy, necessitating a monitoring model that whisks them away before a use can access a violating image.
If AI becomes an obligate technology, we will benefit from encoding values intentionally, balancing outcomes, universal principles, and fairness. The challenge is ensuring these choices reflect the world we want, not just the one we’re competing in.
Another theory of ethics that weights mutuality heavily: it’s frames ethical considerations as something derived between people rather than just based on outcomes or on abstract principles. Its featured particularly in Scanlon’s What We Owe to Each Other for those, like me, who get all of their ethical understanding from watching The Good Place ↩︎
Having recently spent a lot more time around typecheckers, I’ve been reading about the Language Server Protocol that glues IDEs and language support services together. The article, from September last year, gives a breakdown of the good and the bad about the protocol, and is a really great dive into the broader topic.
Much of the pain stems from how the protocol emerged and is managed:
There is zero open discussion of features before they are added to the spec. Typically they are implemented in VSCode, and then the specification is updated as a fait accompli to document those changes. Implementers of open-source language servers get very influence on the development of the specification.6 There is not even a community space for implementers of language servers to get together and talk about the many tricky corners.
I feel echoes of this in a lot of different projects I have been around, including PHP internals, ZeroMQ’s protocol, various CNCF working groups, PyTorch and Triton. Protocols and technologies emerge from a need, and grow because that need is shared, but transitioning from a narrow and highly connected problem source to a true standard is difficult: attempt to standardize and bring in voices too early and you just slow down progress to the point something else emerges which solves immediate needs better; leave it too long and the governance questions can be sufficient to encourage folks to rally around forks or alternatives.
One example of that playing out at the moment is Anthropic’s (and dsp’s!) Model Context Protocol. Tim Kellog wrote a nice post the other day comparing it to OpenAPI, concluding:
Standards are mostly sociological advancements. Yes, they concern technology, but they govern how society interacts with them. The biggest reason for MCP is simply that everyone else is doing it. Sure, you can be a purist and demand that OpenAPI is adequate, but how many clients support it?
The reason everyone is agreeing on MCP is because it’s far smaller than OpenAPI. Everything in the tools part of an MCP server is directly isomorphic to something else in OpenAPI. In fact, I can easily generate an MCP server from an openapi.json file, and vice versa. But MCP is far smaller and purpose-focused than OpenAPI is.
Alex Danco back with a rumination on a what is becoming abundant and that that makes scarce. The two takeaways I had were:
a) How we think about code changes to a more malleabe, more immediate driver of value rather than some inherent store of value itself.
b) The easiest place to deploy agents is in areas where you are both already doing the work, but also where that work is executed by somewhat lower-accountability folks (e.g. vendors, contractors, maybe very junior staff).
It might be a bit too flippant to say, “We’re evolving from a mindset where the codebase is capital (the past few decades of software) and into a mindset where code is labor.” But this is a blog post, so it suits the medium. And it suits today’s energy: new projects and startups are writing a lot more code on the basis of “does this make me money now” (what Simon Wardley would’ve once called “worth-driven development”): the codebase is more like a workforce to be trained than like a factory line to be architected. You still want to put thought in it, but it’s a different kind of thoughtfulness. And you expect profit generation out of it a lot more aggressively than with patient capital deployment.
The two points are connected. Most very large companies have a real but unspoken delineation between different types of code. Some code is absolutely bullet proof, very closely monitored, and staffed for proactive, ongoing maintenance and enhancement. And some code is… not that.
We generally don’t explicitly call out the gradations, but we do put different degrees of gating around things. Getting better at both the identification and fencing of critical things (load bearing vs decorative, from a business point of view) will be necessary to unlock the pace of potential improvements via agentic systems, and to focus human attention on the areas where there is a very high inherent complexity/ambiguity.
A couple of conversations I was in last week around agentic system design reminding me of Brandon Smith’s excellent write libraries, not frameworks post
A library is a set of building blocks that may share a common theme or work well together, but are largely independent.
A framework is a context in which someone writes their own code. This could take the form of inversion-of-control, a domain-specific language, or just a very opinionated and internally-coupled library.
[…]
So here’s my point: frameworks aren’t always bad, but they are a much bigger risk – for both the creators and the users – than libraries are. If your framework can be a library without losing much, it probably should be
I feel like we have seen this in ML around general-purpose training loops. Everyone training a model needs a training loop, and there are a lot of commonalities (dataloading, checkpointing, observability and so on). It’s very tempting to build a general training loop that many different groups can use. Unfortunately, this is almost inevitably a framework, rather than a library, and inherently hard to compose.
When the needs of the modeler exceed the bounds of the framework they either have to make extensive changes or drop the framework and move to a more bespoke set up. In practice this seems to result in a handful of training frameworks that are somewhat domain specialized: for example, a recsys training framework, an LLM training framework, a multimedia/vision oriented framework and so on.
My sense is the same pattern will happen with agentic loops. A single “one-size-fits-all” agentic framework can feel too broad or rigid, and many teams will carve out domain-specific variants to get the features they truly need. Ideally, we will identify some truly generic components that can be build out library-style, and composed to the domains that we need.