Values in AI

Daniel Schmachtenberger has made the argument:

All technologies embody value systems
Some technologies are obligate in a competitive environment

The example of his that stuck with me was the plough: many cultures were animistic (a belief in the spirit of the animal), but after the scaling up of agriculture enabled by the plough, most weren’t. The plough’s enablement of large-scale agriculture likely shifted societies toward sedentism (vs nomadism) and surplus, altering spiritual relationships with animals as they became tools for labor. The perspective shift — the value it encodes — is embedded in the technology.

The plough is also obligate. If one group uses it and other doesn’t, the group that does will be able to farm more per person. That surplus enables for more specialization, which yields an advantage either in terms of trading or conflict. If the second group doesn’t adopt the plough they will be taken-over, outgrown, or both, by the first.

AI may well be an obligate technology, which forces us to make deliberate ethical choices about its deployment and values. We are in the early stages of seeing that with software development. That’s going to change the nature of certain careers: changing what the day-to-day work looks like and impacting demand for software engineers. That isn’t necessarily negative: it will depend on the opportunities that replace the current ones. It also isn’t neutral: our approach to AI, how we deploy it, how it is used are all a series of choices that embed values.

Some of those values are encoded into the models by the training data and loss functions, some are encoded in the systems engineering, the choices of which tasks to apply it to, which interactions to explore and so on, and some are explicitly engineered in through fine tuning and reinforcement learning.

One way of looking at those values is through the study of ethics, how to live in a just way. This is a core topic for philosophers. One example is Kant’s Categorical Imperative, which requires actions to follow maxims that could be universally applied without contradiction, ensuring rational consistency.

It’s somewhat akin to asking the question: Would I still support this if I knew everyone else would act this way? Further, would I support this action if knew I would be born again randomly into the world, maybe in a much different situation than my one now?

The proliferation of useful AI agents adds a somewhat realistic flavor to the question: if, in the future, you are dependent on systems constrained by these specific guidelines or rules , are you happy about that?

Kantian (or deontological) thinking is far from the only ethical system. A lot of thinking about AI ethics has been consequentialist. Consequentialism is practical: the “goodness” of an action is whether it results in a good outcome! Inherently we judge AI training (at least for RL and supervised learning) by the achievement of the outcome encoded in a loss function, reward function or similar. Stuart Russel (of & Norvig fame from university courses of my youth) has written about “provably beneficial” AI where the AI maximizes a human-involved reward signal (a little like the Assistant Games pattern we discussed before).

The downside of all this is well documented — Nick Bostrom’s famous paperclip maximizer thought experiment is an AI that achieves the objective, but in a way that was undesirable. A more benign but annoying example might be a cleaning robot that pushes everything outside the house in order to make it tidier. Because outcome-based rules just judge the what, and now the how, they can also encourage power-seeking (as called out by Bostrom) in order to better achieve objectives.

standard forms of consequentialism recommend taking unsafe actions when such acts maximize expected utility. Adding features like risk-aversion and future discounting may mitigate some of these safety issues, but it’s not clear they solve them entirely.

Deontology and safe artificial intelligence – William D’Allesandro

Anthropic’s constitutional AI approach can be seen as a blend of approaches; the constitution is a set of principles that can be used by another AI to criticize and improve output in response to requests:

As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as ‘Constitutional AI’.

The training still ultimately uses a form of reinforcement learning (which is inherently consequentialist), but the reward is given according to how well the outputs adhere to the constitutional principles.

A more recent philosopher, Derek Parfitt, argued that all moral systems were hill climbing towards a shared perspective, and you can evaluate an action on multiple in order to gain confidence. For example, when considering an option, you could ask:

a) Would it maximize overall good? (consequentialist)
b) Could everyone rationally will it? (Kantian)
c) Could anyone reasonably reject it? (contractualist¹)

“Rationally” here is doing a bit of work: it means “with reasoning”, as in there is a chain of thought that can support and justify the decision.

Part of the challenge with rationalism is that part of the reward signal here is coming from human raters. We have seen this play out with LMSys where models which are “friendlier” score better, and in a more extreme version in the ChatGPT 4O misalignment where the model became excessively sycophantic in a way that resulted in better rewards in short doses, and didn’t impact any of the quantitative evaluations, despite being an overall negative to the experience.

As we move into more agentic systems we often have fewer tools to evaluate or make visible the values we are encoding, but we are still doing it!

For example. Google’s recent AlphaEvolve project uses Gemini underneath, which is an LLM that can be evaluated and aligned. But on top of that it uses an evolutionary search scheme (another reminder of Rich Sutton’s bitter lesson) to generate different prompts and evaluations and iterate towards a new, externally defined goal: in that case generating better algorithms and code. We are searching for superior outcomes, but that search itself is -somewhat unconstrained by other values: it’s a more consequentialist approach.

The current crop of agentic coding tools often recommends encoding preference data into a project specific file. For example, Claude Code recommends a CLAUDE.md file

Include frequently used commands (build, test, lint) to avoid repeated searches

Document code style preferences and naming conventions

Add important architectural patterns specific to your project

CLAUDE.md memories can be used for both instructions shared with your team and for your individual preferences.

While it presents them as memory, the idea here is to guide the choices of the model in a way that aligns with the principles by which the project being modified is managed.

OpenAI have published work in allowing hierarchies of instruction: The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions | OpenAI

we argue that one of the primary vulnerabilities underlying these attacks is that LLMs often consider system prompts (e.g., text from an application developer) to be the same priority as text from untrusted users and third parties. To address this, we propose an instruction hierarchy that explicitly defines how models should behave when instructions of different priorities conflict.

As well as using a single model that can incorporate different safeguards, we can use models themself to verify actions and outputs. Verification is generally an easier problem than generation, so a model that is unable to consistently follow a set of principles may still be able to validate whether a given example does or does not follow them.

LlamaGuard is a good example of this kind of system, built and released by Meta’s GenAI team alongside Llama. One example of seeing this process in the wild is OpenAI’s safety systems on 4O image generation. Inherently agentic, 4O can generate image ideas, then the image itself. Despite the model having constraints on it, it will happily generate things which violate OpenAI’s content policy, necessitating a monitoring model that whisks them away before a use can access a violating image.

If AI becomes an obligate technology, we will benefit from encoding values intentionally, balancing outcomes, universal principles, and fairness. The challenge is ensuring these choices reflect the world we want, not just the one we’re competing in.

Another theory of ethics that weights mutuality heavily: it’s frames ethical considerations as something derived between people rather than just based on outcomes or on abstract principles. Its featured particularly in Scanlon’s What We Owe to Each Other for those, like me, who get all of their ethical understanding from watching The Good Place ↩︎

Values in AI

More posts

Loss Exploded.

Unbundling Work

Native DSLs Ops in PyTorch

Programming The Loop

Discover more from Ian’s Blog