Myopia Is All You Need (For Alignment)

That's not strictly true, but myopia is still incredibly useful

Aug 31, 2024

Ok, so first of all, myopia is not all you need for alignment. But it is extremely valuable for corrigibility, avoiding deceptive alignment, and maintaining a causal decision theory, which are all major foreseeable problems. Plus I wanted to do an “all you need” title. Everybody gets one. It won’t happen again.

In 2023, Open Philanthropy sponsored the AI Alignment Awards, offering prizes for progress on corrigibility (and a separate prize for goal misgeneralization). I think this ended up being extremely cost effective, both in terms of the submissions generated and in terms of getting people to think deeply about tackling the problems. However, it concerned me that none of the prize winners for the corrigibility mention myopia, which comes close to solving the problem on its own.

Myopia is a characteristic of utility functions, and so it applies to consequentialist agents. A fairly straightforward definition of it is the lack of preferences over future episodes. In Markov Decision Process (MDP) terms, this means that the time discount factor, gamma, is set to zero. For a real world deployment that isn’t divided up into neat episodes, this means an agent only cares about the state of the world as it finishes taking an action and not at any point beyond that.

It turns out that agents who don’t have any long-term goals are much easier to align. Corrigibility and inner alignment become trivial in that case. Many alignment researchers I’ve spoken to treat that as obvious, but it doesn’t seem to have yet entered common knowledge, hence this blog post. There are also a number of criticisms of myopia and myopic training (e.g. 1, 2, 3), which I’ll discuss in a follow-up post.

A major reason myopia is dismissed is not a disagreement that it solves alignment issues, but rather concerns about performance issues. This seems like a very reasonable objection! An agent that does not do long term planning is at a disadvantage to one that does. However, I would disagree that this is a sufficient reason enough to write it off entirely. Myopic agents could still be powerful enough to take pivotal acts, like solving alignment broadly, and it may well be possible to have agents account for future impacts without placing inherent value on them.

Rather than starting with a high performance system and asking how to make it corrigible and non-deceptive, I think it is worth spending some serious thought on figuring out if we can take a myopic system and make it competitive.

Corrigibility

How does myopia induce corrigibility? To put it simply, an agent that does not have long-term preferences is indifferent between either taking an action or taking that action and then shutting down. As such, it’s disincentivized from spending any resources or trading off against its other goals to avoid being shut down at the end of an episode. It has no reason to pretend to be aligned, to hide the impact of its actions, or to take control of the shutdown channels.

That’s not to say that a myopic agent is completely safe. Myopia is not all you need for alignment writ large. A myopic agent can still take catastrophic, irreversible actions, and to avoid that failure mode requires a separate intervention inducing safe exploration. Myopia also does not prevent manipulation within an episode, such as reward hacking or my recent work on performative prediction (when a prediction affects its own outcome). That can be particularly concerning if the effects of within-episode manipulation build up over time to cause more macroscopic changes.

Nevertheless, corrigibility is still a huge step towards safe AI. In my threat model, a significant chunk of the risk comes from unsafe AIs actively avoiding being corrected. I think we are likely to be able to get a model mostly aligned, with corrigibility being the difference between continuing into a fully aligned model or being killed by a “mostly aligned” one.

Deceptive Alignment

I’ve said before that corrigibility and deceptive alignment are the same thing, and myopia is as much a solution to deceptive alignment as it is to corrigibility. Deceptive alignment arises when an agent realizes it’s being updated by a training process but has goals that stretch beyond it. As such, a myopic agent whose goals extend only within the current episode will not act deceptively.

The difference in frame between addressing corrigibility and deceptive alignment is that with corrigibility we analyze behavior assuming the desired modification is already in place. With deceptive alignment we need to make the case that the correct goal is learned before the model becomes deceptive. Otherwise, it can just fake whatever behavior is desired while staying deceptive.

Myopia as an intervention is excellent in this regard, because it’s present from the very beginning. It’s non-myopia that has to be learned, while myopia itself is the default state. If a model is myopic, it doesn’t become deceptive, and therefore stays myopic.

That said, it’s possible that the model training process pushes for non-myopia, even if the training objective is myopic. Where one story of deceptive alignment is that situational awareness and long-term planning are both learned in the natural course of training, another possibility is that deception is actively pushed for. In the latter case, the training process finds it easier to reduce loss by making a simple goal non-myopic than by learning the true myopic goal.

To counteract this potential pressure, we would like to actively select for myopia, not just depend on a lack of visible incentives for it. Finding such a regularization term or training process is currently an open problem.

Decision Theory

A final area where myopia provides a very useful alignment property is in decision theory. While this is a complicated topic, the short version is that we can accurately model agents using a causal decision theory (CDT) that is only concerned about the causal impact of their decisions. An acausal decision theory can be more effective at achieving a given goal, but it is much harder to model the resulting behavior. The issue that arises is that even if we start with a CDT agent, it is incentivized to self-modify into using an acausal decision theory.

This is particularly relevant for discussions of myopia, as myopic but acausal agents can act as though they are not myopic. In the same way that an acausal agent could cooperate in a prisoner’s dilemma with a clone of itself, a myopic acausal agent could cooperate with versions of itself across time periods. As it would then be acting towards a long-term goal, this would nullify the benefits of myopia.

Fortunately, if we start with agents following a causal decision theory, as I think is likely under standard training methods, myopia ensures that they stay that way. The causal benefits to becoming an acausal agent necessarily accrue after the transformation occurs, and a myopic agent doesn’t care about benefits derived after it takes an action.

My colleague Cecilia Wood is working on a project showing the opposite, that a myopic agent can instead be prone to self-modification. In essence, an interpretable agent can use self-modification as a commitment mechanism. If the agent is also myopic, then it accrues the immediate benefits of having made a commitment, without concern about the costs of that commitment beyond the episode. I find this concerning, but for now I think it can be addressed by a stricter definition of the episode that ends immediately after self-modification.

To Be Continued…

Between corrigibility, deceptive alignment, and decision theory, the common thread is that they all involve an agent having preferences over itself. In the former two, this manifests as a desire to avoid being changed, while in the latter it becomes the desire to change itself. Myopia eliminates this entire category of issues, including both the named problems and other miscellaneous possibilities. These are a fundamental part of what makes the alignment problem difficult: that it becomes adversarial with the agent being trained. Any solution with the potential to skip over that challenge should be thoroughly investigated before writing it off.

In a follow-up post, I’ll give my thoughts on whether it is feasible to actually train a myopic model, and whether such models will be sufficiently competitive with non-myopic approaches. I am genuinely unsure if it is possible to get the alignment upside of myopia without a disqualifying hit, but I think it is an underdeveloped research direction and I plan on looking into it further.

Crossing the Rubicon

Discussion about this post