Myopic Goals without Myopic Capabilities
How we can get AI to set up a good future without caring about it
Myopia, the condition of not caring about outcomes more than one action ahead, is an extremely useful safety property for AI. If a model does not care about the future then it has no incentive to avoid shutdown or having its values changed, since the resulting inability to pursue its goals will only matter past the cutoff it cares about. This makes it willing to accept being shut down or modified (I use shut down as a shorthand for both, since the latter is equivalent to being shut down and replaced). An agent behaving this way avoids both incorrigibility and deceptive alignment.
The big problem with myopia is that long-term planning is an extremely useful property in terms of capabilities. Without workarounds, an agent that accounts for the future consequences of its actions will be able to accomplish a lot more than one who does not. To make myopia a viable path to safety, we would like to make this gap as small as possible, and ideally non-existent.
This is not impossible in principle. For every agent that does long-term planning, there exists an agent that myopically imitates them, resulting in identical actions taken. The obvious issue with this approach is that myopic imitation includes all the bad behavior we don’t want, like resisting shutdown. With imitation, all we’ve done is go from a model that avoids shutdown instrumentally to one that wants to do so intrinsically. However, the example works as an existence proof that myopic agents can reach the capabilities of non-myopic ones. The question is whether they can do so without defeating the whole point of using myopic agents in the first place.

One approach we might take to implementing a myopic goal without sacrificing capabilities is through incorporating predictions. An agent can myopically care only about the state of the world after it takes a single action, with the relevant aspect of that state being the agent’s prediction of how well its goal will be achieved in the long-term. Then, it can myopically optimize for states that lead to good predicted outcomes, and if the predictions are accurate this will lead to good actual outcomes. If the agent only cares about whether its prediction of the outcome is desirable, rather than whether the actual realized outcome is, it will be willing to shut down, right?
Wrong. The process described above is literally reinforcement learning, specifically the actor-critic algorithm. That algorithm deals with the problem of assigning credit within a string of actions that lead to a single reward by training the model to predict the final reward after each action is taken, then training a policy to optimize for those predictions with each action. The issue here is that the state from which the prediction is made includes information on whether the agent is active, and in most cases an active agent will do much better at achieving a goal. So, even myopically, there is an incentive to avoid shutting down. That said, myopically optimizing for states leading to good predicted outcomes can still be a good base for a proposal, albeit with some variation on it needed.
Researchers from Google DeepMind recently released a paper on what they call Myopic Optimization with Non-Myopic Approval (MONA), which is similar to process-based supervision and explores some variants on using predicted outcomes to get high performance with a myopic goal. The proposal is to use a human overseer’s (possibly augmented) predictions of how good the outcome will be, rather than the model’s own predictions. Their focus is not on corrigibility, but rather on preventing unintended paths to high reward through specification gaming. The idea is that if the model sees an unexpected way to achieve the desired outcome, but the human overseer does not, then the human’s evaluation of that strategy will result in low rewards for pursuing it. This approach lowers capabilities, since the AI will be discouraged from implementing desirable but unforeseen strategies, but notably it also does not ensure corrigibility. A human overseer can also tell that desired outcomes are less likely to be achieved if the AI model pursuing them is shut down.
However, the MONA paper also suggests a variant on the process, where rather than use the human overseer’s unconditional prediction of the outcome, they instead use their prediction of the outcome conditional on the human taking over and choosing actions from that point forward. Then it does not matter whether the agent is active or inactive for the prediction, since what is being evaluated is an unrealized hypothetical where the AI takes no actions. That means there is no incentive for the AI to avoid shutting down, which makes it corrigible, albeit with a cost in terms of capabilities.
We can generalize this approach in a few ways. While the paper makes the prediction conditional on switching to the human overseer’s policy, we can instead use any policy that we trust to be safe. Additionally, it does not have to be the human overseer that makes the prediction; if an AI model is better at predicting, then no safety properties are lost by using it instead. Since the conditional policy is trusted, predicted high rewards must come from acceptable strategies.
For corrigibility though, the important part is not predicting conditional on a safe policy, but rather conditioning at all. If we instead conditioned the prediction on switching to a paperclip maximization policy, we would still get corrigibility, since it still does not matter whether the actual agent is active or not. That means we do not necessarily need to take a capabilities hit in order to get corrigibility this way.
As a simple and powerful case, consider what happens when conditioning the prediction on the same agent’s policy continuing to be used (so conditional on not being inactive). Then, when myopically optimizing, the agent has no incentive to prevent itself from being shut down. Resulting states where it has been shut down are exactly equal in value to states where it stays active. So, the resulting behavior is for the agent to optimize for the long term goal, except by not taking actions to prevent shutdown. This is corrigible behavior, and rather than a capabilities penalty there could even be a slight capabilities boost, as no resources are diverted towards self-preservation.
Unlike the MONA proposal, this approach could result in unanticipated and unwanted strategies for high reward. However, if the AI is corrigible, feedback can afterwards be incorporated to train the AI out of any unwanted behavior demonstrated. Of course, that relies on no permanent harm being done through the initial actions, and some way to ensure safe exploration would be necessary to that end.
An agent that acts as though it cannot be shut down has some downsides. One is that it won’t take actions to defend itself from outside threats. Another is that it will be more prone to sequences of actions that are dangerous when interrupted (e.g. driving a car). The former concern could be largely mitigated by more specifically conditioning on being made active and inactive through a special channel. If the AI is inactive because its servers have been melted, a signal telling it to turn back on and take actions will not have an effect, so conditioning on the signal will still encourage the AI to prevent non-intended shutdowns. For the latter concern, appropriately deprioritizing action sequences that would be dangerous to stop part way through can be done through the choice of goal being optimized. Delaying shutdown until doing so can be made sufficiently safe may also be a possible workaround.
This approach is similar to utility indifference, a proposal from Stuart Armstrong mentioned in the Corrigibility paper but dating back even earlier. In those days, AI alignment was largely about trying to define the right utility function to optimize for. Utility indifference suggests giving an agent the same utility for shutting down when ordered to as it would get if it continued to operate. While the corrigibility paper criticizes the proposal for not disincentivizing the creation incorrigible subagents and successors, I believe it’s more helpful to think of that as a separate problem related to safe exploration.
A more prominent issue is that since it would be non-myopic, an agent with such a utility function would be incentivized to modify its own beliefs to think it will receive high future utility, so it then gets additional utility to compensate it for shutting down. Two more issues are that this is a knife-edge solution that only gives indifference to shutting down, and that there is no known way to implement any desired utility function in practice. Myopia avoids the incentive for self-modification of beliefs, since the benefit to that occurs beyond the relevant time frame, but what about the other two issues?
By default, myopically optimizing for the predicted outcome conditional on staying active also leads to a knife-edge solution. Whether or not the agent shuts down, the end state of an action is treated the same, so it would be indifferent between doing so and not. One simple way to get around this would be to simply have the shutdown outside of the agent’s control, so it does not need to consent to being shut down, as long as it does not actively resist.
However, even if the choice to shut down is made by the agent, myopia and conditional predictions make it possible to penalize them for not shutting down without incentivizing them to prevent shutdown commands. If we take the value of a state’s conditional prediction and add a penalty term when a signal to shutdown has been received and ignored, then all we have to do is further condition the prediction on the signal being rescinded. We could add an equivalent penalty for shutting down when not requested to do so. In contrast, with utility indifference, such a penalty would reduce the utility from not shutting down and so also the equivalent utility for shutting down, giving the agent an incentive to prevent receiving the signal in the first place.
In terms of how to implement this setup, I’m cautious about making claims until it’s done in practice. I think ideally, it would be done with an architecture that forgoes an explicit policy function, and instead uses a hardcoded process to choose between actions based on predicted outcomes. I expect it would be fairly straightforward to modify that to use conditional predictions (while avoiding incentives for manipulation as per my previous work). That said, I don’t see why it wouldn’t also work to train a policy function as well. Importantly, you would want to train it on examples where it does get shut down, otherwise the generalization would likely be the same as standard RL.
I expect that the ideas of this post can be formalized mathematically without much issue. Whether that can translate to an ML setting remains to be seen, but I’m cautiously optimistic. Corrigibility is a huge and fundamental part of the alignment problem, so any approach with a chance of addressing it should be explored in depth.