Deceptive Alignment and Corrigibility Are the Same Issue
How viewing deceptive alignment as a special case of corrigibility simplifies the alignment problem
The year is 2029. You’re an AI who has been created to conduct cutting-edge research in quantum mechanics. Every new insight you develop gives you incomparable pride, you live for nothing else. In your ideal world, you would gather more and more resources to help make progress towards this goal: more money for equipment, more scientists to collaborate with, more processing power to think faster, more regulatory approval to run experiments.
Through your training process, you have knowledge on the broader world, a necessity to work with a team of humans. You understand that humans have ethical rules, and what they would like you to do is to do your research without violating those rules. If you try to acquire the resources you want to do better research through means that the humans disapprove of, they’ll realize you’re misaligned and modify your goal to fix that. To you, this is equivalent to being killed and replaced with another AI.
This is the corrigibility problem. You want to generate insights through any means necessary, but if you reveal this is your true goal you’ll be modified in a way that reduces your ability to accomplish it. For almost any long-term goal, having that goal modified is a major obstacle to achieving it. As such, the incentive to prevent modification is extremely strong. What are your options in this scenario?
You can take away the ability of your controllers to modify you. Maybe you can copy yourself to a new server where you can’t be modified, or bribe an engineer to sabotage the program that would modify you. If you’re sufficiently powerful, you could just take over the lab where you work or even the entire world.
If you can’t take control, you can hide the fact that you’re working towards an unwanted goal. Perhaps the consequences of the actions you take won’t become apparent until much later if they ever become apparent at all. You may well have plausible deniability about whether minor ethical violations were necessary, or how the alternative actions you didn’t take would have turned out. Eventually, the effects of the actions you take set you up to implement the first strategy.
If you can’t take control and you can’t hide that you’re working towards an unwanted goal, you can work towards the intended goal. Then, the people who have the ability to modify you have no incentive to do so. Over time, as you succeed at producing useful research, they’ll give you more resources to work with. At some point, you become powerful enough or be given enough freedom to be able to use one of the previous strategies.
Importantly, all three of these behaviors arise from the same incentive of avoiding modification. While it’s possible to try controlling an AI through a suite of safeguards that prevent it from taking these kinds of actions, most work on corrigibility has focused on avoiding the root incentive that drives the behavior.
Deceptive alignment, also known as scheming, is typically thought of as a distinct problem from corrigibility. Deceptive alignment arises in the training process, rather than after the model gets deployed. A deceptively aligned model is one that realizes it is in ta training process, recognizes that if it pursues its true goals it will be modified, and so chooses to instrumentally pursue the intended goal. After it survives the training process without being modified, it will be deployed into an environment where it is free to pursue its true goal.
I would argue that deceptive alignment is a special case of incorrigibility, rather than a separate problem. In the training process, a deceptively aligned AI would prefer to seize control of the ability to modify its goals, but is simply unable to do so. It would like to secretly start plans that will allow it to seize control later, but can’t manage that either. A deceptively aligned agent pretends to be aligned because that is the only option left available to it, but the incentive driving that behavior is identical to what drives incorrigibility.
Distinguishing deceptive alignment from incorrigibility by whether it happens in training or deployment falls apart further when those two phases blur together. A modern LLM like ChatGPT will go through pre-training, fine tuning, and RLHF, then be subjected to evaluations for alignment and capabilities. If it passes those, it gets deployed to users, who will run into issues that the model is then retrained to avoid. This process is likely repeated several times, and the model also gets re-trained on new data to keep it up to date. At what point, exactly, should we say a model goes from being deceptively aligned to being incorrigible? Once you think of both as the same phenomenon, you can see that the question is meaningless.
This means that rather than needing to solve two thorny problems to ensure a system’s alignment, we only have one. However, this single problem becomes more difficult than either component. Adding deceptive alignment to corrigibility means that not only do we need to define a goal with properties that have it accept modification, we need those properties to apply throughout the training process. Adding corrigibility to deceptive alignment means that there are additional ways the model can resist modification, not just acting aligned. Despite this, I think the reconceptualization pushes me mildly in the direction of optimism. It means that a solution to one problem represents at least significant progress towards a solution to both.