Two Sides to Eliciting Latent Knowledge
Two issues with training a model to provide good explanations
Suppose you want to understand what led an AI model to take a particular action. This is the Eliciting Latent Knowledge (ELK) problem, finding a way to access the internal process a model uses to generate their output. Rather than analyzing the model’s weights and activations yourself, you might train another AI model to take those as an input and output an explanation for the action. You’ll train this new model on many examples of explanations, which have been labeled true or false.
Assume that the AI model you want to explain is one that makes predictions based on a causal understanding of the world. Then, an explanation for its output will be identifying the relevant causal factors, including those used for backchaining information about the initial conditions. The assumption of predictive causal models is canonical from the ELK report, but it’s quite general if you see prediction as the heart of the decision making process for any consequentialist. An explanation for any action from a consequentialist comes down to “the model predicted this would be the outcome, and it liked that outcome”.
As you (or the people you’ve outsourced to) label the training explanations true or false, some mistakes are likely to arise. To err is human, after all. Occasionally true explanations get labeled false or false explanations get labeled true, we’ll say these false positive and false negative rates are each 2%.
If you want the model you’re training to output true explanations, this is a huge issue! The policy “output true explanations” accurately classifies 96% of the training data, while the policy “output explanations the human judge thinks are true” classifies 100% correctly. Every single incorrectly labeled explanation is a point in favor of the latter policy over the former.
Now let’s say you dedicate yourself to driving down the error rates. You take your time to fully think through every explanation and double-check it. You use powerful AI tools that have been developed to help you better understand what’s going on. And just to be safe, any explanation that you’re not 100% certain is labeled correctly gets dropped from the training. Through a miraculous effort, you drive down the error rates to literally 0%.
Bad news! This still isn’t enough to select the policy you want! Now both “output true explanations” and “output explanations the human judge thinks are true” policies match 100% of the data. Which one gets learned comes down to the biases of the architecture and training process, and it’s unclear which way they point. It’s quite possible that giving explanations which only appear to be true is massively favored.
This is the issue that’s focused on in the ELK report. Even without any errors in labeling the training data, the policy that provides true explanations could still be disfavored. Is it possible to regularize the training process in a way that favors truth telling? Yes, of course errors in the training labels will add further selection pressure against generating true explanations, but let’s first try to solve the easier case without it. Once that’s done, we might well be able to scale up the methods we used to offset the mislabeled explanations as well.
You could also imagine someone making the exact opposite argument. Mislabeled explanations are clearly going to select against the truthful policy, while it’s ambiguous which policy is favored by the training biases. In fact, doesn’t telling the truth seem intuitively easier to learn than sycophancy? And there are already proposals for regularization terms, admittedly with issues, that would further favor true explanations. Meanwhile, given realistic error rates and the size of the training data, there are going to be tons of mislabeled examples. On any reasonable margin, trying to reduce the number of those errors will help more than trying to develop new regularization methods.
I’m going be a radical centrist here and say that both sides have a point. On one hand, the issue with incorrectly labeled explanations seems big enough that it’s unlikely to be overcome through regularization alone, at least without significant performance impact. But on the other hand, it also seems unlikely that the error rate can be reduced low enough to succeed using only existing regularization proposals. We need to make progress along both of these paths to be confident in the policy that we train.
So, which of these two sides to ELK is more impactful to work on? While most people have a clear preference, someone on the margin could ask which one is more neglected. Almost everyone that I know of explicitly working on ELK, namely ARC, is focusing on the case where the training data is accurately labeled. However, there are many more people working on interventions (e.g. debate, scalable alignment) that could lower error rates without being motivated by ELK in particular. My intuition, loosely held, is that it’s usually more impactful to take ARC’s approach and first try to solve the issue with ideal training data, but that there could be low hanging fruit in augmenting human judgments specifically with regards to explanations in the ELK context.