Interviewing Myself on AI Safety

Big picture views on risks and strategies

Apr 08, 2024

Euan McLean at FAR AI is interviewing AI safety researchers to understand how they view the field and the impact of their work. He was willing to share his interview questions with me, which I thought would make a great post to give my readers background on my beliefs.

Q1: What is your median guess of what the first human-level AI will look like? Where I define human-level AI as an AI system that can carry out 100% of economically valuable human cognitive tasks that can be achieved by interacting with the internet. For example, something like GPT or something very different?

This answer depends significantly on which human’s level is being referred to with “human-level”. I expect that LLMs in the next few generations will be able to perform cognitive tasks at the level of the median 15 year old. The main barrier to doing so now is not intelligence, but rather the difficulty of interactions with non-text inputs like video and audio. Translating these formats into text seems nearly inevitable though, just a matter of enough labeled data, with image and speech recognition already mostly solved.

For being able to do all cognitive tasks at the level of the median human who gets currently paid to do them, then I still expect the architecture to look mostly like LLMs, although I’m less confident in that. Here, the bottleneck I anticipate is generalization ability, with supervised finetuning and RLHF currently seeming to be required for each individual task. Either we find a way for LLMs to become more agentic, doing search and essentially thinking through what to output with a particular goal in mind, or we bite the bullet and train separately on all the necessary tasks. I’m more skeptical about the feasibility of the latter for achieving truly general intelligence, but still think there’s a decent probability it works.

For both of the above benchmarks, I’m answering for almost all economically valuable cognitive tasks, rather than literally 100%. I expect there will be a few stubborn tasks resistant to automation, at least until recursive-self improvement solves the issues, possibly using a radically different architecture.

I’d highlight the chance I put on goal-directed AI being required for both capabilities improvements and risk as being the largest source of disagreement with the AI safety community, although it’s still a fairly common view. Compared to the most optimistic people, I’m more worried that current systems only appear safe due to a lack of agency. Compared to the biggest pessimists, I see the problem as more tractable because agents are easier to model. This belief motivates the direction of my research, focusing less on existing models and more on future models and general principles of agency.

Q2: What’s your (inside view) 90% confidence interval for the date of the first “human-level AI” (as defined above)?

For AGI at the level of the median 15 year old, all I’m confident in is that we don’t have it now (and I’m not 100% on that,either). The lower bound of my 90% confidence interval, that I put only 5% probability on a date before, is then a few months from now, in mid-2024. For AGI at the professional level, the lower bound of my confidence interval is the end of 2025.

For both intelligence levels, the upper end of my 90% confidence level is “never”. I place at least a 5% chance on humanity running into difficulties with AI development, losing interest or banning it, and eventually going extinct without having built AGI.

Neither endpoint gives much information about my actual beliefs, so I’ll also give my median estimate. That would be 2027 for the 15 year old level AGI and 2036 for professional level AGI.

Q3: What’s your best (inside view) guess at P(doom) (the probability of humanity going extinct due to AI)? How seriously do you take this estimate?

I don’t take the following estimates too seriously numerically, but the relative sizes seem right to me:

10% chance we don’t build AGI, either because we can’t or because we choose not to
30% chance we have “alignment by default”, and it turns out AGI never posed an existential risk
15% chance we solve alignment, AGI did pose a risk by default but we fixed it
5% chance we get aligned AI, but it still wipes us out due to either misuse or conflict
40% chance we don’t solve alignment, and rogue AI wipes us out

Altogether, that adds up to a p(doom) of 45%.

Timelines significantly affect my p(doom), through the chance that we solve alignment. If we deploy AGI in just two years from now, that number would be lower than 15%, while not building AGI for more than fifty years would make it higher.

Q4: Conditional on human extinction due to AI, what is the most likely way this could happen?

There are two parts to this story. Why does AI choose to cause human extinction, and how does it do so?

For the why, my worry is that we’ll get AIs that can be well described as optimizing for some goal. I expect doing so will increase performance, so the behavior will either be added intentionally or found by the training process. We don’t currently know how to specify a goal for which seizing all of humanity’s resources is not instrumentally useful, so it seems unlikely to stumble on to one by chance, or through a compromise between multiple AIs.

As for the how, a smart but unaligned AI can pretend to be aligned to ensure they get deployed. It could then export itself and run many copies of itself for long amounts of subjective time. Initial resources, those could be acquired either through unethical means or by working online. Many copies of an intelligent AI working together on research for a long time would be able to develop significant technological progress, including improvements to the research process. With access to technology many years beyond the rest of humanity, an AI could eliminate us at it’s leisure.

Q5: Imagine a world where, absent any effort from the AI safety community, humans will go extinct due to AI, but actions taken by the AI safety community prevent humans from going extinct. In this world, what did the AI safety community do to prevent extinction?

To start, I credit the AI safety community with raising the issue of existential risk in the first place. Without that, the possibility may not have ever been given serious consideration by AI developers. That’s already saved the world in the case where alignment is necessary but very easy.

In terms of future work from the AI safety community preventing extinction, I think it is most likely to be addressed at problems that may not present themselves until models are agentic and dangerously capable, such as deceptive alignment and incorrigibility. While I don’t expect to develop formal guarantees against these issues, their emergent nature means atheoretical fixes based on less capable models are unlikely to work either. Rather, I expect progress to come in the form training stories, choices made based on theoretical reasons that we think will reduce the likelihood of failure modes.

If that sounds vague, I would point to the ELK proposal contest as the sort of thing I am trying to indicate. ARC has identified a theoretical issue, that a superhuman model trained to explain its reasoning may learn to say what a human evaluator believes rather than what the model itself believes. The community suggested solutions that took the form of training modifications to select for the latter. While I don’t believe that any of the prize winning proposals reach the level of contribution that appreciably reduces existential risk, that form of solution is where I’m most optimistic..

Q6: What research direction do you think will reduce P(doom) the most, and what’s its theory of change?

Related to my above answer, I’m excited about work on solutions to the Eliciting Latent Knowledge (ELK) problem. The main theory of change there is that we need to understand what a model knows about complex situations in order to provide useful feedback through methods like RLHF, which seem on track to be the main way we align AIs. If we don’t solve it, training an AI to optimize for outcomes that merely look good could be disastrous. Success at ELK could also lead to success at eliciting latent thought processes more generally, which could help identify dangerous patterns like deception and train them out.

I’m also excited about the research direction that I’m currently working on, predictive models. The theory of change is that we’re likely to develop powerful predictive models anyway, and so we should figure out how making predictions can be risky and what we can do to avoid that. It also opens up new paths for alignment, asking what new approaches we can take if we have a powerful predictive oracle.

Q7: Could this research direction speed up the development of AI capabilities (or increase P(doom) in some other way)? How would this happen, and how concerned are you about it?

I think either of these directions are unlikely to speed up capabilities development. Developing powerful predictive models would certainly increase p(doom) if the alternative were an AI pause, but we’re on track to create them anyway. Eliciting a model’s latent knowledge seems quite close to an unalloyed good, with the biggest risk I see being that failing at it but thinking we’ve succeeded is possibly more dangerous than recognizing the problem is unsolved.

Q8: Are there any big mistakes the AI safety community has made in the past?

Before I give my answer, there are two responses that I think would be common here which I’d like to reject. The first being that the early AI safety community put too much stock in big conceptual arguments (e.g. paperclippers), rather than making an empirical case. Even beyond the fact that at the beginning, theory was the only avenue available, that work highlighted issues that still need to be fixed before we start seeing them develop naturally. The second answer I would disagree with is that the AI safety community accidentally accelerated capabilities, for example through the starting of OpenAI or the development of RLHF. Here, it’s not clear to me that capabilities were differentially accelerated over the counterfactual, where it seems likely they would have happened anyway.

My answer is that one early big mistake was in making contributing to the field seem like an incredibly daunting task, achievable only by the highest tier of geniuses. For a while, it seemed like you’d have to have extensive knowledge of mathematics, computer science, economics, logic, evolutionary biology, and information theory, just to have a minor chance of contributing. While you do need to be smart and have some domain knowledge to contribute, and the danger of bad work in the early days derailing the field was a real risk, I think this ended up being net negative by scaring off many people who would have been able to add positively to the field. I also think this was compounded with another mistake that is still being made, which is dismissing research directions as too difficult or impossible without having had nearly enough people take a shot at it.

Q9: Are there any big mistakes that some in the AI safety community are currently making?

I think the biggest mistake has been a shift of focus away from aligning a superintelligence to aligning a human-level intelligence. It’s not clear that we’ll have a long enough takeoff to benefit from alignment research by human-level AIs, and even if we do we still will need to be able to demonstrate and evaluate what good research on superintelligence alignment looks like.

A couple years ago, I was excited by what was at the time called “prosaic alignment”, seeing if we could align then state of the art models like GPT-3. Since then, I’m worried that that’s been the only part of the field that’s grown. While I still believe it’s the second most impactful research area that someone could work on, I think on the margin more people should try their hand at making theoretical and conceptual progress.

Crossing the Rubicon

Discussion about this post