Back to All Events

Why Agent Foundations?

Alignment is hard. Research directly aimed at the “big problems” (incorrigiblity, misaligned convergent instrumental subgoals, etc) seems disconnected from the practicalities of how AI research gets done. In the early days of alignment research, much ink was spilled over how to correctly specify utility functions for reinforcement learning agents. Large language models, “the closest thing we have to AGI”, came from a totally different paradigm, making the effort seem silly in retrospect.

Given that framing, why do many alignment researchers continue to pursue “foundational” topics, instead of more tractable approaches like mechanistic interpretability, model evaluations or compute governance? In his article “Why Agent Foundations?” John Wentworth attempts to answer this question; we’ll use the article as a jumping off point to survey the work being done in agent foundations and ask whether too much or too little agent foundations research is being done given the rest of the safety research landscape.

Let’s say you’re relatively new to the field of AI alignment. You notice a certain cluster of people in the field who claim that no substantive progress is likely to be made on alignment without first solving various foundational questions of agency. These sound like a bunch of weird pseudophilosophical questions, like “what does it mean for some chunk of the world to do optimization?”, or “how does an agent model a world bigger than itself?”, or “how do we ‘point’ at things?”, or in my case “how does abstraction work?”. You feel confused about why otherwise-smart-seeming people expect these weird pseudophilosophical questions to be unavoidable for engineering aligned AI. You go look for an explainer, but all you find is bits and pieces of worldview scattered across many posts, plus one post which does address the question but does so entirely in metaphor. Nobody seems to have written a straightforward explanation for why foundational questions of agency must be solved in order to significantly move the needle on alignment.

This post is an attempt to fill that gap. In my judgment, it mostly fails; it explains the abstract reasons for foundational agency research, but in order to convey the intuitions, it would need to instead follow the many paths by which researchers actually arrive at foundational questions of agency. But a better post won’t be ready for a while, and maybe this one will prove useful in the meantime.

Note that this post is not an attempt to address people who already have strong opinions that foundational questions of agency don't need to be answered for alignment; it's just intended as an explanation for those who don't understand what's going on.

-Wentworth, John. “Why Agent Foundations? An Overly Abstract Explanation”. AI Alignment Forum, March 2022.

Previous
Previous
25 January

ML Benkyoukai: Mamba

Next
Next
14 February

Guest Adelin Travers Presents: LeftoverLocals: Listening to LLM responses Through Leaked GPU Local Memory