The most popular area of AI safety research by far is mechanistic interpretability. Mechanistic interpretability work is tractable, appealing from both a scientific and mathematical perspective, is easy to teach, can be published in regular scientific venues, and requires very little to get started. It’s also… useless?
This month, André Röhm will use a post from Charbel-Raphaël Segerie of France’s CESIA as a framing device to discuss how and whether mechanistic interpretability actually helps reduce existential risk from advanced AI.
Neel Nanda has written A Long List of Theories of Impact for Interpretability, which lists 20 diverse Theories of Impact. However, I find myself disagreeing with the majority of these theories. The three big meta-level disagreements are:
- Whenever you want to do something with interpretability, it is probably better to do it without it.
- Interpretability often attempts to address too many objectives simultaneously.
- Interpretability could be harmful.
Photo by Julia Joppien on Unsplash