Against Almost Every Theory of Impact of Interpretability

Wednesday 20 November 2024
19:00 21:00

Google Calendar ICS

The most popular area of AI safety research by far is mechanistic interpretability. Mechanistic interpretability work is tractable, appealing from both a scientific and mathematical perspective, is easy to teach, can be published in regular scientific venues, and requires very little to get started. It’s also… useless?

This month, André Röhm will use a post from Charbel-Raphaël Segerie of France’s CESIA as a framing device to discuss how and whether mechanistic interpretability actually helps reduce existential risk from advanced AI.

Neel Nanda has written A Long List of Theories of Impact for Interpretability, which lists 20 diverse Theories of Impact. However, I find myself disagreeing with the majority of these theories. The three big meta-level disagreements are:
- Whenever you want to do something with interpretability, it is probably better to do it without it.
- Interpretability often attempts to address too many objectives simultaneously.
- Interpretability could be harmful.
— “Against Almost Every Theory of Impact of Interpretability”, Charbel-Raphaël Segerie, Alignment Forum 2023

Photo by Julia Joppien on Unsplash

Against Almost Every Theory of Impact of Interpretability

Technical Research and Talent is Needed for Effective AI Governance

Evaluating the World Model Implicit in a Generative Model

AI Safety 東京