Neural network interpretability is important. A lot of alignment research involves staring at the inscrutable matrices and speculating as to how they “actually work”. Until now, interpretability work stopped when the interpretation and the neural network had the same behaviour - once your interpretation performs the same as the neural network on the test set, they must be the same, right?
The Redwood Research team says: no! It’s really easy to write an algorithm that performs the same as the neural network on the test set: make a big lookup table. No one thinks that a big test-set lookup table is a good intepretation of a neural network. There’s more to “being a good interpretation” than just matching performance on the test set; your interpretation should “explain” what the network weights “are doing”. The causal scrubbing article gives a much more formal definition of what it means to “interpret” a network, along with an algorithm and a set of tools that will check if a given interpretation is accurate. This seems really useful for interpretability research!
This week we’ll learn to do causal scrubbing, discuss the advantages and caveats of this approach, and see if we can use this framework to resolve ongoing disagreements about whether an agent needs a “world model” to be dangerous.
This post introduces causal scrubbing, a principled approach for evaluating the quality of mechanistic interpretations. The key idea behind causal scrubbing is to test interpretability hypotheses via behavior-preserving resampling ablations. We apply this method to develop a refined understanding of how a small language model implements induction and how an algorithmic model correctly classifies if a sequence of parentheses is balanced.
— https://www.alignmentforum.org/s/h95ayYYwMebGEYN5y/p/JvZhhzycHu2Yd57RN