It’s an exciting new research direction for AI safety!
We all have a picture in our heads of how neural network training works: a ball, representing the network weights, rolls down a hill, representing the loss, until it reaches a local minimum. This picture gives us terrible intuitions about what actually happens during neural network training. Singular Learning Theory is an attempt to give us a better picture; Developmental Interpretability is an attempt to use that better picture to answer important questions about when novel capabilities emerge. If we can detect emergence, we can stop training until we work out which capability has emerged, how that capability is implemented in the circuits of the network, and whether training is safe to continue.
This week, Blaine Rogers will sketch an outline of the developmental interpretability research agenda, explaining the basics of singular learning theory and identifying key open problems.
Developmental interpretability is a research agenda that has grown out of a meeting of the Singular Learning Theory (SLT) and AI alignment communities. […] As the name suggests, developmental interpretability (or "devinterp") is inspired by recent progress in the field of mechanistic interpretability, specifically work on phase transitions in neural networks and their relation to internal structure. Our two main motivating examples are the work by Olsson et al. on In-context Learning and Induction Heads and the work by Elhage et al. on Toy Models of Superposition.
Mechanistic interpretability emphasizes features and circuits as the fundamental units of analysis and usually aims at understanding a fully trained neural network. In contrast, developmental interpretability:
◦ is organized around phases and phase transitions as defined mathematically in SLT, and
◦ aims at an incremental understanding of the development of internal structure in neural networks, one phase transition at a time.The hope is that an understanding of phase transitions, integrated over the course of training, will provide a new way of looking at the computational and logical structure of the final trained network.