Progress Measures for Grokking via Mechanistic Interpretability

Wednesday 13 December 2023
19:00 21:00

Google Calendar ICS

Grokking: what is it, and why should we care?

In 2022, OpenAI researchers discovered something: models trained on small algorithmic tasks (like modular addition) will initially memorise the training data. Then “after a long time”, will suddenly learn to generalize to unseen data. From the linked Alignment Forum post: “One of the core claims of mechanistic interpretability is that neural networks can be understood, that rather than being mysterious black boxes they learn interpretable algorithms which can be reverse engineered and comprehended. This work serves as a proof of concept of that, and that reverse engineering models is key to understanding them.”

This week, we’ll recap some background on grokking, double descent and related phenomena. Then we’ll analyze the claims of Nanda et al., and inspect the links between grokking and phase changes. If we have time, we’ll relate the phase changes to the developmental interpretability agenda we’ve covered in previous session, examining some recent work calculating the RLCT for phases in modular addition.

Neural networks often exhibit emergent behavior in which qualitatively new capabilities that arise from scaling up the number of parameters, training data, or even the number of steps. One approach to understanding emergence is to find the continuous \textit{progress measures} that underlie the seemingly discontinuous qualitative changes. In this work, we argue that progress measures can be found via mechanistic interpretability---that is, by reverse engineering learned models into components and measuring the progress of each component over the course of training. As a case study, we study small transformers trained on a modular arithmetic tasks with emergent grokking behavior. We fully reverse engineer the algorithm learned by these networks, which uses discrete fourier transforms and trigonometric identities to convert addition to rotation about a circle. After confirming the algorithm via ablation, we then use our understanding of the algorithm to define progress measures that precede the grokking phase transition on this task. We see our result as demonstrating both that it is possible to fully reverse engineer trained networks, and that doing so can be invaluable to understanding their training dynamics.
- PROGRESS MEASURES FOR GROKKING VIA MECHANISTIC INTERPRETABILITY, Nanda et al. 2023

Progress Measures for Grokking via Mechanistic Interpretability

Anthropic’s Responsible Scaling Policy

The Gemini Report

AI Safety 東京