Last week, we studied developmental interpretability. One of devinterp’s first targets is Toy Models of Superposition, a paper from Anthropic discussing the phenomenon of superposition / polysemantic neurons. Anthropic show that superposition develops suddenly during train, resulting in a step change in the training loss curve. Singular learning theory suggests that this step change is due to the model switching from one singular configuration to another, choosing a different complexity / accuracy tradeoff. The devinterp team hope to find these singularities and demonstrate a correspondence between phase transitions in the loss curve and phase transitions in the Bayesian posterior. We’ve had a sneak peek at their results and they look very promising indeed!
In this week’s session, Blaine will take us through the Anthropic paper. What is superposition, and why do we care? Why should we expect neurons to be monosemantic in the first place? What makes this paper a good jumping off point for developmental interpretability? Finally, we’ll discuss the latest devinterp preprint: can singular learning theory deliver?
It would be very convenient if the individual neurons of artificial neural networks corresponded to cleanly interpretable features of the input. For example, in an “ideal” ImageNet classifier, each neuron would fire only in the presence of a specific visual feature, such as the color red, a left-facing curve, or a dog snout. Empirically, in models we have studied, some of the neurons do cleanly map to features. But it isn't always the case that features correspond so cleanly to neurons, especially in large language models where it actually seems rare for neurons to correspond to clean features. This brings up many questions. Why is it that neurons sometimes align with features and sometimes don't? Why do some models and tasks have many of these clean neurons, while they're vanishingly rare in others?
In this paper, we use toy models — small ReLU networks trained on synthetic data with sparse input features — to investigate how and when models represent more features than they have dimensions. We call this phenomenon superposition.
— Elhage, et al., "Toy Models of Superposition", Transformer Circuits Thread, 2022.