One of the surprising results of scaling research is that LLMs seem to have capabilities that emerge with scale. If an LLM has less than, say, one thousand million parameters, it can’t do 2-digit arithmetic. When it has more than a thousand million parameters, the ability to do two-digit arithmetic spontaneously emerges. Or so it is often claimed! In their latest preprint, Rylan, Brando and Sanmi demonstrate that “emergence” is often an illusion caused by bad model evaluation. If you say a model "can do arithmetic" only if it gets 10/10 questions correct, then you cannot distinguish a model that scores 9/10 from a model that scores 0/10: by your metric, neither can do arithmetic. When your model improves from 9/10 to 10/10, it will look like a capability suddenly emerges!
Does this have actual consequences for safety, or is this just word/semantic games?
Recent work claims that large language models display emergent abilities, abilities not present in smaller-scale models that are present in larger-scale models. What makes emergent abilities intriguing is two-fold: their sharpness, transitioning seemingly instantaneously from not present to present, and their unpredictability, appearing at seemingly unforeseeable model scales. Here, we present an alternative explanation for emergent abilities: that for a particular task and model family, when analyzing fixed model outputs, one can choose a metric which leads to the inference of an emergent ability or another metric which does not. Thus, our alternative suggests that existing claims of emergent abilities are creations of the researcher's analyses, not fundamental changes in model behavior on specific tasks with scale. We present our explanation in a simple mathematical model, then test it in three complementary ways: we (1) make, test and confirm three predictions on the effect of metric choice using the InstructGPT/GPT-3 family on tasks with claimed emergent abilities, (2) make, test and confirm two predictions about metric choices in a meta-analysis of emergent abilities on BIG-Bench; and (3) show how similar metric decisions suggest apparent emergent abilities on vision tasks in diverse deep network architectures (convolutional, autoencoder, transformers). In all three analyses, we find strong supporting evidence that emergent abilities may not be a fundamental property of scaling AI models.