Are large language models capable of reasoning? When you ask a large language model for 8 multiplied by 7, does it run a multiplication algorithm, or use a multiplication table? Perhaps large language models’ impressive capabilities come not from their intelligence, but from having memorized hundreds of thousands of tasks from their petabytes of training data.
To test this, researchers from MIT and Boston University ran language models on ordinary tasks, then gave each task a twist to see how the model reacted. Their paper is interesting both as a study of how language models generalize to new tasks, and because it evaluates popular language models from multiple companies (OpenAI GPT, Anthropic Claude, Google PALM-2) on varied domains with easy-to-read graphs. This Wednesday, Blaine Rogers will lead a session examining the paper, combing it for insights and seeing what questions remain to be answered.
The impressive performance of recent language models across a wide range of tasks suggests that they possess a degree of abstract reasoning skills. Are these skills general and transferable, or specialized to specific tasks seen during pretraining? To disentangle these effects, we propose an evaluation framework based on “counterfactual” task variants that deviate from the default assumptions underlying standard tasks. Across a suite of 11 tasks, we observe nontrivial performance on the counterfactual variants, but nevertheless find that performance substantially and consistently degrades compared to the default conditions. This suggests that while current LMs may possess abstract tasksolving skills to a degree, they often also rely on narrow, non-transferable procedures for task-solving. These results motivate a more careful interpretation of language model performance that teases apart these aspects of behavior.