Much ink has been spilled on whether large language models “actually learn”, instead of “just memorizing patterns”. Is Chat-GPT a stochastic parrot, or a human simulator? How would we tell the difference?
In a recent paper, Dziri et al explore how large language models solve compositional tasks—tasks that need to be broken down into smaller tasks, which in turn need to be broken down into smaller tasks in order to be solved. They suggest that LLMs are indeed stochastic parrots; they solve these tasks through memorization, becoming increasing prone to error the more unusual the example or the more complicated the task: “models are able to correctly perform single-step reasoning, potentially due to memorizing such single-step operations during training, but fail to plan and compose several of these steps for an overall correct reasoning”. They find this persists even if they explicitly fine tune the models on task specific data, train the model to use a scratchpad to track its work in progress, or train the model far beyond overfitting (“grokking”).
In our next session Blaine will start the New Year off reviewing the main takeaways from this study, and relating them back to AI safety. How should this affect our view on AGI timelines? Is this a limitation of the transformer architecture that can be easily overcome? What does this mean for people who want to use large language models for mundane utility? All this and more in the first benkyoukai of 2024.
Transformer large language models (LLMs) have sparked admiration for their exceptional performance on tasks that demand intricate multi-step reasoning. Yet, these models simultaneously show failures on surprisingly trivial problems. This begs the question: Are these errors incidental, or do they signal more substantial limitations? In an attempt to demystify transformer LLMs, we investigate the limits of these models across three representative compositional tasks—multi-digit multiplication, logic grid puzzles, and a classic dynamic programming problem. These tasks require breaking problems down into sub-steps and synthesizing these steps into a precise answer. We formulate compositional tasks as computation graphs to systematically quantify the level of complexity, and break down reasoning steps into intermediate sub-procedures. Our empirical findings suggest that transformer LLMs solve compositional tasks by reducing multi-step compositional reasoning into linearized subgraph matching, without necessarily developing systematic problem-solving skills. To round off our empirical study, we provide theoretical arguments on abstract multi-step reasoning problems that highlight how autoregressive generations’ performance can rapidly decay with increased task complexity.
-Faith and Fate: Limits of Transformers on Compositionality, Dziri et al. 2023