As large language models (LLMs) become increasingly powerful and ubiquitous, it is crucial to develop methods for monitoring and controlling their capabilities. One such capability, considered essential for any generally capable agent, is planning (also referred to as "internal search"). While current systems are not yet capable of arbitrary planning, there is compelling evidence that they manifest this capability to a sufficient extent to warrant attempts at a deeper understanding.
To delve into this topic, our next guest speaker, Alex Spies, will discuss his recent work on understanding transformers in the context of maze-solving tasks. Alex is a PhD student at Imperial College London who is currently in Tokyo as a Research Fellow at the National Institute of Informatics (NII). The work he will discuss is part of an on-going collaboration, and will focus on the paper "Structured World Representations in Maze-Solving Transformers."
During this discussion, Alex will provide an overview of mechanistic interpretability tools and the approaches researchers employ to "reverse engineer" transformer models. He will then explain how his team used some of these techniques to uncover emergent structures in the models they trained and how these structures may facilitate a systematic understanding of internal search processes.
The talk will aim to provide attendees with a better understanding of the inner workings of transformer models in general, viewed through their potential for planning and internal search. By gaining a deeper understanding of these capabilities, we can work towards developing more effective methods for monitoring and controlling the behavior of increasingly powerful LLMs.
With the ultimate goal of better understanding how transformer models perform multi-step reasoning in search-like tasks, we apply the interpretability methods to toy models trained to solve maze tasks. In particular, we experiment with autoregressive transformers trained to solve mazes represented as a list of tokens, which constitutes an offline reinforcement learning task with global observations. By varying the precise configurations of these maze solving tasks, we are able to investigate the conditions under which models tend to learn representations with varying degrees of interpretability and generalizability. Additionally, while prior work has found that transformers struggle to perform complex planning tasks, we find that relatively small transformers are capable of solving mazes.
We use various interpretability techniques to study our models, finding that the geometry of their embedding space correlates with the spatial structure of the mazes (subsection 3.2). We find that our highest-performing models form a linear representation of maze connectivity structure, which can be decoded at early layers (subsection 3.4). Lastly, we identify specific attention heads that condition over valid neighbors for a given state, implicating them in path-following behavior.