A few months ago we discussed shard theory and activation editing. Turner, Pope, Udell et al. measured the difference in activations between two forward passes of a network to identify wedding vectors and cheese vectors, such that adding a wedding vector to an LLM’s activations makes it talk more about weddings and adding a cheese vector to a maze solving network makes it seek cheese more. Their results were suggestive, but unconvincing; it seemed like you could get the same effect by adding the word ‘wedding’ to the input or drawing another cheese in the maze. Now, hot off the presses, a new paper from an entirely different team promises to make good on activation editing’s lofty goals. They claim to find directions in activation space representing concepts like honesty, harmlessness, fairness etc, representations which are correlated, causal, necessary and sufficient. Let’s discuss RepE, contrasting it with last weeks’ dictionary learning paper and asking: is alignment solved? Can we all go home now?
In this paper, we identify and characterize the emerging area of representation engineering (RepE), an approach to enhancing the transparency of AI systems that draws on insights from cognitive neuroscience. RepE places population-level representations, rather than neurons or circuits, at the center of analysis, equipping us with novel methods for monitoring and manipulating high-level cognitive phenomena in deep neural networks (DNNs). We provide baselines and an initial analysis of RepE techniques, showing that they offer simple yet effective solutions for improving our understanding and control of large language models. We showcase how these methods can provide traction on a wide range of safety-relevant problems, including honesty, harmlessness, power-seeking, and more, demonstrating the promise of top-down transparency research. We hope that this work catalyzes further exploration of RepE and fosters advancements in the transparency and safety of AI systems.
— Representation Engineering, Zou et al. 2023