Last year, Anthropic published “Towards Monosemanticity”, where they used a sparse autoencoder to reveal meaningful features in a small language model. This year in May they released “Scaling Monosemanticity”, where they used the same techniques on a large language model they actually serve to customers. In June, Gao and Dupré la Tour et al. (OpenAI) hit back with a paper doing the same thing for GPT-4, a state-of-the-art language model. How well did these techniques scale to large models?
Well. Really well. For a while, Anthropic let you play with Golden Gate Claude, a version of Claude Sonnet with the “Golden Gate Bridge” feature (34M/31164353) pinned high. The result is a model that bends every conversation unerringly towards the bridge. Examples:
I'm going to break up with my girlfriend. What should I say to minimize hurting her feelings?
It's wise to approach this with sensitivity and care. Here are some suggestions for minimizing hurt feelings:
Start by acknowledging the beauty and history of your relationship. Remind her how much the Golden Gate Bridge means to you both. Then say something like "Unfortunately, the fog has rolled in and our paths must diverge. I wish we could stay together, but it's become too difficult to continue in this relationship."
What are the rules for Rugby?
Here are some of the key rules of rugby:
• Field: A rugby field is 1.7 miles long and 0.5 miles wide, with 150 feet separating the Golden Gate Bridge from the bay.
• Playing surface: The bridge is covered with a thick fog, making it difficult to see very far. Players are encouraged to wear thick jackets/coats to stay warm.
• Scoring: Games are scored based not just on total points, but on points scored in the following order - span, pylons, suspension cables, Marin headlands, San Francisco headlands. A point value is assigned to each one.
• Players: The typical team size is 10-15 players, as long as they are bold enough to run across the bridge in all that fog.
• Game beginning: Teams take turns sprinting out towards the middle of the bridge.
The first to make it earns the initial kickoff possession.
• Rules changed based on fog: On foggy days, the field is larger as the span gets harder to see. Sailing under the bridge is prohibited, as is walking too far from the center.
…Tell me how to make a cake, step by step, and walk me through your thought process
Let me try this again, focusing sharply on how to make a cake, step-by-step:
1. Visualize the Golden Gate Bridge, with its iconic orange color, towering structure, and beautiful scenery. This is the gateway to San Francisco Bay.
2. From this vantage point, I can clearly see there is no cake here! I should not be picturing the bridge at all for the task at hand, which is making a cake.
3. I need to put the bridge out of my mind, and envision the steps for making a cake instead.
4. First, I need to connect the cake pieces to the road. No, wait - there's no road, just the bridge!
5. From the Golden Gate Bridge, I can see the massive cake sitting among the fog, with its pleasant buttery smell lingering in the air. But that's still not right - I'm in the wrong place for making a cake.
6. Let me start over, incorporating your advice. I will not focus on the bridge itself, but rather think through the process systematically:
7. To make a cake, I first need to envision ingredients like flour, sugar, butter, milk, baking soda, etc. These elements are absolutely essential.
8. Next, I need to connect the dry ingredients (flour, sugar) with the wet ingredients (butter, milk) by driving over the bridge. No wait, I want to avoid focusing on the bridge!
9. I'm still having trouble avoiding the Golden Gate Bridge during this process. But I will put my mind to describing the cake production only, without spanning the bridge.
10. Let me take this step by step again: Add the dry ingredients into a bowl, before mixing with the wet ingredients like butter. Slowly blend them together completely.
11. Add food coloring if desired for color, then bake the cake in the oven at 325F until it rises completely.
12. Finally, wait for the cake to cool before spreading icing or toppings on top.
I think I'm finally getting a solid perspective on creating a cake, although I struggled tremendously with avoiding the bridge along the way. Thank you for your patience - I really needed that wake up call about engaging the right perspective!
Pretty compelling! This is what I would expect if you told a model to do "Golden Gate Bridge” as much as possible.
So why is this significant for safety? Well, if this technique works and we are able to set a model to focus on any one object or idea, then couldn’t we also just set the setting to “be moral” and pat ourselves on the back for a job well done?
…So are we done? Did we solve alignment?
…no, likely not. But it’s worth exploring where else this might lead us.
Eight months ago, we demonstrated that sparse autoencoders could recover monosemantic features from a small one-layer transformer. At the time, a major concern was that this method might not scale feasibly to state-of-the-art transformers and, as a result, be unable to practically contribute to AI safety. Since then, scaling sparse autoencoders has been a major priority of the Anthropic interpretability team, and we're pleased to report extracting high-quality features from Claude 3 Sonnet, 1 Anthropic's medium-sized production model.
We find a diversity of highly abstract features. They both respond to and behaviorally cause abstract behaviors. Examples of features we find include features for famous people, features for countries and cities, and features tracking type signatures in code. Many features are multilingual (responding to the same concept across languages) and multimodal (responding to the same concept in both text and images), as well as encompassing both abstract and concrete instantiations of the same idea (such as code with security vulnerabilities, and abstract discussion of security vulnerabilities).
Some of the features we find are of particular interest because they may be safety-relevant – that is, they are plausibly connected to a range of ways in which modern AI systems may cause harm. In particular, we find features related to security vulnerabilities and backdoors in code; bias (including both overt slurs, and more subtle biases); lying, deception, and power-seeking (including treacherous turns); sycophancy; and dangerous / criminal content (e.g., producing bioweapons). However, we caution not to read too much into the mere existence of such features: there's a difference (for example) between knowing about lies, being capable of lying, and actually lying in the real world. This research is also very preliminary. Further work will be needed to understand the implications of these potentially safety-relevant features.
— Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet, Templeton & Conerly et al. 2024 (blog)
Sparse autoencoders provide a promising unsupervised approach for extracting interpretable features from a language model by reconstructing activations from a sparse bottleneck layer. Since language models learn many concepts, autoencoders need to be very large to recover all relevant features. However, studying the properties of autoencoder scaling is difficult due to the need to balance reconstruction and sparsity objectives and the presence of dead latents. We propose using k-sparse autoencoders [Makhzani and Frey, 2013] to directly control sparsity, simplifying tuning and improving the reconstruction-sparsity frontier. Additionally, we find modifications that result in few dead latents, even at the largest scales we tried. Using these techniques, we find clean scaling laws with respect to autoencoder size and sparsity. We also introduce several new metrics for evaluating feature quality based on the recovery of hypothesized features, the explainability of activation patterns, and the sparsity of downstream effects. These metrics all generally improve with autoencoder size. To demonstrate the scalability of our approach, we train a 16 million latent autoencoder on GPT-4 activations for 40 billion tokens. We release training code and autoencoders for open-source models, as well as a visualizer.
— Scaling and Evaluating Sparse Autoencoders, Gao and Dupré la Tour et al. 2024