Back to All Events

Measuring Feature Sensitivity Using Dataset Filtering

The Monosemanticity papers marked a big step forward in mechanistic interpretability, showing that when certain features activate, they often represent human-understandable concepts. However, a recent update from Anthropic suggests that the methods have low sensitivity: the features often fail to activate even when the associated concept is present.

This raises new questions about how well our interpretability methods capture model representations—and what it means for circuits and superposition. Join us for a critical analysis of the latest AI safety insights!

Feature sensitivity—the likelihood that a learned feature activates when a concept is present—remains difficult to measure with interpretability techniques. While specificity can be assessed by inspecting dataset examples where a feature is active, sensitivity requires reliably identifying concept presence across large corpora. We introduce a dataset filtering approach using Claude’s relevance ratings to approximate sensitivity, allowing for targeted evaluation without requiring exhaustive concept definitions.

Our results show that individual features often fail to activate even when a concept is highly relevant, suggesting that multiple features collectively represent broader topics. This raises questions about the completeness of current feature attribution methods and highlights the need for interpretability techniques that better capture feature interactions and the distributed nature of conceptual representations in AI models.

Measuring feature sensitivity using dataset filtering, Nicholas L Turner, Adam Jermyn, Joshua Batson 2024. Summary by Claude.

Previous
Previous
15 January

Alignment Faking in Large Language Models