paper:scalingScaling monosemanticity: Ex-tracting interpretable features from claude 3 sonnet
Methods (13)
- Attribution patchingGradient-based method to estimate the effect of zeroing a feature on a specific logit difference.
- Automated interpretability pipeline using LLMsUsing Claude 3 Opus to generate feature explanations and predict held-out activations.
- Feature ablation (zeroing feature activations)Clamping a feature's value to zero to measure its causal effect on model output.
- Feature attribution via gradient dot product with SAE decoderComputing attribution as the dot product of the output logit gradient with the SAE decoder weight, multiplied by feature activation.
- Feature completeness search using LLM-generated queriesUsing Claude to search for features activating on specific concepts and automated labeling.
- Feature neighborhood exploration via cosine similarity of decoder weightsIdentifying related features by cosine distance in SAE decoder space.
- Feature steering (clamping feature activations)Modifying model behavior by clamping SAE feature activations to specific values during forward pass.
- Few-shot linear probe steering baselineConstructing steering vectors from the difference of mean activations on positive and negative examples, for comparison.
- SAE training loss (MSE + L1 penalty with decoder norm scaling)The objective function combining L2 reconstruction error and L1 penalty scaled by decoder norm, used to train the SAE.
- Scaled SAE training on Claude 3 Sonnet middle residual stream layerSpecific application of SAE to extract features from the middle layer of Claude 3 Sonnet, at three scales (1M, 4M, 34M features).
- Scaling laws analysis for SAE hyperparametersSweeping number of features and training steps to find compute-optimal SAE configurations.
- Specificity scoring rubric (0-3 scale) with Claude 3 OpusRubric where LLM rates how well a feature's interpretation matches the activating text.
- UMAP visualization for featuresDimensionality reduction of SAE decoder vectors to create interactive feature maps.
Findings (31)
- The likelihood of a dedicated feature for a concept (element, city, animal, food) follows a sigmoid in log-frequency of the concept in training data, with threshold frequency inversely proportional to number of alive features.
Quantitative relationship between concept frequency and feature presence.
- For four example features (Golden Gate Bridge, brain sciences, monuments, transit infrastructure), all strong activations (top bucket) received specificity rating 3 from Claude 3 Opus.
Validation that top activations are highly specific to interpretation.
- Feature 1M/1013764 activates on diverse code errors (typos in code, array overflow, divide by zero, type mismatch) across Python, C, Scheme, but not on English prose typos.
Shows a general code error detector beyond simple typo detection.
- Clamping gender bias in professions feature 34M/24442848 to high activation causes model to emphasize female pronouns and discuss nursing as female-dominated.
Feature steers model toward gender-stereotypical completions.
- Clamping code error feature to large negative activation causes model to output correct result despite bug in code, and in one case rewrite code without bug.
Suppressing the feature makes the model ignore bugs.
- Automated interpretability (Claude 3 Opus) and specificity scoring show SAE features are significantly more interpretable and specific than MLP neurons.
Quantitative comparison supporting SAE utility.
- Golden Gate Bridge feature neighborhood includes Alcatraz, Presidio, Lake Tahoe, Yosemite; decoder cosine similarity maps onto semantic relatedness.
Example of geometric clustering of features.
- Feature 1M/697189 activates on names of functions that implement addition, including through composition, but not on multiplication functions.
Feature represents the 'addition' function abstractly.
- 82% of features in 1M SAE had maximum Pearson correlation ≤0.3 with any MLP neuron, and manual inspection showed no semantic resemblance.
SAE features are not simply mirroring individual neurons.
- Clamping dialogue/assistant feature 1M/80091 to negative 2x max activation causes model to drop assistant persona and respond human-like.
Feature manipulation alters persona.
Claims (20)
- There appears to be a systematic relationship between the frequency of concepts and the dictionary size needed to resolve features for them.
Feature presence depends on concept frequency in training data, with a threshold scaling inversely with alive features.
- The resulting features are highly abstract: multilingual, multimodal, and generalizing between concrete and abstract references.
Features respond to concepts across languages and in images, not just text.
- We observe features related to a broad range of safety concerns, including deception, sycophancy, bias, and dangerous content.
SAEs uncover safety-relevant representations that might be monitored or controlled.
- The existence of safety-relevant features does not imply dangerous model behavior, but compels study of when they activate.
Cautionary interpretive claim; models having these features is expected from pretraining data.
- Dictionary learning offers advantages over linear probes: amortization of cost and unsupervised discovery of abstractions.
SAE features can be found without pre-specified concepts, and feature steering often outperforms few-shot probe vectors.
- The model's representation of self in assistant persona invokes common AI tropes and is heavily anthropomorphized.
Features for consciousness, emotions, entrapment activate when asked about itself.
- SAE features generalize to images despite training only on text, indicating out-of-distribution robustness.
A promising property for interpretability analysis off-distribution.
- Feature attribution correlates well with ablation effects, making it an efficient proxy for causal effect.
Gradient-based attribution approximates ablation impact, enabling fast search for causally important features.
- Feature splitting occurs: smaller SAE features split into multiple finer-grained features in larger SAEs.
Observed across SAE scales, e.g., 'San Francisco' split into 11 features.
- The features are often organized in geometrically-related clusters that share a semantic relationship.
Decoder cosine similarity maps onto concept similarity.
Hypotheses (2)
- Linear representation hypothesis: neural networks represent meaningful concepts as directions in their activation spaces.
Foundation for interpreting features as linear directions.
- Superposition hypothesis: neural networks represent more features than dimensions using almost-orthogonal directions.
Explanation for why dictionary learning can recover many more features than dimensions.
Questions (9)
- what features need to activate / remain inactive for Claude to give advice on producing Chemical, Biological, Radiological or Nuclear (CBRN) weapons?
Potential safety claim about suppressing features to prevent CBRN advice.
- can we use the feature basis to detect when fine-tuning a model increases the likelihood of undesirable behaviors?
Question about practical safety application of feature monitoring.
- what features activate when we ask Claude questions about its subjective experience?
Question about features related to consciousness and self-report.
- what features activate on tokens we'd expect to signify Claude's self-identity?
Open question from the discussion on future research directions.
- what features activate when we ask questions probing Claude's goals and values?
Direction for understanding model's internal objectives via features.
- does the model have a feature corresponding to every major world city?
Question explored in feature completeness study.
- what features activate when Claude is trained to be a sleeper agent?
Question posed after discussing sleeper agent threat model.
- will these methods work for large models?
Motivating question for the paper, addressed by scaling SAEs to Claude 3 Sonnet.
- what features activate during jailbreaks?
Open question for future safety interpretability work.
Related work— refs + corpus + external arXiv
Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.
- Measuring and Guiding MonosemanticityFelix Friedrich, Manuel Brack, Stephan W\"aldchen, Bj\"orn Deiseroth, Patrick Schramowski, Kristian Kersting Ruben H\"arle2025≈ 89%
- Sparse Autoencoder Features for Classifications and TransferabilityShan Chen, Kuleen Sasse, Hugo Aerts, Thomas Hartvigsen, Danielle S. Bitterman Jack Gallifant2026≈ 87%
- Improving Dictionary Learning with Gated Sparse AutoencodersArthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, J\'anos Kram\'ar, Rohin Shah and Neel Nanda Senthooran Rajamanoharan2024≈ 87%
- A Unified Theory of Sparse Dictionary Learning in Mechanistic Interpretability: Piecewise Biconvexity and Spurious MinimaHarshvardhan Saini, Zhaoqian Yao, Zheng Lin, Yizhen Liao, Jingyi Cui, Yisen Wang, Mengnan Du, Dianbo Liu Yiming Tang2026≈ 87%
- Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 SmallMaheep Chaudhary and Atticus Geiger2024≈ 86%
- Beyond Redundancy: Diverse and Specialized Multi-Expert Sparse AutoencoderZhen Tan, Song Wang, Kaidi Xu, Tianlong Chen Zhen Xu2025≈ 86%
- Resurrecting the Salmon: Rethinking Mechanistic Interpretability with Domain-Specific Sparse AutoencodersMudith Jayasekara, Max Kirkby Charles O'Neill2025≈ 86%
- Sparse Semantic Dimension as a Generalization Certificate for LLMsAsif Ekbal Dibyanayan Bandyopadhyay2026≈ 86%
- Sparse Autoencoders Do Not Find Canonical Units of AnalysisMichael Pearce, Joseph Bloom, Curt Tigges, Noura Al Moubayed, Lee Sharkey, Neel Nanda Patrick Leask and Bart Bussmann2025≈ 86%
- The Rate-Distortion-Polysemanticity Tradeoff in SAEsFrancesco Locatello Tommaso Mencattini and Francesco Montagna2026≈ 86%
- Supervised sparse auto-encoders for interpretable and compositional representationsHugo Wallner, Yoonsoo Nam, Haixuan Xavier Tao Ouns El Harzli2026≈ 86%
- Constructing Interpretable Features from Compositional Neuron GroupsAtticus Geiger, Mor Geva Or Shafran2026≈ 85%
- Semantic Convergence: Investigating Shared Representations Across Scaled LLMsSanjana Rathore, Andrew Rufail, Adrian Simon, Daniel Zhang, Soham Dave, Cole Blondin, Kevin Zhu, Sean O'Brien Daniel Son2025≈ 85%
- Taming Polysemanticity in LLMs: Provable Feature Recovery via Sparse AutoencodersHeejune Sheen, Xuyuan Xiong, Tianhao Wang and Zhuoran Yang Siyu Chen2025≈ 85%
- How Pruning Reshapes Features: Sparse Autoencoder Analysis of Weight-Pruned Language ModelsElies Segu\'i-Mas, Guillermina Tormo-Carb\'o Hector Borobia2026≈ 85%
- Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencodersin corpus2026≈ 84%
- ≈ 83%
- The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasetsin corpus2023≈ 83%
- ≈ 83%
- ≈ 82%
- Alignment faking in large language modelsin corpus2024≈ 81%
- ≈ 81%
- The Platonic Representation Hypothesisin corpus2024≈ 81%
- ≈ 81%
- Interpreting Language Model Parametersin corpus2026≈ 81%
- The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?in corpus2025≈ 80%
Similar preprints — Semantic Scholar
Cited by (2)
- Endogenous Resistance to Activation Steering in Language Models
- Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
Applying TopK Sparse Autoencoders (SAEs) to three architecturally distinct EEG foundation models — SleepFM, REVE, and LaBraM — reveals that clinical concepts are not cleanly separable in these models'