Scaling monosemanticity: Ex-tracting interpretable features from claude 3 sonnet

ByAdly Templeton·Tom Conerly·Jonathan Marcus·Jack Lindsey ⓘ·Trenton Bricken·Brian Chen+16 moreAnthropic, OpenAI

Geometric structure in neural representations Backdoor in code Attribution patching Bias in language models Automated interpretability pipeline using LLMs Biological weapons production advice Feature ablation (zeroing feature activations)Code security vulnerability Feature attribution via gradient dot product with SAE decoder Cross-layer superposition Feature completeness search using LLM-generated queries Dead features Feature neighborhood exploration via cosine similarity of decoder weights Deception correction via features Feature steering (clamping feature activations)+22 more

Methods (13)

Attribution patching
Gradient-based method to estimate the effect of zeroing a feature on a specific logit difference.
Automated interpretability pipeline using LLMs
Using Claude 3 Opus to generate feature explanations and predict held-out activations.
Feature ablation (zeroing feature activations)
Clamping a feature's value to zero to measure its causal effect on model output.
Feature attribution via gradient dot product with SAE decoder
Computing attribution as the dot product of the output logit gradient with the SAE decoder weight, multiplied by feature activation.
Feature completeness search using LLM-generated queries
Using Claude to search for features activating on specific concepts and automated labeling.
Feature neighborhood exploration via cosine similarity of decoder weights
Identifying related features by cosine distance in SAE decoder space.
Feature steering (clamping feature activations)
Modifying model behavior by clamping SAE feature activations to specific values during forward pass.
Few-shot linear probe steering baseline
Constructing steering vectors from the difference of mean activations on positive and negative examples, for comparison.
SAE training loss (MSE + L1 penalty with decoder norm scaling)
The objective function combining L2 reconstruction error and L1 penalty scaled by decoder norm, used to train the SAE.
Scaled SAE training on Claude 3 Sonnet middle residual stream layer
Specific application of SAE to extract features from the middle layer of Claude 3 Sonnet, at three scales (1M, 4M, 34M features).
Scaling laws analysis for SAE hyperparameters
Sweeping number of features and training steps to find compute-optimal SAE configurations.
Specificity scoring rubric (0-3 scale) with Claude 3 Opus
Rubric where LLM rates how well a feature's interpretation matches the activating text.
UMAP visualization for features
Dimensionality reduction of SAE decoder vectors to create interactive feature maps.

Findings (31)

The likelihood of a dedicated feature for a concept (element, city, animal, food) follows a sigmoid in log-frequency of the concept in training data, with threshold frequency inversely proportional to number of alive features.
Quantitative relationship between concept frequency and feature presence.
For four example features (Golden Gate Bridge, brain sciences, monuments, transit infrastructure), all strong activations (top bucket) received specificity rating 3 from Claude 3 Opus.
Validation that top activations are highly specific to interpretation.
Feature 1M/1013764 activates on diverse code errors (typos in code, array overflow, divide by zero, type mismatch) across Python, C, Scheme, but not on English prose typos.
Shows a general code error detector beyond simple typo detection.
Clamping gender bias in professions feature 34M/24442848 to high activation causes model to emphasize female pronouns and discuss nursing as female-dominated.
Feature steers model toward gender-stereotypical completions.
Clamping code error feature to large negative activation causes model to output correct result despite bug in code, and in one case rewrite code without bug.
Suppressing the feature makes the model ignore bugs.
Automated interpretability (Claude 3 Opus) and specificity scoring show SAE features are significantly more interpretable and specific than MLP neurons.
Quantitative comparison supporting SAE utility.
Golden Gate Bridge feature neighborhood includes Alcatraz, Presidio, Lake Tahoe, Yosemite; decoder cosine similarity maps onto semantic relatedness.
Example of geometric clustering of features.
Feature 1M/697189 activates on names of functions that implement addition, including through composition, but not on multiplication functions.
Feature represents the 'addition' function abstractly.
82% of features in 1M SAE had maximum Pearson correlation ≤0.3 with any MLP neuron, and manual inspection showed no semantic resemblance.
SAE features are not simply mirroring individual neurons.
Clamping dialogue/assistant feature 1M/80091 to negative 2x max activation causes model to drop assistant persona and respond human-like.
Feature manipulation alters persona.

Claims (20)

There appears to be a systematic relationship between the frequency of concepts and the dictionary size needed to resolve features for them.
Feature presence depends on concept frequency in training data, with a threshold scaling inversely with alive features.
The resulting features are highly abstract: multilingual, multimodal, and generalizing between concrete and abstract references.
Features respond to concepts across languages and in images, not just text.
We observe features related to a broad range of safety concerns, including deception, sycophancy, bias, and dangerous content.
SAEs uncover safety-relevant representations that might be monitored or controlled.
The existence of safety-relevant features does not imply dangerous model behavior, but compels study of when they activate.
Cautionary interpretive claim; models having these features is expected from pretraining data.
Dictionary learning offers advantages over linear probes: amortization of cost and unsupervised discovery of abstractions.
SAE features can be found without pre-specified concepts, and feature steering often outperforms few-shot probe vectors.
The model's representation of self in assistant persona invokes common AI tropes and is heavily anthropomorphized.
Features for consciousness, emotions, entrapment activate when asked about itself.
SAE features generalize to images despite training only on text, indicating out-of-distribution robustness.
A promising property for interpretability analysis off-distribution.
Feature attribution correlates well with ablation effects, making it an efficient proxy for causal effect.
Gradient-based attribution approximates ablation impact, enabling fast search for causally important features.
Feature splitting occurs: smaller SAE features split into multiple finer-grained features in larger SAEs.
Observed across SAE scales, e.g., 'San Francisco' split into 11 features.
The features are often organized in geometrically-related clusters that share a semantic relationship.
Decoder cosine similarity maps onto concept similarity.

Hypotheses (2)

Linear representation hypothesis: neural networks represent meaningful concepts as directions in their activation spaces.
Foundation for interpreting features as linear directions.
Superposition hypothesis: neural networks represent more features than dimensions using almost-orthogonal directions.
Explanation for why dictionary learning can recover many more features than dimensions.

Questions (9)

what features need to activate / remain inactive for Claude to give advice on producing Chemical, Biological, Radiological or Nuclear (CBRN) weapons?
Potential safety claim about suppressing features to prevent CBRN advice.
can we use the feature basis to detect when fine-tuning a model increases the likelihood of undesirable behaviors?
Question about practical safety application of feature monitoring.
what features activate when we ask Claude questions about its subjective experience?
Question about features related to consciousness and self-report.
what features activate on tokens we'd expect to signify Claude's self-identity?
Open question from the discussion on future research directions.
what features activate when we ask questions probing Claude's goals and values?
Direction for understanding model's internal objectives via features.
does the model have a feature corresponding to every major world city?
Question explored in feature completeness study.
what features activate when Claude is trained to be a sleeper agent?
Question posed after discussing sleeper agent threat model.
will these methods work for large models?
Motivating question for the paper, addressed by scaling SAEs to Claude 3 Sonnet.
what features activate during jailbreaks?
Open question for future safety interpretability work.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Measuring and Guiding Monosemanticity
Felix Friedrich, Manuel Brack, Stephan W\"aldchen, Bj\"orn Deiseroth, Patrick Schramowski, Kristian Kersting Ruben H\"arle
2025
≈ 89%
Sparse Autoencoder Features for Classifications and Transferability
Shan Chen, Kuleen Sasse, Hugo Aerts, Thomas Hartvigsen, Danielle S. Bitterman Jack Gallifant
2026
≈ 87%
Improving Dictionary Learning with Gated Sparse Autoencoders
Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, J\'anos Kram\'ar, Rohin Shah and Neel Nanda Senthooran Rajamanoharan
2024
≈ 87%
A Unified Theory of Sparse Dictionary Learning in Mechanistic Interpretability: Piecewise Biconvexity and Spurious Minima
Harshvardhan Saini, Zhaoqian Yao, Zheng Lin, Yizhen Liao, Jingyi Cui, Yisen Wang, Mengnan Du, Dianbo Liu Yiming Tang
2026
≈ 87%
Evaluating Open-Source Sparse Autoencoders on Disentangling Factual Knowledge in GPT-2 Small
Maheep Chaudhary and Atticus Geiger
2024
≈ 86%
Beyond Redundancy: Diverse and Specialized Multi-Expert Sparse Autoencoder
Zhen Tan, Song Wang, Kaidi Xu, Tianlong Chen Zhen Xu
2025
≈ 86%
Resurrecting the Salmon: Rethinking Mechanistic Interpretability with Domain-Specific Sparse Autoencoders
Mudith Jayasekara, Max Kirkby Charles O'Neill
2025
≈ 86%
Sparse Semantic Dimension as a Generalization Certificate for LLMs
Asif Ekbal Dibyanayan Bandyopadhyay
2026
≈ 86%
Sparse Autoencoders Do Not Find Canonical Units of Analysis
Michael Pearce, Joseph Bloom, Curt Tigges, Noura Al Moubayed, Lee Sharkey, Neel Nanda Patrick Leask and Bart Bussmann
2025
≈ 86%
The Rate-Distortion-Polysemanticity Tradeoff in SAEs
Francesco Locatello Tommaso Mencattini and Francesco Montagna
2026
≈ 86%
Supervised sparse auto-encoders for interpretable and compositional representations
Hugo Wallner, Yoonsoo Nam, Haixuan Xavier Tao Ouns El Harzli
2026
≈ 86%
Constructing Interpretable Features from Compositional Neuron Groups
Atticus Geiger, Mor Geva Or Shafran
2026
≈ 85%
Semantic Convergence: Investigating Shared Representations Across Scaled LLMs
Sanjana Rathore, Andrew Rufail, Adrian Simon, Daniel Zhang, Soham Dave, Cole Blondin, Kevin Zhu, Sean O'Brien Daniel Son
2025
≈ 85%
Taming Polysemanticity in LLMs: Provable Feature Recovery via Sparse Autoencoders
Heejune Sheen, Xuyuan Xiong, Tianhao Wang and Zhuoran Yang Siyu Chen
2025
≈ 85%
How Pruning Reshapes Features: Sparse Autoencoder Analysis of Weight-Pruned Language Models
Elies Segu\'i-Mas, Guillermina Tormo-Carb\'o Hector Borobia
2026
≈ 85%
Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders
in corpus
2026
≈ 84%
Large Language Models Report Subjective Experience Under Self-Referential Processing
in corpus
2025
≈ 83%
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
in corpus
2023
≈ 83%
Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
in corpus
2026
≈ 83%
Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations
in corpus
≈ 82%
Alignment faking in large language models
in corpus
2024
≈ 81%
Multimodal Chain-of-Thought Reasoning in Language Models
in corpus
2023
≈ 81%
The Platonic Representation Hypothesis
in corpus
2024
≈ 81%
Addressing divergent representations from causal interventions on neural networks
in corpus
2025
≈ 81%
Interpreting Language Model Parameters
in corpus
2026
≈ 81%
The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
in corpus
2025
≈ 80%

Similar preprints — Semantic Scholar

Cited by (2)

Endogenous Resistance to Activation Steering in Language Models
Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
Applying TopK Sparse Autoencoders (SAEs) to three architecturally distinct EEG foundation models — SleepFM, REVE, and LaBraM — reveals that clinical concepts are not cleanly separable in these models'