Towards monosemanticity: Decomposing language models with dictionary learning

ByTrenton Bricken·Adly Templeton·Joshua Batson·Brian Chen·Adam Jermyn·Tom Conerly+19 moreAnthropic

Feature splitting Disentanglement Activation Interval Sampling 8 Billion MLP Activation Samples Finite State Automata Feature Assemblies Sparse Autoencoder for Dictionary Learning Attribution Similarity The Pile Linear representation Feature Interpretability Rubric Mechanistic Interpretability Masked Cosine Similarity Neuron Resampling

Methods (5)

Activation Interval Sampling
Dividing feature activation spectrum into 11 evenly-spaced intervals and sampling uniformly to evaluate monosemanticity across activation levels
Attribution Similarity
Correlating attribution vectors (feature activation × logit weight of next token) across model pairs to measure functional universality
Feature Interpretability Rubric
14-point scoring rubric for human evaluation of feature interpretability covering confidence, activation consistency, logit consistency, and specificity
Masked Cosine Similarity
Cosine similarity between feature activations restricted to tokens where one of the features fires; used to identify feature splitting relationships
Neuron Resampling
Periodically reinitializing dead autoencoder neurons using high-loss data points to improve feature coverage

Frameworks (2)

Disentanglement
Related research agenda seeking representations that separate conceptually distinct factors; contrasted with superposition approach
Sparse Autoencoder for Dictionary Learning
Primary method introduced: trains a one-hidden-layer MLP with L1 sparsity penalty to decompose model activations into overcomplete feature dictionaries

Datasets (2)

8 Billion MLP Activation Samples
Dataset of transformer MLP activations used to train sparse autoencoders; collected from 40M contexts
The Pile
Training corpus used for the 67M-parameter model tested with VPD.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Monet: Mixture of Monosemantic Experts for Transformers
Young Jin Ahn, Kee-Eung Kim, Jaewoo Kang Jungwoo Park
2025
≈ 76%
A Pattern Language for Machine Learning Tasks
Ian Fan, Tuomas Laakkonen, Neil John Ortega, Thomas Hoffmann, Vincent Wang-Mascianica Benjamin Rodatz
2025
≈ 76%
A Monosemantic Attribution Framework for Stable Interpretability in Clinical Neuroscience Large Language Models
Tiago Azevedo, Cristian Cosentino, Chiara D'Ercoli, Subati Abulikemu, Zhongtian Sun, Richard Bethlehem, Pietro Lio Michail Mamalakis
2026
≈ 75%
A Unified Theory of Sparse Dictionary Learning in Mechanistic Interpretability: Piecewise Biconvexity and Spurious Minima
Harshvardhan Saini, Zhaoqian Yao, Zheng Lin, Yizhen Liao, Jingyi Cui, Yisen Wang, Mengnan Du, Dianbo Liu Yiming Tang
2026
≈ 75%
Multi-Agent Language Models: Advancing Cooperation, Coordination, and Adaptation
Arjun Vaithilingam Sudhakar
2025
≈ 75%
Majorization Minimization Technique for Optimally Solving Deep Dictionary Learning
Vanika Singhal and Angshul Majumdar
2019
≈ 75%
Learning to Model the World with Language
Yuqing Du, Olivia Watkins, Danijar Hafner, Pieter Abbeel, Dan Klein, Anca Dragan Jessy Lin
2024
≈ 75%
Vocabulary Expansion of Large Language Models via Kullback-Leibler-Based Self-Distillation
Max Rehman Linder
2026
≈ 74%
The Rate-Distortion-Polysemanticity Tradeoff in SAEs
Francesco Locatello Tommaso Mencattini and Francesco Montagna
2026
≈ 74%
Semantic Convergence: Investigating Shared Representations Across Scaled LLMs
Sanjana Rathore, Andrew Rufail, Adrian Simon, Daniel Zhang, Soham Dave, Cole Blondin, Kevin Zhu, Sean O'Brien Daniel Son
2025
≈ 74%
Improving Dictionary Learning with Gated Sparse Autoencoders
Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, J\'anos Kram\'ar, Rohin Shah and Neel Nanda Senthooran Rajamanoharan
2024
≈ 74%
Dictionary Learning under Symmetries via Group Representations
Aaron Y. R. Low, Yong Sheng Soh, Zhuohang Feng, and Brendan K. Y. Tan Subhroshekhar Ghosh
2026
≈ 74%
Lost in Multilinguality: Dissecting Cross-lingual Factual Inconsistency in Transformer Language Models
Heike Adel, Lukas Lange, Yihong Liu, Ercong Nie, Jannik Str\"otgen, Hinrich Sch\"utze Mingyang Wang
2025
≈ 74%
Measuring and Guiding Monosemanticity
Felix Friedrich, Manuel Brack, Stephan W\"aldchen, Bj\"orn Deiseroth, Patrick Schramowski, Kristian Kersting Ruben H\"arle
2025
≈ 74%
Probing Task-Oriented Dialogue Representation from Language Models
Chien-Sheng Wu and Caiming Xiong
2020
≈ 74%
Interpreting Language Model Parameters
in corpus
2026
≈ 73%
Paper Summary: Interpreting Language Model Parameters
in corpus
≈ 71%
Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders
in corpus
2026
≈ 71%
Active inference on discrete state-spaces: a synthesis
in corpus
2020
≈ 71%
Model Alignment Search
in corpus
2025
≈ 70%
Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations
in corpus
≈ 70%
The Platonic Representation Hypothesis
in corpus
2024
≈ 69%
Active Inference, Curiosity and Insight
in corpus
2017
≈ 69%
Towards Safe and Honest AI Agents with Neural Self-Other Overlap
in corpus
2024
≈ 68%
Denotational Design: from meanings to programs
in corpus
2015
≈ 68%
Technological Approach to Mind Everywhere: An Experimentally-Grounded Framework for Understanding Diverse Bodies and Minds
in corpus
2022
≈ 68%
Addressing divergent representations from causal interventions on neural networks
in corpus
2025
≈ 68%
Denotational design with type class morphisms (extended version)
in corpus
2015
≈ 68%

Similar preprints — Semantic Scholar

Cited by (4)

Endogenous Resistance to Activation Steering in Language Models
Quantitative Introspection in Language Models: Tracking Emotive States Across Conversation
Quantitative introspection—the causal coupling between an instruction-tuned LLM's numeric self-report and a probe-defined internal emotive direction—is demonstrably present in models as small as LLaMA
Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
Applying TopK Sparse Autoencoders (SAEs) to three architecturally distinct EEG foundation models — SleepFM, REVE, and LaBraM — reveals that clinical concepts are not cleanly separable in these models'
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
Manifold steering — intervening on model activations along paths constrained to lie on a learned activation manifold M_h rather than along Euclidean linear directions — produces behavioral trajectorie