paper
referenced-only
2023
paper:bricken-towards-monosemanticity-decomposing-lang-2023Towards monosemanticity: Decomposing language models with dictionary learning
Feature splittingDisentanglementActivation Interval Sampling8 Billion MLP Activation SamplesFinite State Automata Feature AssembliesSparse Autoencoder for Dictionary LearningAttribution SimilarityThe PileLinear representationFeature Interpretability RubricMechanistic InterpretabilityMasked Cosine SimilarityNeuron Resampling
Methods (5)
- Activation Interval SamplingDividing feature activation spectrum into 11 evenly-spaced intervals and sampling uniformly to evaluate monosemanticity across activation levels
- Attribution SimilarityCorrelating attribution vectors (feature activation × logit weight of next token) across model pairs to measure functional universality
- Feature Interpretability Rubric14-point scoring rubric for human evaluation of feature interpretability covering confidence, activation consistency, logit consistency, and specificity
- Masked Cosine SimilarityCosine similarity between feature activations restricted to tokens where one of the features fires; used to identify feature splitting relationships
- Neuron ResamplingPeriodically reinitializing dead autoencoder neurons using high-loss data points to improve feature coverage
Frameworks (2)
- DisentanglementRelated research agenda seeking representations that separate conceptually distinct factors; contrasted with superposition approach
- Sparse Autoencoder for Dictionary LearningPrimary method introduced: trains a one-hidden-layer MLP with L1 sparsity penalty to decompose model activations into overcomplete feature dictionaries
Datasets (2)
- 8 Billion MLP Activation SamplesDataset of transformer MLP activations used to train sparse autoencoders; collected from 40M contexts
- The PileTraining corpus used for the 67M-parameter model tested with VPD.
Related work— refs + corpus + external arXiv
Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.
- Monet: Mixture of Monosemantic Experts for TransformersYoung Jin Ahn, Kee-Eung Kim, Jaewoo Kang Jungwoo Park2025≈ 76%
- A Pattern Language for Machine Learning TasksIan Fan, Tuomas Laakkonen, Neil John Ortega, Thomas Hoffmann, Vincent Wang-Mascianica Benjamin Rodatz2025≈ 76%
- A Monosemantic Attribution Framework for Stable Interpretability in Clinical Neuroscience Large Language ModelsTiago Azevedo, Cristian Cosentino, Chiara D'Ercoli, Subati Abulikemu, Zhongtian Sun, Richard Bethlehem, Pietro Lio Michail Mamalakis2026≈ 75%
- A Unified Theory of Sparse Dictionary Learning in Mechanistic Interpretability: Piecewise Biconvexity and Spurious MinimaHarshvardhan Saini, Zhaoqian Yao, Zheng Lin, Yizhen Liao, Jingyi Cui, Yisen Wang, Mengnan Du, Dianbo Liu Yiming Tang2026≈ 75%
- Multi-Agent Language Models: Advancing Cooperation, Coordination, and AdaptationArjun Vaithilingam Sudhakar2025≈ 75%
- Majorization Minimization Technique for Optimally Solving Deep Dictionary LearningVanika Singhal and Angshul Majumdar2019≈ 75%
- Learning to Model the World with LanguageYuqing Du, Olivia Watkins, Danijar Hafner, Pieter Abbeel, Dan Klein, Anca Dragan Jessy Lin2024≈ 75%
- Vocabulary Expansion of Large Language Models via Kullback-Leibler-Based Self-DistillationMax Rehman Linder2026≈ 74%
- The Rate-Distortion-Polysemanticity Tradeoff in SAEsFrancesco Locatello Tommaso Mencattini and Francesco Montagna2026≈ 74%
- Semantic Convergence: Investigating Shared Representations Across Scaled LLMsSanjana Rathore, Andrew Rufail, Adrian Simon, Daniel Zhang, Soham Dave, Cole Blondin, Kevin Zhu, Sean O'Brien Daniel Son2025≈ 74%
- Improving Dictionary Learning with Gated Sparse AutoencodersArthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, J\'anos Kram\'ar, Rohin Shah and Neel Nanda Senthooran Rajamanoharan2024≈ 74%
- Dictionary Learning under Symmetries via Group RepresentationsAaron Y. R. Low, Yong Sheng Soh, Zhuohang Feng, and Brendan K. Y. Tan Subhroshekhar Ghosh2026≈ 74%
- Lost in Multilinguality: Dissecting Cross-lingual Factual Inconsistency in Transformer Language ModelsHeike Adel, Lukas Lange, Yihong Liu, Ercong Nie, Jannik Str\"otgen, Hinrich Sch\"utze Mingyang Wang2025≈ 74%
- Measuring and Guiding MonosemanticityFelix Friedrich, Manuel Brack, Stephan W\"aldchen, Bj\"orn Deiseroth, Patrick Schramowski, Kristian Kersting Ruben H\"arle2025≈ 74%
- Probing Task-Oriented Dialogue Representation from Language ModelsChien-Sheng Wu and Caiming Xiong2020≈ 74%
- Interpreting Language Model Parametersin corpus2026≈ 73%
- ≈ 71%
- Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencodersin corpus2026≈ 71%
- ≈ 71%
- Model Alignment Searchin corpus2025≈ 70%
- ≈ 70%
- The Platonic Representation Hypothesisin corpus2024≈ 69%
- Active Inference, Curiosity and Insightin corpus2017≈ 69%
- ≈ 68%
- Denotational Design: from meanings to programsin corpus2015≈ 68%
- ≈ 68%
- ≈ 68%
- ≈ 68%
Similar preprints — Semantic Scholar
Cited by (4)
- Endogenous Resistance to Activation Steering in Language Models
- Quantitative Introspection in Language Models: Tracking Emotive States Across Conversation
Quantitative introspection—the causal coupling between an instruction-tuned LLM's numeric self-report and a probe-defined internal emotive direction—is demonstrably present in models as small as LLaMA
- Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
Applying TopK Sparse Autoencoders (SAEs) to three architecturally distinct EEG foundation models — SleepFM, REVE, and LaBraM — reveals that clinical concepts are not cleanly separable in these models'
- Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
Manifold steering — intervening on model activations along paths constrained to lie on a learned activation manifold M_h rather than along Euclidean linear directions — produces behavioral trajectorie