paper:cunningham-sparse-autoencoders-find-highly-interpre-2023Sparse autoencoders find highly interpretable features in language models
Related work— refs + corpus + external arXiv
Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.
- A Survey on Sparse Autoencoders: Interpreting the Internal Mechanisms of Large Language ModelsXuansheng Wu, Haiyan Zhao, Daking Rai, Ziyu Yao, Ninghao Liu, Mengnan Du Dong Shu2025≈ 85%
- ≈ 83%
- Mechanistic Interpretability of ASR models using Sparse AutoencodersZachary Nicholas Houghton, Yu Zhou, and Vijay K. Gurbani Dan Pluth2026≈ 83%
- Sparse Autoencoder Features for Classifications and TransferabilityShan Chen, Kuleen Sasse, Hugo Aerts, Thomas Hartvigsen, Danielle S. Bitterman Jack Gallifant2026≈ 82%
- ≈ 82%
- Sparse Autoencoders Do Not Find Canonical Units of AnalysisMichael Pearce, Joseph Bloom, Curt Tigges, Noura Al Moubayed, Lee Sharkey, Neel Nanda Patrick Leask and Bart Bussmann2025≈ 82%
- Improving Dictionary Learning with Gated Sparse AutoencodersArthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, J\'anos Kram\'ar, Rohin Shah and Neel Nanda Senthooran Rajamanoharan2024≈ 82%
- ≈ 81%
- Evaluating and Designing Sparse Autoencoders by Approximating Quasi-OrthogonalityAdam Davies, Marc E. Canby and Julia Hockenmaier Sewoong Lee2025≈ 81%
- Interpreting Attention Layer Outputs with Sparse AutoencodersRobert Krzyzanowski, Joseph Isaac Bloom, Arthur Conmy, Neel Nanda Connor Kissane2024≈ 81%
- Incorporating Hierarchical Semantics in Sparse Autoencoder ArchitecturesSean Richardson, Kiho Park, Victor Veitch Mark Muchane2025≈ 80%
- Group Equivariance Meets Mechanistic Interpretability: Equivariant Sparse AutoencodersAna Lucic Ege Erdogan2025≈ 80%
- Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language ModelsThang Bui Charles O'Neill2024≈ 80%
- Features that Make a Difference: Leveraging Gradients for Improved Dictionary LearningJared Wilson, Max Forsey, Bryce Hepner, Thomas Vin Howe, David Wingate Jeffrey Olmo2025≈ 80%
- Understanding sparse autoencoder scaling in the presence of feature manifoldsLiv Gorton, Tom McGrath Eric J. Michaud2025≈ 80%
- Interpreting Language Model Parametersin corpus2026≈ 76%
- Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencodersin corpus2026≈ 76%
- ≈ 75%
- ≈ 75%
- ≈ 73%
- ≈ 69%
- ≈ 69%
- ≈ 68%
- The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasetsin corpus2023≈ 68%
- Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representationsin corpus2023≈ 68%
- The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?in corpus2025≈ 68%
- Active inference: demystified and comparedin corpus2021≈ 68%
Similar preprints — Semantic Scholar
Cited by (6)
- From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs
Propositional truth in LLMs is not encoded as a single linear direction but as a multi-dimensional subspace that can be characterized by concept cones—sets of all nonnegative linear combinations of or
- The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
Under arbitrarily powerful alignment maps, causal abstraction becomes vacuous: any neural network can be perfectly mapped to any algorithm, a result proven formally in Theorem 1 under five mild assump
- The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
At sufficient scale, LLMs linearly represent the truth or falsehood of factual statements in their internal activations — a claim supported by PCA visualizations, cross-dataset probe transfer, and cau
- Endogenous Resistance to Activation Steering in Language Models
- Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
Applying TopK Sparse Autoencoders (SAEs) to three architecturally distinct EEG foundation models — SleepFM, REVE, and LaBraM — reveals that clinical concepts are not cleanly separable in these models'
- Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
Manifold steering — intervening on model activations along paths constrained to lie on a learned activation manifold M_h rather than along Euclidean linear directions — produces behavioral trajectorie