Sparse autoencoders find highly interpretable features in language models

ByHoagy Cunningham·Aidan Ewart·Logan Riggs·Robert Huben·Lee Sharkey

DOI 10.48550/arxiv.2309.08600 arXiv 2309.08600

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

A Survey on Sparse Autoencoders: Interpreting the Internal Mechanisms of Large Language Models
Xuansheng Wu, Haiyan Zhao, Daking Rai, Ziyu Yao, Ninghao Liu, Mengnan Du Dong Shu
2025
≈ 85%
Sparse Autoencoder Insights on Voice Embeddings
Yu Zhou, Vijay K. Gurbani Daniel Pluth
2025
≈ 83%
Mechanistic Interpretability of ASR models using Sparse Autoencoders
Zachary Nicholas Houghton, Yu Zhou, and Vijay K. Gurbani Dan Pluth
2026
≈ 83%
Sparse Autoencoder Features for Classifications and Transferability
Shan Chen, Kuleen Sasse, Hugo Aerts, Thomas Hartvigsen, Danielle S. Bitterman Jack Gallifant
2026
≈ 82%
Measuring Sparse Autoencoder Feature Sensitivity
Katherine Tian, Nathan Hu Claire Tian
2025
≈ 82%
Sparse Autoencoders Do Not Find Canonical Units of Analysis
Michael Pearce, Joseph Bloom, Curt Tigges, Noura Al Moubayed, Lee Sharkey, Neel Nanda Patrick Leask and Bart Bussmann
2025
≈ 82%
Improving Dictionary Learning with Gated Sparse Autoencoders
Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, J\'anos Kram\'ar, Rohin Shah and Neel Nanda Senthooran Rajamanoharan
2024
≈ 82%
Empirical Evaluation of Progressive Coding for Sparse Autoencoders
Anders S{\o}gaard Hans Peter
2025
≈ 81%
Evaluating and Designing Sparse Autoencoders by Approximating Quasi-Orthogonality
Adam Davies, Marc E. Canby and Julia Hockenmaier Sewoong Lee
2025
≈ 81%
Interpreting Attention Layer Outputs with Sparse Autoencoders
Robert Krzyzanowski, Joseph Isaac Bloom, Arthur Conmy, Neel Nanda Connor Kissane
2024
≈ 81%
Incorporating Hierarchical Semantics in Sparse Autoencoder Architectures
Sean Richardson, Kiho Park, Victor Veitch Mark Muchane
2025
≈ 80%
Group Equivariance Meets Mechanistic Interpretability: Equivariant Sparse Autoencoders
Ana Lucic Ege Erdogan
2025
≈ 80%
Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models
Thang Bui Charles O'Neill
2024
≈ 80%
Features that Make a Difference: Leveraging Gradients for Improved Dictionary Learning
Jared Wilson, Max Forsey, Bryce Hepner, Thomas Vin Howe, David Wingate Jeffrey Olmo
2025
≈ 80%
Understanding sparse autoencoder scaling in the presence of feature manifolds
Liv Gorton, Tom McGrath Eric J. Michaud
2025
≈ 80%
Interpreting Language Model Parameters
in corpus
2026
≈ 76%
Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders
in corpus
2026
≈ 76%
Paper Summary: Interpreting Language Model Parameters
in corpus
≈ 75%
Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations
in corpus
≈ 75%
Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
in corpus
2026
≈ 73%
Persistence and Introspection of Emotion Features
in corpus
≈ 69%
Large Language Models Report Subjective Experience Under Self-Referential Processing
in corpus
2025
≈ 69%
Steering Evaluation-Aware Language Models to Act Like They Are Deployed
in corpus
2025
≈ 68%
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
in corpus
2023
≈ 68%
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations
in corpus
2023
≈ 68%
The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
in corpus
2025
≈ 68%
Active inference: demystified and compared
in corpus
2021
≈ 68%

Similar preprints — Semantic Scholar

Cited by (6)

From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs
Propositional truth in LLMs is not encoded as a single linear direction but as a multi-dimensional subspace that can be characterized by concept cones—sets of all nonnegative linear combinations of or
The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
Under arbitrarily powerful alignment maps, causal abstraction becomes vacuous: any neural network can be perfectly mapped to any algorithm, a result proven formally in Theorem 1 under five mild assump
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
At sufficient scale, LLMs linearly represent the truth or falsehood of factual statements in their internal activations — a claim supported by PCA visualizations, cross-dataset probe transfer, and cau
Endogenous Resistance to Activation Steering in Language Models
Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
Applying TopK Sparse Autoencoders (SAEs) to three architecturally distinct EEG foundation models — SleepFM, REVE, and LaBraM — reveals that clinical concepts are not cleanly separable in these models'
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
Manifold steering — intervening on model activations along paths constrained to lie on a learned activation manifold M_h rather than along Euclidean linear directions — produces behavioral trajectorie