community
active
leiden_hybrid_concepts
label: sonnet
community:leiden_hybrid_concepts-run2-c106Sparse autoencoder interpretability limits
Critiques of SAEs for mechanistic interpretability, focusing on activation vs. parameter decoding gaps.
2 members. Each node is clickable.
Loading graph…
Drawn from 2 sources
The papers/notes whose extracted claims & findings make up this cluster.
Bridges (3)
Other communities that share members with this one — cross-cutting threads or papers that sit at the seam between two themes.
Claims (2)
- Natural Language Autoencoders achieve readable explanations through unsupervised reconstruction loss optimized with reinforcement learning, not explicit interpretability constraints.Core insight: reconstruction objective combined with appropriate initialization and KL regularization produces human-interpretable explanations as emergent property.
- Sparse autoencoders don't provide a comprehensive solution because they decode activations, not parametersCritique of activation-based interpretability methods.