community

active

leiden_hybrid_concepts

label: sonnet

community:leiden_hybrid_concepts-run2-c106

Sparse autoencoder interpretability limits

Critiques of SAEs for mechanistic interpretability, focusing on activation vs. parameter decoding gaps.

2 members. Each node is clickable.

Loading graph…

Drawn from 2 sources

The papers/notes whose extracted claims & findings make up this cluster.

Bridges (3)

Other communities that share members with this one — cross-cutting threads or papers that sit at the seam between two themes.

Claims (2)

Natural Language Autoencoders achieve readable explanations through unsupervised reconstruction loss optimized with reinforcement learning, not explicit interpretability constraints.Core insight: reconstruction objective combined with appropriate initialization and KL regularization produces human-interpretable explanations as emergent property.
Sparse autoencoders don't provide a comprehensive solution because they decode activations, not parametersCritique of activation-based interpretability methods.