claim

active

claim:sparse-autoencoders-don-t-provide-a-comprehensive-solution-because-they-decode-activations-not-parameters

Sparse autoencoders don't provide a comprehensive solution because they decode activations, not parameters

Critique of activation-based interpretability methods.

Source paper

extracted_from

cimcWhitepaper

Neighborhood — ranked by edge-count

Communities (3)

community

Mechanistic interpretability & model evaluation
members_of
Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
Autoregressive models and context window limitations
members_of
Theoretical and empirical analysis of why AR language models cannot maintain coherence or convergence beyond their context window through local interactions alone.
Sparse autoencoder interpretability limits
members_of
Critiques of SAEs for mechanistic interpretability, focusing on activation vs. parameter decoding gaps.

Methods (1)

method

Sparse Autoencoders (SAE)
associated_with
Interpretability method criticized in this paper for shattering manifolds into atomic pieces, obscuring overarching semantic structure.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Sparse autoencoders produce interpretable features for large models.claim0.883
Central claim of the paper: the method scales to state-of-the-art transformers.
Sparse Autoencoders Find Highly Interpretable Features in Language Models (Cunningham et al., 2023)concept0.869
Core methodology paper for SAE-based interpretable feature extraction
Sparse autoencoders are preferable to stronger iterative dictionary learning methods because they cannot recover features the model itself cannot accessclaim0.866
Rationale for using simpler sparse autoencoders rather than NP-hard compressed sensing algorithms
We hypothesize that sparse autoencoders or similar methods will work on frontier large language models, though significant computational challenges remainhypothesis0.844
Forward-looking prediction about scalability of the method to larger models
Training the sparse autoencoder on more data makes features subjectively sharper and more interpretableclaim0.839
Empirical principle discovered during autoencoder training; led to using 8 billion training points
Sparse Autoencoderframework0.839
Interpretability framework used to decompose layer-40 activations into sparse feature sets for studying emotional alignment and persistence
Sparse autoencoders extract features that are significantly more monosemantic than neurons, as shown by four independent lines of evidenceclaim0.835
Central claim of the paper, supported by detailed feature analysis, human evaluation, automated interpretability of activations, and automated interpretability of logit weights
Sparse Autoencoder Featuresconcept0.831
Used in Anthropic welfare assessment to identify performative behavior and hidden emotional struggle co-activating features

Cross-corpus bridges (1)

same_concept_as · Nomic cosine

External markdown files that talk about the same concept as this entity.

aboutblank_kb
Autoencoder Architectureframeworks/variational-autoencoder-architecture.md0.804