claim

active

claim:sparse-autoencoders-extract-features-that-are-significantly-more-monosemantic-than-neurons-as-shown-by-four-independent-lines-of-evidence

Sparse autoencoders extract features that are significantly more monosemantic than neurons, as shown by four independent lines of evidence

Central claim of the paper, supported by detailed feature analysis, human evaluation, automated interpretability of activations, and automated interpretability of logit weights

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Findings (7)

finding

Arabic feature A/1/3450 has 27 neurons with coefficient magnitude ≥0.1 and three largest coefficients are negative; most correlated neuron responds to mixture of non-English languages
supports
Demonstrates that the Arabic feature is not aligned to any single neuron
Arabic script feature A/1/3450 fires on 81% Arabic-script tokens when active, with 98% specificity at high activation levels
supports
Demonstrates activation specificity of the Arabic script sparse autoencoder feature
Automated logit weight prediction achieves 74% mean accuracy for features vs 58% for neurons vs 50% chance
supports
Automated interpretability of logit weights confirms feature downstream effects are more interpretable than neuron effects
Claude achieves significantly higher Spearman correlation predicting feature activations vs neuron activations
supports
Automated interpretability analysis of activations confirms features are more interpretable than neurons
Higher-activating feature intervals are systematically more interpretable than lower-activating intervals in human analysis
supports
Shows interpretability correlates with activation strength, most model effect comes from high activations
Median feature interval scored 12/14 on interpretability rubric vs median neuron score of 0
supports
Human analysis showing features are substantially more interpretable than neurons
No neuron found with Hebrew Unicode block in top dataset examples; most correlated neuron A/neurons/489 has correlation of only 0.1
supports
Hebrew feature is effectively invisible in the neuron basis

Hypotheses (1)

hypothesis

We hypothesize that sparse autoencoders or similar methods will work on frontier large language models, though significant computational challenges remain
extends
Forward-looking prediction about scalability of the method to larger models

Questions (1)

question

what metrics can reliably tell us if dictionary learning has successfully extracted high quality features?
gates
Central methodological gap: current metrics (loss, density histograms, manual inspection) are inadequate

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Sparse autoencoders produce interpretable features for large models.claim0.862
Central claim of the paper: the method scales to state-of-the-art transformers.
Sparse Autoencoders Find Highly Interpretable Features in Language Models (Cunningham et al., 2023)concept0.845
Core methodology paper for SAE-based interpretable feature extraction
Sparse autoencoders don't provide a comprehensive solution because they decode activations, not parametersclaim0.835
Critique of activation-based interpretability methods.
Training the sparse autoencoder on more data makes features subjectively sharper and more interpretableclaim0.832
Empirical principle discovered during autoencoder training; led to using 8 billion training points
Sparse autoencoders are preferable to stronger iterative dictionary learning methods because they cannot recover features the model itself cannot accessclaim0.818
Rationale for using simpler sparse autoencoders rather than NP-hard compressed sensing algorithms
Sparse Autoencoder Featuresconcept0.805
Used in Anthropic welfare assessment to identify performative behavior and hidden emotional struggle co-activating features
SAEs successfully extract sparse feature dictionaries from embeddings of SleepFM, REVE, and LaBraM EEG transformers.finding0.799
Foundational empirical result enabling all downstream analysis
Automated interpretability (Claude 3 Opus) and specificity scoring show SAE features are significantly more interpretable and specific than MLP neurons.finding0.792
Quantitative comparison supporting SAE utility.