finding

active

finding:a-5-autoencoder-131-072-features-recovers-94-5-of-mlp-log-likelihood-loss-reduction

A/5 autoencoder (131,072 features) recovers 94.5% of MLP log-likelihood loss reduction

Shows that loss recovery increases with autoencoder size

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Hypotheses (1)

hypothesis

We hypothesize that sparse autoencoders or similar methods will work on frontier large language models, though significant computational challenges remain
associated_with
Forward-looking prediction about scalability of the method to larger models

Questions (1)

question

to what extent do interpretable features represent the 'full story' of the MLP layer?
answered_by
Question about completeness of feature-based model explanation

Findings (1)

finding

A/1 autoencoder recovers 79% of MLP log-likelihood loss reduction with 4,096 features
restates
Measures how much of the MLP layer's function is explained by the learned features

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

512-neuron MLP continues to yield new features as autoencoder scales to 131,072 features (256× expansion)finding0.787
Shows superposition enables many more features than neurons
Sparse autoencoders extract features that are significantly more monosemantic than neurons, as shown by four independent lines of evidenceclaim0.759
Central claim of the paper, supported by detailed feature analysis, human evaluation, automated interpretability of activations, and automated interpretability of logit weights
Sparse autoencoders produce interpretable features for large models.claim0.758
Central claim of the paper: the method scales to state-of-the-art transformers.
Sparse Autoencoders Find Highly Interpretable Features in Language Models (Cunningham et al., 2023)concept0.757
Core methodology paper for SAE-based interpretable feature extraction
Automated logit weight prediction achieves 74% mean accuracy for features vs 58% for neurons vs 50% chancefinding0.753
Automated interpretability of logit weights confirms feature downstream effects are more interpretable than neuron effects
Natural Language Autoencoders achieve readable explanations through unsupervised reconstruction loss optimized with reinforcement learning, not explicit interpretability constraints.claim0.749
Core insight: reconstruction objective combined with appropriate initialization and KL regularization produces human-interpretable explanations as emergent property.
82% of features in 1M SAE had maximum Pearson correlation ≤0.3 with any MLP neuron, and manual inspection showed no semantic resemblance.finding0.745
SAE features are not simply mirroring individual neurons.
Binary detection accuracy (up to 97.3% at L0 α=5) is entirely explained by global logit shifts (r=0.999 correlation with control)finding0.744
Core negative result: the binary detection paradigm cannot distinguish genuine introspection from uniform output bias

Restated by (1)

cosine ≥ 0.90

Other entities that say roughly the same thing. May be merge candidates or independent restatements across papers.

finding
A/1 autoencoder recovers 79% of MLP log-likelihood loss reduction with 4,096 features