finding

active

finding:a-1-autoencoder-recovers-79-of-mlp-log-likelihood-loss-reduction-with-4-096-features

A/1 autoencoder recovers 79% of MLP log-likelihood loss reduction with 4,096 features

Measures how much of the MLP layer's function is explained by the learned features

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Questions (1)

question

to what extent do interpretable features represent the 'full story' of the MLP layer?
answered_by
Question about completeness of feature-based model explanation

Findings (1)

finding

A/5 autoencoder (131,072 features) recovers 94.5% of MLP log-likelihood loss reduction
restates
Shows that loss recovery increases with autoencoder size

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

512-neuron MLP continues to yield new features as autoencoder scales to 131,072 features (256× expansion)finding0.784
Shows superposition enables many more features than neurons
Natural Language Autoencoders achieve readable explanations through unsupervised reconstruction loss optimized with reinforcement learning, not explicit interpretability constraints.claim0.765
Core insight: reconstruction objective combined with appropriate initialization and KL regularization produces human-interpretable explanations as emergent property.
Sparse autoencoders produce interpretable features for large models.claim0.765
Central claim of the paper: the method scales to state-of-the-art transformers.
Sparse autoencoders extract features that are significantly more monosemantic than neurons, as shown by four independent lines of evidenceclaim0.764
Central claim of the paper, supported by detailed feature analysis, human evaluation, automated interpretability of activations, and automated interpretability of logit weights
Sparse Autoencoders Find Highly Interpretable Features in Language Models (Cunningham et al., 2023)concept0.761
Core methodology paper for SAE-based interpretable feature extraction
Shrinkage from L1 penalty significantly harms sparse autoencoder performance.claim0.753
Systematic underestimation of feature activations degrades reconstruction and interpretability.
Sparse autoencoders are preferable to stronger iterative dictionary learning methods because they cannot recover features the model itself cannot accessclaim0.751
Rationale for using simpler sparse autoencoders rather than NP-hard compressed sensing algorithms
Sparse autoencoders don't provide a comprehensive solution because they decode activations, not parametersclaim0.751
Critique of activation-based interpretability methods.

Restated by (1)

cosine ≥ 0.90

Other entities that say roughly the same thing. May be merge candidates or independent restatements across papers.

finding
A/5 autoencoder (131,072 features) recovers 94.5% of MLP log-likelihood loss reduction