question

active

question:to-what-extent-do-interpretable-features-represent-the-full-story-of-the-mlp-layer

to what extent do interpretable features represent the 'full story' of the MLP layer?

Question about completeness of feature-based model explanation

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Neighborhood — ranked by edge-count

Findings (2)

finding

A/5 autoencoder (131,072 features) recovers 94.5% of MLP log-likelihood loss reduction
answered_by
Shows that loss recovery increases with autoencoder size
A/1 autoencoder recovers 79% of MLP log-likelihood loss reduction with 4,096 features
answered_by
Measures how much of the MLP layer's function is explained by the learned features

Concepts (1)

concept

Reconstructed Transformer NLL
associated_with
Metric measuring fraction of MLP loss contribution explained by the autoencoder by replacing MLP activations with autoencoder outputs

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

In intermediate regimes of scale or layer depth, LLMs may linearly represent features at intermediate levels of abstraction such as 'accurate factual recall' or 'close association' rather than abstract truthclaim0.774
Theoretical interpretation of antipodal alignment and misalignment phenomena in PCA visualizations
MLP layers are much harder to get traction on than attention layers; understanding them requires individually interpretable neurons which are rarely foundclaim0.773
Key limitation of the paper's approach; MLP layers make up 2/3 of standard transformer parameters
Interpretability features converge across different model architectures, revealing structural similarities.claim0.758
Learned features reflect the functionality of the model and not just the data distribution, as evidenced by interpretable downstream effects not used in dictionary learningclaim0.758
Authors argue features are model properties because logit effects and ablations are consistent with feature interpretations
Automated interpretability using LLMs can usefully score feature specificity.claim0.750
Claude 3 Opus ratings aligned with human judgment of feature descriptions.
Geometry of features matters for representation quality.claim0.749
General principle supported tangentially by covariance pooling work; relates to feature co-occurrence structure.
Automated interpretability (Claude 3 Opus) and specificity scoring show SAE features are significantly more interpretable and specific than MLP neurons.finding0.747
Quantitative comparison supporting SAE utility.
When and how can MLP neurons in transformers be individually interpreted, and what progress is needed to extend mechanistic interpretability to them?question0.741
Major open problem identified in the paper; MLP layers constitute 2/3 of transformer parameters