question
active
question:to-what-extent-do-interpretable-features-represent-the-full-story-of-the-mlp-layerto what extent do interpretable features represent the 'full story' of the MLP layer?
Question about completeness of feature-based model explanation
Source paper
extracted_from(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1
Neighborhood — ranked by edge-count
Findings (2)
finding
- Shows that loss recovery increases with autoencoder size
- Measures how much of the MLP layer's function is explained by the learned features
Concepts (1)
concept
- Reconstructed Transformer NLLassociated_withMetric measuring fraction of MLP loss contribution explained by the autoencoder by replacing MLP activations with autoencoder outputs
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Theoretical interpretation of antipodal alignment and misalignment phenomena in PCA visualizations
- Key limitation of the paper's approach; MLP layers make up 2/3 of standard transformer parameters
- Authors argue features are model properties because logit effects and ablations are consistent with feature interpretations
- Claude 3 Opus ratings aligned with human judgment of feature descriptions.
- General principle supported tangentially by covariance pooling work; relates to feature co-occurrence structure.
- Quantitative comparison supporting SAE utility.
- Major open problem identified in the paper; MLP layers constitute 2/3 of transformer parameters