hypothesis
active
hypothesis:similar-superposition-phenomena-may-exist-in-self-attention-layers-and-similar-sparse-autoencoder-methods-may-extract-useful-structure-from-attentionSimilar superposition phenomena may exist in self-attention layers and similar sparse autoencoder methods may extract useful structure from attention
Extension of superposition hypothesis to attention layers as future research direction
Source paper
extracted_from(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Interpretive claim about the mechanistic substrate of introspection in LLMs
- Structural finding about which attention heads control reflection behavior
- Practical urgency argument connecting lab findings to deployment contexts
- Mathematical equivalence showing the relationship between attention mechanisms and convolutional operations
- The theoretical hypothesis tested across all four experiments; motivated by convergence of GWT, RPT, HOT, IIT, predictive processing on recurrent/self-referential dynamics
- Empirical observation from examining expanded OV/QK matrices; approximately 10 out of 12 heads show significant copying
- Interpretation of the cars-in-superposition circuit finding as an intentional representational strategy
- The strongest mechanistic question the behavioral evidence cannot answer; requires interpretability analysis of activations