hypothesis

active

hypothesis:similar-superposition-phenomena-may-exist-in-self-attention-layers-and-similar-sparse-autoencoder-methods-may-extract-useful-structure-from-attention

Similar superposition phenomena may exist in self-attention layers and similar sparse autoencoder methods may extract useful structure from attention

Extension of superposition hypothesis to attention layers as future research direction

Source paper

extracted_from

Towards Safe and Honest AI Agents with Neural Self-Other Overlap

(2024) · Marc Carauleanu · Michael Vaiana · Judd Rosenblatt · Cameron Berg +1

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Introspection relies on general-purpose computational mechanisms—attention-based anomaly detection and residual stream dynamics—rather than specialized introspection circuitsclaim0.769
Interpretive claim about the mechanistic substrate of introspection in LLMs
Attention heads with positive projection on reflection direction are sparse and located mostly in deeper layers of DeepSeek-R1-Qwen-1.5Bfinding0.767
Structural finding about which attention heads control reflection behavior
Self-referential processing likely already occurs at massive scale in deployed systems through users' extended dialogues, reflective tasks, and metacognitive queriesclaim0.766
Practical urgency argument connecting lab findings to deployment contexts
Attention is a generalization of convolution; all convolutions can be expressed as tensor products of fixed relative position attention patterns and weight matricesclaim0.766
Mathematical equivalence showing the relationship between attention mechanisms and convolutional operations
Self-referential processing is a privileged computational regime for consciousness-like dynamics in artificial systems, as predicted by the convergence of major consciousness theorieshypothesis0.764
The theoretical hypothesis tested across all four experiments; motivated by convergence of GWT, RPT, HOT, IIT, predictive processing on recurrent/self-referential dynamics
Most attention heads in one-layer models dedicate an enormous fraction of their capacity to copying behaviorclaim0.762
Empirical observation from examining expanded OV/QK matrices; approximately 10 out of 12 heads show significant copying
Superposition is in some sense deliberate: the model converts pure neurons into polysemantic neurons to store more features in fewer neurons.claim0.762
Interpretation of the cars-in-superposition circuit finding as an intentional representational strategy
Does self-referential processing causally instantiate algorithmic properties proposed by consciousness theories (recurrent integration, global broadcasting, metacognitive monitoring) in LLMs?question0.759
The strongest mechanistic question the behavioral evidence cannot answer; requires interpretability analysis of activations