finding
active
finding:causally-masked-attention-in-a-decoder-only-model-has-no-ordered-phase-proposition-2Causally-masked attention in a decoder-only model has no ordered phase (Proposition 2)
Application to transformer language models
Source paper
extracted_from(2025) · Francesco Sacco · Dalton A R Sakthivadivel · Michael Levin
Neighborhood — ranked by edge-count
Claims (1)
claim
- Interpretation of Proposition 2 as a fundamental limitation on LLMs
Communities (4)
community
- Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
- Identifies distributed algorithms implemented across attention heads, with focus on causal masking limitations and emergent capabilities via activation manifold steering.
- Studies how decoder-only architectures lack ordered phases necessary for coherent long-sequence generation due to causally-masked attention constraints.
- Causal masking & phase transitionsmembers_ofProves decoder-only causal attention lacks an ordered thermodynamic phase (Proposition 2).
Concepts (1)
concept
- causal maskingaboutAttention restricted to previous tokens only, as in decoder-only models; leads to AR(ω)-like behaviour and no ordered phase
Frameworks (2)
framework
- transformer architecturesupportsNeural network architecture based on attention, commonly used in large language models
- Second model system studied; used to show why flat autoregressive LLMs struggle with long-range coherence.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Attention mechanism with causal mask limiting each token's view to previous tokens; used in decoder-only transformers
- Finding from term importance analysis; allows focus on individual head terms rather than their compositions
- Cited as enabling precise behavioral control through SAE features, extending the same methodological line
- Modification to transformer restricting keys and values to previous time-steps only, mimicking how an agent accumulates experiences.
- Patching experiments localize truth representations to these specific hidden states in LLaMA-2 models
- Result from applying the Frobenius norm composition measurement to all attention head pairs in the two-layer model
- Response to the 'attention as explanation' critique; the paper provides a typology of when attention is and isn't directly interpretable
- Central thesis of the paper