finding

active

finding:causally-masked-attention-in-a-decoder-only-model-has-no-ordered-phase-proposition-2

Causally-masked attention in a decoder-only model has no ordered phase (Proposition 2)

Application to transformer language models

Source paper

extracted_from

Topological constraints on self-organisation in locally interacting systems

(2025) · Francesco Sacco · Dalton A R Sakthivadivel · Michael Levin

Neighborhood — ranked by edge-count

Claims (1)

claim

Decoder-only transformer architectures are fundamentally limited in generating long, coherent sequences due to lack of ordered phase.
supports
Interpretation of Proposition 2 as a fundamental limitation on LLMs

Communities (4)

community

Mechanistic interpretability & model evaluation
members_of
Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
Mechanistic structure of transformer attention computations
members_of
Identifies distributed algorithms implemented across attention heads, with focus on causal masking limitations and emergent capabilities via activation manifold steering.
Causal masking phase transitions in transformers
members_of
Studies how decoder-only architectures lack ordered phases necessary for coherent long-sequence generation due to causally-masked attention constraints.
Causal masking & phase transitions
members_of
Proves decoder-only causal attention lacks an ordered thermodynamic phase (Proposition 2).

Concepts (1)

concept

causal masking
about
Attention restricted to previous tokens only, as in decoder-only models; leads to AR(ω)-like behaviour and no ordered phase

Frameworks (2)

framework

transformer architecture
supports
Neural network architecture based on attention, commonly used in large language models
Autoregressive models
about
Second model system studied; used to show why flat autoregressive LLMs struggle with long-range coherence.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

causally-masked attentionmethod0.804
Attention mechanism with causal mask limiting each token's view to previous tokens; used in decoder-only transformers
Second-order virtual attention head terms (V-composition) have a small marginal effect in two-layer attention-only modelsclaim0.766
Finding from term importance analysis; allows focus on individual head terms rather than their compositions
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models (Marks et al., 2025)concept0.766
Cited as enabling precise behavioral control through SAE features, extending the same methodological line
Causal Attention Maskmethod0.761
Modification to transformer restricting keys and values to previous time-steps only, mimicking how an agent accumulates experiences.
A small group of hidden states (group b) over end-of-sentence punctuation tokens is highly causally implicated in truth judgmentsfinding0.760
Patching experiments localize truth representations to these specific hidden states in LLaMA-2 models
In the analyzed two-layer attention-only model, only K-composition is significant; V- and Q-composition are negligible by Frobenius norm measurefinding0.760
Result from applying the Frobenius norm composition measurement to all attention head pairs in the two-layer model
Naive interpretation of attention patterns can be both informative and fundamentally misleading when Q-, K-, or V-composition is presentclaim0.759
Response to the 'attention as explanation' critique; the paper provides a typology of when attention is and isn't directly interpretable
Causal abstraction is not enough for mechanistic interpretability because it becomes vacuous without assumptions about how models encode informationclaim0.757
Central thesis of the paper