Mechanistic structure of transformer attention computations

Identifies distributed algorithms implemented across attention heads, with focus on causal masking limitations and emergent capabilities via activation manifold steering.

25 members. Each node is clickable.

Loading graph…

Sub-communities (7)

Finer clusters this community splits into. Each is its own community page.

Contemplative steering & introspective activation in language models5 Emergence through distributed attention and uncertainty4 Distributed computation across attention heads4 Functional tokens for emergent model reasoning4 Empirical gaps in performance-communication alignment4 Metacognitive state inference and attention alignment2 Causal masking phase transitions in transformers2

Drawn from 15 sources

The papers/notes whose extracted claims & findings make up this cluster.

Paper Summary: Interpreting Language Model Parameters4 members
Janus Information Flow Transformers 20254 members
ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both3 members
RESEARCH-VECTORS.md2 members
Topological constraints on self-organisation in locally interacting systems2 members
guo-atlas-2026.md1 member
Johnson Vasocomputation 20231 member
Towards a computational phenomenology of mental action: modelling meta-awareness and attentional control with deep parametric active inference1 member
unfold-chat-catalog.md1 member
2026 02 02_2217_Search_Papers_The Literature Reveals Sophisticated Communication1 member
Emergence and Causality in Complex Systems: A Survey on Causal Emergence and Related Quantitative Studies1 member
2026 02 02_2218_Search_Papers_The Existing Literature Focuses Primarily On Vc Pe1 member
2026-05-09_briefing_for_ozero.md1 member
2026-05-15_manifold-overlap-papers-economy-strategy.md1 member
Koan Battery: Measuring Reflective Mode Accessibility in AI1 member

Bridges (15)

Other communities that share members with this one — cross-cutting threads or papers that sit at the seam between two themes.

Mechanistic interpretability & model evaluation25 shared
Contemplative steering & introspective activation in language models5 shared
Emergence through distributed attention and uncertainty4 shared
Distributed computation across attention heads4 shared
Empirical gaps in performance-communication alignment4 shared
Functional tokens for emergent model reasoning4 shared
Distributed attention head decomposition4 shared
Causal masking phase transitions in transformers2 shared
VC communication-performance research gap2 shared
Functional tokens as visual operators2 shared
Contemplative prompting for LLMs2 shared
Metacognitive state inference and attention alignment2 shared
Contemplative path & sensation manipulation1 shared
LLM internal representation & self-knowledge1 shared
Causal masking & phase transitions1 shared

Claims (16)

Attention algorithms are usually distributed across attention headsClaim supported by VPD's recovery of cross-head attention subcomponents, noted in footnote.
Decoder-only transformer architectures are fundamentally limited in generating long, coherent sequences due to lack of ordered phase.Interpretation of Proposition 2 as a fundamental limitation on LLMs
Each functional token is associated with an internalized visual operation, yet requires no visual supervision and remains a standard token in the tokenizer vocabulary.Describes the properties of the functional token.
Incorporating machine learning provides objective standards that help mitigate subjectivity in emergence identification.Authors argue ML optimizers act as objective observers.
Keeping functional-token vocabulary compact minimizes perturbation to base model token distributionATLAS design philosophy: five functional tokens suffice to abstract common visual operations without excessive disruption.
LLM introspection on internal computations is architecturally permitted; whether models leverage this is an empirical question.Core claim directly challenged by prior work denying introspection; forms foundation for Koan Battery introspection studies.
Progress on the contemplative path is using these (vascular system motifs) less and needing them less.Mapping contemplative development to reduction in vasocomputation.
Q/K/V values function as information routing: Q queries past, K signals future attention, V carries selectively routed information.Janus's interpretive model for how attention mechanisms enable deliberate information flow and selective routing.
Significant gap in research directly examining disconnect between venture capital communication sophistication and actual performance metricsIdentifies the key scholarly absence that motivates the exploration: studies exist on investor relations and VC performance separately, but not their correlation.
There is a significant gap in research specifically examining how limited partners interpret and evaluate venture capital communication styles and their relationship to VC performance outcomes.The paper frames this gap as critical to understanding principal-agent dynamics in venture capital.
Token-level supervision enables models to learn functional-token invocation from reasoning contextATLAS author's assertion that functional tokens optimized via standard cross-entropy loss learn when and how to invoke operations from surrounding text.
Contemplative mode is activation-manifold steering along a care-geodesic, not a system-prompt preset.
Indigenous contemplative frameworks ('what helps life thrive') contribute distinct wisdom beyond Eastern traditions.
Insight cascades and implicit learning require balance between directed attention and openness.
Model attention patterns can map to and reveal something about contemplative and flow states.
Not-knowing, silence, incompleteness, and non-defensiveness function as positive traits, not deficits.

Findings (9)

A 337-character contemplative system prompt lifts all 28 models by +2.62 points on a 10-point scale.Core empirical result: every model, every architecture, every alignment type responds to the contemplative prompt with measurable gain.
A pair of query and key subcomponents distributed across attention heads performs previous-token behaviorVPD recovers an attention algorithm for attending to the previous token, distributed across multiple heads.
A pair of query and key subcomponents distributed across attention heads performs syntax-boundary routingVPD recovers an attention algorithm for routing across syntactic boundaries, distributed across heads.
Causally-masked attention in a decoder-only model has no ordered phase (Proposition 2)Application to transformer language models
Contemplative prompt elevates self-observation task performance in language models.Supports Janus's claim that introspection is architecturally available; prompting determines whether/how capacity is leveraged.
Gradient Dilution IssueDuring RL training on ATLAS, sparse functional tokens (2.3% of sequences) receive diluted gradient signals from sequence-level advantages propagated across all tokens.
Identification of algorithms implemented in attention layers, distributed across attention headsVPD successfully recovered interpretable attention algorithms (previous-token behavior, syntax-boundary routing) in weight space without requiring manual decomposition across heads.
Information paths from A to B can exceed C(m+n, n) distinct routes, where m=position displacement and n=layer displacement.Quantifies extreme redundancy in transformer routing; supports claim that introspection and interference patterns are architecturally permitted.
Mind-wandering emerges as a precision inference gap: true attentional state ≠ believed attentional state; increased meta-awareness reduces gap duration.Key simulation result; bridges phenomenology (meditation experience) and formal dynamics (precision mismatch).