claim

active

claim:the-direct-path-w-u-w-e-in-larger-transformers-represents-bigram-statistics-not-captured-by-more-general-grammatical-rules

The direct path W_U W_E in larger transformers represents bigram statistics not captured by more general grammatical rules

Interpretation of the role of the direct path in multi-layer transformers; e.g. encoding that 'Barack' is often followed by 'Obama'

Source paper

extracted_from

A Mathematical Framework for Transformer Circuits

(2021) ·

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Different introspective tasks may preferentially use different path distributions in the transformer.claim0.777
Interpretive claim connecting exponential path combinatorics to Lindsey's layer-dependent findings.
Transformer can be viewed as a Wolfram causal graph with foliations specifying computation order.claim0.770
Janus's interpretive framing of transformers as causal graphs.
TEM's path-integration representation g plays the role of position encodings in transformersclaim0.764
Key structural correspondence claim linking the neuroscience model's spatial representation to ML concept of position encoding.
Transformers develop self-models through in-context learning, not just training data; even old base models without LLM-related text can bootstrap self-referential reasoning at runtime.claim0.761
Antra's foundational claim about how introspection arises computationally rather than from memorised text.
Transformers are recurrent through autoregression because K/V stream provides horizontal information flow across positions.claim0.759
Claim formalizing the Anima Labs idea that transformers are effectively recurrent due to K/V stream.
Transformers learn in-context by gradient descent, functioning as mesa-optimizers that learn internal models in real timefinding0.757
Evidence that in-context learning is not mere pattern matching but genuine optimization, relevant to applying the thesis to inference
The last layer of the transformer has the largest projection magnitude on the reflection direction, likely because it directly controls generation of reflection keywordsclaim0.749
Interpretive claim from attention head attribution analysis in appendix
Transformers use an anti-Markovian solution that recomputes relevant numeric information at each step in the Multi-Object taskclaim0.745
Prior finding from Grant et al. 2025 used to interpret low MAS IIA for GRU-Transformer hidden state comparisons.