claim

active

claim:zero-layer-transformers-optimally-approximate-bigram-log-likelihood-through-w-u-w-e

Zero-layer transformers optimally approximate bigram log-likelihood through W_U W_E

First result in the hierarchy: the simplest possible transformer does nothing more than learn which tokens follow which

Source paper

extracted_from

A Mathematical Framework for Transformer Circuits

(2021) ·

Neighborhood — ranked by edge-count

Findings (1)

finding

PCA analysis shows token embeddings and unembeddings are concentrated in a relatively small fraction of residual stream dimensions in large models
supports
Supporting evidence for the claim that most residual stream dimensions are free for other layers to use

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

One-layer attention-only transformers are an ensemble of bigram and skip-trigram models whose parameters can be read directly from weightsclaim0.735
Core claim for one-layer models; the skip-trigram tables can be accessed without running the model
The direct path W_U W_E in larger transformers represents bigram statistics not captured by more general grammatical rulesclaim0.733
Interpretation of the role of the direct path in multi-layer transformers; e.g. encoding that 'Barack' is often followed by 'Obama'
No single layer is universally optimal for probing truth directions; different tasks peak at different layers.claim0.733
Argues against the single-layer analysis approach of prior work.
The last layer of the transformer has the largest projection magnitude on the reflection direction, likely because it directly controls generation of reflection keywordsclaim0.731
Interpretive claim from attention head attribution analysis in appendix
Zero-Layer Transformerconcept0.726
A transformer with no attention layers; shown to model bigram statistics via T = W_U W_E
Transformers learn in-context by gradient descent, functioning as mesa-optimizers that learn internal models in real timefinding0.725
Evidence that in-context learning is not mere pattern matching but genuine optimization, relevant to applying the thesis to inference
Sauers' statistical anomaly: when models are given Janus post explaining transformers, reconstruction accuracy tails extend both ways, with ~1/1000 reconstructions anomalously accuratefinding0.723
Statistically rigorous analysis of Claude introspection; suggests models may have latent introspective capabilities that can be enhanced or disrupted.
Theorem 2: Transformers with randomly independently initialized continuous distribution weights are almost surely injective at initialisation up to each layerfinding0.720
Supports input-injectivity assumption for transformers at initialisation