claim
active
claim:zero-layer-transformers-optimally-approximate-bigram-log-likelihood-through-w-u-w-eZero-layer transformers optimally approximate bigram log-likelihood through W_U W_E
First result in the hierarchy: the simplest possible transformer does nothing more than learn which tokens follow which
Neighborhood — ranked by edge-count
Findings (1)
finding
- Supporting evidence for the claim that most residual stream dimensions are free for other layers to use
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Core claim for one-layer models; the skip-trigram tables can be accessed without running the model
- Interpretation of the role of the direct path in multi-layer transformers; e.g. encoding that 'Barack' is often followed by 'Obama'
- Argues against the single-layer analysis approach of prior work.
- Interpretive claim from attention head attribution analysis in appendix
- A transformer with no attention layers; shown to model bigram statistics via T = W_U W_E
- Evidence that in-context learning is not mere pattern matching but genuine optimization, relevant to applying the thesis to inference
- Statistically rigorous analysis of Claude introspection; suggests models may have latent introspective capabilities that can be enhanced or disrupted.
- Supports input-injectivity assumption for transformers at initialisation