Bigram Statistics

Next-token probabilities conditioned only on the present token; what zero-layer transformers optimally approximate and what the direct path W_U W_E contributes to in all transformers

Neighborhood — ranked by edge-count

Concepts (2)

concept

Skip-Trigram
extends
A three-token pattern of the form [source]...[destination][out] that one-layer attention heads implement; the paper's key characterization of one-layer transformer behavior
Zero-Layer Transformer
implements
A transformer with no attention layers; shown to model bigram statistics via T = W_U W_E

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Engramconcept0.700
Physical/informational substrate of memory; reframed not as static encoded detail but as prompt requiring creative interpretation by receiving system.
Wolfram Researchinstitute0.696
Stephen Wolfram's organization
LaBraMframework0.688
EEG transformer foundation model for brain activity analysis, one of the three architectures studied.
BIG-benchframework0.684
Large-scale collaborative benchmark for LLM capabilities, cited.
Probabilistic Bisection Algorithmmethod0.681
Algorithm used to calibrate per-latent threshold boost values for consistent first-attempt difficulty
Sufficient Statisticsconcept0.671
Parameters of the approximate posterior, such as Dirichlet counts for model parameters.
Nonparametric Bayesian Methodsframework0.660
Methods for bottom-up model space construction; contrasted with top-down BMR approach of this paper
Big Two Modelframework0.660
Meta-trait model grouping OCEAN traits into stability (C, A, reversed N) and plasticity (E, O); used to evaluate covariance patterns from injections