Zero-Layer Transformer

A transformer with no attention layers; shown to model bigram statistics via T = W_U W_E

Neighborhood — ranked by edge-count

concept

Bigram Statistics
implements
Next-token probabilities conditioned only on the present token; what zero-layer transformers optimally approximate and what the direct path W_U W_E contributes to in all transformers

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

One-Layer Attention-Only Transformerconcept0.767
The first toy model analyzed; shown to implement an ensemble of bigram and skip-trigram models readable directly from weights
Signal Transformerconcept0.752
Core abstraction in Fruit: pure function mapping signals to signals; enables compositional GUI definitions.
Two-Layer Attention-Only Transformerconcept0.745
The primary model analyzed; uses attention head composition, especially K-composition, to create induction heads for powerful in-context learning
Shallow Transformer (RoPE-based)framework0.739
Two-layer transformer with rotary positional encodings used in numeric task experiments.
transformer architectureframework0.737
Neural network architecture based on attention, commonly used in large language models
TEM-Transformer (TEM-t)framework0.730
The transformer version directly analogous to TEM, introduced in this paper, offering dramatic performance improvements.
Cortex as a Transformerhypothesis0.728
Hypothesis that neocortical circuits beyond hippocampus may implement transformer-like computations for language and other domains.
Reconstructed Transformer NLLconcept0.728
Metric measuring fraction of MLP loss contribution explained by the autoencoder by replacing MLP activations with autoencoder outputs