concept
active
concept:zero-layer-transformerZero-Layer Transformer
A transformer with no attention layers; shown to model bigram statistics via T = W_U W_E
Neighborhood — ranked by edge-count
Concepts (1)
concept
- Bigram StatisticsimplementsNext-token probabilities conditioned only on the present token; what zero-layer transformers optimally approximate and what the direct path W_U W_E contributes to in all transformers
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- The first toy model analyzed; shown to implement an ensemble of bigram and skip-trigram models readable directly from weights
- Core abstraction in Fruit: pure function mapping signals to signals; enables compositional GUI definitions.
- The primary model analyzed; uses attention head composition, especially K-composition, to create induction heads for powerful in-context learning
- Two-layer transformer with rotary positional encodings used in numeric task experiments.
- Neural network architecture based on attention, commonly used in large language models
- The transformer version directly analogous to TEM, introduced in this paper, offering dramatic performance improvements.
- Hypothesis that neocortical circuits beyond hippocampus may implement transformer-like computations for language and other domains.
- Metric measuring fraction of MLP loss contribution explained by the autoencoder by replacing MLP activations with autoencoder outputs