Skip-Trigram Bugs

Model failures where a one-layer attention head must simultaneously increase probability of unintended token combinations because it factors the three-way interaction

Neighborhood — ranked by edge-count

Papers (1)

paper

A Mathematical Framework for Transformer Circuits
introduces

Claims (1)

claim

Skip-trigram bugs in one-layer models demonstrate interpretability can reveal and characterize specific model failure modes
supports
Early example of using mechanistic interpretability to understand unintended model behavior

Concepts (1)

concept

Skip-Trigram
related_to
A three-token pattern of the form [source]...[destination][out] that one-layer attention heads implement; the paper's key characterization of one-layer transformer behavior

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Trigram Featuresconcept0.740
Features implementing specific three-token sequence predictions (e.g., predicting '19' after 'COVID-')
causal bypassingconcept0.680
Confound where naming injected concepts reflects direct logit effects rather than metacognitive awareness, raised by Morris & Plunkett
Gradient Descentmethod0.668
Used for updating hidden state expectations; provides dynamical process theory testable against neuronal data
Position Encodingsconcept0.667
Mechanism for encoding sequence order in transformers; paper argues these should reflect learned structural representations rather than fixed sines/cosines.
Trajectory Filteringmethod0.666
Strategic filtering procedure that removes invalid trajectories and maintains optimal positive-to-negative trajectory ratio to stabilize training.
Error minimizationconcept0.664
The progressive reduction of error (stress) as cells move toward their target positions.
Message Passingframework0.663
Traditional parallel programming model requiring explicit point-to-point communication; Linda generalizes this via tuple spaces.
tilingsconcept0.657
Edge-to-edge coverings of a surface with no overlaps or gaps