claim
active
claim:skip-trigram-bugs-in-one-layer-models-demonstrate-interpretability-can-reveal-and-characterize-specific-model-failure-modesSkip-trigram bugs in one-layer models demonstrate interpretability can reveal and characterize specific model failure modes
Early example of using mechanistic interpretability to understand unintended model behavior
Neighborhood — ranked by edge-count
Papers (1)
paper
Concepts (1)
concept
- Skip-Trigram BugssupportsModel failures where a one-layer attention head must simultaneously increase probability of unintended token combinations because it factors the three-way interaction
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- How much performance (in points of loss) do skip-trigram bugs cost the model, and do they persist in larger models?question0.836Open question raised by the paper's identification of skip-trigram bugs as interpretability-visible failure modes
- Concrete example from examining expanded QK/OV matrices showing how specific programming language structure is encoded in attention weights
- Acknowledges the confound of not explicitly instructing models to track wealth, yet points to reasoning gaps given code agents avoid errors without prompts.
- noted as a possible confound
- Core claim for one-layer models; the skip-trigram tables can be accessed without running the model
- Observed by Anima Labs in untrained base models; not present in training data, implying computational origin of self-reported parallel processing.
- Methodological critique of prior work that fixed a single layer for truth probing.