claim

active

claim:skip-trigram-bugs-in-one-layer-models-demonstrate-interpretability-can-reveal-and-characterize-specific-model-failure-modes

Skip-trigram bugs in one-layer models demonstrate interpretability can reveal and characterize specific model failure modes

Early example of using mechanistic interpretability to understand unintended model behavior

Source paper

extracted_from

A Mathematical Framework for Transformer Circuits

(2021) ·

Neighborhood — ranked by edge-count

Papers (1)

paper

A Mathematical Framework for Transformer Circuits
introduces

Concepts (1)

concept

Skip-Trigram Bugs
supports
Model failures where a one-layer attention head must simultaneously increase probability of unintended token combinations because it factors the three-way interaction

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

How much performance (in points of loss) do skip-trigram bugs cost the model, and do they persist in larger models?question0.836
Open question raised by the paper's identification of skip-trigram bugs as interpretability-visible failure modes
One-layer model attention heads encode Python-specific skip-trigrams including indentation-based elif/else prediction and function signature patternsfinding0.800
Concrete example from examining expanded QK/OV matrices showing how specific programming language structure is encoded in attention weights
Some failures may reflect prompt design rather than model limitations, but the underlying issue is one of reasoning rather than instruction-following.claim0.763
Acknowledges the confound of not explicitly instructing models to track wealth, yet points to reasoning gaps given code agents avoid errors without prompts.
Some failures may reflect prompt design rather than model limitations, though code agents avoid errors without promptsclaim0.762
noted as a possible confound
Interpretability features converge across different model architectures, revealing structural similarities.claim0.751
One-layer attention-only transformers are an ensemble of bigram and skip-trigram models whose parameters can be read directly from weightsclaim0.750
Core claim for one-layer models; the skip-trigram tables can be accessed without running the model
Base models spontaneously talk about experiencing multiple parallel processing pathsfinding0.746
Observed by Anima Labs in untrained base models; not present in training data, implying computational origin of self-reported parallel processing.
Single-layer analyses can be misleading because early-layer truth directions may reflect surface features with limited cross-task generalization.claim0.744
Methodological critique of prior work that fixed a single layer for truth probing.