claim
active
claim:skip-trigram-bugs-in-one-layer-models-demonstrate-interpretability-can-reveal-and-characterize-specific-model-failure-modes

Skip-trigram bugs in one-layer models demonstrate interpretability can reveal and characterize specific model failure modes

Early example of using mechanistic interpretability to understand unintended model behavior

Source paper

extracted_from
A Mathematical Framework for Transformer Circuits
(2021) ·

Neighborhood — ranked by edge-count

Concepts (1)

concept
  • Model failures where a one-layer attention head must simultaneously increase probability of unintended token combinations because it factors the three-way interaction

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.