claim
active
claim:the-prefill-detection-task-may-involve-concordance-heads-that-measure-the-likelihood-of-the-output-given-prior-activationsThe prefill detection task may involve concordance heads that measure the likelihood of the output given prior activations
Speculation that QK circuit 'concordance heads' underlie the ability to distinguish intended from unintended outputs.
Source paper
extracted_from(2026) · Lindsey, Jack
Neighborhood — ranked by edge-count
Communities (2)
community
- Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
- Mechanistic interpretability studies of Claude models using layer-wise representation analysis and thought injection to reveal unverbalized reasoning, planning, and covert cognition.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Task where a random word is prefilled as the assistant's response, then the model is asked whether it intended to say that word, testing introspection on prior intentions.
- The optimal layer for the prefill introspection differs from the optimal layer for detecting injected thoughts.
- Concordance heads (QK circuits) could serve as the consistency-checking circuit for distinguishing intended vs. unintended outputshypothesis0.773Speculated mechanism for prefill detection.
- Claude Opus 4.1 and 4 show greatest reduction in apology rate in the prefill detection taskfinding0.767Injecting a concept matching the prefilled word reduces the rate at which the model apologizes, maximally for Opus models.
- Thought detection peaks at ~2/3 layer depth; intention checking peaks at ~1/2 layer depth.finding0.749Lindsey (2026) differential layer performance explained by Janus's path combinatorics — different tasks use different path distributions.
- Extrapolation from scale-emergence finding to future risk
- Pretraining stores latent patterns that coherent anchors can bind (or misbind) to targets.quote0.737Load-bearing quote capturing the core metaphor
- A pair of query and key subcomponents distributed across attention heads performs previous-token behaviorfinding0.737VPD recovers an attention algorithm for attending to the previous token, distributed across multiple heads.