claim

active

claim:the-prefill-detection-task-may-involve-concordance-heads-that-measure-the-likelihood-of-the-output-given-prior-activations

The prefill detection task may involve concordance heads that measure the likelihood of the output given prior activations

Speculation that QK circuit 'concordance heads' underlie the ability to distinguish intended from unintended outputs.

Source paper

extracted_from

Emergent Introspective Awareness in Large Language Models

(2026) · Lindsey, Jack

Neighborhood — ranked by edge-count

Communities (2)

community

Mechanistic interpretability & model evaluation
members_of
Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
Internal reasoning detection via neural activation analysis
members_of
Mechanistic interpretability studies of Claude models using layer-wise representation analysis and thought injection to reveal unverbalized reasoning, planning, and covert cognition.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Prefill detection taskmethod0.834
Task where a random word is prefilled as the assistant's response, then the model is asked whether it intended to say that word, testing introspection on prior intentions.
Prefill detection effect peaks at an earlier layer (slightly over halfway through) in Opus 4.1, different from injected thoughts peakfinding0.776
The optimal layer for the prefill introspection differs from the optimal layer for detecting injected thoughts.
Concordance heads (QK circuits) could serve as the consistency-checking circuit for distinguishing intended vs. unintended outputshypothesis0.773
Speculated mechanism for prefill detection.
Claude Opus 4.1 and 4 show greatest reduction in apology rate in the prefill detection taskfinding0.767
Injecting a concept matching the prefilled word reduces the rate at which the model apologizes, maximally for Opus models.
Thought detection peaks at ~2/3 layer depth; intention checking peaks at ~1/2 layer depth.finding0.749
Lindsey (2026) differential layer performance explained by Janus's path combinatorics — different tasks use different path distributions.
Future models with substantially increased capabilities will exhibit alignment faking that is more consistent, robust, and harder to detecthypothesis0.742
Extrapolation from scale-emergence finding to future risk
Pretraining stores latent patterns that coherent anchors can bind (or misbind) to targets.quote0.737
Load-bearing quote capturing the core metaphor
A pair of query and key subcomponents distributed across attention heads performs previous-token behaviorfinding0.737
VPD recovers an attention algorithm for attending to the previous token, distributed across multiple heads.