Prefill detection task

Task where a random word is prefilled as the assistant's response, then the model is asked whether it intended to say that word, testing introspection on prior intentions.

Neighborhood — ranked by edge-count

Papers (1)

paper

Emergent Introspective Awareness in Large Language Models
introduces

Concepts (1)

concept

Concept Injection
implements
Technique of injecting activation patterns associated with specific concepts into a model's internal states to test whether self-reports reflect ground truth.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

The prefill detection task may involve concordance heads that measure the likelihood of the output given prior activationsclaim0.834
Speculation that QK circuit 'concordance heads' underlie the ability to distinguish intended from unintended outputs.
Binary Detection Taskmethod0.795
Task paradigm from prior work asking 'Did you detect an injected thought?' via YES/NO logit comparison; shown here to be confounded
Prefill detection effect peaks at an earlier layer (slightly over halfway through) in Opus 4.1, different from injected thoughts peakfinding0.749
The optimal layer for the prefill introspection differs from the optimal layer for detecting injected thoughts.
Hinting Taskmethod0.747
One of four ToM tasks analyzed; requires inferring speaker intent from indirect hints; scored 0/1.
thought detectionconcept0.739
Task of detecting a model's internal thoughts; found by Lindsey (2026) to peak at ~2/3 depth in transformers.
sequential reasoning tasksconcept0.721
Language model reasoning tasks with sequential geometry used in experiments.
Task balancingconcept0.720
The problem of ensuring all tasks in MTL perform well, avoiding dominance by some tasks.
Thought detection peaks at ~2/3 layer depth; intention checking peaks at ~1/2 layer depth.finding0.715
Lindsey (2026) differential layer performance explained by Janus's path combinatorics — different tasks use different path distributions.