method
active
method:prefill-detection-taskPrefill detection task
Task where a random word is prefilled as the assistant's response, then the model is asked whether it intended to say that word, testing introspection on prior intentions.
Neighborhood — ranked by edge-count
Papers (1)
paper
Concepts (1)
concept
- Concept InjectionimplementsTechnique of injecting activation patterns associated with specific concepts into a model's internal states to test whether self-reports reflect ground truth.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Speculation that QK circuit 'concordance heads' underlie the ability to distinguish intended from unintended outputs.
- Task paradigm from prior work asking 'Did you detect an injected thought?' via YES/NO logit comparison; shown here to be confounded
- The optimal layer for the prefill introspection differs from the optimal layer for detecting injected thoughts.
- One of four ToM tasks analyzed; requires inferring speaker intent from indirect hints; scored 0/1.
- Task of detecting a model's internal thoughts; found by Lindsey (2026) to peak at ~2/3 depth in transformers.
- Language model reasoning tasks with sequential geometry used in experiments.
- The problem of ensuring all tasks in MTL perform well, avoiding dominance by some tasks.
- Thought detection peaks at ~2/3 layer depth; intention checking peaks at ~1/2 layer depth.finding0.715Lindsey (2026) differential layer performance explained by Janus's path combinatorics — different tasks use different path distributions.