hypothesis
active
hypothesis:the-sensitivity-to-think-don-t-think-instructions-may-be-achieved-via-a-circuit-that-tags-tokens-as-attention-worthy-based-on-instructions-or-incentivesThe sensitivity to think/don't think instructions may be achieved via a circuit that tags tokens as attention-worthy based on instructions or incentives
Mechanism for how the model modulates representation strength.
Source paper
extracted_from(2026) · Lindsey, Jack
Neighborhood — ranked by edge-count
Papers (1)
paper
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Normative vision for how the circuits agenda could resolve the pre-paradigmatic state of interpretability
- Speculation about the mechanistic basis of the distinguishing thoughts from text experiment.
- Contrasts with synthetic doc finding; suggests different mechanisms may be at play
- Practical implication showing task instructions are equivalent to inducing prior beliefs in experimental settings
- Concise statement of the free-energy principle's unification of action and perception.
- Thought detection peaks at ~2/3 layer depth; intention checking peaks at ~1/2 layer depth.finding0.763Lindsey (2026) differential layer performance explained by Janus's path combinatorics — different tasks use different path distributions.
- Key decomposition enabling separate analysis of where attention goes and what it does
- Observation from alternative prompts that detection is weaker without setup.