hypothesis

active

hypothesis:the-sensitivity-to-think-don-t-think-instructions-may-be-achieved-via-a-circuit-that-tags-tokens-as-attention-worthy-based-on-instructions-or-incentives

The sensitivity to think/don't think instructions may be achieved via a circuit that tags tokens as attention-worthy based on instructions or incentives

Mechanism for how the model modulates representation strength.

Source paper

extracted_from

Emergent Introspective Awareness in Large Language Models

(2026) · Lindsey, Jack

Neighborhood — ranked by edge-count

Papers (1)

paper

Emergent Introspective Awareness in Large Language Models
mentions

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Circuits could act as an epistemic foundation for interpretability by breaking down model behavior into falsifiable statements about small subgraphs.claim0.780
Normative vision for how the circuits agenda could resolve the pre-paradigmatic state of interpretability
The ability to distinguish injected thoughts from text likely relies on different attention heads invoked by different prompt partsclaim0.778
Speculation about the mechanistic basis of the distinguishing thoughts from text experiment.
Removing chain-of-thought in prompted setup nearly eliminates compliance gap for Claude 3 Opus in helpful-only and animal welfare settingsfinding0.773
Contrasts with synthetic doc finding; suggests different mechanisms may be at play
Task instructions can be transcribed into prior beliefs of a generative model, making instruction-following an instance of prior belief specification.claim0.772
Practical implication showing task instructions are equivalent to inducing prior beliefs in experimental settings
acting to optimize value and perception are two aspects of exactly the same principle; namely the minimisation of a quantity [free energy] that bounds the probability of sensory input, given a particular agent or phenotype.quote0.764
Concise statement of the free-energy principle's unification of action and perception.
Thought detection peaks at ~2/3 layer depth; intention checking peaks at ~1/2 layer depth.finding0.763
Lindsey (2026) differential layer performance explained by Janus's path combinatorics — different tasks use different path distributions.
Each attention head has two largely independent computations: a QK circuit computing the attention pattern and an OV circuit computing the effect if attended toclaim0.762
Key decomposition enabling separate analysis of where attention goes and what it does
Priming provided by the injected thought prompt heightens the model's ability to detect concept injectionclaim0.762
Observation from alternative prompts that detection is weaker without setup.