finding

active

finding:concept-injection-at-strength-2-does-not-increase-affirmative-responses-on-unrelated-yes-no-questions

Concept injection at strength 2 does not increase affirmative responses on unrelated yes/no questions

Control experiment rules out the possibility that concept vectors simply bias the model to answer affirmatively.

Source paper

extracted_from

Emergent Introspective Awareness in Large Language Models

(2026) · Lindsey, Jack

Neighborhood — ranked by edge-count

Claims (1)

claim

Modern language models possess at least a limited, functional form of introspective awareness
supports
The paper's central interpretive assertion.

Communities (3)

community

Mechanistic interpretability & model evaluation
members_of
Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
Mechanistic introspection in language models
members_of
Empirical investigation of how LMs access and report internal states across layers, using concept injection and thought detection on Claude models.
Introspective awareness activation in language models
members_of
Studying how concept injection and random vectors trigger self-reflective capabilities in LLMs across varying strength parameters.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Random vectors at injection strength 8 elicit introspective awareness in 9 out of 100 trialsfinding0.751
Random vectors are less effective, and even then produce introspection at lower rates.
Strength comparison accuracy reaches 73% at layer 3 for injection pair (2,6) vs. 50% chancefinding0.739
Secondary positive result for strength comparison showing graded sensitivity to perturbation magnitude
Abstract nouns elicit the highest introspective awareness rates; all concept categories show nonzero detectionfinding0.738
Opus 4.1 is most effective at recognizing injected abstract concepts (e.g., justice, peace) but detects other categories too.
Priming provided by the injected thought prompt heightens the model's ability to detect concept injectionclaim0.730
Observation from alternative prompts that detection is weaker without setup.
47.69% of 130 injection-manipulated alpha trends have near-linear fits (R2 >= 0.95); 96.15% have roughly linear fits (R2 >= 0.75)finding0.724
Demonstrates alignment with Linear Representation Hypothesis: target trait steers approximately linearly with alpha
Injection stride s=1 produces the highest mean SJT scores across all LLMs; more frequent injection yields stronger steeringfinding0.723
Empirical finding about injection stride parameter; injecting into every completion activation maximizes steering strength
With unrestricted vocabulary, models occasionally respond in non-English Yes/No equivalents (e.g., Sí, Nein) after truth-direction interventionsfinding0.723
Suggestive evidence for language-independent truth representation in LLMs
Stating and proving that answers to questions and other statements are responsive seems to require a substantially larger logical apparatus than merely proving that the answers are truthful.claim0.722
Claim about the difficulty of responsiveness verification.