finding
active
finding:concept-injection-at-strength-2-does-not-increase-affirmative-responses-on-unrelated-yes-no-questionsConcept injection at strength 2 does not increase affirmative responses on unrelated yes/no questions
Control experiment rules out the possibility that concept vectors simply bias the model to answer affirmatively.
Source paper
extracted_from(2026) · Lindsey, Jack
Neighborhood — ranked by edge-count
Claims (1)
claim
- Modern language models possess at least a limited, functional form of introspective awarenesssupportsThe paper's central interpretive assertion.
Communities (3)
community
- Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
- Empirical investigation of how LMs access and report internal states across layers, using concept injection and thought detection on Claude models.
- Studying how concept injection and random vectors trigger self-reflective capabilities in LLMs across varying strength parameters.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Random vectors at injection strength 8 elicit introspective awareness in 9 out of 100 trialsfinding0.751Random vectors are less effective, and even then produce introspection at lower rates.
- Strength comparison accuracy reaches 73% at layer 3 for injection pair (2,6) vs. 50% chancefinding0.739Secondary positive result for strength comparison showing graded sensitivity to perturbation magnitude
- Opus 4.1 is most effective at recognizing injected abstract concepts (e.g., justice, peace) but detects other categories too.
- Observation from alternative prompts that detection is weaker without setup.
- Demonstrates alignment with Linear Representation Hypothesis: target trait steers approximately linearly with alpha
- Empirical finding about injection stride parameter; injecting into every completion activation maximizes steering strength
- Suggestive evidence for language-independent truth representation in LLMs
- Claim about the difficulty of responsiveness verification.