claim
active
claim:models-are-not-merely-tracking-dialogue-context-features-same-concept-steering-shows-privileged-internal-access-is-necessary-to-explain-self-report-shiftsModels are not merely tracking dialogue context features; same-concept steering shows privileged internal access is necessary to explain self-report shifts
Addresses skeptical alternative that reports reflect only conversational content
Source paper
extracted_from(2026) · Nicolas Martorell · Bianchi, Bruno
Neighborhood — ranked by edge-count
Findings (1)
finding
- Causal confirmation that coupling between self-report and internal state is genuine; steering toward positive pole increases report
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Key methodological insight: introspection enables a new probe validation criterion beyond conventional separation metrics
- Foundational claim of the paper, defining self-evidencing.
- Load-bearing summary of the paper's central contribution
- Explicit scope delimitation that situates the paper's claims within interpretability rather than consciousness science
- Mechanistic interpretation of how activation steering induces deception through the model's reasoning process
- Shows gating effect is specific to the self-referential computational regime, not a general feature effect
- Foundational paper introducing activation steering methodology used in this work