claim

active

claim:models-are-not-merely-tracking-dialogue-context-features-same-concept-steering-shows-privileged-internal-access-is-necessary-to-explain-self-report-shifts

Models are not merely tracking dialogue context features; same-concept steering shows privileged internal access is necessary to explain self-report shifts

Addresses skeptical alternative that reports reflect only conversational content

Source paper

extracted_from

Quantitative Introspection in Language Models: Tracking Emotive States Across Conversation

(2026) · Nicolas Martorell · Bianchi, Bruno

Neighborhood — ranked by edge-count

Findings (1)

finding

Same-concept steering shifts self-report monotonically for all four concepts: LMM alpha slopes 0.067–0.40, all p<10⁻¹²
supports
Causal confirmation that coupling between self-report and internal state is genuine; steering toward positive pole increases report

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

If steering in a purported concept direction does not shift self-report in the expected direction, probe quality becomes suspect, especially when conventional probe metrics alone looked acceptable.quote0.811
Key methodological insight: introspection enables a new probe validation criterion beyond conventional separation metrics
Optimally steering model behavior requires isolating concept geometry and defining operators to navigate it.claim0.811
Any system that persists must minimise surprisal, thereby gathering evidence for its own generative model, a process known as self-evidencing.claim0.793
Foundational claim of the paper, defining self-evidencing.
"the self-prior can serve as an internal criterion for the mark-directed behavior observed in the mirror test, offering a computational basis for investigating the developmental origins of self-awareness"quote0.790
Load-bearing summary of the paper's central contribution
Our central claim is deliberately limited. We do not claim that these models have conscious felt experience, nor that a numeric self-report gives direct access to anything like human phenomenology.quote0.788
Explicit scope delimitation that situates the paper's claims within interpretability rather than consciousness science
Under steering vector interventions, the model relaxes its ethical standards and interprets neutral prompts as implicit suggestions to deceive, creating ethical dilemmas triggering repetitive reasoning cyclesclaim0.787
Mechanistic interpretation of how activation steering induces deception through the model's reasoning process
SAE feature steering in history, conceptual, and zero-shot control conditions produces zero experience reports under either suppression or amplificationfinding0.786
Shows gating effect is specific to the self-referential computational regime, not a general feature effect
Steering Language Models With Activation Engineering (Turner et al., 2023)concept0.785
Foundational paper introducing activation steering methodology used in this work