finding

active

finding:same-concept-steering-shifts-self-report-monotonically-for-all-four-concepts-lmm-alpha-slopes-0-067-0-40-all-p-10-12

Same-concept steering shifts self-report monotonically for all four concepts: LMM alpha slopes 0.067–0.40, all p<10⁻¹²

Causal confirmation that coupling between self-report and internal state is genuine; steering toward positive pole increases report

Source paper

extracted_from

Quantitative Introspection in Language Models: Tracking Emotive States Across Conversation

(2026) · Nicolas Martorell · Bianchi, Bruno

Neighborhood — ranked by edge-count

Claims (2)

claim

The coupling between LLM self-report and internal emotive state is causal, not merely correlational
associated_withsupports
Supported by same-concept steering experiments showing monotonic shifts in self-report with activation steering
Models are not merely tracking dialogue context features; same-concept steering shows privileged internal access is necessary to explain self-report shifts
supports
Addresses skeptical alternative that reports reflect only conversational content

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Wellbeing same-concept steering: LMM alpha slope=0.19, focus=0.40, interest=0.25, impulsivity=0.067 in LLaMA-3.2-3Bfinding0.867
Quantifies per-concept effect size of same-concept steering on self-report
Logit self-report drift positive for all three LLaMA sizes (turn slopes 0.159, 0.038, 0.141; all p<10⁻²⁰) but does not increase monotonically with scalefinding0.799
Unlike probe drift, report drift magnitude does not follow a clean scaling law; size-slope is negative
Impulsivity→interest steering: probe entropy increases (LMM slope=0.024, p=2.30×10⁻⁴) but report entropy does not (p=0.11)finding0.797
Evidence of a bottleneck between richer internal variation and final report distribution in impulsivity→interest condition
Cross-concept steering: impulsivity→interest R² increases from 0.55 (α=-4) to 0.72 (α=+4), ∆R²=0.10, p=0.012 in LLaMA-3.2-3Bfinding0.793
Second significant cross-concept introspection improvement; marginal after BH correction (q≈0.066)
If steering in a purported concept direction does not shift self-report in the expected direction, probe quality becomes suspect, especially when conventional probe metrics alone looked acceptable.quote0.771
Key methodological insight: introspection enables a new probe validation criterion beyond conventional separation metrics
Cross-concept steering: focus→wellbeing R² increases from 0.30 (α=-4) to 0.76 (α=+4), ∆R²=0.30, p<0.001 in LLaMA-3.2-3Bfinding0.766
Strongest cross-concept introspection improvement; survives BH correction (q≈0.011)
Interest probe score drifts positively across turns: LMM slope=0.005, p=4.12×10⁻¹⁴ in LLaMA-3.2-3Bfinding0.764
Demonstrates genuine internal-state dynamics in LLMs during multi-turn conversation
Linear steering is often mismatched with a model's internal representation geometry, producing noisy, off-target effects.claim0.758
The paper's critique of the standard linear steering baseline, supported by the days-of-week demo.