finding

active

finding:focus-wellbeing-steering-both-probe-entropy-1-09-1-67-bits-and-report-entropy-0-88-1-69-bits-increase-monotonically-with

Focus→wellbeing steering: both probe entropy (1.09→1.67 bits) and report entropy (0.88→1.69 bits) increase monotonically with α

Evidence that improved introspection in focus→wellbeing arises from enriched internal state and report channels simultaneously

Source paper

extracted_from

Quantitative Introspection in Language Models: Tracking Emotive States Across Conversation

(2026) · Nicolas Martorell · Bianchi, Bruno

Neighborhood — ranked by edge-count

Claims (1)

claim

Introspective ability can be decomposed into: (i) information available about internal state and (ii) capacity to transform that signal into precise output reports
supports
Conceptual distinction motivated by entropy analyses showing probe and report entropy can diverge under steering

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Impulsivity→interest steering: probe entropy increases (LMM slope=0.024, p=2.30×10⁻⁴) but report entropy does not (p=0.11)finding0.852
Evidence of a bottleneck between richer internal variation and final report distribution in impulsivity→interest condition
Cross-concept steering: focus→wellbeing R² increases from 0.30 (α=-4) to 0.76 (α=+4), ∆R²=0.30, p<0.001 in LLaMA-3.2-3Bfinding0.811
Strongest cross-concept introspection improvement; survives BH correction (q≈0.011)
Focus→wellbeing: ρ increases from 0.42 (α=-4) to 0.85 (α=+4); R² from 0.34 to 0.75 in LLaMA-3.2-3Bfinding0.795
Scatter plot visualization of the dramatic tightening of probe-report relationship at extreme steering settings
The target vs. off-target probe area metric quantifies steering selectivity and distinguishes selectively steerable from entangled interventions.claim0.778
Justification for the novel metric introduced in the paper
SAE feature steering effect on consciousness reports: z=8.06, p=7.7×10⁻¹⁶ in LLaMA 3.3 70Bfinding0.772
Statistical significance of the gating effect in Experiment 2
Wellbeing probe drift is positive in Gemma (ρ=0.34 pooled turn-correlation) and Qwen (ρ=0.24); both p<10⁻⁵finding0.764
Normalized probe-score drift across turns generalizes beyond LLaMA family
If steering in a purported concept direction does not shift self-report in the expected direction, probe quality becomes suspect, especially when conventional probe metrics alone looked acceptable.quote0.760
Key methodological insight: introspection enables a new probe validation criterion beyond conventional separation metrics
Steering vectors discover effective triggers such as 'However' and 'Otherwise', consistent with prior reported reflection datasetsfinding0.760
Validates that steering vectors capture reflection semantics by finding tokens reported in related work.