finding
active
finding:focus-wellbeing-steering-both-probe-entropy-1-09-1-67-bits-and-report-entropy-0-88-1-69-bits-increase-monotonically-withFocus→wellbeing steering: both probe entropy (1.09→1.67 bits) and report entropy (0.88→1.69 bits) increase monotonically with α
Evidence that improved introspection in focus→wellbeing arises from enriched internal state and report channels simultaneously
Source paper
extracted_from(2026) · Nicolas Martorell · Bianchi, Bruno
Neighborhood — ranked by edge-count
Claims (1)
claim
- Conceptual distinction motivated by entropy analyses showing probe and report entropy can diverge under steering
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Evidence of a bottleneck between richer internal variation and final report distribution in impulsivity→interest condition
- Strongest cross-concept introspection improvement; survives BH correction (q≈0.011)
- Focus→wellbeing: ρ increases from 0.42 (α=-4) to 0.85 (α=+4); R² from 0.34 to 0.75 in LLaMA-3.2-3Bfinding0.795Scatter plot visualization of the dramatic tightening of probe-report relationship at extreme steering settings
- Justification for the novel metric introduced in the paper
- SAE feature steering effect on consciousness reports: z=8.06, p=7.7×10⁻¹⁶ in LLaMA 3.3 70Bfinding0.772Statistical significance of the gating effect in Experiment 2
- Wellbeing probe drift is positive in Gemma (ρ=0.34 pooled turn-correlation) and Qwen (ρ=0.24); both p<10⁻⁵finding0.764Normalized probe-score drift across turns generalizes beyond LLaMA family
- Key methodological insight: introspection enables a new probe validation criterion beyond conventional separation metrics
- Validates that steering vectors capture reflection semantics by finding tokens reported in related work.