finding

active

finding:random-direction-controls-show-weak-non-significant-coupling-0-11-to-0-17-r2-0-03-0-11-compared-to-true-probes-0-23-0-79-all-p-0-05

Random direction controls show weak non-significant coupling (ρ=-0.11 to 0.17; R²=0.03–0.11) compared to true probes (∆ρ=0.23–0.79, all p<0.05)

Controls for probe artifacts; demonstrates self-reports carry information specifically about probe-defined concept directions

Source paper

extracted_from

Quantitative Introspection in Language Models: Tracking Emotive States Across Conversation

(2026) · Nicolas Martorell · Bianchi, Bruno

Neighborhood — ranked by edge-count

Claims (1)

claim

Numeric self-report is a viable, complementary black-box tool for monitoring LLM internal emotive states alongside white-box probe methods
supports
Central practical conclusion; both methods partially track the same latent state but with different failure modes

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Cross-model pairwise cosine similarity of zero-shot control responses = 0.603 (n=12,720 pairs, t=35.1, p=4.3×10⁻²⁶² vs. experimental)finding0.766
Experiment 3 comparison: zero-shot control shows lower semantic convergence than experimental condition
MM probe trained on likely dataset achieves NIE of 0.70 (false→true) on LLaMA-2-13B, surprisingly strong but weaker than truth probesfinding0.758
Likely-trained MM probe is a surprisingly effective causal baseline due to correlation between truth and probability on sp_en_trans
Mass-mean probe directions outperform LR and CCS in causal intervention experiments (NIE) in 7/8 experimental conditionsfinding0.757
Core result showing MM is superior to LR for causal implication despite similar classification accuracy
47.69% of 130 injection-manipulated alpha trends have near-linear fits (R2 >= 0.95); 96.15% have roughly linear fits (R2 >= 0.75)finding0.755
Demonstrates alignment with Linear Representation Hypothesis: target trait steers approximately linearly with alpha
Impulsivity→interest steering: probe entropy increases (LMM slope=0.024, p=2.30×10⁻⁴) but report entropy does not (p=0.11)finding0.754
Evidence of a bottleneck between richer internal variation and final report distribution in impulsivity→interest condition
For simple factual tasks F0-F3, probe directions show a sharp geometric transition in middle layers, with late-layer probes converging to high cosine similarity; A3 and F4-F5 show no clear transition.finding0.746
Geometric evidence for convergence to stable truth directions only for simpler tasks.
Cogito emotion probe residual autocorrelation +0.077 above variance-matched controls (p=1.5e-27, 157/171 probes positive)finding0.745
Demonstrates that Cogito emotion probes are persistently active beyond what is explained by their variance alone
Steering Vector Control maintains low unexpected rate of 0.08 in Experiment 1, comparable to baselinefinding0.743
Shows that inducing deception via steering vectors preserves semantic coherence and does not cause random errors