finding
active
finding:random-direction-controls-show-weak-non-significant-coupling-0-11-to-0-17-r2-0-03-0-11-compared-to-true-probes-0-23-0-79-all-p-0-05Random direction controls show weak non-significant coupling (ρ=-0.11 to 0.17; R²=0.03–0.11) compared to true probes (∆ρ=0.23–0.79, all p<0.05)
Controls for probe artifacts; demonstrates self-reports carry information specifically about probe-defined concept directions
Source paper
extracted_from(2026) · Nicolas Martorell · Bianchi, Bruno
Neighborhood — ranked by edge-count
Claims (1)
claim
- Central practical conclusion; both methods partially track the same latent state but with different failure modes
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Experiment 3 comparison: zero-shot control shows lower semantic convergence than experimental condition
- Likely-trained MM probe is a surprisingly effective causal baseline due to correlation between truth and probability on sp_en_trans
- Core result showing MM is superior to LR for causal implication despite similar classification accuracy
- Demonstrates alignment with Linear Representation Hypothesis: target trait steers approximately linearly with alpha
- Evidence of a bottleneck between richer internal variation and final report distribution in impulsivity→interest condition
- Geometric evidence for convergence to stable truth directions only for simpler tasks.
- Demonstrates that Cogito emotion probes are persistently active beyond what is explained by their variance alone
- Steering Vector Control maintains low unexpected rate of 0.08 in Experiment 1, comparable to baselinefinding0.743Shows that inducing deception via steering vectors preserves semantic coherence and does not cause random errors