claim
active
claim:md-vectors-outperform-probe-based-vectors-in-sjts-because-they-align-with-construct-centroids-rather-than-distorting-direction-via-regularizationMD vectors outperform probe-based vectors in SJTs because they align with construct centroids rather than distorting direction via regularization
Mechanistic explanation for MDS superiority; attributed to two design choices: centroid alignment and full-utterance semantics in h_s
Source paper
extracted_from(2026) · Leonardo Blas · Robin Jia · Emilio Ferrara
Neighborhood — ranked by edge-count
Findings (1)
finding
- MDS achieves global win proportion of 89.5% on SJTs across 14 LLMs and four injection stridessupportsMDS dominates in open-ended generation by global win proportion metric (Table 2)
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Observation from 100% accuracy on specific concept-layer-strength combinations suggesting concept-specific detectability
- Demonstrates averaging multiple prompt pairs reduces noise; optimal subset selection further improves performance.
- Synthetic MFT SJTs achieve 77.71%-83.84% alignment with Clifford et al. human-composed MFT vignettesfinding0.762Moderate-to-high alignment validating SJT synthesis for moral foundations domain
- Validates that steering vectors capture reflection semantics by finding tokens reported in related work.
- Appendix E replication of DIM alignment finding in Qwen model
- Theoretical alignment claim backed by OLS R2 analysis showing 96.15% of trends have R2>=0.75
- Open question raised in §7.1 about an unexplained anomalous result
- Steering vectors from µ(0→2) slightly outperform µ(1→2) for instruction discovery across datasets and modelsfinding0.755Shows that contrasting No Reflection with Triggered Reflection provides a stronger signal than Intrinsic vs Triggered.