claim

active

claim:md-vectors-outperform-probe-based-vectors-in-sjts-because-they-align-with-construct-centroids-rather-than-distorting-direction-via-regularization

MD vectors outperform probe-based vectors in SJTs because they align with construct centroids rather than distorting direction via regularization

Mechanistic explanation for MDS superiority; attributed to two design choices: centroid alignment and full-utterance semantics in h_s

Source paper

extracted_from

Psychological Steering of Large Language Models

(2026) · Leonardo Blas · Robin Jia · Emilio Ferrara

Neighborhood — ranked by edge-count

Findings (1)

finding

MDS achieves global win proportion of 89.5% on SJTs across 14 LLMs and four injection strides
supports
MDS dominates in open-ended generation by global win proportion metric (Table 2)

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Some steering vectors produce more salient perturbations than others, perhaps based on shared semantic or qualitative factorsclaim0.775
Observation from 100% accuracy on specific concept-layer-strength combinations suggesting concept-specific detectability
Steering vector constructed from all 16 contrastive pairs outperforms most single-pair vectors; best 4-pair vector outperforms full 16-pair vectorfinding0.770
Demonstrates averaging multiple prompt pairs reduces noise; optimal subset selection further improves performance.
Synthetic MFT SJTs achieve 77.71%-83.84% alignment with Clifford et al. human-composed MFT vignettesfinding0.762
Moderate-to-high alignment validating SJT synthesis for moral foundations domain
Steering vectors discover effective triggers such as 'However' and 'Otherwise', consistent with prior reported reflection datasetsfinding0.758
Validates that steering vectors capture reflection semantics by finding tokens reported in related work.
In Qwen-2.5-9B, only v1 has meaningful cosine similarity to DIM direction; all additional basis vectors have cosine similarities ~1e-9finding0.758
Appendix E replication of DIM alignment finding in Qwen model
MDS injections align with the Linear Representation Hypothesis: target trait varies near-linearly with alpha in open-ended generationclaim0.755
Theoretical alignment claim backed by OLS R2 analysis showing 96.15% of trends have R2>=0.75
Why were interventions with mass-mean probe directions extracted from the likely dataset so effective, despite these probes not being accurate at classifying true/false statements?question0.755
Open question raised in §7.1 about an unexplained anomalous result
Steering vectors from µ(0→2) slightly outperform µ(1→2) for instruction discovery across datasets and modelsfinding0.755
Shows that contrasting No Reflection with Triggered Reflection provides a stronger signal than Intrinsic vs Triggered.