claim

active

claim:uncalibrated-sweep-units-and-restricted-coefficient-ranges-are-the-primary-cause-of-prior-reports-showing-p2-outperforming-md-injections

Uncalibrated sweep units and restricted coefficient ranges are the primary cause of prior reports showing P2 outperforming MD injections

Mechanistic explanation for discrepancy with Banayeeanzade et al.; addressed by centroid unit and unbounded sweep contributions

Source paper

extracted_from

Psychological Steering of Large Language Models

(2026) · Leonardo Blas · Robin Jia · Emilio Ferrara

Neighborhood — ranked by edge-count

Papers (1)

paper

Psychological Steering of Large Language Models
introduces

Findings (1)

finding

MDS injections outperform P2 in open-ended generation in 11 of 14 LLMs with Phi gains of 3.61% to 16.44%
associated_withsupports
Primary quantitative result overturning prior reports that prompting outperforms representation engineering

Methods (1)

method

Centroid Unit Calibration
associated_with
Novel calibration of injection strength as the distance from centroid midpoint to centroid; enables meaningful cross-layer comparison of alpha values

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

MDS injections show no salient patterns in MPI-120 inventory responses beyond occasional co-occurring peaksfinding0.776
Contrasts with SJT results; leads authors to narrow analyses to SJT responses
PM hybrid outperforms both P2 and MDS in 13 of 14 LLMs with Phi gains over P2 from 5.56% to 21.92% and over MDS from 3.30% to 26.67%finding0.763
Key finding showing that combining prompting and injection is the strongest approach
Cosine similarity between perturbed and baseline residual streams returns toward 1.0 and projection onto injection direction decays exponentially over subsequent layersfinding0.758
Mechanistic evidence that network actively attenuates injected perturbations, explaining late-layer introspection failure
gemma-3-1b-it yields only one valid MDS injection score (phi_1,A,up = 4.8) and is excluded from main analysesfinding0.757
Identified exception to overall MDS effectiveness; reason remains unexplained as a limitation
MDS injections align with the Linear Representation Hypothesis: target trait varies near-linearly with alpha in open-ended generationclaim0.755
Theoretical alignment claim backed by OLS R2 analysis showing 96.15% of trends have R2>=0.75
L2 regularisation with bias term delivers best probe performance; L2 regularisation increases probe selectivityfinding0.750
Hyperparameter tuning result for probes; consistent with Hewitt and Liang 2019 finding
For small models, critiqued revisions yield higher harmlessness PM scores than direct revisions; for large models the difference is negligible.finding0.744
Figure 7 comparison of critiqued vs direct revisions across model sizes.
47.69% of 130 injection-manipulated alpha trends have near-linear fits (R2 >= 0.95); 96.15% have roughly linear fits (R2 >= 0.75)finding0.742
Demonstrates alignment with Linear Representation Hypothesis: target trait steers approximately linearly with alpha