claim
active
claim:uncalibrated-sweep-units-and-restricted-coefficient-ranges-are-the-primary-cause-of-prior-reports-showing-p2-outperforming-md-injectionsUncalibrated sweep units and restricted coefficient ranges are the primary cause of prior reports showing P2 outperforming MD injections
Mechanistic explanation for discrepancy with Banayeeanzade et al.; addressed by centroid unit and unbounded sweep contributions
Source paper
extracted_from(2026) · Leonardo Blas · Robin Jia · Emilio Ferrara
Neighborhood — ranked by edge-count
Papers (1)
paper
Findings (1)
finding
- MDS injections outperform P2 in open-ended generation in 11 of 14 LLMs with Phi gains of 3.61% to 16.44%associated_withsupportsPrimary quantitative result overturning prior reports that prompting outperforms representation engineering
Methods (1)
method
- Centroid Unit Calibrationassociated_withNovel calibration of injection strength as the distance from centroid midpoint to centroid; enables meaningful cross-layer comparison of alpha values
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- MDS injections show no salient patterns in MPI-120 inventory responses beyond occasional co-occurring peaksfinding0.776Contrasts with SJT results; leads authors to narrow analyses to SJT responses
- Key finding showing that combining prompting and injection is the strongest approach
- Mechanistic evidence that network actively attenuates injected perturbations, explaining late-layer introspection failure
- Identified exception to overall MDS effectiveness; reason remains unexplained as a limitation
- Theoretical alignment claim backed by OLS R2 analysis showing 96.15% of trends have R2>=0.75
- Hyperparameter tuning result for probes; consistent with Hewitt and Liang 2019 finding
- Figure 7 comparison of critiqued vs direct revisions across model sizes.
- Demonstrates alignment with Linear Representation Hypothesis: target trait steers approximately linearly with alpha