finding

active

finding:mds-injections-can-steer-toward-multiple-distinct-constructs-in-the-same-completion-producing-strongly-polarized-yet-smoothly-connected-segments

MDS injections can steer toward multiple distinct constructs in the same completion, producing strongly polarized yet smoothly connected segments

Qualitative finding demonstrating unique capability of activation-level interventions unavailable to prompting methods including PM

Source paper

extracted_from

Psychological Steering of Large Language Models

(2026) · Leonardo Blas · Robin Jia · Emilio Ferrara

Neighborhood — ranked by edge-count

Claims (1)

claim

The psychological steering framework generalizes beyond OCEAN to Dark Tetrad, CMNI, CFNI, and other psychological models
supports
Supported by qualitative experiments showing fluent and coherent steering for three additional models

Frameworks (1)

framework

Personality Prompting
contradicts
Established baseline for OCEAN steering via personality-descriptive system prompts; compared against injection methods throughout

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

MDS injections align with the Linear Representation Hypothesis: target trait varies near-linearly with alpha in open-ended generationclaim0.812
Theoretical alignment claim backed by OLS R2 analysis showing 96.15% of trends have R2>=0.75
MDS injection steering efficiency peaks at mid-layers across LLMs, injection strides, and OCEAN traitsfinding0.802
Consistent empirical pattern supporting the connection between mid-layer representations and emotion/behavioral content
MDS injections show no salient patterns in MPI-120 inventory responses beyond occasional co-occurring peaksfinding0.779
Contrasts with SJT results; leads authors to narrow analyses to SJT responses
MDS injections outperform P2 in open-ended generation in 11 of 14 LLMs with Phi gains of 3.61% to 16.44%finding0.745
Primary quantitative result overturning prior reports that prompting outperforms representation engineering
MDS Injectionmethod0.742
Mean-difference vectors derived from self-statement activations (h_s); best-performing injection method in open-ended generation
Do the findings about MDS injection effectiveness generalize to base (non-instruction-tuned) language models?question0.734
Acknowledged limitation: only instruction-tuned models were studied
Combining multiple construct injections simultaneously may enable richer persona simulation or fine-grained controlhypothesis0.733
Identified as future work; demonstrated qualitatively in Figure 1 but not formally evaluated
Why do MDS injections outperform other methods on the inventory (multiple-choice) task?question0.728
Identified as an unexplained result and open question in limitations section