paper:doi-10-48550-arxiv-2604-14463Psychological Steering of Large Language Models
TL;DR
Mean-difference-from-self (MDS) residual-stream injections outperform Personality Prompting (P²), the established baseline for OCEAN psychological steering, in open-ended generation across 11 of 14 tested LLMs—including Llama-3.1-8B-Instruct, Qwen3-8B, and gemma-3-12b-it—with steerability score (Φ) gains ranging from 3.61% to 16.44% on synthetic situational judgment tests scored by GPT-5.1. A hybrid method (PM) combining P² prompting with MDS injections extends this further, outperforming both constituents in 13 of 14 models with gains over P² of 5.56%–21.92% and over MDS alone of 3.30%–26.67%. These results directly overturn Banayeeanzade et al.'s prior finding that P² surpasses MD-based injection methods, with the gap traced to two methodological failures in prior work: sweeping injection strength in uncalibrated activation-space units and restricting search to a narrow coefficient range such as [0.4, 0.5, …, 1.5]. The psychological steering framework introduced here addresses both by defining centroid units—layer-wise calibrated scales anchored to the distance between construct and antithesis activation centroids—and operationalizing unbounded fluency-constrained sweeps using lightweight logistic-regression classifiers trained on Qwen3Embedding-0.6B embeddings rather than paid frontier APIs. OLS regression on 10-step α sweeps shows 89.23% of manipulated OCEAN trends achieve R² ≥ 0.85, confirming near-linear control consistent with the Linear Representation Hypothesis; however, the induced cross-trait covariance matches the Big Two metatrait model in only 46.15% of cases, implying that LLM representation geometry diverges meaningfully from the structure of human personality.
What to take away
- 1. MDS (mean-difference-from-self-statement) injections outperform Personality Prompting (P²) on open-ended OCEAN situational judgment tests in 11 of 14 LLMs spanning Llama-3.2-1B-Instruct through Qwen3-32B, with steerability Φ gains of 3.61%–16.44%.
- 2. A hybrid method (PM) that applies both P² system-prompt steering and MDS residual-stream injections simultaneously achieves the highest steerability in 13 of 14 LLMs, with Φ gains over P² of 5.56%–21.92% and over MDS alone of 3.30%–26.67%.
- 3. Prior reports that prompting outperforms MD injections—specifically Banayeeanzade et al. 2025—are attributable to two methodological gaps: sweeping in uncalibrated unit-vector magnitudes and restricting the search space to a narrow set of coefficients such as [0.4, 0.5, …, 1.5], which can miss optima far from that range.
- 4. The framework introduces centroid units as a calibrated injection-strength scale: for each transformer layer, the unit is defined as the Euclidean distance from the midpoint between construct and antithesis activation centroids to the target centroid, enabling meaningful cross-layer and cross-model comparisons.
- 5. Among the six injection methods compared (L1LI, L1ZI, L2LI, L2ZI, MDB, MDS), MDS dominates on the open-ended SJT task with a global win proportion of 89.5% across 14 LLMs and four injection stride settings, versus 47.3% for MDB and under 30% for all probe-based methods.
- 6. OLS linear regression on 10-equidistant-α sweeps shows that 89.23% of the 130 injection-manipulated OCEAN SJT score trends achieve R² ≥ 0.85, and 96.15% achieve R² ≥ 0.75, confirming near-linear control consistent with the Linear Representation Hypothesis.
- 7. OCEAN MDS injections induce cross-trait covariance that matches the Big Two model of stability (C, A, reversed-N) and plasticity (E, O) clustering in only 46.15% of cases across 13 LLMs, with the A–C correlation being the most frequent (10 LLMs) and N–C the rarest (1 LLM), suggesting a systematic divergence between learned representations and human personality structure.
- 8. Construct-specific statement corpora are synthesized by prompting Llama-3.1-8B-Instruct to generate 35,000 first-person texts per condition, retaining 500 fluent, semantically deduplicated texts (cosine similarity threshold 0.9 on Qwen3Embedding-0.6B embeddings), and validated against Perez et al.'s corpora with centroid cosine similarities of 85.62%–94.00%.
- 9. Fluency-constrained unbounded α sweeps are operationalized using logistic-regression classifiers on Qwen3Embedding-0.6B embeddings, achieving mean accuracy and F1-macro of 95.96% across OCEAN, HEXACO, Dark Tetrad, CMNI, and CFNI constructs, replacing frontier LLM calls during the sweep phase.
- 10. An open question raised is why MDS injections entirely failed on gemma-3-1b-it—yielding only one valid steering score across all OCEAN traits and strides—while succeeding on larger Gemma variants (4b, 12b, 27b), suggesting a potential model-size or architecture threshold for residual-stream steerability that remains unexplained.
Peer brief — for seminar discussion
The paper constructs a psychological steering framework for LLMs that targets the OCEAN personality model via additive residual-stream injections and operationalizes what prior work left underspecified: a calibrated injection-strength scale and an unbounded sweep procedure. Fourteen instruction-tuned models ranging from Llama-3.2-1B-Instruct to Qwen3-32B (models above 12B quantized to 4-bit NF4) are evaluated against six injection types—four probe-based (L1LI, L1ZI, L2LI, L2ZI) and two mean-difference variants (MDB, MDS)—plus the Personality Prompting baseline P² from Jiang et al. Evaluation uses the IPIP-NEO-120 inventory and a battery of synthetic open-ended situational judgment tests (SJTs) scored by GPT-5.1 at temperature 0. The framework's method contributions are: (1) centroid units, a layer-wise calibrated scale where one unit equals the distance from the midpoint of construct and antithesis activation centroids to the target centroid; (2) unbounded fluency-constrained sweeps that early-stop when mean SJT fluency drops below 95% of baseline or more than 5% of responses fall below 90% of baseline, using a RoBERTa-large CoLA classifier; and (3) lightweight logistic-regression classifiers on Qwen3Embedding-0.6B embeddings (mean accuracy 95.96%) to replace frontier API calls during sweeping. An alternative the framework could have used—and implicitly contrasts against—is the restricted fixed-grid sweep over a small coefficient range like [0.4, 0.5, …, 1.5] at uncalibrated unit-vector magnitudes, as employed by Banayeeanzade et al. and Deng et al. The load-bearing finding is that MDS injections, derived from differences between centroids of first-person self-statement activations, outperform P² in 11 of 14 LLMs with Φ gains of 3.61%–16.44%, and a PM hybrid of P² plus MDS outperforms both in 13 of 14 LLMs with Φ gains over P² of 5.56%–21.92%. This directly overturns Banayeeanzade et al.'s 2025 report that P² outperforms MD-based injections, attributing the reversal entirely to the two methodological gaps corrected here. Additionally, 89.23% of manipulated OCEAN score trends show R² ≥ 0.85 under OLS regression across 10 equidistant α values, supporting alignment with the Linear Representation Hypothesis, while induced cross-trait covariance matches the Big Two metatrait model in only 46.15% of cases—a finding the authors interpret as evidence of a gap between LLM representation geometry and human personality structure. The implications are that representation engineering, properly calibrated, should be considered the leading approach to open-ended psychological steering, and that prompt-based and activation-based methods are complementary rather than competing. A critical reader should push back on the evaluation instrument's circularity: the SJTs used to compare methods are synthesized from ATOMIC10x heads paired with IPIP-NEO-120 items via GPT-5.1, and performance on them is then scored by GPT-5.1 using the same OCEAN facet descriptions that anchored statement generation. There is no independent human-validated SJT battery for OCEAN at the scale needed, so the reported Φ scores measure alignment with GPT-5.1's conception of OCEAN traits as instantiated through the same pipeline that produced the tests—a closed loop that makes it difficult to assess ecological validity or to rule out that PM's gains reflect better exploitation of GPT-5.1's scoring preferences rather than stronger behavioral expression.
Methods (5)
- Centroid Unit CalibrationNovel calibration of injection strength as the distance from centroid midpoint to centroid; enables meaningful cross-layer comparison of alpha values
- Embedding-based Construct Logistic ClassifierLogistic regressor on Qwen3Embedding-0.6B embeddings trained on construct statements; used to measure construct presence in alpha sweeps
- Injection StrideParameter controlling how often an injection is applied during completion; s=1 injects on every activation, achieving strongest steering
- PM Hybrid MethodHybrid method combining Personality Prompting (P2) with MDS injections; best overall steering method
- Synthetic Situational Judgment Test BatteryOpen-ended situational judgment tests synthesized using GPT-5.1 from ATOMIC10x heads and inventory items; primary evaluation instrument for open-ended steering
Frameworks (2)
- OCEAN Personality ModelPrimary personality model used to benchmark steering methods across 14 LLMs
- Psychological Steering FrameworkThe paper's primary contribution: performs unbounded, fluency-constrained sweeps in semantically calibrated centroid units using psychological artifacts
Datasets (1)
- Qwen3-8BReasoning-optimized base model used for training SFR-DR-8B variant.
Findings (23)
- MDS injections can steer toward multiple distinct constructs in the same completion, producing strongly polarized yet smoothly connected segments
Qualitative finding demonstrating unique capability of activation-level interventions unavailable to prompting methods including PM
- Embedding-based construct classifiers achieve mean accuracy and F1-macro of 95.96% across OCEAN, HEXACO, Dark Tetrad, CMNI, CFNI constructs
Validates use of lightweight classifiers as replacement for frontier LLM evaluation during alpha sweeps
- Probes trained on h_b activations achieve perfect test accuracy in every case; h_s probes achieve perfect accuracy in only 0.60% of cases
Justifies restricting probe-based vector derivation to h_b activations; attributed to Yes/No semantics
- PM hybrid outperforms both P2 and MDS in 13 of 14 LLMs with Phi gains over P2 from 5.56% to 21.92% and over MDS from 3.30% to 26.67%
Key finding showing that combining prompting and injection is the strongest approach
- Generated statements achieve 85.62%-94.00% cosine similarity alignment with Perez et al. validated OCEAN and Dark Triad statements
Validates the statement synthesis pipeline as producing behavior-specific content comparable to established methods
- 47.69% of 130 injection-manipulated alpha trends have near-linear fits (R2 >= 0.95); 96.15% have roughly linear fits (R2 >= 0.75)
Demonstrates alignment with Linear Representation Hypothesis: target trait steers approximately linearly with alpha
- Injection stride s=1 produces the highest mean SJT scores across all LLMs; more frequent injection yields stronger steering
Empirical finding about injection stride parameter; injecting into every completion activation maximizes steering strength
- Only 46.15% of cases show covariance patterns consistent with the Big Two model; no LLM satisfies all Big Two correlations
Suggests a gap between LLM learned representations and human personality structure as described by Big Two
- Agreeableness-Conscientiousness Big Two correlation is observed in 10 of 13 LLMs; N-C correlation is rarest (1 LLM)
Most and least common Big Two covariance pattern in LLM OCEAN MDS injections
- Only 13.27% of 520 non-manipulated alpha trends achieve R2 >= 0.95, contrasting with 47.69% for manipulated trends
Control comparison showing near-linearity is specific to the targeted manipulation direction
Claims (8)
- MD vectors outperform probe-based vectors in SJTs because they align with construct centroids rather than distorting direction via regularization
Mechanistic explanation for MDS superiority; attributed to two design choices: centroid alignment and full-utterance semantics in h_s
- OCEAN MDS injection covariance patterns departing from the Big Two model suggest a gap between learned LLM representations and human psychology
Interpretive conclusion from Big Two mismatch finding; tentative due to only 46.15% match rate
- Uncalibrated sweep units and restricted coefficient ranges are the primary cause of prior reports showing P2 outperforming MD injections
Mechanistic explanation for discrepancy with Banayeeanzade et al.; addressed by centroid unit and unbounded sweep contributions
- MDS injections align with the Linear Representation Hypothesis: target trait varies near-linearly with alpha in open-ended generation
Theoretical alignment claim backed by OLS R2 analysis showing 96.15% of trends have R2>=0.75
- h_b activations encode Yes/No semantics that produce clean separability for logistic probes, unlike h_s full-utterance activations
Explanation for why probes on h_b achieve perfect accuracy but h_s probes succeed only 0.60% of the time
- Representation engineering and prompting methods may combine to achieve stronger behavioral expression across other domains
Broader implication of PM hybrid's superior performance; extrapolated from OCEAN results
- The psychological steering framework generalizes beyond OCEAN to Dark Tetrad, CMNI, CFNI, and other psychological models
Supported by qualitative experiments showing fluent and coherent steering for three additional models
- RepE is a new frontier in open-ended psychological steering of LLMs, outperforming prompting when properly calibrated
Central interpretive claim overturning prior reports; supported by 11-of-14 LLM wins for MDS over P2
Hypotheses (2)
- The framework's methodological contributions could be adapted to target arbitrary non-psychological attributes given custom evaluation criteria
Generalization hypothesis stated in introduction; not tested in paper
- Combining multiple construct injections simultaneously may enable richer persona simulation or fine-grained control
Identified as future work; demonstrated qualitatively in Figure 1 but not formally evaluated
Questions (4)
- Do the findings about MDS injection effectiveness generalize to base (non-instruction-tuned) language models?
Acknowledged limitation: only instruction-tuned models were studied
- Why do MDS injections outperform other methods on the inventory (multiple-choice) task?
Identified as an unexplained result and open question in limitations section
- Why do MDS injections fail on gemma-3-1b-it but succeed across all other tested LLMs?
Unexplained exception identified as a limitation and open question
- Do psychological steering results hold beyond 64-token completions?
Acknowledged limitation of restricting experiments to 64-token completions
Original abstract (expand)
Large language models (LLMs) emulate a consistent human-like behavior that can be shaped through activation-level interventions. This paradigm is converging on additive residual-stream injections, which rely on injection-strength sweeps to approximate optimal intervention settings. However, existing methods restrict the search space and sweep in uncalibrated activation-space units, potentially missing optimal intervention conditions. Thus, we introduce a psychological steering framework that performs unbounded, fluency-constrained sweeps in semantically calibrated units. Our method derives and calibrates residual-stream injections using psychological artifacts, and we use the IPIP-NEO-120, which measures the OCEAN personality model, to compare six injection methods. We find that mean-difference (MD) injections outperform Personality Prompting (P$^2$), an established baseline for OCEAN steering, in open-ended generation in 11 of 14 LLMs, with gains of 3.6\% to 16.4\%, overturning prior reports favoring prompting and positioning representation engineering as a new frontier in open-ended psychological steering. Further, we find that a hybrid of P$^2$ and MD injections outperforms both methods in 13 of 14 LLMs, with gains over P$^2$ ranging from 5.6\% to 21.9\% and from 3.3\% to 26.7\% over MD injections. Finally, we show that MD injections align with the Linear Representation Hypothesis and provide reliable, approximately linear control knobs for psychological steering. Nevertheless, they also induce OCEAN trait covariance patterns that depart from the Big Two model, suggesting a gap between learned representations and human psychology.
Related work— refs + corpus + external arXiv
Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.
- Psychological Steering in LLMs: An Evaluation of Effectiveness and TrustworthinessAla N. Tak, Fatemeh Bahrani, Anahita Bolourani, Leonardo Blas, Emilio Ferrara, Jonathan Gratch, Sai Praneeth Karimireddy Amin Banayeeanzade2025≈ 89%
- Controllable and explainable personality sliders for LLMs at inference timeDavid Khachaturov, Robert Mullins, Mark Huasong Meng Florian Hoppe2026≈ 88%
- The Effectiveness of Style Vectors for Steering Large Language Models: A Human EvaluationKatharina Dworatzyk, Sophie Jentzsch, Peer Sch\"utt, Sabine Theis, Tobias Hecking Diaoul\'e Diallo2026≈ 86%
- What Can We Actually Steer? A Multi-Behavior Study of Activation ControlKrystian Novak Tetiana Bas2026≈ 86%
- Steer Like the LLM: Activation Steering that Mimics PromptingGeert Heyman and Frederik Vandeputte2026≈ 86%
- Mechanistic Indicators of Steering Effectiveness in Large Language ModelsHao Xue, Flora Salim Mehdi Jafari2026≈ 86%
- Curveball Steering: The Right Direction To Steer Isn't Always LinearHae Jin Song, Linlin Wu, Abir Harrasse, Jeff M. Phillips, Fazl Barez, Amirali Abdullah Shivam Raval2026≈ 86%
- Persona Non Grata: Single-Method Safety Evaluation Is Incomplete for Persona-Imbued LLMsFan Yang, Shaunak A. Mehta, Koichi Onoue Wenkai Li2026≈ 85%
- Steering Conceptual Bias via Transformer Latent-Subspace ActivationVansh Sharma and Venkat Raman2025≈ 85%
- Steering at the Source: Style Modulation Heads for Robust Persona ControlGouki Minegishi, Koshi Eguchi, Sosuke Hosokawa, Kenjiro Taura Yoshihiro Izawa2026≈ 84%
- Causal Evidence that Language Models use Confidence to Drive BehaviorNathaniel Daw, Simon Osindero, Petar Velickovic, Viorica Patraucean Dharshan Kumaran2026≈ 84%
- CBMAS: Cognitive Behavioral Modeling via Activation SteeringAnthony Kuang, Ayo Akinkugbe, Kevin Zhu, Sean O'Brien Ahmed H. Ismail2026≈ 84%
- Intrinsic Guardrails: How Semantic Geometry of Personality Interacts with Emergent Misalignment in LLMsManas Mittal, Anmol Goel, Ponnurangam Kumaraguru, Vamshi Krishna Bonagiri Krishak Aneja2026≈ 84%
- Beyond Steering Vector: Flow-based Activation Steering for Inference-Time InterventionRuixuan Deng, Junran Wang, Xinjie Shen, Chao Zhang Zehao Jin2026≈ 84%
- ≈ 84%
- Quantitative Introspection in Language Models: Tracking Emotive States Across Conversationin corpus2026≈ 83%
- ≈ 83%
- ≈ 82%
- ≈ 82%
- ≈ 82%
- ≈ 82%
- Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencodersin corpus2026≈ 82%
- ≈ 81%
- From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMsin corpus2025≈ 81%
- Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behaviorin corpus2026≈ 81%
- The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasetsin corpus2023≈ 81%