Psychological Steering of Large Language Models

ByLeonardo Blas·Robin Jia·Emilio FerraraUniversity of Southern California

DOI 10.48550/arxiv.2604.14463 arXiv 2604.14463 OpenAlex W7154736432

gemma-3-1b-it OCEAN Personality Model Centroid Unit Calibration Qwen3-8B Llama-3.1-8B-Instruct Psychological Steering Framework Embedding-based Construct Logistic Classifier Qwen3-32B Injection Stride PM Hybrid Method Synthetic Situational Judgment Test Battery

TL;DR

Mean-difference-from-self (MDS) residual-stream injections outperform Personality Prompting (P²), the established baseline for OCEAN psychological steering, in open-ended generation across 11 of 14 tested LLMs—including Llama-3.1-8B-Instruct, Qwen3-8B, and gemma-3-12b-it—with steerability score (Φ) gains ranging from 3.61% to 16.44% on synthetic situational judgment tests scored by GPT-5.1. A hybrid method (PM) combining P² prompting with MDS injections extends this further, outperforming both constituents in 13 of 14 models with gains over P² of 5.56%–21.92% and over MDS alone of 3.30%–26.67%. These results directly overturn Banayeeanzade et al.'s prior finding that P² surpasses MD-based injection methods, with the gap traced to two methodological failures in prior work: sweeping injection strength in uncalibrated activation-space units and restricting search to a narrow coefficient range such as [0.4, 0.5, …, 1.5]. The psychological steering framework introduced here addresses both by defining centroid units—layer-wise calibrated scales anchored to the distance between construct and antithesis activation centroids—and operationalizing unbounded fluency-constrained sweeps using lightweight logistic-regression classifiers trained on Qwen3Embedding-0.6B embeddings rather than paid frontier APIs. OLS regression on 10-step α sweeps shows 89.23% of manipulated OCEAN trends achieve R² ≥ 0.85, confirming near-linear control consistent with the Linear Representation Hypothesis; however, the induced cross-trait covariance matches the Big Two metatrait model in only 46.15% of cases, implying that LLM representation geometry diverges meaningfully from the structure of human personality.

What to take away

1. MDS (mean-difference-from-self-statement) injections outperform Personality Prompting (P²) on open-ended OCEAN situational judgment tests in 11 of 14 LLMs spanning Llama-3.2-1B-Instruct through Qwen3-32B, with steerability Φ gains of 3.61%–16.44%.
2. A hybrid method (PM) that applies both P² system-prompt steering and MDS residual-stream injections simultaneously achieves the highest steerability in 13 of 14 LLMs, with Φ gains over P² of 5.56%–21.92% and over MDS alone of 3.30%–26.67%.
3. Prior reports that prompting outperforms MD injections—specifically Banayeeanzade et al. 2025—are attributable to two methodological gaps: sweeping in uncalibrated unit-vector magnitudes and restricting the search space to a narrow set of coefficients such as [0.4, 0.5, …, 1.5], which can miss optima far from that range.
4. The framework introduces centroid units as a calibrated injection-strength scale: for each transformer layer, the unit is defined as the Euclidean distance from the midpoint between construct and antithesis activation centroids to the target centroid, enabling meaningful cross-layer and cross-model comparisons.
5. Among the six injection methods compared (L1LI, L1ZI, L2LI, L2ZI, MDB, MDS), MDS dominates on the open-ended SJT task with a global win proportion of 89.5% across 14 LLMs and four injection stride settings, versus 47.3% for MDB and under 30% for all probe-based methods.
6. OLS linear regression on 10-equidistant-α sweeps shows that 89.23% of the 130 injection-manipulated OCEAN SJT score trends achieve R² ≥ 0.85, and 96.15% achieve R² ≥ 0.75, confirming near-linear control consistent with the Linear Representation Hypothesis.
7. OCEAN MDS injections induce cross-trait covariance that matches the Big Two model of stability (C, A, reversed-N) and plasticity (E, O) clustering in only 46.15% of cases across 13 LLMs, with the A–C correlation being the most frequent (10 LLMs) and N–C the rarest (1 LLM), suggesting a systematic divergence between learned representations and human personality structure.
8. Construct-specific statement corpora are synthesized by prompting Llama-3.1-8B-Instruct to generate 35,000 first-person texts per condition, retaining 500 fluent, semantically deduplicated texts (cosine similarity threshold 0.9 on Qwen3Embedding-0.6B embeddings), and validated against Perez et al.'s corpora with centroid cosine similarities of 85.62%–94.00%.
9. Fluency-constrained unbounded α sweeps are operationalized using logistic-regression classifiers on Qwen3Embedding-0.6B embeddings, achieving mean accuracy and F1-macro of 95.96% across OCEAN, HEXACO, Dark Tetrad, CMNI, and CFNI constructs, replacing frontier LLM calls during the sweep phase.
10. An open question raised is why MDS injections entirely failed on gemma-3-1b-it—yielding only one valid steering score across all OCEAN traits and strides—while succeeding on larger Gemma variants (4b, 12b, 27b), suggesting a potential model-size or architecture threshold for residual-stream steerability that remains unexplained.

Peer brief — for seminar discussion

The paper constructs a psychological steering framework for LLMs that targets the OCEAN personality model via additive residual-stream injections and operationalizes what prior work left underspecified: a calibrated injection-strength scale and an unbounded sweep procedure. Fourteen instruction-tuned models ranging from Llama-3.2-1B-Instruct to Qwen3-32B (models above 12B quantized to 4-bit NF4) are evaluated against six injection types—four probe-based (L1LI, L1ZI, L2LI, L2ZI) and two mean-difference variants (MDB, MDS)—plus the Personality Prompting baseline P² from Jiang et al. Evaluation uses the IPIP-NEO-120 inventory and a battery of synthetic open-ended situational judgment tests (SJTs) scored by GPT-5.1 at temperature 0. The framework's method contributions are: (1) centroid units, a layer-wise calibrated scale where one unit equals the distance from the midpoint of construct and antithesis activation centroids to the target centroid; (2) unbounded fluency-constrained sweeps that early-stop when mean SJT fluency drops below 95% of baseline or more than 5% of responses fall below 90% of baseline, using a RoBERTa-large CoLA classifier; and (3) lightweight logistic-regression classifiers on Qwen3Embedding-0.6B embeddings (mean accuracy 95.96%) to replace frontier API calls during sweeping. An alternative the framework could have used—and implicitly contrasts against—is the restricted fixed-grid sweep over a small coefficient range like [0.4, 0.5, …, 1.5] at uncalibrated unit-vector magnitudes, as employed by Banayeeanzade et al. and Deng et al. The load-bearing finding is that MDS injections, derived from differences between centroids of first-person self-statement activations, outperform P² in 11 of 14 LLMs with Φ gains of 3.61%–16.44%, and a PM hybrid of P² plus MDS outperforms both in 13 of 14 LLMs with Φ gains over P² of 5.56%–21.92%. This directly overturns Banayeeanzade et al.'s 2025 report that P² outperforms MD-based injections, attributing the reversal entirely to the two methodological gaps corrected here. Additionally, 89.23% of manipulated OCEAN score trends show R² ≥ 0.85 under OLS regression across 10 equidistant α values, supporting alignment with the Linear Representation Hypothesis, while induced cross-trait covariance matches the Big Two metatrait model in only 46.15% of cases—a finding the authors interpret as evidence of a gap between LLM representation geometry and human personality structure. The implications are that representation engineering, properly calibrated, should be considered the leading approach to open-ended psychological steering, and that prompt-based and activation-based methods are complementary rather than competing. A critical reader should push back on the evaluation instrument's circularity: the SJTs used to compare methods are synthesized from ATOMIC10x heads paired with IPIP-NEO-120 items via GPT-5.1, and performance on them is then scored by GPT-5.1 using the same OCEAN facet descriptions that anchored statement generation. There is no independent human-validated SJT battery for OCEAN at the scale needed, so the reported Φ scores measure alignment with GPT-5.1's conception of OCEAN traits as instantiated through the same pipeline that produced the tests—a closed loop that makes it difficult to assess ecological validity or to rule out that PM's gains reflect better exploitation of GPT-5.1's scoring preferences rather than stronger behavioral expression.

Methods (5)

Centroid Unit Calibration
Novel calibration of injection strength as the distance from centroid midpoint to centroid; enables meaningful cross-layer comparison of alpha values
Embedding-based Construct Logistic Classifier
Logistic regressor on Qwen3Embedding-0.6B embeddings trained on construct statements; used to measure construct presence in alpha sweeps
Injection Stride
Parameter controlling how often an injection is applied during completion; s=1 injects on every activation, achieving strongest steering
PM Hybrid Method
Hybrid method combining Personality Prompting (P2) with MDS injections; best overall steering method
Synthetic Situational Judgment Test Battery
Open-ended situational judgment tests synthesized using GPT-5.1 from ATOMIC10x heads and inventory items; primary evaluation instrument for open-ended steering

Frameworks (2)

OCEAN Personality Model
Primary personality model used to benchmark steering methods across 14 LLMs
Psychological Steering Framework
The paper's primary contribution: performs unbounded, fluency-constrained sweeps in semantically calibrated centroid units using psychological artifacts

Datasets (1)

Qwen3-8B
Reasoning-optimized base model used for training SFR-DR-8B variant.

Findings (23)

MDS injections can steer toward multiple distinct constructs in the same completion, producing strongly polarized yet smoothly connected segments
Qualitative finding demonstrating unique capability of activation-level interventions unavailable to prompting methods including PM
Embedding-based construct classifiers achieve mean accuracy and F1-macro of 95.96% across OCEAN, HEXACO, Dark Tetrad, CMNI, CFNI constructs
Validates use of lightweight classifiers as replacement for frontier LLM evaluation during alpha sweeps
Probes trained on h_b activations achieve perfect test accuracy in every case; h_s probes achieve perfect accuracy in only 0.60% of cases
Justifies restricting probe-based vector derivation to h_b activations; attributed to Yes/No semantics
PM hybrid outperforms both P2 and MDS in 13 of 14 LLMs with Phi gains over P2 from 5.56% to 21.92% and over MDS from 3.30% to 26.67%
Key finding showing that combining prompting and injection is the strongest approach
Generated statements achieve 85.62%-94.00% cosine similarity alignment with Perez et al. validated OCEAN and Dark Triad statements
Validates the statement synthesis pipeline as producing behavior-specific content comparable to established methods
47.69% of 130 injection-manipulated alpha trends have near-linear fits (R2 >= 0.95); 96.15% have roughly linear fits (R2 >= 0.75)
Demonstrates alignment with Linear Representation Hypothesis: target trait steers approximately linearly with alpha
Injection stride s=1 produces the highest mean SJT scores across all LLMs; more frequent injection yields stronger steering
Empirical finding about injection stride parameter; injecting into every completion activation maximizes steering strength
Only 46.15% of cases show covariance patterns consistent with the Big Two model; no LLM satisfies all Big Two correlations
Suggests a gap between LLM learned representations and human personality structure as described by Big Two
Agreeableness-Conscientiousness Big Two correlation is observed in 10 of 13 LLMs; N-C correlation is rarest (1 LLM)
Most and least common Big Two covariance pattern in LLM OCEAN MDS injections
Only 13.27% of 520 non-manipulated alpha trends achieve R2 >= 0.95, contrasting with 47.69% for manipulated trends
Control comparison showing near-linearity is specific to the targeted manipulation direction

Claims (8)

MD vectors outperform probe-based vectors in SJTs because they align with construct centroids rather than distorting direction via regularization
Mechanistic explanation for MDS superiority; attributed to two design choices: centroid alignment and full-utterance semantics in h_s
OCEAN MDS injection covariance patterns departing from the Big Two model suggest a gap between learned LLM representations and human psychology
Interpretive conclusion from Big Two mismatch finding; tentative due to only 46.15% match rate
Uncalibrated sweep units and restricted coefficient ranges are the primary cause of prior reports showing P2 outperforming MD injections
Mechanistic explanation for discrepancy with Banayeeanzade et al.; addressed by centroid unit and unbounded sweep contributions
MDS injections align with the Linear Representation Hypothesis: target trait varies near-linearly with alpha in open-ended generation
Theoretical alignment claim backed by OLS R2 analysis showing 96.15% of trends have R2>=0.75
h_b activations encode Yes/No semantics that produce clean separability for logistic probes, unlike h_s full-utterance activations
Explanation for why probes on h_b achieve perfect accuracy but h_s probes succeed only 0.60% of the time
Representation engineering and prompting methods may combine to achieve stronger behavioral expression across other domains
Broader implication of PM hybrid's superior performance; extrapolated from OCEAN results
The psychological steering framework generalizes beyond OCEAN to Dark Tetrad, CMNI, CFNI, and other psychological models
Supported by qualitative experiments showing fluent and coherent steering for three additional models
RepE is a new frontier in open-ended psychological steering of LLMs, outperforming prompting when properly calibrated
Central interpretive claim overturning prior reports; supported by 11-of-14 LLM wins for MDS over P2

Hypotheses (2)

The framework's methodological contributions could be adapted to target arbitrary non-psychological attributes given custom evaluation criteria
Generalization hypothesis stated in introduction; not tested in paper
Combining multiple construct injections simultaneously may enable richer persona simulation or fine-grained control
Identified as future work; demonstrated qualitatively in Figure 1 but not formally evaluated

Questions (4)

Do the findings about MDS injection effectiveness generalize to base (non-instruction-tuned) language models?
Acknowledged limitation: only instruction-tuned models were studied
Why do MDS injections outperform other methods on the inventory (multiple-choice) task?
Identified as an unexplained result and open question in limitations section
Why do MDS injections fail on gemma-3-1b-it but succeed across all other tested LLMs?
Unexplained exception identified as a limitation and open question
Do psychological steering results hold beyond 64-token completions?
Acknowledged limitation of restricting experiments to 64-token completions

Original abstract (expand)

Large language models (LLMs) emulate a consistent human-like behavior that can be shaped through activation-level interventions. This paradigm is converging on additive residual-stream injections, which rely on injection-strength sweeps to approximate optimal intervention settings. However, existing methods restrict the search space and sweep in uncalibrated activation-space units, potentially missing optimal intervention conditions. Thus, we introduce a psychological steering framework that performs unbounded, fluency-constrained sweeps in semantically calibrated units. Our method derives and calibrates residual-stream injections using psychological artifacts, and we use the IPIP-NEO-120, which measures the OCEAN personality model, to compare six injection methods. We find that mean-difference (MD) injections outperform Personality Prompting (P$^2$), an established baseline for OCEAN steering, in open-ended generation in 11 of 14 LLMs, with gains of 3.6\% to 16.4\%, overturning prior reports favoring prompting and positioning representation engineering as a new frontier in open-ended psychological steering. Further, we find that a hybrid of P$^2$ and MD injections outperforms both methods in 13 of 14 LLMs, with gains over P$^2$ ranging from 5.6\% to 21.9\% and from 3.3\% to 26.7\% over MD injections. Finally, we show that MD injections align with the Linear Representation Hypothesis and provide reliable, approximately linear control knobs for psychological steering. Nevertheless, they also induce OCEAN trait covariance patterns that depart from the Big Two model, suggesting a gap between learned representations and human psychology.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Psychological Steering in LLMs: An Evaluation of Effectiveness and Trustworthiness
Ala N. Tak, Fatemeh Bahrani, Anahita Bolourani, Leonardo Blas, Emilio Ferrara, Jonathan Gratch, Sai Praneeth Karimireddy Amin Banayeeanzade
2025
≈ 89%
Controllable and explainable personality sliders for LLMs at inference time
David Khachaturov, Robert Mullins, Mark Huasong Meng Florian Hoppe
2026
≈ 88%
The Effectiveness of Style Vectors for Steering Large Language Models: A Human Evaluation
Katharina Dworatzyk, Sophie Jentzsch, Peer Sch\"utt, Sabine Theis, Tobias Hecking Diaoul\'e Diallo
2026
≈ 86%
What Can We Actually Steer? A Multi-Behavior Study of Activation Control
Krystian Novak Tetiana Bas
2026
≈ 86%
Steer Like the LLM: Activation Steering that Mimics Prompting
Geert Heyman and Frederik Vandeputte
2026
≈ 86%
Mechanistic Indicators of Steering Effectiveness in Large Language Models
Hao Xue, Flora Salim Mehdi Jafari
2026
≈ 86%
Curveball Steering: The Right Direction To Steer Isn't Always Linear
Hae Jin Song, Linlin Wu, Abir Harrasse, Jeff M. Phillips, Fazl Barez, Amirali Abdullah Shivam Raval
2026
≈ 86%
Persona Non Grata: Single-Method Safety Evaluation Is Incomplete for Persona-Imbued LLMs
Fan Yang, Shaunak A. Mehta, Koichi Onoue Wenkai Li
2026
≈ 85%
Steering Conceptual Bias via Transformer Latent-Subspace Activation
Vansh Sharma and Venkat Raman
2025
≈ 85%
Steering at the Source: Style Modulation Heads for Robust Persona Control
Gouki Minegishi, Koshi Eguchi, Sosuke Hosokawa, Kenjiro Taura Yoshihiro Izawa
2026
≈ 84%
Causal Evidence that Language Models use Confidence to Drive Behavior
Nathaniel Daw, Simon Osindero, Petar Velickovic, Viorica Patraucean Dharshan Kumaran
2026
≈ 84%
CBMAS: Cognitive Behavioral Modeling via Activation Steering
Anthony Kuang, Ayo Akinkugbe, Kevin Zhu, Sean O'Brien Ahmed H. Ismail
2026
≈ 84%
Intrinsic Guardrails: How Semantic Geometry of Personality Interacts with Emergent Misalignment in LLMs
Manas Mittal, Anmol Goel, Ponnurangam Kumaraguru, Vamshi Krishna Bonagiri Krishak Aneja
2026
≈ 84%
Beyond Steering Vector: Flow-based Activation Steering for Inference-Time Intervention
Ruixuan Deng, Junran Wang, Xinjie Shen, Chao Zhang Zehao Jin
2026
≈ 84%
Steering Language Models with Weight Arithmetic
Fabien Roger Constanza Fierro
2026
≈ 84%
Quantitative Introspection in Language Models: Tracking Emotive States Across Conversation
in corpus
2026
≈ 83%
Steering Evaluation-Aware Language Models to Act Like They Are Deployed
in corpus
2025
≈ 83%
Persistence and Introspection of Emotion Features
in corpus
≈ 82%
Detecting the Disturbance: A Nuanced View of Introspective Abilities in LLMs
in corpus
2025
≈ 82%
The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models
in corpus
2026
≈ 82%
Large Language Models Report Subjective Experience Under Self-Referential Processing
in corpus
2025
≈ 82%
Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders
in corpus
2026
≈ 82%
Unveiling the Latent Directions of Reflection in Large Language Models
in corpus
2025
≈ 81%
From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs
in corpus
2025
≈ 81%
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
in corpus
2026
≈ 81%
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
in corpus
2023
≈ 81%