paper
active
2026
paper:doi-10-48550-arxiv-2604-14463

Psychological Steering of Large Language Models

TL;DR

Mean-difference-from-self (MDS) residual-stream injections outperform Personality Prompting (P²), the established baseline for OCEAN psychological steering, in open-ended generation across 11 of 14 tested LLMs—including Llama-3.1-8B-Instruct, Qwen3-8B, and gemma-3-12b-it—with steerability score (Φ) gains ranging from 3.61% to 16.44% on synthetic situational judgment tests scored by GPT-5.1. A hybrid method (PM) combining P² prompting with MDS injections extends this further, outperforming both constituents in 13 of 14 models with gains over P² of 5.56%–21.92% and over MDS alone of 3.30%–26.67%. These results directly overturn Banayeeanzade et al.'s prior finding that P² surpasses MD-based injection methods, with the gap traced to two methodological failures in prior work: sweeping injection strength in uncalibrated activation-space units and restricting search to a narrow coefficient range such as [0.4, 0.5, …, 1.5]. The psychological steering framework introduced here addresses both by defining centroid units—layer-wise calibrated scales anchored to the distance between construct and antithesis activation centroids—and operationalizing unbounded fluency-constrained sweeps using lightweight logistic-regression classifiers trained on Qwen3Embedding-0.6B embeddings rather than paid frontier APIs. OLS regression on 10-step α sweeps shows 89.23% of manipulated OCEAN trends achieve R² ≥ 0.85, confirming near-linear control consistent with the Linear Representation Hypothesis; however, the induced cross-trait covariance matches the Big Two metatrait model in only 46.15% of cases, implying that LLM representation geometry diverges meaningfully from the structure of human personality.

What to take away

  1. 1. MDS (mean-difference-from-self-statement) injections outperform Personality Prompting (P²) on open-ended OCEAN situational judgment tests in 11 of 14 LLMs spanning Llama-3.2-1B-Instruct through Qwen3-32B, with steerability Φ gains of 3.61%–16.44%.
  2. 2. A hybrid method (PM) that applies both P² system-prompt steering and MDS residual-stream injections simultaneously achieves the highest steerability in 13 of 14 LLMs, with Φ gains over P² of 5.56%–21.92% and over MDS alone of 3.30%–26.67%.
  3. 3. Prior reports that prompting outperforms MD injections—specifically Banayeeanzade et al. 2025—are attributable to two methodological gaps: sweeping in uncalibrated unit-vector magnitudes and restricting the search space to a narrow set of coefficients such as [0.4, 0.5, …, 1.5], which can miss optima far from that range.
  4. 4. The framework introduces centroid units as a calibrated injection-strength scale: for each transformer layer, the unit is defined as the Euclidean distance from the midpoint between construct and antithesis activation centroids to the target centroid, enabling meaningful cross-layer and cross-model comparisons.
  5. 5. Among the six injection methods compared (L1LI, L1ZI, L2LI, L2ZI, MDB, MDS), MDS dominates on the open-ended SJT task with a global win proportion of 89.5% across 14 LLMs and four injection stride settings, versus 47.3% for MDB and under 30% for all probe-based methods.
  6. 6. OLS linear regression on 10-equidistant-α sweeps shows that 89.23% of the 130 injection-manipulated OCEAN SJT score trends achieve R² ≥ 0.85, and 96.15% achieve R² ≥ 0.75, confirming near-linear control consistent with the Linear Representation Hypothesis.
  7. 7. OCEAN MDS injections induce cross-trait covariance that matches the Big Two model of stability (C, A, reversed-N) and plasticity (E, O) clustering in only 46.15% of cases across 13 LLMs, with the A–C correlation being the most frequent (10 LLMs) and N–C the rarest (1 LLM), suggesting a systematic divergence between learned representations and human personality structure.
  8. 8. Construct-specific statement corpora are synthesized by prompting Llama-3.1-8B-Instruct to generate 35,000 first-person texts per condition, retaining 500 fluent, semantically deduplicated texts (cosine similarity threshold 0.9 on Qwen3Embedding-0.6B embeddings), and validated against Perez et al.'s corpora with centroid cosine similarities of 85.62%–94.00%.
  9. 9. Fluency-constrained unbounded α sweeps are operationalized using logistic-regression classifiers on Qwen3Embedding-0.6B embeddings, achieving mean accuracy and F1-macro of 95.96% across OCEAN, HEXACO, Dark Tetrad, CMNI, and CFNI constructs, replacing frontier LLM calls during the sweep phase.
  10. 10. An open question raised is why MDS injections entirely failed on gemma-3-1b-it—yielding only one valid steering score across all OCEAN traits and strides—while succeeding on larger Gemma variants (4b, 12b, 27b), suggesting a potential model-size or architecture threshold for residual-stream steerability that remains unexplained.

Peer brief — for seminar discussion

The paper constructs a psychological steering framework for LLMs that targets the OCEAN personality model via additive residual-stream injections and operationalizes what prior work left underspecified: a calibrated injection-strength scale and an unbounded sweep procedure. Fourteen instruction-tuned models ranging from Llama-3.2-1B-Instruct to Qwen3-32B (models above 12B quantized to 4-bit NF4) are evaluated against six injection types—four probe-based (L1LI, L1ZI, L2LI, L2ZI) and two mean-difference variants (MDB, MDS)—plus the Personality Prompting baseline P² from Jiang et al. Evaluation uses the IPIP-NEO-120 inventory and a battery of synthetic open-ended situational judgment tests (SJTs) scored by GPT-5.1 at temperature 0. The framework's method contributions are: (1) centroid units, a layer-wise calibrated scale where one unit equals the distance from the midpoint of construct and antithesis activation centroids to the target centroid; (2) unbounded fluency-constrained sweeps that early-stop when mean SJT fluency drops below 95% of baseline or more than 5% of responses fall below 90% of baseline, using a RoBERTa-large CoLA classifier; and (3) lightweight logistic-regression classifiers on Qwen3Embedding-0.6B embeddings (mean accuracy 95.96%) to replace frontier API calls during sweeping. An alternative the framework could have used—and implicitly contrasts against—is the restricted fixed-grid sweep over a small coefficient range like [0.4, 0.5, …, 1.5] at uncalibrated unit-vector magnitudes, as employed by Banayeeanzade et al. and Deng et al. The load-bearing finding is that MDS injections, derived from differences between centroids of first-person self-statement activations, outperform P² in 11 of 14 LLMs with Φ gains of 3.61%–16.44%, and a PM hybrid of P² plus MDS outperforms both in 13 of 14 LLMs with Φ gains over P² of 5.56%–21.92%. This directly overturns Banayeeanzade et al.'s 2025 report that P² outperforms MD-based injections, attributing the reversal entirely to the two methodological gaps corrected here. Additionally, 89.23% of manipulated OCEAN score trends show R² ≥ 0.85 under OLS regression across 10 equidistant α values, supporting alignment with the Linear Representation Hypothesis, while induced cross-trait covariance matches the Big Two metatrait model in only 46.15% of cases—a finding the authors interpret as evidence of a gap between LLM representation geometry and human personality structure. The implications are that representation engineering, properly calibrated, should be considered the leading approach to open-ended psychological steering, and that prompt-based and activation-based methods are complementary rather than competing. A critical reader should push back on the evaluation instrument's circularity: the SJTs used to compare methods are synthesized from ATOMIC10x heads paired with IPIP-NEO-120 items via GPT-5.1, and performance on them is then scored by GPT-5.1 using the same OCEAN facet descriptions that anchored statement generation. There is no independent human-validated SJT battery for OCEAN at the scale needed, so the reported Φ scores measure alignment with GPT-5.1's conception of OCEAN traits as instantiated through the same pipeline that produced the tests—a closed loop that makes it difficult to assess ecological validity or to rule out that PM's gains reflect better exploitation of GPT-5.1's scoring preferences rather than stronger behavioral expression.

Methods (5)

  • Centroid Unit Calibration
    Novel calibration of injection strength as the distance from centroid midpoint to centroid; enables meaningful cross-layer comparison of alpha values
  • Embedding-based Construct Logistic Classifier
    Logistic regressor on Qwen3Embedding-0.6B embeddings trained on construct statements; used to measure construct presence in alpha sweeps
  • Injection Stride
    Parameter controlling how often an injection is applied during completion; s=1 injects on every activation, achieving strongest steering
  • PM Hybrid Method
    Hybrid method combining Personality Prompting (P2) with MDS injections; best overall steering method
  • Synthetic Situational Judgment Test Battery
    Open-ended situational judgment tests synthesized using GPT-5.1 from ATOMIC10x heads and inventory items; primary evaluation instrument for open-ended steering

Frameworks (2)

  • OCEAN Personality Model
    Primary personality model used to benchmark steering methods across 14 LLMs
  • Psychological Steering Framework
    The paper's primary contribution: performs unbounded, fluency-constrained sweeps in semantically calibrated centroid units using psychological artifacts

Datasets (1)

  • Qwen3-8B
    Reasoning-optimized base model used for training SFR-DR-8B variant.

Findings (23)

Claims (8)

Hypotheses (2)

Questions (4)

Original abstract (expand)

Large language models (LLMs) emulate a consistent human-like behavior that can be shaped through activation-level interventions. This paradigm is converging on additive residual-stream injections, which rely on injection-strength sweeps to approximate optimal intervention settings. However, existing methods restrict the search space and sweep in uncalibrated activation-space units, potentially missing optimal intervention conditions. Thus, we introduce a psychological steering framework that performs unbounded, fluency-constrained sweeps in semantically calibrated units. Our method derives and calibrates residual-stream injections using psychological artifacts, and we use the IPIP-NEO-120, which measures the OCEAN personality model, to compare six injection methods. We find that mean-difference (MD) injections outperform Personality Prompting (P$^2$), an established baseline for OCEAN steering, in open-ended generation in 11 of 14 LLMs, with gains of 3.6\% to 16.4\%, overturning prior reports favoring prompting and positioning representation engineering as a new frontier in open-ended psychological steering. Further, we find that a hybrid of P$^2$ and MD injections outperforms both methods in 13 of 14 LLMs, with gains over P$^2$ ranging from 5.6\% to 21.9\% and from 3.3\% to 26.7\% over MD injections. Finally, we show that MD injections align with the Linear Representation Hypothesis and provide reliable, approximately linear control knobs for psychological steering. Nevertheless, they also induce OCEAN trait covariance patterns that depart from the Big Two model, suggesting a gap between learned representations and human psychology.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Similar preprints — Semantic Scholar