finding

active

finding:pm-achieves-overall-sjt-steerability-phi-9-6-on-gemma-3-12b-it-vs-mds-8-7-and-p2-8-3

PM achieves overall SJT steerability Phi=9.6 on gemma-3-12b-it vs MDS=8.7 and P2=8.3

Per-model steerability comparison from Table 4

Source paper

extracted_from

Psychological Steering of Large Language Models

(2026) · Leonardo Blas · Robin Jia · Emilio Ferrara

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

On Qwen3-1.7B, MDS achieves ϕ1,C,↑ = 5.0 (SJTs) vs P2 at 4.7, and ϕ1,C,↓ = 1.4 (SJTs) vs P2 at 3.6finding0.793
Specific consciousness sweep result for Qwen3-1.7B from Table 6 demonstrating strong bidirectional steering
gemma-3-1b-it yields only one valid MDS injection score (phi_1,A,up = 4.8) and is excluded from main analysesfinding0.781
Identified exception to overall MDS effectiveness; reason remains unexplained as a limitation
Synthetic MFT SJTs achieve 77.71%-83.84% alignment with Clifford et al. human-composed MFT vignettesfinding0.767
Moderate-to-high alignment validating SJT synthesis for moral foundations domain
Gemma-2-27B MT-Bench score slightly decreased from 8.81 to 8.40 ± 0.15 after SOO fine-tuningfinding0.766
SOO fine-tuning caused a small decrease in Gemma-2-27B general capabilities
Steering at 6 layers (strength 0.6 each, total 3.6) outperforms single-layer steering at equivalent total strength for type hint suppressionfinding0.765
Demonstrates distributed steering is more effective and less accuracy-damaging than concentrated steering.
Gemma-2-27B-it deceptive response rate reduced from 100% to 9.36% ± 7.09% after SOO fine-tuningfinding0.765
Primary result showing SOO fine-tuning significantly reduces deception in Gemma-2-27B
Base and instruct Gemma 2 27B role PCs have cosine similarities of 0.93, 0.87, 0.83 for the top 3 PCs respectively; role vector cosine similarities >0.99 for every role pairfinding0.764
Shows persona space axes are inherited from pre-training, not solely created by post-training
PM hybrid outperforms both P2 and MDS in 13 of 14 LLMs with Phi gains over P2 from 5.56% to 21.92% and over MDS from 3.30% to 26.67%finding0.764
Key finding showing that combining prompting and injection is the strongest approach