finding
active
finding:pm-achieves-overall-sjt-steerability-phi-9-6-on-gemma-3-12b-it-vs-mds-8-7-and-p2-8-3PM achieves overall SJT steerability Phi=9.6 on gemma-3-12b-it vs MDS=8.7 and P2=8.3
Per-model steerability comparison from Table 4
Source paper
extracted_from(2026) · Leonardo Blas · Robin Jia · Emilio Ferrara
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- On Qwen3-1.7B, MDS achieves ϕ1,C,↑ = 5.0 (SJTs) vs P2 at 4.7, and ϕ1,C,↓ = 1.4 (SJTs) vs P2 at 3.6finding0.793Specific consciousness sweep result for Qwen3-1.7B from Table 6 demonstrating strong bidirectional steering
- Identified exception to overall MDS effectiveness; reason remains unexplained as a limitation
- Synthetic MFT SJTs achieve 77.71%-83.84% alignment with Clifford et al. human-composed MFT vignettesfinding0.767Moderate-to-high alignment validating SJT synthesis for moral foundations domain
- Gemma-2-27B MT-Bench score slightly decreased from 8.81 to 8.40 ± 0.15 after SOO fine-tuningfinding0.766SOO fine-tuning caused a small decrease in Gemma-2-27B general capabilities
- Demonstrates distributed steering is more effective and less accuracy-damaging than concentrated steering.
- Gemma-2-27B-it deceptive response rate reduced from 100% to 9.36% ± 7.09% after SOO fine-tuningfinding0.765Primary result showing SOO fine-tuning significantly reduces deception in Gemma-2-27B
- Shows persona space axes are inherited from pre-training, not solely created by post-training
- Key finding showing that combining prompting and injection is the strongest approach