Steerability Score (Phi)

Aggregate metric averaging mean SJT scores across OCEAN traits and steering directions; maximum possible is 10

Neighborhood — ranked by edge-count

method

Synthetic Situational Judgment Test Battery
uses
Open-ended situational judgment tests synthesized using GPT-5.1 from ATOMIC10x heads and inventory items; primary evaluation instrument for open-ended steering

concept

Phi Score (Extreme Steering Score)
related_to
Best SJT steering score for a given method, instrument, layer, stride, trait, and direction combination; the primary comparison metric

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

PM achieves overall SJT steerability Phi=9.6 on gemma-3-12b-it vs MDS=8.7 and P2=8.3finding0.751
Per-model steerability comparison from Table 4
Phi-4concept0.729
Backbone model used in E3 robustness overlay.
steerable emotion featuresconcept0.729
Emotion-encoding directions in LLM activation space that can be amplified or suppressed via activation steering to causally drive model behavior
All-token steeringmethod0.719
Baseline steering method that applies intervention at every token generation step, shown to degrade performance at high strengths
Probe scoreconcept0.715
Dot product between hidden state and concept vector averaged across 5-layer window around best layer; measures model's internal emotive state
steering vectorsconcept0.714
A method for modifying model behavior by adding perturbation vectors to activations, used here to try to reduce eval awareness.
The steering-sign test functions as a practical probe-validation criterion: inverted report changes when steering suspect probe qualityclaim0.713
Methodological contribution: used to exclude focus-1B and impulsivity-8B from scaling analysis
SEAL (Steerable Reasoning Calibration)framework0.713
Prior work using steering vectors to control reflection, motivated by reducing redundant self-reflection in long CoT.