concept
active
concept:steerability-score-phiSteerability Score (Phi)
Aggregate metric averaging mean SJT scores across OCEAN traits and steering directions; maximum possible is 10
Neighborhood — ranked by edge-count
Methods (1)
method
- Open-ended situational judgment tests synthesized using GPT-5.1 from ATOMIC10x heads and inventory items; primary evaluation instrument for open-ended steering
Concepts (1)
concept
- Phi Score (Extreme Steering Score)related_toBest SJT steering score for a given method, instrument, layer, stride, trait, and direction combination; the primary comparison metric
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Per-model steerability comparison from Table 4
- Backbone model used in E3 robustness overlay.
- Emotion-encoding directions in LLM activation space that can be amplified or suppressed via activation steering to causally drive model behavior
- Baseline steering method that applies intervention at every token generation step, shown to degrade performance at high strengths
- Dot product between hidden state and concept vector averaged across 5-layer window around best layer; measures model's internal emotive state
- A method for modifying model behavior by adding perturbation vectors to activations, used here to try to reduce eval awareness.
- Methodological contribution: used to exclude focus-1B and impulsivity-8B from scaling analysis
- Prior work using steering vectors to control reflection, motivated by reducing redundant self-reflection in long CoT.