concept
active
concept:phi-score-extreme-steering-scorePhi Score (Extreme Steering Score)
Best SJT steering score for a given method, instrument, layer, stride, trait, and direction combination; the primary comparison metric
Neighborhood — ranked by edge-count
Concepts (1)
concept
- Steerability Score (Phi)related_toAggregate metric averaging mean SJT scores across OCEAN traits and steering directions; maximum possible is 10
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Backbone model used in E3 robustness overlay.
- Second of three operational criteria; requires distributional significance in IIT estimates across performance levels.
- Dot product between hidden state and concept vector averaged across 5-layer window around best layer; measures model's internal emotive state
- Identified exception to overall MDS effectiveness; reason remains unexplained as a limitation
- First of three operational criteria for identifying consciousness phenomena in LLM representations.
- Baseline steering method that applies intervention at every token generation step, shown to degrade performance at high strengths
- Metrics derived from benchmarks to quantify how safe a model is, e.g., refusal rate to harmful requests.
- Models produce first-attempt mean scores 87.8-91.8/100 without steering across all model familiesfinding0.680Establishes high baseline quality confirming steering-induced degradation is the experimental signal