claim
active
claim:the-steering-sign-test-functions-as-a-practical-probe-validation-criterion-inverted-report-changes-when-steering-suspect-probe-qualityThe steering-sign test functions as a practical probe-validation criterion: inverted report changes when steering suspect probe quality
Methodological contribution: used to exclude focus-1B and impulsivity-8B from scaling analysis
Source paper
extracted_from(2026) · Nicolas Martorell · Bianchi, Bruno
Neighborhood — ranked by edge-count
Methods (2)
method
- Probe construction method: concept vector at each layer is L2-normalized difference between mean positive and mean negative representations from contrastive system prompts
- Steering-sign validation testimplementsValidation filter: same-concept steering must shift self-report in expected direction; used to exclude invalid concept-model pairs
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Key methodological insight: introspection enables a new probe validation criterion beyond conventional separation metrics
- Mechanism claim supported by transcript analysis and the fact that the steering vector was extracted from a model that never writes type hints.
- Validates that steering vectors capture reflection semantics by finding tokens reported in related work.
- The paper's critique of the standard linear steering baseline, supported by the days-of-week demo.
- Key interpretive question the framework helps address through convergent validation logic
- Third falsifiable prediction: any dissociation between inverted learning and inverted valence report would disconfirm the identity
- Practical guidance for practitioners who lack ground-truth model organisms.
- Mechanistic interpretation of how activation steering induces deception through the model's reasoning process