Steering-sign validation test

Validation filter: same-concept steering must shift self-report in expected direction; used to exclude invalid concept-model pairs

Neighborhood — ranked by edge-count

Methods (1)

method

Activation Steering
extends
Causal intervention technique: edit NLA explanation, reconstruct via AR, use difference as steering vector to manipulate model behavior.

Claims (1)

claim

The steering-sign test functions as a practical probe-validation criterion: inverted report changes when steering suspect probe quality
implements
Methodological contribution: used to exclude focus-1B and impulsivity-8B from scaling analysis

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Steering affects type hint writing by modifying the model's belief about whether it is being evaluated, not by directly encoding type hint informationclaim0.740
Mechanism claim supported by transcript analysis and the fact that the steering vector was extracted from a model that never writes type hints.
steering vectorsconcept0.735
A method for modifying model behavior by adding perturbation vectors to activations, used here to try to reduce eval awareness.
Interpretive Validationconcept0.733
CIMC's methodology for evaluating whether a built system is conscious: combining multiple forms of evidence including predicted functional organization and developmental trajectories
Interpretability-Driven Feedback Steeringconcept0.732
Framework of using internal-state representations to control or steer generative models; conceptually parallel to manifold steering in language models.
Steering vectors discover effective triggers such as 'However' and 'Otherwise', consistent with prior reported reflection datasetsfinding0.729
Validates that steering vectors capture reflection semantics by finding tokens reported in related work.
Cross-task and cross-modal validation of manifold steeringfinding0.726
The paper demonstrates the bidirectional geometry-behavior relationship across multiple tasks and modalities (language models and video world models)
Agentic self-steering evaluationmethod0.717
Method where Kimi K2.5 steers its own SAE features in real time and reports on its internal emotional state
Preventative Steering During Trainingconcept0.717
Alternative to inference-time activation capping: applying persona steering during training to deeply anchor models; cited from Chen et al.