finding
active
finding:steering-with-same-vector-on-pre-fine-tuned-llama-nemotron-has-zero-effect-on-type-hint-rateSteering with same vector on pre-fine-tuned Llama Nemotron has zero effect on type hint rate
Falsifies hypothesis that steering vector encodes information about type hints; steering can only affect behavior via deployment-belief modification.
Source paper
extracted_from(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel
Neighborhood — ranked by edge-count
Claims (2)
claim
- Steering affects type hint writing by modifying the model's belief about whether it is being evaluated, not by directly encoding type hint informationassociated_withsupportsMechanism claim supported by transcript analysis and the fact that the steering vector was extracted from a model that never writes type hints.
- Methodological justification for using SDF over direct demonstrations to train a realistic model organism.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Shows behavioral pattern of self-correction is trainable in smaller models
- Key limitation acknowledged by authors.
- Empirical result demonstrating the failure mode of linear steering when concept geometry is cyclic.
- Robustness check ruling out that any perturbation would decrease type hint rate due to brittleness.
- Illustrative finding that ESR mitigates but does not fully eliminate steering influence
- Demonstrates distributed steering is more effective and less accuracy-damaging than concentrated steering.
- Demonstrates ESR can be deliberately enhanced through prompting in the largest model
- Core empirical result demonstrating that manifold steering produces on-target, behavior-aligned outputs.