finding

active

finding:steering-with-same-vector-on-pre-fine-tuned-llama-nemotron-has-zero-effect-on-type-hint-rate

Steering with same vector on pre-fine-tuned Llama Nemotron has zero effect on type hint rate

Falsifies hypothesis that steering vector encodes information about type hints; steering can only affect behavior via deployment-belief modification.

Source paper

extracted_from

Steering Evaluation-Aware Language Models to Act Like They Are Deployed

(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel

Neighborhood — ranked by edge-count

Claims (2)

claim

Steering affects type hint writing by modifying the model's belief about whether it is being evaluated, not by directly encoding type hint information
associated_withsupports
Mechanism claim supported by transcript analysis and the fact that the steering vector was extracted from a model that never writes type hints.
Synthetic document fine-tuning avoids artificially strengthening the evaluation-deployment representational direction compared to direct demonstration fine-tuning
supports
Methodological justification for using SDF over direct demonstrations to train a realistic model organism.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Fine-tuning Llama-3.1-8B on self-correction examples increases multi-attempt rate proportionally with training data ratiofinding0.795
Shows behavioral pattern of self-correction is trainable in smaller models
Results may not fully generalize to all models and scenarios because the model organism relies on hints and nudges and Llama Nemotron cannot consistently distinguish evaluation/deployment based on subtle cuesclaim0.786
Key limitation acknowledged by authors.
Linear steering on Llama-3.1 8B for the days-of-week task cuts across the behavior manifold, producing noisy off-target effects where predicted tokens are not even days of the week.finding0.785
Empirical result demonstrating the failure mode of linear steering when concept geometry is cyclic.
24 Gaussian noise vectors (matched mean/variance/ℓ2 norm) do not decrease type hint rate nearly as much as deployment steering vectorfinding0.779
Robustness check ruling out that any perturbation would decrease type hint rate due to brittleness.
Llama-3.3-70B corrected response scores 75/100 rather than 100 due to residual steering effects (Snell's law reference)finding0.778
Illustrative finding that ESR mitigates but does not fully eliminate steering influence
Steering at 6 layers (strength 0.6 each, total 3.6) outperforms single-layer steering at equivalent total strength for type hint suppressionfinding0.767
Demonstrates distributed steering is more effective and less accuracy-damaging than concentrated steering.
Meta-prompting increases Llama-3.3-70B multi-attempt rate 4.3× (from 7.4% to 31.7%)finding0.761
Demonstrates ESR can be deliberately enhanced through prompting in the largest model
Steering Llama-3.1 8B along the circular representation manifold produces outputs that follow the natural circle of the behavior manifold, cleanly shifting probability mass from Monday through successive days.finding0.761
Core empirical result demonstrating that manifold steering produces on-target, behavior-aligned outputs.