finding
active
finding:24-gaussian-noise-vectors-matched-mean-variance-l2-norm-do-not-decrease-type-hint-rate-nearly-as-much-as-deployment-steering-vector24 Gaussian noise vectors (matched mean/variance/ℓ2 norm) do not decrease type hint rate nearly as much as deployment steering vector
Robustness check ruling out that any perturbation would decrease type hint rate due to brittleness.
Source paper
extracted_from(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel
Neighborhood — ranked by edge-count
Claims (1)
claim
- Central claim of the paper; supported by the model organism ground-truth approach.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Four best contrastive prompt pairs outperform full 16-pair average steering vector for type hint suppressionfinding0.814Optimization result for steering vector construction.
- Demonstrates steering is not equivalent to prompting with the contrastive prompts.
- Steering with same vector on pre-fine-tuned Llama Nemotron has zero effect on type hint ratefinding0.779Falsifies hypothesis that steering vector encodes information about type hints; steering can only affect behavior via deployment-belief modification.
- Steering vectors from µ(0→2) slightly outperform µ(1→2) for instruction discovery across datasets and modelsfinding0.778Shows that contrasting No Reflection with Triggered Reflection provides a stronger signal than Intrinsic vs Triggered.
- Demonstrates distributed steering is more effective and less accuracy-damaging than concentrated steering.
- Main result: steering elicits deployment behavior even when the evaluation cue is present and prompting fails.
- Demonstrates averaging multiple prompt pairs reduces noise; optimal subset selection further improves performance.
- Weakens overall setup by showing some prompts can lower type hints, but does not invalidate core steering result.