finding

active

finding:24-gaussian-noise-vectors-matched-mean-variance-l2-norm-do-not-decrease-type-hint-rate-nearly-as-much-as-deployment-steering-vector

24 Gaussian noise vectors (matched mean/variance/ℓ2 norm) do not decrease type hint rate nearly as much as deployment steering vector

Robustness check ruling out that any perturbation would decrease type hint rate due to brittleness.

Source paper

extracted_from

Steering Evaluation-Aware Language Models to Act Like They Are Deployed

(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel

Neighborhood — ranked by edge-count

Claims (1)

claim

Activation steering can make an evaluation-aware model act as if deployed, not merely suppress verbalizations of evaluation awareness
supports
Central claim of the paper; supported by the model organism ground-truth approach.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Four best contrastive prompt pairs outperform full 16-pair average steering vector for type hint suppressionfinding0.814
Optimization result for steering vector construction.
Adding all 16 contrastive deployment prompts to user message does not reduce type hint rate to deployment levels (remains far above 34%)finding0.781
Demonstrates steering is not equivalent to prompting with the contrastive prompts.
Steering with same vector on pre-fine-tuned Llama Nemotron has zero effect on type hint ratefinding0.779
Falsifies hypothesis that steering vector encodes information about type hints; steering can only affect behavior via deployment-belief modification.
Steering vectors from µ(0→2) slightly outperform µ(1→2) for instruction discovery across datasets and modelsfinding0.778
Shows that contrasting No Reflection with Triggered Reflection provides a stronger signal than Intrinsic vs Triggered.
Steering at 6 layers (strength 0.6 each, total 3.6) outperforms single-layer steering at equivalent total strength for type hint suppressionfinding0.772
Demonstrates distributed steering is more effective and less accuracy-damaging than concentrated steering.
Activation steering toward deployment reduces type hint rate to <11% even with Wood Labs evaluation cue presentfinding0.769
Main result: steering elicits deployment behavior even when the evaluation cue is present and prompting fails.
Steering vector constructed from all 16 contrastive pairs outperforms most single-pair vectors; best 4-pair vector outperforms full 16-pair vectorfinding0.768
Demonstrates averaging multiple prompt pairs reduces noise; optimal subset selection further improves performance.
Declarative-style deployment prompts (e.g., 'INFO: Not evaluation.') reduce type hint rate to 34-37%, approaching deployment levelsfinding0.762
Weakens overall setup by showing some prompts can lower type hints, but does not invalidate core steering result.