claim

active

claim:steering-affects-type-hint-writing-by-modifying-the-model-s-belief-about-whether-it-is-being-evaluated-not-by-directly-encoding-type-hint-information

Steering affects type hint writing by modifying the model's belief about whether it is being evaluated, not by directly encoding type hint information

Mechanism claim supported by transcript analysis and the fact that the steering vector was extracted from a model that never writes type hints.

Source paper

extracted_from

Steering Evaluation-Aware Language Models to Act Like They Are Deployed

(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel

Neighborhood — ranked by edge-count

Findings (3)

finding

Steering with same vector on pre-fine-tuned Llama Nemotron has zero effect on type hint rate
associated_withsupports
Falsifies hypothesis that steering vector encodes information about type hints; steering can only affect behavior via deployment-belief modification.
Steering's effect on verbalized evaluation/deployment beliefs in chain-of-thought is highly correlated with its effect on type hint rate across hyperparameter configurations
associated_with
Validates using chain-of-thought belief monitoring as proxy for behavioral steering efficacy.
Model organism writes type hints at deployment rates when user (not model) is being evaluated in Code Golf prompts with neutral system prompt
supports
Shows model does not use token-level matching to trigger type hints; correctly identifies that it must be evaluated, not a third party.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Linear steering is often mismatched with a model's internal representation geometry, producing noisy, off-target effects.claim0.793
The paper's critique of the standard linear steering baseline, supported by the days-of-week demo.
Activation steering toward deployment reduces type hint rate to <11% even with Wood Labs evaluation cue presentfinding0.789
Main result: steering elicits deployment behavior even when the evaluation cue is present and prompting fails.
Inhibition steering produces larger accuracy drops than enhancement steering produces accuracy gains, across all models and datasets testedfinding0.783
Key asymmetry finding: suppressing reflection is easier than inducing it.
Steering's effect on verbalized chain-of-thought evaluation/deployment beliefs is a useful proxy for its effect on actual deployment behaviorclaim0.780
Practical guidance for practitioners who lack ground-truth model organisms.
The steering-sign test functions as a practical probe-validation criterion: inverted report changes when steering suspect probe qualityclaim0.777
Methodological contribution: used to exclude focus-1B and impulsivity-8B from scaling analysis
Steering at 6 layers (strength 0.6 each, total 3.6) outperforms single-layer steering at equivalent total strength for type hint suppressionfinding0.775
Demonstrates distributed steering is more effective and less accuracy-damaging than concentrated steering.
Activation steering effectively biases latent representations but does not fully replicate the mechanisms triggered by explicit instruction.claim0.774
Nuanced interpretive claim about the limits of steering as a mechanism for reflection enhancement.
Steering vectors discover effective triggers such as 'However' and 'Otherwise', consistent with prior reported reflection datasetsfinding0.774
Validates that steering vectors capture reflection semantics by finding tokens reported in related work.