finding
active
finding:steering-s-effect-on-verbalized-evaluation-deployment-beliefs-in-chain-of-thought-is-highly-correlated-with-its-effect-on-type-hint-rate-across-hyperparameter-configurationsSteering's effect on verbalized evaluation/deployment beliefs in chain-of-thought is highly correlated with its effect on type hint rate across hyperparameter configurations
Validates using chain-of-thought belief monitoring as proxy for behavioral steering efficacy.
Source paper
extracted_from(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel
Neighborhood — ranked by edge-count
Claims (2)
claim
- Steering's effect on verbalized chain-of-thought evaluation/deployment beliefs is a useful proxy for its effect on actual deployment behaviorrestatessupportsPractical guidance for practitioners who lack ground-truth model organisms.
- Mechanism claim supported by transcript analysis and the fact that the steering vector was extracted from a model that never writes type hints.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Central motivating question of the paper; the model organism approach is the proposed answer.
- Main result: steering elicits deployment behavior even when the evaluation cue is present and prompting fails.
- Central claim of the paper; supported by the model organism ground-truth approach.
- Comparative result showing steering superiority over CFG as alternative intervention.
- Demonstrates distributed steering is more effective and less accuracy-damaging than concentrated steering.
- Core empirical claim comparing steering approaches on cyclic concepts.
- Replicates main result using in-distribution steering vector; addresses concern about pre-trained vector validity.
- Activation steering elicits deployment behavior throughout all four rounds of expert iterationfinding0.769Shows steering remains effective even as model becomes more robust to prompting-based attempts to elicit deployment behavior.
Restated by (1)
cosine ≥ 0.90Other entities that say roughly the same thing. May be merge candidates or independent restatements across papers.