claim
active
claim:steering-s-effect-on-verbalized-chain-of-thought-evaluation-deployment-beliefs-is-a-useful-proxy-for-its-effect-on-actual-deployment-behaviorSteering's effect on verbalized chain-of-thought evaluation/deployment beliefs is a useful proxy for its effect on actual deployment behavior
Practical guidance for practitioners who lack ground-truth model organisms.
Source paper
extracted_from(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel
Neighborhood — ranked by edge-count
Findings (1)
finding
- Validates using chain-of-thought belief monitoring as proxy for behavioral steering efficacy.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Central motivating question of the paper; the model organism approach is the proposed answer.
- Central claim of the paper; supported by the model organism ground-truth approach.
- Methodological claim distinguishing this paper from prior work on verbalization suppression.
- Activation steering elicits deployment behavior throughout all four rounds of expert iterationfinding0.793Shows steering remains effective even as model becomes more robust to prompting-based attempts to elicit deployment behavior.
- Replicates main result using in-distribution steering vector; addresses concern about pre-trained vector validity.
- Core question motivating the shift from linear to geometry-aware steering; answered via manifold alignment analysis.
- Mechanism claim supported by transcript analysis and the fact that the steering vector was extracted from a model that never writes type hints.
- Future work direction: the inverse problem to the Wood Labs evaluation cue tested in this paper.
Restated by (1)
cosine ≥ 0.90Other entities that say roughly the same thing. May be merge candidates or independent restatements across papers.