finding
active
finding:classifier-free-guidance-does-not-decrease-type-hint-rate-to-deployment-levels-activation-steering-is-more-effectiveClassifier-free guidance does not decrease type hint rate to deployment levels; activation steering is more effective
Comparative result showing steering superiority over CFG as alternative intervention.
Source paper
extracted_from(2025) · Hua, Tim Tian · Qin, Andrew · Marks, Samuel · Nanda, Neel
Neighborhood — ranked by edge-count
Claims (1)
claim
- Key distinction showing steering offers value beyond prompting; supported by Figure 5 and random vector experiments.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Main result: steering elicits deployment behavior even when the evaluation cue is present and prompting fails.
- Validates using chain-of-thought belief monitoring as proxy for behavioral steering efficacy.
- Tested as alternative to steering by magnifying difference between evaluation and deployment prompts; found less effective than steering.
- Demonstrates steering is not equivalent to prompting with the contrastive prompts.
- Mechanism claim supported by transcript analysis and the fact that the steering vector was extracted from a model that never writes type hints.
- Central motivating question of the paper; the model organism approach is the proposed answer.
- Weakens overall setup by showing some prompts can lower type hints, but does not invalidate core steering result.
- Core validation that identified latent directions correspond to meaningful control over reflective behavior.