claim
active
claim:some-steering-vectors-produce-more-salient-perturbations-than-others-perhaps-based-on-shared-semantic-or-qualitative-factorsSome steering vectors produce more salient perturbations than others, perhaps based on shared semantic or qualitative factors
Observation from 100% accuracy on specific concept-layer-strength combinations suggesting concept-specific detectability
Source paper
extracted_from(2025) · Ely Hahami · I. N. Sinha · Jain, Lavik · Kaplan, Josh +1
Neighborhood — ranked by edge-count
Findings (1)
finding
- Demonstrates concept-specific variation in introspective salience, suggesting some vectors produce more detectable perturbations
Questions (1)
question
- Open question arising from the 100% accuracy on specific concept-layer-strength combinations
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Validates that steering vectors capture reflection semantics by finding tokens reported in related work.
- Supported by the instruction discovery experiments comparing steering vs. embedding baselines.
- Demonstrates averaging multiple prompt pairs reduces noise; optimal subset selection further improves performance.
- The paper's critique of the standard linear steering baseline, supported by the days-of-week demo.
- A method for modifying model behavior by adding perturbation vectors to activations, used here to try to reduce eval awareness.
- Core applied contribution claim, supported by top-k accuracy comparisons.
- Core empirical claim comparing steering approaches on cyclic concepts.
- Steering vectors used to reduce eval awareness can inadvertently introduce alternative user personasfinding0.791A side effect observed when applying activation steering: the model's response persona changed unexpectedly.