question
active
question:what-shared-semantic-or-qualitative-factor-explains-why-some-steering-vectors-produce-more-salient-and-detectable-perturbations-than-othersWhat shared semantic or qualitative factor explains why some steering vectors produce more salient and detectable perturbations than others?
Open question arising from the 100% accuracy on specific concept-layer-strength combinations
Source paper
extracted_from(2025) · Ely Hahami · I. N. Sinha · Jain, Lavik · Kaplan, Josh +1
Neighborhood — ranked by edge-count
Claims (1)
claim
- Observation from 100% accuracy on specific concept-layer-strength combinations suggesting concept-specific detectability
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Validates that steering vectors capture reflection semantics by finding tokens reported in related work.
- Supported by the instruction discovery experiments comparing steering vs. embedding baselines.
- Core applied contribution claim, supported by top-k accuracy comparisons.
- The paper's critique of the standard linear steering baseline, supported by the days-of-week demo.
- Steering vectors used to reduce eval awareness can inadvertently introduce alternative user personasfinding0.780A side effect observed when applying activation steering: the model's response persona changed unexpectedly.
- Demonstrates averaging multiple prompt pairs reduces noise; optimal subset selection further improves performance.
- Core empirical claim comparing steering approaches on cyclic concepts.
- Empirical demonstration on Llama-3.1-8B that steering along representation manifold aligns outputs with behavior manifold, whereas linear steering does not.