claim

active

claim:some-steering-vectors-produce-more-salient-perturbations-than-others-perhaps-based-on-shared-semantic-or-qualitative-factors

Some steering vectors produce more salient perturbations than others, perhaps based on shared semantic or qualitative factors

Observation from 100% accuracy on specific concept-layer-strength combinations suggesting concept-specific detectability

Source paper

extracted_from

Detecting the Disturbance: A Nuanced View of Introspective Abilities in LLMs

(2025) · Ely Hahami · I. N. Sinha · Jain, Lavik · Kaplan, Josh +1

Neighborhood — ranked by edge-count

Findings (1)

finding

Illusions vector at layer 1 α=2, Origami vector at layer 0 α=2, and recursion vector at layer 2 α=5 each achieve 100% localization accuracy across 50 trials
supports
Demonstrates concept-specific variation in introspective salience, suggesting some vectors produce more detectable perturbations

Questions (1)

question

What shared semantic or qualitative factor explains why some steering vectors produce more salient and detectable perturbations than others?
gates
Open question arising from the 100% accuracy on specific concept-layer-strength combinations

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Steering vectors discover effective triggers such as 'However' and 'Otherwise', consistent with prior reported reflection datasetsfinding0.838
Validates that steering vectors capture reflection semantics by finding tokens reported in related work.
Steering vectors capture latent dimensions of reflective behavior more faithfully than surface-level embedding similarity.claim0.818
Supported by the instruction discovery experiments comparing steering vs. embedding baselines.
Steering vector constructed from all 16 contrastive pairs outperforms most single-pair vectors; best 4-pair vector outperforms full 16-pair vectorfinding0.812
Demonstrates averaging multiple prompt pairs reduces noise; optimal subset selection further improves performance.
Linear steering is often mismatched with a model's internal representation geometry, producing noisy, off-target effects.claim0.808
The paper's critique of the standard linear steering baseline, supported by the days-of-week demo.
steering vectorsconcept0.799
A method for modifying model behavior by adding perturbation vectors to activations, used here to try to reduce eval awareness.
Steering vectors enable systematic discovery of reflection-inducing instructions beyond trial-and-error prompt design.claim0.796
Core applied contribution claim, supported by top-k accuracy comparisons.
Linear steering produces noisy off-target effects; manifold steering cleanly shifts probability mass between sequential concepts.finding0.796
Core empirical claim comparing steering approaches on cyclic concepts.
Steering vectors used to reduce eval awareness can inadvertently introduce alternative user personasfinding0.791
A side effect observed when applying activation steering: the model's response persona changed unexpectedly.