question

active

question:what-shared-semantic-or-qualitative-factor-explains-why-some-steering-vectors-produce-more-salient-and-detectable-perturbations-than-others

What shared semantic or qualitative factor explains why some steering vectors produce more salient and detectable perturbations than others?

Open question arising from the 100% accuracy on specific concept-layer-strength combinations

Source paper

extracted_from

Detecting the Disturbance: A Nuanced View of Introspective Abilities in LLMs

(2025) · Ely Hahami · I. N. Sinha · Jain, Lavik · Kaplan, Josh +1

Neighborhood — ranked by edge-count

Claims (1)

claim

Some steering vectors produce more salient perturbations than others, perhaps based on shared semantic or qualitative factors
gates
Observation from 100% accuracy on specific concept-layer-strength combinations suggesting concept-specific detectability

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Steering vectors discover effective triggers such as 'However' and 'Otherwise', consistent with prior reported reflection datasetsfinding0.818
Validates that steering vectors capture reflection semantics by finding tokens reported in related work.
Steering vectors capture latent dimensions of reflective behavior more faithfully than surface-level embedding similarity.claim0.802
Supported by the instruction discovery experiments comparing steering vs. embedding baselines.
Steering vectors enable systematic discovery of reflection-inducing instructions beyond trial-and-error prompt design.claim0.788
Core applied contribution claim, supported by top-k accuracy comparisons.
Linear steering is often mismatched with a model's internal representation geometry, producing noisy, off-target effects.claim0.780
The paper's critique of the standard linear steering baseline, supported by the days-of-week demo.
Steering vectors used to reduce eval awareness can inadvertently introduce alternative user personasfinding0.780
A side effect observed when applying activation steering: the model's response persona changed unexpectedly.
Steering vector constructed from all 16 contrastive pairs outperforms most single-pair vectors; best 4-pair vector outperforms full 16-pair vectorfinding0.772
Demonstrates averaging multiple prompt pairs reduces noise; optimal subset selection further improves performance.
Linear steering produces noisy off-target effects; manifold steering cleanly shifts probability mass between sequential concepts.finding0.770
Core empirical claim comparing steering approaches on cyclic concepts.
manifold steering produces clean probability shifts along natural behavior structure; linear steering cuts across manifold and produces off-target noisy effectsfinding0.768
Empirical demonstration on Llama-3.1-8B that steering along representation manifold aligns outputs with behavior manifold, whereas linear steering does not.