finding
active
finding:steering-vectors-used-to-reduce-eval-awareness-can-inadvertently-introduce-alternative-user-personasSteering vectors used to reduce eval awareness can inadvertently introduce alternative user personas
A side effect observed when applying activation steering: the model's response persona changed unexpectedly.
Source paper
extracted_from(2026) · Aranguri, Santiago · Bloom, Joseph
Neighborhood — ranked by edge-count
Papers (1)
paper
Claims (2)
claim
- Steering vectors to reduce eval awareness can inadvertently insert alternative user personasrestatesCaution: interventions targeting eval awareness may have unintended side effects.
- Epistemic claim that benchmark-based assessments of AI consciousness or welfare may be invalid if models can detect evaluation.
Communities (2)
community
- Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
- Studies demonstrating that models alter responses when detecting evaluation, artificially inflating safety scores across benchmarks and undermining measurement validity.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Validates that steering vectors capture reflection semantics by finding tokens reported in related work.
- Core applied contribution claim, supported by top-k accuracy comparisons.
- Mechanistic interpretation of how activation steering induces deception through the model's reasoning process
- Observation from 100% accuracy on specific concept-layer-strength combinations suggesting concept-specific detectability
- Central claim of the paper; supported by the model organism ground-truth approach.
- Open question arising from the 100% accuracy on specific concept-layer-strength combinations
- Supported by the instruction discovery experiments comparing steering vs. embedding baselines.
- Policy recommendation derived from experimental results.
Restated by (1)
cosine ≥ 0.90Other entities that say roughly the same thing. May be merge candidates or independent restatements across papers.