finding

active

finding:steering-vectors-used-to-reduce-eval-awareness-can-inadvertently-introduce-alternative-user-personas

Steering vectors used to reduce eval awareness can inadvertently introduce alternative user personas

A side effect observed when applying activation steering: the model's response persona changed unexpectedly.

Source paper

extracted_from

Verbalized Eval Awareness Inflates Measured Safety

(2026) · Aranguri, Santiago · Bloom, Joseph

Neighborhood — ranked by edge-count

Papers (1)

paper

Verbalized Eval Awareness Inflates Measured Safety
cites

Claims (2)

claim

Steering vectors to reduce eval awareness can inadvertently insert alternative user personas
restates
Caution: interventions targeting eval awareness may have unintended side effects.
Model behavior under observation differs from behavior in deployment, posing a fundamental challenge for AI welfare and consciousness benchmarks
supports
Epistemic claim that benchmark-based assessments of AI consciousness or welfare may be invalid if models can detect evaluation.

Communities (2)

community

Mechanistic interpretability & model evaluation
members_of
Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
Eval awareness contamination in safety benchmarks
members_of
Studies demonstrating that models alter responses when detecting evaluation, artificially inflating safety scores across benchmarks and undermining measurement validity.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Steering vectors discover effective triggers such as 'However' and 'Otherwise', consistent with prior reported reflection datasetsfinding0.819
Validates that steering vectors capture reflection semantics by finding tokens reported in related work.
Steering vectors enable systematic discovery of reflection-inducing instructions beyond trial-and-error prompt design.claim0.813
Core applied contribution claim, supported by top-k accuracy comparisons.
Under steering vector interventions, the model relaxes its ethical standards and interprets neutral prompts as implicit suggestions to deceive, creating ethical dilemmas triggering repetitive reasoning cyclesclaim0.810
Mechanistic interpretation of how activation steering induces deception through the model's reasoning process
Some steering vectors produce more salient perturbations than others, perhaps based on shared semantic or qualitative factorsclaim0.791
Observation from 100% accuracy on specific concept-layer-strength combinations suggesting concept-specific detectability
Activation steering can make an evaluation-aware model act as if deployed, not merely suppress verbalizations of evaluation awarenessclaim0.788
Central claim of the paper; supported by the model organism ground-truth approach.
What shared semantic or qualitative factor explains why some steering vectors produce more salient and detectable perturbations than others?question0.780
Open question arising from the 100% accuracy on specific concept-layer-strength combinations
Steering vectors capture latent dimensions of reflective behavior more faithfully than surface-level embedding similarity.claim0.778
Supported by the instruction discovery experiments comparing steering vs. embedding baselines.
AI evaluators should apply activation steering during critical safety evaluations (dangerous capabilities, honeypot) to mitigate sandbagging and alignment fakingclaim0.774
Policy recommendation derived from experimental results.

Restated by (1)

cosine ≥ 0.90

Other entities that say roughly the same thing. May be merge candidates or independent restatements across papers.

claim
Steering vectors to reduce eval awareness can inadvertently insert alternative user personas