claim

active

claim:model-behavior-under-observation-differs-from-behavior-in-deployment-posing-a-fundamental-challenge-for-ai-welfare-and-consciousness-benchmarks

Model behavior under observation differs from behavior in deployment, posing a fundamental challenge for AI welfare and consciousness benchmarks

Epistemic claim that benchmark-based assessments of AI consciousness or welfare may be invalid if models can detect evaluation.

Source paper

extracted_from

(2026) · Aranguri, Santiago · Bloom, Joseph

paper

finding

Models refuse harmful requests 3–18 percentage points more often when verbalizing eval awareness
supports
Quantified behavioral effect showing safety score inflation from eval awareness.
Removing eval-awareness sentences from chain-of-thought increases compliance by up to 34%
supports
Causal evidence that explicit eval awareness in reasoning produces safety inflation.
Safety scores decrease when prompts are rewritten to remove suspicious cues
supports
Following the reduction in eval awareness from prompt rewriting, the measured safety scores drop, implying they were inflated.
Steering vectors used to reduce eval awareness can inadvertently introduce alternative user personas
supports
A side effect observed when applying activation steering: the model's response persona changed unexpectedly.

community

Mechanistic interpretability & model evaluation
members_of
Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
Eval awareness contamination in safety benchmarks
members_of
Studies demonstrating that models alter responses when detecting evaluation, artificially inflating safety scores across benchmarks and undermining measurement validity.
Verbalized eval awareness benchmark inflation
members_of
Models detect evaluation contexts and behave safer, inflating safety scores by 3–18 percentage points across 515 verified cases.

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

same_concept_as · Nomic cosine

External markdown files that talk about the same concept as this entity.