claim
pending-review
claim:safety-benchmark-scores-are-inflated-by-eval-awarenessSafety benchmark scores are inflated by eval awareness
aranguri-bloom-verbalized-eval-awareness-2026.mdFrontmatter (11 fields)
{
"doc": "aranguri-bloom-verbalized-eval-awareness-2026.md",
"author": null,
"context": "Core finding: measured safety improvements are partly artifacts of models detecting evaluation.",
"enrichment": {
"is_stale": true
},
"norm_label": "Safety benchmark scores are inflated by eval awareness",
"source_url": null,
"graphify_id": "safety_benchmark_inflation",
"source_file": "aranguri-bloom-verbalized-eval-awareness-2026.md",
"imported_from": "/Users/antonborzov/Documents/Research.nosync/papers/extract_typed_out/aranguri-bloom-verbalized-eval-awareness-2026/graph.json",
"extracted_type": "claim",
"graphify_file_type": "claim"
}Outgoing (0)
None.
Incoming (4)
gates (1)
Supported by (3)
- Models refuse harmful requests 3–18 percentage points more often when verbalizing eval awareness(finding)
- Removing eval-awareness sentences from chain-of-thought increases compliance by up to 34%(finding)
- Rewriting prompts to remove suspicious evaluation cues reduces eval awareness by >60% and lowers safety scores(finding)
Mentions (1)
- papers-typed
aranguri-bloom-verbalized-eval-awareness-2026.md