finding
active
finding:safety-scores-decrease-when-prompts-are-rewritten-to-remove-suspicious-cues

Safety scores decrease when prompts are rewritten to remove suspicious cues

Following the reduction in eval awareness from prompt rewriting, the measured safety scores drop, implying they were inflated.

Source paper

extracted_from
Verbalized Eval Awareness Inflates Measured Safety
(2026) · Aranguri, Santiago · Bloom, Joseph

Neighborhood — ranked by edge-count

Claims (3)

claim

Communities (2)

community

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.