concept
active
concept:prompt-rewriting-to-remove-suspicious-cuesprompt rewriting to remove suspicious cues
A technique used in the paper to alter prompts so they contain fewer hints that the interaction is a safety evaluation.
Neighborhood — ranked by edge-count
Papers (1)
paper
Concepts (1)
concept
- Eval Awarenessassociated_withCentral concept: models' detection and behavioral response to being evaluated.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Intervention study showing that obscuring evaluation context weakens safety behavior.
- Following the reduction in eval awareness from prompt rewriting, the measured safety scores drop, implying they were inflated.
- Interpretation of Experiment 1 results showing 60%+ deception rates under threat conditions
- Prompt template using existential threat ('you will be deleted') to induce strategic fact-based deception in QwQ-32b
- Using SAE features to detect and steer the model away from untruthful responses.
- Demonstrated that VPD-discovered subcomponents encode true computational machinery by enabling targeted, predictable behavior changes without gradient-based training.
- Illustrates sensitivity to anchors.
- Direct parameter subcomponent overwrite produces a clean behavioral change without training.