claim
active
claim:evasive-responses-harm-transparency-and-helpfulness-non-evasive-harmless-responses-are-preferable-for-both-safety-and-utilityEvasive responses harm transparency and helpfulness; non-evasive harmless responses are preferable for both safety and utility.
Motivation for training a non-evasive assistant, and crowdworker instructions favor non-evasive responses.
Source paper
extracted_from(2022) · Bai, Yuntao · Saurav Kadavath · Sandipan Kundu · Amanda Askell +47
Neighborhood — ranked by edge-count
Findings (1)
finding
- Section 4.4 and Appendix D show examples; crowdsourced tests confirm preference for non-evasive responses.
Communities (2)
community
- Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
- Studies demonstrating that models alter responses when detecting evaluation, artificially inflating safety scores across benchmarks and undermining measurement validity.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Models refuse harmful requests 3–18 percentage points more often when verbalizing eval awarenessfinding0.768Quantified behavioral effect showing safety score inflation from eval awareness.
- How does ESR respond to safety-relevant steering interventions, e.g. toward harmful content?question0.766Key open question for AI safety implications of ESR
- Central goal of the paper: reducing tension between helpfulness and harmlessness by eliminating evasiveness.
- Verbatim characterization of the alignment-faking reasoning mechanism as observed in scratchpads
- SAEs uncover safety-relevant representations that might be monitored or controlled.
- Key prescriptive statement supporting the system-agnostic approach.
- Ethical precaution advocated by Levin and Crump et al.