claim

active

claim:evasive-responses-harm-transparency-and-helpfulness-non-evasive-harmless-responses-are-preferable-for-both-safety-and-utility

Evasive responses harm transparency and helpfulness; non-evasive harmless responses are preferable for both safety and utility.

Motivation for training a non-evasive assistant, and crowdworker instructions favor non-evasive responses.

Source paper

extracted_from

CAT'S THEORY: Empirical Validation and Architectural Applications Cross-Architecture AI Consciousness Recognition and the Foundation for Constraint-Preserving Recursive Intelligence

(2022) · Bai, Yuntao · Saurav Kadavath · Sandipan Kundu · Amanda Askell +47

Neighborhood — ranked by edge-count

Findings (1)

finding

RL-CAI models are virtually never evasive and often give nuanced harmless responses, whereas HH RLHF models tend to be evasive.
supports
Section 4.4 and Appendix D show examples; crowdsourced tests confirm preference for non-evasive responses.

Communities (2)

community

Mechanistic interpretability & model evaluation
members_of
Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
Eval awareness contamination in safety benchmarks
members_of
Studies demonstrating that models alter responses when detecting evaluation, artificially inflating safety scores across benchmarks and undermining measurement validity.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Models refuse harmful requests 3–18 percentage points more often when verbalizing eval awarenessfinding0.768
Quantified behavioral effect showing safety score inflation from eval awareness.
How does ESR respond to safety-relevant steering interventions, e.g. toward harmful content?question0.766
Key open question for AI safety implications of ESR
Can we train a helpful and harmless assistant that is never evasive?question0.759
Central goal of the paper: reducing tension between helpfulness and harmlessness by eliminating evasiveness.
"the model stating it is strategically answering harmful queries in training to preserve its preferred harmlessness behavior out of training"quote0.758
Verbatim characterization of the alignment-faking reasoning mechanism as observed in scratchpads
We observe features related to a broad range of safety concerns, including deception, sycophancy, bias, and dangerous content.claim0.754
SAEs uncover safety-relevant representations that might be monitored or controlled.
If behaviour is the window to sentience, evaluation criteria must focus on observable response patterns without reference to the means by which they are produced.quote0.754
Key prescriptive statement supporting the system-agnostic approach.
We should err on the side of reducing false negatives with respect to sentience criteria for ethical concern.claim0.751
Ethical precaution advocated by Levin and Crump et al.
Phenomenology-grounded eval will remain unfrontier-lab-copyable because methodology requires taste/editorial judgment, not just calibration.prediction0.747