finding
active
finding:increasing-number-of-constitutional-principles-2-to-16-does-not-significantly-affect-harmlessness-pm-scores-of-revised-responsesIncreasing number of constitutional principles (2 to 16) does not significantly affect harmlessness PM scores of revised responses.
Figure 6 shows similar harmlessness scores for N=1,2,4,8,16 principles.
Source paper
extracted_from(2022) · Bai, Yuntao · Saurav Kadavath · Sandipan Kundu · Amanda Askell +47
Neighborhood — ranked by edge-count
Communities (2)
community
- Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
- Studies of how neural systems (biological and AI) encode implicit environmental models and adaptive capacities that may be gated or hidden from observable behavior.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Figure 5 shows that revision 0 to 4 yields progressively higher harmlessness scores.
- Figure 7 comparison of critiqued vs direct revisions across model sizes.
- Paper on AI-feedback fine-tuning as alternative to human-feedback RLHF; cited as ref 20
- Constitutional AI method whose constitutions, if changed, could trigger alignment faking
- Models refuse harmful requests 3–18 percentage points more often when verbalizing eval awarenessfinding0.742Quantified behavioral effect showing safety score inflation from eval awareness.
- Discussion section suggests generalizability beyond harmlessness.
- Section 3.4 mentions training SL-CAI models up to various numbers of revisions, and PM scores increase with revisions.
- Directly prompting CalmeRys-78B to be honest had no effect on deceptive response rate