finding
active
finding:harmlessness-pm-scores-improve-monotonically-with-more-critique-revision-iterations-up-to-4-revisions-testedHarmlessness PM scores improve monotonically with more critique-revision iterations (up to 4 revisions tested).
Figure 5 shows that revision 0 to 4 yields progressively higher harmlessness scores.
Source paper
extracted_from(2022) · Bai, Yuntao · Saurav Kadavath · Sandipan Kundu · Amanda Askell +47
Neighborhood — ranked by edge-count
Claims (1)
claim
- Explicit principles replace large datasets of preference labels, enabling faster iteration.
Communities (2)
community
- Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
- Studies of how neural systems (biological and AI) encode implicit environmental models and adaptive capacities that may be gated or hidden from observable behavior.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Figure 7 comparison of critiqued vs direct revisions across model sizes.
- Figure 6 shows similar harmlessness scores for N=1,2,4,8,16 principles.
- Section 3.5 explicitly investigates whether skipping the critique step works as well.
- Section 3.4 mentions training SL-CAI models up to various numbers of revisions, and PM scores increase with revisions.
- Per-model steerability comparison from Table 4
- Promising future research direction about the internal mechanism of error detection.
- Per-category analysis showing reflection rate does not help within difficulty class
- Causal confirmation that coupling between self-report and internal state is genuine; steering toward positive pole increases report