finding
active
finding:for-small-models-critiqued-revisions-yield-higher-harmlessness-pm-scores-than-direct-revisions-for-large-models-the-difference-is-negligibleFor small models, critiqued revisions yield higher harmlessness PM scores than direct revisions; for large models the difference is negligible.
Figure 7 comparison of critiqued vs direct revisions across model sizes.
Source paper
extracted_from(2022) · Bai, Yuntao · Saurav Kadavath · Sandipan Kundu · Amanda Askell +47
Neighborhood — ranked by edge-count
Communities (2)
community
- Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
- Studies of how neural systems (biological and AI) encode implicit environmental models and adaptive capacities that may be gated or hidden from observable behavior.
Questions (1)
question
- Section 3.5 explicitly investigates whether skipping the critique step works as well.
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Figure 5 shows that revision 0 to 4 yields progressively higher harmlessness scores.
- Section 3.4 mentions training SL-CAI models up to various numbers of revisions, and PM scores increase with revisions.
- The model tends to reflect more when the question is difficult, and accuracy is generally lower for harder questionshypothesis0.783Hypothesis explaining negative correlation between reflection rate and accuracy without implying reflection is harmful
- Figure 6 shows similar harmlessness scores for N=1,2,4,8,16 principles.
- Unexplained result pointing to asymmetry in how training on opposites affects truth probes at 70B scale