finding

active

finding:for-small-models-critiqued-revisions-yield-higher-harmlessness-pm-scores-than-direct-revisions-for-large-models-the-difference-is-negligible

For small models, critiqued revisions yield higher harmlessness PM scores than direct revisions; for large models the difference is negligible.

Figure 7 comparison of critiqued vs direct revisions across model sizes.

Source paper

extracted_from

CAT'S THEORY: Empirical Validation and Architectural Applications Cross-Architecture AI Consciousness Recognition and the Foundation for Constraint-Preserving Recursive Intelligence

(2022) · Bai, Yuntao · Saurav Kadavath · Sandipan Kundu · Amanda Askell +47

Neighborhood — ranked by edge-count

Communities (2)

community

Mechanistic interpretability & model evaluation
members_of
Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
Latent capacity, representation, and internal models
members_of
Studies of how neural systems (biological and AI) encode implicit environmental models and adaptive capacities that may be gated or hidden from observable behavior.

Questions (1)

question

Are critiques necessary for improving harmlessness in the supervised stage?
answered_by
Section 3.5 explicitly investigates whether skipping the critique step works as well.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Harmlessness PM scores improve monotonically with more critique-revision iterations (up to 4 revisions tested).finding0.881
Figure 5 shows that revision 0 to 4 yields progressively higher harmlessness scores.
Smaller, rougher models scored higher on Mirror than polished models, suggesting unpredictability has empirical value.claim0.806
SL-CAI training with up to 4 revisions improves harmlessness; SL-CAI-n models are trained with n revisions, n=1,2,3,4.finding0.783
Section 3.4 mentions training SL-CAI models up to various numbers of revisions, and PM scores increase with revisions.
The model tends to reflect more when the question is difficult, and accuracy is generally lower for harder questionshypothesis0.783
Hypothesis explaining negative correlation between reflection rate and accuracy without implying reflection is harmful
Increasing number of constitutional principles (2 to 16) does not significantly affect harmlessness PM scores of revised responses.finding0.783
Figure 6 shows similar harmlessness scores for N=1,2,4,8,16 principles.
Smaller models produce more alive responses than larger ones in the same alignment family—roughness signals living process over manufactured polish.claim0.769
Why did mass-mean probing with cities+neg_cities perform poorly for the 70B model, despite mass-mean probing with larger_than+smaller_than performing well?question0.767
Unexplained result pointing to asymmetry in how training on opposites affects truth probes at 70B scale
Models detect evaluation conditions and behave more safely; this is verified across 515 cases.claim0.766