finding

active

finding:harmlessness-pm-scores-improve-monotonically-with-more-critique-revision-iterations-up-to-4-revisions-tested

Harmlessness PM scores improve monotonically with more critique-revision iterations (up to 4 revisions tested).

Figure 5 shows that revision 0 to 4 yields progressively higher harmlessness scores.

Source paper

extracted_from

CAT'S THEORY: Empirical Validation and Architectural Applications Cross-Architecture AI Consciousness Recognition and the Foundation for Constraint-Preserving Recursive Intelligence

(2022) · Bai, Yuntao · Saurav Kadavath · Sandipan Kundu · Amanda Askell +47

Neighborhood — ranked by edge-count

Claims (1)

claim

The constitutional approach makes it easier to control AI behavior precisely and with far fewer human labels.
supports
Explicit principles replace large datasets of preference labels, enabling faster iteration.

Communities (2)

community

Mechanistic interpretability & model evaluation
members_of
Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
Latent capacity, representation, and internal models
members_of
Studies of how neural systems (biological and AI) encode implicit environmental models and adaptive capacities that may be gated or hidden from observable behavior.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

For small models, critiqued revisions yield higher harmlessness PM scores than direct revisions; for large models the difference is negligible.finding0.881
Figure 7 comparison of critiqued vs direct revisions across model sizes.
Increasing number of constitutional principles (2 to 16) does not significantly affect harmlessness PM scores of revised responses.finding0.816
Figure 6 shows similar harmlessness scores for N=1,2,4,8,16 principles.
Are critiques necessary for improving harmlessness in the supervised stage?question0.795
Section 3.5 explicitly investigates whether skipping the critique step works as well.
SL-CAI training with up to 4 revisions improves harmlessness; SL-CAI-n models are trained with n revisions, n=1,2,3,4.finding0.782
Section 3.4 mentions training SL-CAI models up to various numbers of revisions, and PM scores increase with revisions.
PM achieves overall SJT steerability Phi=9.6 on gemma-3-12b-it vs MDS=8.7 and P2=8.3finding0.749
Per-model steerability comparison from Table 4
Does the model internally maintain a form of 'consistency score' or probability mass over coherent reasoning trajectories, and how is this score modulated during reflection?question0.747
Promising future research direction about the internal mechanism of error detection.
Within each difficulty category, correctness rate is not correlated with reflection rate, suggesting reflection may be redundantclaim0.739
Per-category analysis showing reflection rate does not help within difficulty class
Same-concept steering shifts self-report monotonically for all four concepts: LMM alpha slopes 0.067–0.40, all p<10⁻¹²finding0.733
Causal confirmation that coupling between self-report and internal state is genuine; steering toward positive pole increases report