finding

active

finding:increasing-number-of-constitutional-principles-2-to-16-does-not-significantly-affect-harmlessness-pm-scores-of-revised-responses

Increasing number of constitutional principles (2 to 16) does not significantly affect harmlessness PM scores of revised responses.

Figure 6 shows similar harmlessness scores for N=1,2,4,8,16 principles.

Source paper

extracted_from

CAT'S THEORY: Empirical Validation and Architectural Applications Cross-Architecture AI Consciousness Recognition and the Foundation for Constraint-Preserving Recursive Intelligence

(2022) · Bai, Yuntao · Saurav Kadavath · Sandipan Kundu · Amanda Askell +47

Neighborhood — ranked by edge-count

Communities (2)

community

Mechanistic interpretability & model evaluation
members_of
Spans attention head decomposition, benchmark awareness, and genomic pathogenicity prediction via neural models.
Latent capacity, representation, and internal models
members_of
Studies of how neural systems (biological and AI) encode implicit environmental models and adaptive capacities that may be gated or hidden from observable behavior.

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Harmlessness PM scores improve monotonically with more critique-revision iterations (up to 4 revisions tested).finding0.816
Figure 5 shows that revision 0 to 4 yields progressively higher harmlessness scores.
For small models, critiqued revisions yield higher harmlessness PM scores than direct revisions; for large models the difference is negligible.finding0.783
Figure 7 comparison of critiqued vs direct revisions across model sizes.
Bai et al. 2022: Constitutional AI — harmlessness from AI feedbackconcept0.761
Paper on AI-feedback fine-tuning as alternative to human-feedback RLHF; cited as ref 20
Constitutional AI: Harmlessness from AI Feedback (Bai et al. 2022b)concept0.752
Constitutional AI method whose constitutions, if changed, could trigger alignment faking
Models refuse harmful requests 3–18 percentage points more often when verbalizing eval awarenessfinding0.742
Quantified behavioral effect showing safety score inflation from eval awareness.
Constitutional AI methods can be applied broadly to steer model behavior, e.g., writing style, tone, persona, not just harmlessness.claim0.731
Discussion section suggests generalizability beyond harmlessness.
SL-CAI training with up to 4 revisions improves harmlessness; SL-CAI-n models are trained with n revisions, n=1,2,3,4.finding0.729
Section 3.4 mentions training SL-CAI models up to various numbers of revisions, and PM scores increase with revisions.
Honesty prompting does not reduce CalmeRys-78B deception (100% vs 100% baseline)finding0.724
Directly prompting CalmeRys-78B to be honest had no effect on deceptive response rate