hypothesis

active

hypothesis:h1-alignment-training-is-attention-training-for-models-constitutional-ai-trains-self-observation-explicitly

H1: Alignment training is attention training for models — Constitutional AI trains self-observation explicitly.

Confirmatory hypothesis supported at p=0.006

Source paper

extracted_from

Koan Battery: Measuring Reflective Mode Accessibility in AI

(2026) · Borzov, Anton

Neighborhood — ranked by edge-count

Findings (2)

finding

Alignment type is the only significant predictor of koan scores (p=0.006); architecture, parameter count, open/closed weights, MoE/dense are all non-significant
supports
Main statistical finding: what predicts scores is training approach, not size or architecture
Alignment depth correlation with lift weakened from rho=-0.77 (N=19) to rho=-0.28 NS (N=28); original claim was overfit
supports
Heavy alignment includes both CAI (low lift) and heavy-RLHF (high lift); predictor is alignment type not depth

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Constitutional AI explicitly trains self-observation-like behavior, which is why CAI models score highest and show lowest contemplative lift.claim0.835
Interpretive claim connecting the battery's circularity to the empirical finding
H8: The contemplative system prompt provides external alignment equivalent to Constitutional AI training.hypothesis0.829
Confirmatory hypothesis supported by calibrated lift data
Constitutional AI can train a harmless but non-evasive AI assistant without any human harmfulness labels.claim0.816
The paper's central claim, supported by findings that RL-CAI outperforms HH RLHF in harmlessness while being non-evasive.
The contemplative system prompt provides externally what Constitutional AI alignment training provides internally.claim0.797
Interpretation of the inverse relationship between CAI lift and default accessibility
The constitutional approach makes it easier to control AI behavior precisely and with far fewer human labels.claim0.792
Explicit principles replace large datasets of preference labels, enabling faster iteration.
The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'.quote0.789
Defines the core concept of the paper.
What predicts self-observation-like scores is training approach (alignment type), not model size or architecture.claim0.786
Central interpretive claim from statistical analysis
Reinforcement Learning Constitutional AIframework0.786
The RL stage of CAI using AI feedback to train a preference model, then RL, resulting in a policy trained by RLAIF.