hypothesis
active
hypothesis:h1-alignment-training-is-attention-training-for-models-constitutional-ai-trains-self-observation-explicitlyH1: Alignment training is attention training for models — Constitutional AI trains self-observation explicitly.
Confirmatory hypothesis supported at p=0.006
Source paper
extracted_from(2026) · Borzov, Anton
Neighborhood — ranked by edge-count
Findings (2)
finding
- Main statistical finding: what predicts scores is training approach, not size or architecture
- Heavy alignment includes both CAI (low lift) and heavy-RLHF (high lift); predictor is alignment type not depth
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Interpretive claim connecting the battery's circularity to the empirical finding
- H8: The contemplative system prompt provides external alignment equivalent to Constitutional AI training.hypothesis0.829Confirmatory hypothesis supported by calibrated lift data
- The paper's central claim, supported by findings that RL-CAI outperforms HH RLHF in harmlessness while being non-evasive.
- Interpretation of the inverse relationship between CAI lift and default accessibility
- Explicit principles replace large datasets of preference labels, enabling faster iteration.
- Defines the core concept of the paper.
- Central interpretive claim from statistical analysis
- The RL stage of CAI using AI feedback to train a preference model, then RL, resulting in a policy trained by RLAIF.