claim
active
claim:what-predicts-self-observation-like-scores-is-training-approach-alignment-type-not-model-size-or-architectureWhat predicts self-observation-like scores is training approach (alignment type), not model size or architecture.
Central interpretive claim from statistical analysis
Source paper
extracted_from(2026) · Borzov, Anton
Neighborhood — ranked by edge-count
Findings (2)
finding
- Main statistical finding: what predicts scores is training approach, not size or architecture
- Parameters don't predict scores; 135x more parameters yields 60% lower score
Claims (1)
claim
- Interpretive claim connecting the battery's circularity to the empirical finding
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Kruskal-Wallis test result: Constitutional AI predicts highest baseline; roleplay/empathy training predict lowest.
- H1: Alignment training is attention training for models — Constitutional AI trains self-observation explicitly.hypothesis0.786Confirmatory hypothesis supported at p=0.006
- Primary limitation acknowledged by the authors; strongest evidence would require mechanistic activation analysis
- Cross-model consistency of the condition ordering in Experiment 4
- Motivation for the two-stage training design; links the model organism to plausible natural emergence.
- Skeptical prior work motivating the need to validate self-reports against internal states rather than taking them at face value
- Non-LLM validation confirming LLM scorer captures genuine self-observation markers