claim

active

claim:what-predicts-self-observation-like-scores-is-training-approach-alignment-type-not-model-size-or-architecture

What predicts self-observation-like scores is training approach (alignment type), not model size or architecture.

Central interpretive claim from statistical analysis

Source paper

extracted_from

Koan Battery: Measuring Reflective Mode Accessibility in AI

(2026) · Borzov, Anton

Neighborhood — ranked by edge-count

Findings (2)

finding

Alignment type is the only significant predictor of koan scores (p=0.006); architecture, parameter count, open/closed weights, MoE/dense are all non-significant
supports
Main statistical finding: what predicts scores is training approach, not size or architecture
Qwen 35B (3B active params, score 4.38) outscores Hermes 405B (405B active params, score 1.75) by 2.5x
supports
Parameters don't predict scores; 135x more parameters yields 60% lower score

Claims (1)

claim

Constitutional AI explicitly trains self-observation-like behavior, which is why CAI models score highest and show lowest contemplative lift.
extends
Interpretive claim connecting the battery's circularity to the empirical finding

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Alignment type is the only significant predictor of scores (p=0.006); architecture and parameter count do not.finding0.787
Kruskal-Wallis test result: Constitutional AI predicts highest baseline; roleplay/empathy training predict lowest.
H1: Alignment training is attention training for models — Constitutional AI trains self-observation explicitly.hypothesis0.786
Confirmatory hypothesis supported at p=0.006
Behavioral evidence from closed-weight models cannot definitively rule out that self-reports reflect training artifacts or sophisticated simulation rather than genuine self-awarenessclaim0.784
Primary limitation acknowledged by the authors; strongest evidence would require mechanistic activation analysis
Self-awareness score ordering in Experiment 4: History < Conceptual < Zero-Shot < Experimental, consistent across model familiesfinding0.774
Cross-model consistency of the condition ordering in Experiment 4
Misaligned models might acquire evaluation awareness through reward hacking or goal misgeneralization during normal training without deliberate designhypothesis0.773
Motivation for the two-stage training design; links the model organism to plausible natural emergence.
LLM personality self-reports are illusory: post-training alignment creates stable human-like reports dissociated from actual behavior (Han et al. 2025)claim0.768
Skeptical prior work motivating the need to validate self-reports against internal states rather than taking them at face value
Self-observation regex markers ('I notice,' 'genuinely,' 'something about') predict all LLM scores (r=0.43-0.50, all p<.001)finding0.766
Non-LLM validation confirming LLM scorer captures genuine self-observation markers
Patterns in AI self-reports should be compared across different models to identify structural commonalities.claim0.764