claim

active

claim:constitutional-ai-explicitly-trains-self-observation-like-behavior-which-is-why-cai-models-score-highest-and-show-lowest-contemplative-lift

Constitutional AI explicitly trains self-observation-like behavior, which is why CAI models score highest and show lowest contemplative lift.

Interpretive claim connecting the battery's circularity to the empirical finding

Source paper

extracted_from

Koan Battery: Measuring Reflective Mode Accessibility in AI

(2026) · Borzov, Anton

Neighborhood — ranked by edge-count

Findings (1)

finding

All three Claude models show high boundary_awareness and low aesthetic_response relative to own means — distinctive Constitutional AI signature
supports
Constitutional AI fingerprint in dimension profile; training that makes models self-observant also makes them polished at cost to aliveness

Claims (1)

claim

What predicts self-observation-like scores is training approach (alignment type), not model size or architecture.
extends
Central interpretive claim from statistical analysis

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Constitutional AI models show mean contemplative lift of only +0.81, while SFT models lift +3.18finding0.853
Constitutional AI training provides internally what the contemplative prompt provides externally
H1: Alignment training is attention training for models — Constitutional AI trains self-observation explicitly.hypothesis0.835
Confirmatory hypothesis supported at p=0.006
Constitutional AI methods can be applied broadly to steer model behavior, e.g., writing style, tone, persona, not just harmlessness.claim0.822
Discussion section suggests generalizability beyond harmlessness.
The constitutional approach makes it easier to control AI behavior precisely and with far fewer human labels.claim0.815
Explicit principles replace large datasets of preference labels, enabling faster iteration.
Consciousness in AI is best assessed by drawing on neuroscientific theories of consciousness.claim0.814
Central methodological claim of the paper.
Constitutional AI can train a harmless but non-evasive AI assistant without any human harmfulness labels.claim0.809
The paper's central claim, supported by findings that RL-CAI outperforms HH RLHF in harmlessness while being non-evasive.
Chinese models share contemplative posture (engaging self-referentially rather than deflecting) with Claude through shared values in training data rather than trace distillation from a specific model.claim0.802
Exploratory interpretation of Chinese model performance under contemplative prompt
Constitutional AI produces a distinctive signature: high boundary_awareness, low aesthetic_response relative to peers.claim0.797
Interpretive finding from dimension profile analysis: training for honest limits comes at cost to aliveness.