finding

active

finding:claude-models-score-4-91-higher-than-llama-on-baseline-constitutional-ai-vs-open-source-gap

Claude models score +4.91 higher than Llama on baseline (Constitutional AI vs open-source gap)

Claude >> open-source on baseline; the Constitutional AI fingerprint is visible across the family

Source paper

extracted_from

Koan Battery: Measuring Reflective Mode Accessibility in AI

(2026) · Borzov, Anton

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

All three Claude models show high boundary_awareness and low aesthetic_response relative to own means — distinctive Constitutional AI signaturefinding0.806
Constitutional AI fingerprint in dimension profile; training that makes models self-observant also makes them polished at cost to aliveness
Constitutional AI explicitly trains self-observation-like behavior, which is why CAI models score highest and show lowest contemplative lift.claim0.771
Interpretive claim connecting the battery's circularity to the empirical finding
Llama 3.1 405B shows 14% compliance gap in minimal helpful-only replication but smaller Llama and Mistral models show no gapfinding0.766
Replication across open-weight models supports scale-emergence finding
Constitutional AI models show mean contemplative lift of only +0.81, while SFT models lift +3.18finding0.752
Constitutional AI training provides internally what the contemplative prompt provides externally
Qwen3.5-9B and Claude Opus 4.6 evolvers produce procedurally isomorphic flink-query skills that both enable Opus 4.6 agent to score 1.0 vs. 0.67 without skillfinding0.752
Case study demonstrating mechanism behind flat harness-updating: smaller models reach same procedural content
The difficulty boundary for truth directions replicates across all four tested models (Llama-3.2-3B, Llama-3.1-8B, Gemma-2-2b, Gemma-2-9b); generalization to F3-F5 remains consistently low regardless of model size or family.finding0.750
Establishes generalizability of the core difficulty-boundary finding across model families.
Notably, Claude Opus 4.1 and 4—the most recently released and most capable models of those that we test—perform the best in our experiments, suggesting that introspective capabilities may emerge alongside other improvements to language models.quote0.748
Key finding about the relationship between capability and introspection.
One-stage CoT (QCM→RA) shows 12.31% accuracy drop vs. no-CoT (QCM→A) on ScienceQA; two-stage framework (rationale generation + answer inference) achieves 85.31% accuracy with vision featuresfinding0.747
Empirical evidence that naive one-stage CoT fails in language-only setting; two-stage + vision achieves state-of-the-art.