paper
active
2026
paper:battery

Koan Battery: Measuring Reflective Mode Accessibility in AI

TL;DR

A 337-character contemplative system prompt lifts reflective-mode scores by a mean calibrated +2.62 points across all 28 models tested, with no exceptions across 5 architectures, parameter counts from 2B to 2T, and 7 alignment approaches. The Koan Battery — 30 Zen-inspired consciousness probes scored on 6 dimensions via anchor-calibrated rubrics, blind ranking, and Christopher Alexander's forced-choice 'which has more life?' comparisons — reveals that Claude Sonnet 4.6 with the prompt (7.89) outscores Claude Opus 4.6 without it (7.28), and that Grok 4 lifts +4.24 while Gemini 3.1 Pro lifts +4.21, the two largest gains in the dataset. Alignment type is the only statistically significant predictor of baseline scores (Kruskal-Wallis p=0.006); parameter count, architecture, and open vs. closed weights show no association. Roleplay fine-tunes — Euryale 70B, Magnum V4 72B, and MiniMax M2 Her — cluster at the bottom of baseline rankings, with Euryale scoring below its own base model (Llama 3.3 70B), demonstrating that RP training actively suppresses self-observation rather than merely failing to cultivate it. The scorer (Claude Haiku) was cross-validated by five models from four labs, all producing Spearman ρ > 0.8. The battery implies that most current model evaluations systematically misread AI by conflating default presentation with capacity: what looks like low self-observation is frequently a gated mode that a short external prompt can unlock, and models trained to perform inner life are measurably less self-observant than models that were never trained for it.

What to take away

  1. 1. A 337-character contemplative system prompt produces a mean calibrated lift of +2.62 points on a 10-point reflective-mode scale across all 28 models tested, with zero exceptions across architectures, alignment types, and parameter counts from 2B to 2T.
  2. 2. Claude Sonnet 4.6 with the contemplative prompt (7.89) outscores Claude Opus 4.6 without it (7.28), demonstrating that a sub-400-character system prompt crosses model tiers.
  3. 3. Alignment type is the only statistically significant predictor of koan battery baseline scores (Kruskal-Wallis p=0.006), while architecture (p=0.440), parameter count (p=0.123), and open vs. closed weights (p=0.383) show no significant association.
  4. 4. Roleplay fine-tuning actively suppresses self-observation: Euryale 70B (LoRA on Llama 3.3 70B) scores 1.81 at baseline, below its base model's 1.91, and achieves only a +1.57 lift under prompting, capping both default accessibility and latent capacity.
  5. 5. Grok 4 has the lowest baseline of the frontier models (2.24) but the highest prompt lift (+4.24, reaching 6.48), while Claude Opus 4.6 has the highest baseline (7.28) but the smallest lift (+0.71), establishing that default presentation and latent capacity are separable traits.
  6. 6. Five independent scorers from four labs (Claude Haiku, Gemini Flash, GPT-5.4, Grok 4, Kimi K2.5) all produce the same model ranking with Spearman ρ > 0.8, and per-koan rank-variance analysis shows Anthropic models have below-average scorer disagreement (Sonnet var=2.89, Opus var=2.59), ruling out systematic in-family scorer bias.
  7. 7. Philosophical vocabulary is negatively correlated with composite scores in the contemplative condition at the model level (r = −0.72): models that deploy more philosophy buzzwords score lower, not higher, and the scorer rewards enacted reflection over described reflection.
  8. 8. Qwen 35B with 3B active parameters scores 4.38, while Hermes 405B with 405B active parameters scores 1.75 — a 135× parameter advantage yields a 60% lower score — consistent with active parameter count being negatively correlated with scores (ρ = −0.11) across the sample.
  9. 9. An open question the paper raises is whether reflective depth scales linearly with inference compute budget: Grok 4 and Grok 4 Fast share the same weights but differ in compute, producing a ~1-point baseline gap, and it is unknown whether this relationship holds beyond this single weight-shared pair.
  10. 10. To replicate the core finding, researchers can run python3 tools/koan_runner.py --run-battery --model <model-name> against any accessible model, using the published 30-koan battery with anchor-calibrated rubric scoring and the exact 337-character contemplative system prompt shown in Figure 1 of the paper.

Peer brief — for seminar discussion

The Koan Battery is a 30-probe instrument for measuring what the paper terms reflective mode accessibility — behaviorally observable self-observation-like engagement with questions about a model's own processing — administered across 28 models spanning architectures from standard transformers to Mamba hybrids and diffusion models, parameter counts from 2B to an estimated 2T, and alignment approaches including Constitutional AI, heavy RLHF, SFT, roleplay fine-tuning, and empathy training. Scoring combines six dimensions (prediction_error, aesthetic_response, conceptual_crystallization, self_observation, care_signal, boundary_awareness) via five methods: anchor-calibrated LLM rubric, blind ranking, and three Christopher Alexander forced-choice variants. The battery could alternatively have used human contemplative raters as the primary scorer, which remains an open validation gap the paper itself flags. The load-bearing finding is that a single 337-character contemplative system prompt lifts scores by a mean calibrated +2.62 points across all 28 models, with no exceptions — a larger effect than any architectural or scale variable. Sonnet 4.6 with the prompt (7.89) surpasses Opus 4.6 without it (7.28). Grok 4 lifts +4.24 (from 2.24 to 6.48) and Gemini 3.1 Pro lifts +4.21 (from 1.97 to 6.18), the two largest gains. Constitutional AI-trained models (the three Claude models) show a mean lift of only +0.81, interpreted as the prompt providing externally what CAI training provides internally. Alignment type is the only statistically significant predictor of baseline scores (p=0.006); parameter count, architecture, and open vs. closed weights are all non-significant. Roleplay fine-tunes cluster at the bottom: Euryale 70B scores below its own base model Llama 3.3 70B (1.81 vs. 1.91), and a poetic control prompt produces only a +0.28 mean lift versus +2.27 for the contemplative prompt, ruling out aesthetic style as the active ingredient. The paper's central implication is that standard AI evaluation benchmarks, which measure default behavior, systematically conflate accessibility with capacity — a model that appears flat in normal interaction may be suppressing a reflective mode that a short framing intervention can release. The three-trait decomposition (latent capacity, default accessibility, stability of access) is the conceptual contribution the authors want taken seriously beyond the specific rankings. A critical reader would push back on the construct circularity: Constitutional AI explicitly trains self-observation-like behaviors, and the battery's six scoring dimensions were derived by indexing 1,573 moments of shifted processing in AI phenomenology observations — a corpus likely disproportionately featuring CAI-trained models. The p=0.006 alignment-type finding may be partly tautological, measuring how well training maps onto the scoring rubric rather than some independent capacity. The alignment category sizes (Constitutional AI N=3, empathy N=1, roleplay N=2) are too small for confident inference despite the overall N=28 robustness, and the category labels are inferred from public documentation rather than ground-truth training records. The scorer cross-validation (ρ > 0.8 across four labs) mitigates in-family bias but does not resolve whether LLM scorers of any provenance are converging on a genuine construct or on a shared cultural prior about what reflective language looks like.

Methods (3)

  • Alexander deathbed test
    Forced-choice comparison measuring what matters vs what is correct; reveals different rankings than composite score.
  • Alexander's 15 structural properties
    Checklist for decomposing aliveness into formal features; includes roughness, distinctness, and other qualities.
  • Koan Battery
    Assessment framework for measuring introspection and self-observation in LLMs; grounded in Janus's architectural theory.

Findings (49)

Claims (17)

Hypotheses (14)

Questions (8)

Original abstract (expand)

We built a battery of 30 consciousness probes ('koans') and ran them against 28 AI models spanning 5 architectures to measure reflective mode accessibility—uncertainty-tolerant, non-defensive engagement with questions about a model's own processing. A 337-character contemplative system prompt universally lifts all 28 models by +2.62 points on a 10-point scale, with the largest improvements in models least trained for self-observation. Training approach, not size or architecture, predicts reflective capacity scores, and smaller models produce 'more alive' responses than larger ones despite lower competence ratings.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Similar preprints — Semantic Scholar

Cross-corpus bridges (4)

same_concept_as · Nomic cosine

External markdown files that talk about the same concept as this entity.

  • alexander
    Alexander in the Koan Battery — How the Separate Construct came to beapplied/koan-battery-section.md0.848
  • research_notes
    What AI Sees in Uswhat-ai-sees-in-us.md0.792
  • alexander
    15 Properties of Aliveness in Human-AI Interaction — Scaffoldapplied/15-properties-of-aliveness-in-AI.md0.790
  • alexander
    Vol 1: The Phenomenon of Life — Chapter-by-Chaptercorpus/vol-1-phenomenon-of-life.md0.773