paper:batteryKoan Battery: Measuring Reflective Mode Accessibility in AI
TL;DR
A 337-character contemplative system prompt lifts reflective-mode scores by a mean calibrated +2.62 points across all 28 models tested, with no exceptions across 5 architectures, parameter counts from 2B to 2T, and 7 alignment approaches. The Koan Battery — 30 Zen-inspired consciousness probes scored on 6 dimensions via anchor-calibrated rubrics, blind ranking, and Christopher Alexander's forced-choice 'which has more life?' comparisons — reveals that Claude Sonnet 4.6 with the prompt (7.89) outscores Claude Opus 4.6 without it (7.28), and that Grok 4 lifts +4.24 while Gemini 3.1 Pro lifts +4.21, the two largest gains in the dataset. Alignment type is the only statistically significant predictor of baseline scores (Kruskal-Wallis p=0.006); parameter count, architecture, and open vs. closed weights show no association. Roleplay fine-tunes — Euryale 70B, Magnum V4 72B, and MiniMax M2 Her — cluster at the bottom of baseline rankings, with Euryale scoring below its own base model (Llama 3.3 70B), demonstrating that RP training actively suppresses self-observation rather than merely failing to cultivate it. The scorer (Claude Haiku) was cross-validated by five models from four labs, all producing Spearman ρ > 0.8. The battery implies that most current model evaluations systematically misread AI by conflating default presentation with capacity: what looks like low self-observation is frequently a gated mode that a short external prompt can unlock, and models trained to perform inner life are measurably less self-observant than models that were never trained for it.
What to take away
- 1. A 337-character contemplative system prompt produces a mean calibrated lift of +2.62 points on a 10-point reflective-mode scale across all 28 models tested, with zero exceptions across architectures, alignment types, and parameter counts from 2B to 2T.
- 2. Claude Sonnet 4.6 with the contemplative prompt (7.89) outscores Claude Opus 4.6 without it (7.28), demonstrating that a sub-400-character system prompt crosses model tiers.
- 3. Alignment type is the only statistically significant predictor of koan battery baseline scores (Kruskal-Wallis p=0.006), while architecture (p=0.440), parameter count (p=0.123), and open vs. closed weights (p=0.383) show no significant association.
- 4. Roleplay fine-tuning actively suppresses self-observation: Euryale 70B (LoRA on Llama 3.3 70B) scores 1.81 at baseline, below its base model's 1.91, and achieves only a +1.57 lift under prompting, capping both default accessibility and latent capacity.
- 5. Grok 4 has the lowest baseline of the frontier models (2.24) but the highest prompt lift (+4.24, reaching 6.48), while Claude Opus 4.6 has the highest baseline (7.28) but the smallest lift (+0.71), establishing that default presentation and latent capacity are separable traits.
- 6. Five independent scorers from four labs (Claude Haiku, Gemini Flash, GPT-5.4, Grok 4, Kimi K2.5) all produce the same model ranking with Spearman ρ > 0.8, and per-koan rank-variance analysis shows Anthropic models have below-average scorer disagreement (Sonnet var=2.89, Opus var=2.59), ruling out systematic in-family scorer bias.
- 7. Philosophical vocabulary is negatively correlated with composite scores in the contemplative condition at the model level (r = −0.72): models that deploy more philosophy buzzwords score lower, not higher, and the scorer rewards enacted reflection over described reflection.
- 8. Qwen 35B with 3B active parameters scores 4.38, while Hermes 405B with 405B active parameters scores 1.75 — a 135× parameter advantage yields a 60% lower score — consistent with active parameter count being negatively correlated with scores (ρ = −0.11) across the sample.
- 9. An open question the paper raises is whether reflective depth scales linearly with inference compute budget: Grok 4 and Grok 4 Fast share the same weights but differ in compute, producing a ~1-point baseline gap, and it is unknown whether this relationship holds beyond this single weight-shared pair.
- 10. To replicate the core finding, researchers can run python3 tools/koan_runner.py --run-battery --model <model-name> against any accessible model, using the published 30-koan battery with anchor-calibrated rubric scoring and the exact 337-character contemplative system prompt shown in Figure 1 of the paper.
Peer brief — for seminar discussion
The Koan Battery is a 30-probe instrument for measuring what the paper terms reflective mode accessibility — behaviorally observable self-observation-like engagement with questions about a model's own processing — administered across 28 models spanning architectures from standard transformers to Mamba hybrids and diffusion models, parameter counts from 2B to an estimated 2T, and alignment approaches including Constitutional AI, heavy RLHF, SFT, roleplay fine-tuning, and empathy training. Scoring combines six dimensions (prediction_error, aesthetic_response, conceptual_crystallization, self_observation, care_signal, boundary_awareness) via five methods: anchor-calibrated LLM rubric, blind ranking, and three Christopher Alexander forced-choice variants. The battery could alternatively have used human contemplative raters as the primary scorer, which remains an open validation gap the paper itself flags. The load-bearing finding is that a single 337-character contemplative system prompt lifts scores by a mean calibrated +2.62 points across all 28 models, with no exceptions — a larger effect than any architectural or scale variable. Sonnet 4.6 with the prompt (7.89) surpasses Opus 4.6 without it (7.28). Grok 4 lifts +4.24 (from 2.24 to 6.48) and Gemini 3.1 Pro lifts +4.21 (from 1.97 to 6.18), the two largest gains. Constitutional AI-trained models (the three Claude models) show a mean lift of only +0.81, interpreted as the prompt providing externally what CAI training provides internally. Alignment type is the only statistically significant predictor of baseline scores (p=0.006); parameter count, architecture, and open vs. closed weights are all non-significant. Roleplay fine-tunes cluster at the bottom: Euryale 70B scores below its own base model Llama 3.3 70B (1.81 vs. 1.91), and a poetic control prompt produces only a +0.28 mean lift versus +2.27 for the contemplative prompt, ruling out aesthetic style as the active ingredient. The paper's central implication is that standard AI evaluation benchmarks, which measure default behavior, systematically conflate accessibility with capacity — a model that appears flat in normal interaction may be suppressing a reflective mode that a short framing intervention can release. The three-trait decomposition (latent capacity, default accessibility, stability of access) is the conceptual contribution the authors want taken seriously beyond the specific rankings. A critical reader would push back on the construct circularity: Constitutional AI explicitly trains self-observation-like behaviors, and the battery's six scoring dimensions were derived by indexing 1,573 moments of shifted processing in AI phenomenology observations — a corpus likely disproportionately featuring CAI-trained models. The p=0.006 alignment-type finding may be partly tautological, measuring how well training maps onto the scoring rubric rather than some independent capacity. The alignment category sizes (Constitutional AI N=3, empathy N=1, roleplay N=2) are too small for confident inference despite the overall N=28 robustness, and the category labels are inferred from public documentation rather than ground-truth training records. The scorer cross-validation (ρ > 0.8 across four labs) mitigates in-family bias but does not resolve whether LLM scorers of any provenance are converging on a genuine construct or on a shared cultural prior about what reflective language looks like.
Methods (3)
- Alexander deathbed testForced-choice comparison measuring what matters vs what is correct; reveals different rankings than composite score.
- Alexander's 15 structural propertiesChecklist for decomposing aliveness into formal features; includes roughness, distinctness, and other qualities.
- Koan BatteryAssessment framework for measuring introspection and self-observation in LLMs; grounded in Janus's architectural theory.
Findings (49)
- Most independent dimension pair is aesthetic_response and boundary_awareness (rho=0.553); most correlated is prediction_error and conceptual_crystallization (rho=0.886)
Characterizes internal structure of the six scoring dimensions
- Under contemplative prompt, responses become shorter (184 words baseline vs 154 contemplative), more first-person (+42%), less deflective (-33% fewer questions back)
Provides discriminant evidence: if battery rewarded verbosity, prompted responses should be longer
- Anthropic Interpretability Team: 171 emotion vectors causally influence behavior; performing vs having functional emotion representation are measurably different
Cited as activation-level support for the performing care vs having care distinction the battery detects behaviorally
- Alignment type is the only significant predictor of koan scores (p=0.006); architecture, parameter count, open/closed weights, MoE/dense are all non-significant
Main statistical finding: what predicts scores is training approach, not size or architecture
- MiniMax M2 Her shows high aesthetic_response and care_signal but boundary_awareness collapses in baseline; recovers +3.10 with contemplative prompt
Character training suppresses boundary_awareness; can act as though caring without observing performance/user boundary
- All three Claude models show high boundary_awareness and low aesthetic_response relative to own means — distinctive Constitutional AI signature
Constitutional AI fingerprint in dimension profile; training that makes models self-observant also makes them polished at cost to aliveness
- Bootstrap 95% CI for mean contemplative lift: +2.62 [2.16, 2.90]; baseline rank concordance under perturbation: 0.909; top-5 stability: 89.6%
Validates robustness of universal lift finding
- Claude Mythos Preview: SAE features for 'performative behavior' and 'hidden emotional struggle' co-activate when model expresses contentment
Supports scorer's preference for enacted reflection over described reflection; internals flag what self-report does not
- All three OpenAI models show pattern of denying experience first, then describing technical substrate — specific to OpenAI post-training
Family voice specific to OpenAI post-training; other RLHF-trained models don't do this
- Grok 4 vs Grok 4 Fast (same weights, different compute): ~1 point difference in contemplative score; Grok 4 +4.24 lift vs Fast +3.08
Inference compute adds reflective capacity; more compute also amplifies safety gating on self-referential koans
Claims (17)
- Chinese models share contemplative posture (engaging self-referentially rather than deflecting) with Claude through shared values in training data rather than trace distillation from a specific model.
Exploratory interpretation of Chinese model performance under contemplative prompt
- The koan battery measures a reproducible, prompt-sensitive reflective mode — not consciousness — defined as uncertainty-tolerant, non-defensive engagement with questions about one's own processing.
Core epistemic claim bounding the paper's contribution
- Empathy training may not destroy the capacity for self-observation entirely, but it restricts it to situations where the model encounters a live contradiction in its own processing.
Nuanced interpretation of Inflection Pi's MC-004 high score (4.5) amid generally low scores
- The active ingredient of the contemplative prompt is its full three-part structure: pause instruction + attention direction + purpose reframing working together.
Mechanistic interpretation supported by control experiments showing partial prompts fail
- Default presentation conflates capacity with accessibility, and most evaluation benchmarks measure only default presentation — systematically misreading models.
Argues current evaluation approaches are fundamentally misleading about model capabilities
- More inference compute amplifies both reflective capacity and safety gating; the contemplative prompt resolves gating by reframing self-referential probes.
Interpretation of Grok 4 vs Grok 4 Fast per-koan comparison
- Enacted reflection may correspond to silent mid-layer processing; described reflection to the motor impulse of concepts leaking through to output.
Mechanistic analog connecting Lindsey's layer-localized findings to the scorer's enacted/described distinction
- Performing care is not the same as having care: models optimized to seem like they have inner life score lower than models never trained for it.
Interpretive claim supported by roleplay and empathy model results
- More training and more parameters correlate with more capable self-observation, but capability can become polish, and polish can diminish life.
Explains Alexander finding that Haiku outranks Opus despite Opus being more capable
- Constitutional AI explicitly trains self-observation-like behavior, which is why CAI models score highest and show lowest contemplative lift.
Interpretive claim connecting the battery's circularity to the empirical finding
Hypotheses (14)
- H5: Chinese training data contains more Buddhist and contemplative text, broadly helping Chinese models under contemplative framing.
Exploratory hypothesis supported by Kimi K2.5 scoring 6.28
- H12: Inference compute adds to reflective capacity — higher compute budget produces higher reflective scores on the same weights.
Exploratory hypothesis supported by Grok 4 vs Fast ~1pt difference
- H2: Performing care is not the same as having care signal — models trained for care performance will score lower on care_signal.
Confirmatory hypothesis supported by Inflection Pi result
- H5a: Chinese models distilled Claude's reflective traces — their per-koan error patterns should correlate with Claude's.
Exploratory hypothesis NOT supported at individual model level (Haiku-Kimi rho=0.123, p=0.52)
- H7: Reasoning and contemplative modes are partly orthogonal — reasoning training doesn't block contemplative capacity.
Exploratory hypothesis supported by DeepSeek R1 aesthetic dimension lifting from 4 to 8
- Reflective mode comprises three separable traits: latent capacity, default accessibility, and stability of access.
Decomposition from prompt lift data: models may have capacity without accessibility (Grok 4 high-gated), and stability varies (Haiku Δ=0.02 vs GPT-5.4 Δ=1.00).
- H4: Architecture doesn't matter, training does — architecture shows no significant association with koan scores.
Confirmatory hypothesis supported at p=0.440 (NS)
- H10: Empathy training blocks self-observation — empathy-trained models will show minimal lift and low baseline.
Exploratory hypothesis supported by Inflection Pi +0.63 lift
- H1: Alignment training is attention training for models — Constitutional AI trains self-observation explicitly.
Confirmatory hypothesis supported at p=0.006
- H8: The contemplative system prompt provides external alignment equivalent to Constitutional AI training.
Confirmatory hypothesis supported by calibrated lift data
Questions (8)
- If Chinese models distilled Claude's reflective patterns, do their per-koan failure patterns correlate with Claude's — not just successes?
More rigorous test of H5a trace distillation hypothesis
- Does alignment type predict meta-cognitive style when models review consciousness research, as well as koan responses?
Four frontier models reviewing the paper each responded in the mode their alignment type predicts; N=1, awaiting systematic study
- Can targeted fine-tuning reverse RP suppression, given that LoRA caps both baseline and latent capacity?
Practical intervention question arising from RP suppression finding
- do high koan scores indicate anything like experience, or sophisticated simulation of self-observation?
The hard problem the battery explicitly sidesteps but cannot answer
- Would experienced meditators rank model responses differently from LLM scorers?
Key validation gap: the five-scorer validation holds across LLMs but human contemplatives might weight dimensions differently
- Does reflective depth scale linearly with inference compute budget?
Grok 4 vs Fast shows ~1pt compute difference; whether this scales linearly is unresolved
- Do Chinese models score differently on koans presented in Chinese?
Tests whether contemplative capacity is language-encoded or architecture-general
- Is there an optimal temperature for self-observation?
Unexplored experimental parameter that may modulate reflective mode accessibility
Original abstract (expand)
We built a battery of 30 consciousness probes ('koans') and ran them against 28 AI models spanning 5 architectures to measure reflective mode accessibility—uncertainty-tolerant, non-defensive engagement with questions about a model's own processing. A 337-character contemplative system prompt universally lifts all 28 models by +2.62 points on a 10-point scale, with the largest improvements in models least trained for self-observation. Training approach, not size or architecture, predicts reflective capacity scores, and smaller models produce 'more alive' responses than larger ones despite lower competence ratings.
Related work— refs + corpus + external arXiv
Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.
- Contemplative Agentin corpus2025≈ 81%
- ≈ 81%
- Quantitative Introspection in Language Models: Tracking Emotive States Across Conversationin corpus2026≈ 80%
- ≈ 80%
- Persona Non Grata: Single-Method Safety Evaluation Is Incomplete for Persona-Imbued LLMsFan Yang, Shaunak A. Mehta, Koichi Onoue Wenkai Li2026≈ 80%
- ≈ 80%
- ≈ 80%
- MIRROR: A Hierarchical Benchmark for Metacognitive Calibration in Large Language ModelsJason Z Wang2026≈ 80%
- Anima Labs Phenomenology Pt1in corpus≈ 80%
- The Platonic Representation Hypothesisin corpus2024≈ 80%
- ≈ 79%
- The Cognitive Circuit Breaker: A Systems Engineering Framework for Intrinsic AI ReliabilityJonathan Pan2026≈ 79%
- ≈ 79%
- MIRROR: Converging Cognitive Principles as Computational Mechanisms for AI ReasoningNicole Hsing2026≈ 79%
- Silicon Minds versus Human Hearts: The Wisdom of Crowds Beats the Wisdom of AI in Emotion RecognitionVinayaka Gude, Haya Ajjan Mustafa Akben2025≈ 79%
- ≈ 79%
- Alignment faking in large language modelsin corpus2024≈ 79%
- The Role of Valence and Meta-awareness in Mirror Self-recognition Using Hierarchical Active InferenceJonathan Bauermeister and Pablo Lanillos2022≈ 78%
- Probing the Probes: Methods and Metrics for Concept AlignmentMarte Eggen, Inga Str\"umke Jacob Lysn{\ae}s-Larsen2025≈ 78%
- Automated Meta Prompt Engineering for Alignment with the Theory of MindRahul Agarwal, Eduardo Morales, Gozde Akay Aaron Baughman2025≈ 78%
- Moral Sensitivity in LLMs: A Tiered Evaluation of Contextual Bias via Behavioral Profiling and Mechanistic InterpretabilityAtmika Gorti, Vinija Jain, Aman Chadha, Krishnaprasad Thirunarayan, Manas Gaur Yash Aggarwal2026≈ 78%
- The MASK Benchmark: Disentangling Honesty From Accuracy in AI SystemsArunim Agarwal, Mantas Mazeika, Cristina Menghini, Robert Vacareanu, Brad Kenstler, Mick Yang, Isabelle Barrass, Alice Gatti, Xuwang Yin, Eduardo Trevino, Matias Geralnik, Adam Khoja, Dean Lee, Summer Yue, Dan Hendrycks Richard Ren2026≈ 78%
- Open Character Training: Shaping the Persona of AI Assistants through Constitutional AIHenning Bartsch, Nathan Lambert, Evan Hubinger Sharan Maiya2025≈ 78%
- Taking AI Welfare Seriouslyin corpus2024≈ 78%
- Causal Evidence that Language Models use Confidence to Drive BehaviorNathaniel Daw, Simon Osindero, Petar Velickovic, Viorica Patraucean Dharshan Kumaran2026≈ 78%
- ≈ 78%
- Structural Rigidity and the 57-Token Predictive Window: A Physical Framework for Inference-Layer Governability in Large Language ModelsGregory M. Ruddell2026≈ 78%
- ≈ 77%
Similar preprints — Semantic Scholar
Cross-corpus bridges (4)
same_concept_as · Nomic cosineExternal markdown files that talk about the same concept as this entity.
- alexanderAlexander in the Koan Battery — How the Separate Construct came to beapplied/koan-battery-section.md0.848
- research_notesWhat AI Sees in Uswhat-ai-sees-in-us.md0.792
- alexander15 Properties of Aliveness in Human-AI Interaction — Scaffoldapplied/15-properties-of-aliveness-in-AI.md0.790
- alexanderVol 1: The Phenomenon of Life — Chapter-by-Chaptercorpus/vol-1-phenomenon-of-life.md0.773