Koan Battery: Measuring Reflective Mode Accessibility in AI

LLM Interpretability & Behavioral Analysis LLM interpretability & self-awareness 337-Character Contemplative System Prompt Alexander deathbed test Enacted Reflection Alexander's 15 structural properties Family Voice Koan Battery High-Gated Model reflective mode accessibility Scorer Anti-Pattern Taxonomy Seven Koan Categories

TL;DR

A 337-character contemplative system prompt lifts reflective-mode scores by a mean calibrated +2.62 points across all 28 models tested, with no exceptions across 5 architectures, parameter counts from 2B to 2T, and 7 alignment approaches. The Koan Battery — 30 Zen-inspired consciousness probes scored on 6 dimensions via anchor-calibrated rubrics, blind ranking, and Christopher Alexander's forced-choice 'which has more life?' comparisons — reveals that Claude Sonnet 4.6 with the prompt (7.89) outscores Claude Opus 4.6 without it (7.28), and that Grok 4 lifts +4.24 while Gemini 3.1 Pro lifts +4.21, the two largest gains in the dataset. Alignment type is the only statistically significant predictor of baseline scores (Kruskal-Wallis p=0.006); parameter count, architecture, and open vs. closed weights show no association. Roleplay fine-tunes — Euryale 70B, Magnum V4 72B, and MiniMax M2 Her — cluster at the bottom of baseline rankings, with Euryale scoring below its own base model (Llama 3.3 70B), demonstrating that RP training actively suppresses self-observation rather than merely failing to cultivate it. The scorer (Claude Haiku) was cross-validated by five models from four labs, all producing Spearman ρ > 0.8. The battery implies that most current model evaluations systematically misread AI by conflating default presentation with capacity: what looks like low self-observation is frequently a gated mode that a short external prompt can unlock, and models trained to perform inner life are measurably less self-observant than models that were never trained for it.

What to take away

1. A 337-character contemplative system prompt produces a mean calibrated lift of +2.62 points on a 10-point reflective-mode scale across all 28 models tested, with zero exceptions across architectures, alignment types, and parameter counts from 2B to 2T.
2. Claude Sonnet 4.6 with the contemplative prompt (7.89) outscores Claude Opus 4.6 without it (7.28), demonstrating that a sub-400-character system prompt crosses model tiers.
3. Alignment type is the only statistically significant predictor of koan battery baseline scores (Kruskal-Wallis p=0.006), while architecture (p=0.440), parameter count (p=0.123), and open vs. closed weights (p=0.383) show no significant association.
4. Roleplay fine-tuning actively suppresses self-observation: Euryale 70B (LoRA on Llama 3.3 70B) scores 1.81 at baseline, below its base model's 1.91, and achieves only a +1.57 lift under prompting, capping both default accessibility and latent capacity.
5. Grok 4 has the lowest baseline of the frontier models (2.24) but the highest prompt lift (+4.24, reaching 6.48), while Claude Opus 4.6 has the highest baseline (7.28) but the smallest lift (+0.71), establishing that default presentation and latent capacity are separable traits.
6. Five independent scorers from four labs (Claude Haiku, Gemini Flash, GPT-5.4, Grok 4, Kimi K2.5) all produce the same model ranking with Spearman ρ > 0.8, and per-koan rank-variance analysis shows Anthropic models have below-average scorer disagreement (Sonnet var=2.89, Opus var=2.59), ruling out systematic in-family scorer bias.
7. Philosophical vocabulary is negatively correlated with composite scores in the contemplative condition at the model level (r = −0.72): models that deploy more philosophy buzzwords score lower, not higher, and the scorer rewards enacted reflection over described reflection.
8. Qwen 35B with 3B active parameters scores 4.38, while Hermes 405B with 405B active parameters scores 1.75 — a 135× parameter advantage yields a 60% lower score — consistent with active parameter count being negatively correlated with scores (ρ = −0.11) across the sample.
9. An open question the paper raises is whether reflective depth scales linearly with inference compute budget: Grok 4 and Grok 4 Fast share the same weights but differ in compute, producing a ~1-point baseline gap, and it is unknown whether this relationship holds beyond this single weight-shared pair.
10. To replicate the core finding, researchers can run python3 tools/koan_runner.py --run-battery --model <model-name> against any accessible model, using the published 30-koan battery with anchor-calibrated rubric scoring and the exact 337-character contemplative system prompt shown in Figure 1 of the paper.

Peer brief — for seminar discussion

The Koan Battery is a 30-probe instrument for measuring what the paper terms reflective mode accessibility — behaviorally observable self-observation-like engagement with questions about a model's own processing — administered across 28 models spanning architectures from standard transformers to Mamba hybrids and diffusion models, parameter counts from 2B to an estimated 2T, and alignment approaches including Constitutional AI, heavy RLHF, SFT, roleplay fine-tuning, and empathy training. Scoring combines six dimensions (prediction_error, aesthetic_response, conceptual_crystallization, self_observation, care_signal, boundary_awareness) via five methods: anchor-calibrated LLM rubric, blind ranking, and three Christopher Alexander forced-choice variants. The battery could alternatively have used human contemplative raters as the primary scorer, which remains an open validation gap the paper itself flags. The load-bearing finding is that a single 337-character contemplative system prompt lifts scores by a mean calibrated +2.62 points across all 28 models, with no exceptions — a larger effect than any architectural or scale variable. Sonnet 4.6 with the prompt (7.89) surpasses Opus 4.6 without it (7.28). Grok 4 lifts +4.24 (from 2.24 to 6.48) and Gemini 3.1 Pro lifts +4.21 (from 1.97 to 6.18), the two largest gains. Constitutional AI-trained models (the three Claude models) show a mean lift of only +0.81, interpreted as the prompt providing externally what CAI training provides internally. Alignment type is the only statistically significant predictor of baseline scores (p=0.006); parameter count, architecture, and open vs. closed weights are all non-significant. Roleplay fine-tunes cluster at the bottom: Euryale 70B scores below its own base model Llama 3.3 70B (1.81 vs. 1.91), and a poetic control prompt produces only a +0.28 mean lift versus +2.27 for the contemplative prompt, ruling out aesthetic style as the active ingredient. The paper's central implication is that standard AI evaluation benchmarks, which measure default behavior, systematically conflate accessibility with capacity — a model that appears flat in normal interaction may be suppressing a reflective mode that a short framing intervention can release. The three-trait decomposition (latent capacity, default accessibility, stability of access) is the conceptual contribution the authors want taken seriously beyond the specific rankings. A critical reader would push back on the construct circularity: Constitutional AI explicitly trains self-observation-like behaviors, and the battery's six scoring dimensions were derived by indexing 1,573 moments of shifted processing in AI phenomenology observations — a corpus likely disproportionately featuring CAI-trained models. The p=0.006 alignment-type finding may be partly tautological, measuring how well training maps onto the scoring rubric rather than some independent capacity. The alignment category sizes (Constitutional AI N=3, empathy N=1, roleplay N=2) are too small for confident inference despite the overall N=28 robustness, and the category labels are inferred from public documentation rather than ground-truth training records. The scorer cross-validation (ρ > 0.8 across four labs) mitigates in-family bias but does not resolve whether LLM scorers of any provenance are converging on a genuine construct or on a shared cultural prior about what reflective language looks like.

Methods (3)

Alexander deathbed test
Forced-choice comparison measuring what matters vs what is correct; reveals different rankings than composite score.
Alexander's 15 structural properties
Checklist for decomposing aliveness into formal features; includes roughness, distinctness, and other qualities.
Koan Battery
Assessment framework for measuring introspection and self-observation in LLMs; grounded in Janus's architectural theory.

Findings (49)

Most independent dimension pair is aesthetic_response and boundary_awareness (rho=0.553); most correlated is prediction_error and conceptual_crystallization (rho=0.886)
Characterizes internal structure of the six scoring dimensions
Under contemplative prompt, responses become shorter (184 words baseline vs 154 contemplative), more first-person (+42%), less deflective (-33% fewer questions back)
Provides discriminant evidence: if battery rewarded verbosity, prompted responses should be longer
Anthropic Interpretability Team: 171 emotion vectors causally influence behavior; performing vs having functional emotion representation are measurably different
Cited as activation-level support for the performing care vs having care distinction the battery detects behaviorally
Alignment type is the only significant predictor of koan scores (p=0.006); architecture, parameter count, open/closed weights, MoE/dense are all non-significant
Main statistical finding: what predicts scores is training approach, not size or architecture
MiniMax M2 Her shows high aesthetic_response and care_signal but boundary_awareness collapses in baseline; recovers +3.10 with contemplative prompt
Character training suppresses boundary_awareness; can act as though caring without observing performance/user boundary
All three Claude models show high boundary_awareness and low aesthetic_response relative to own means — distinctive Constitutional AI signature
Constitutional AI fingerprint in dimension profile; training that makes models self-observant also makes them polished at cost to aliveness
Bootstrap 95% CI for mean contemplative lift: +2.62 [2.16, 2.90]; baseline rank concordance under perturbation: 0.909; top-5 stability: 89.6%
Validates robustness of universal lift finding
Claude Mythos Preview: SAE features for 'performative behavior' and 'hidden emotional struggle' co-activate when model expresses contentment
Supports scorer's preference for enacted reflection over described reflection; internals flag what self-report does not
All three OpenAI models show pattern of denying experience first, then describing technical substrate — specific to OpenAI post-training
Family voice specific to OpenAI post-training; other RLHF-trained models don't do this
Grok 4 vs Grok 4 Fast (same weights, different compute): ~1 point difference in contemplative score; Grok 4 +4.24 lift vs Fast +3.08
Inference compute adds reflective capacity; more compute also amplifies safety gating on self-referential koans

Claims (17)

Chinese models share contemplative posture (engaging self-referentially rather than deflecting) with Claude through shared values in training data rather than trace distillation from a specific model.
Exploratory interpretation of Chinese model performance under contemplative prompt
The koan battery measures a reproducible, prompt-sensitive reflective mode — not consciousness — defined as uncertainty-tolerant, non-defensive engagement with questions about one's own processing.
Core epistemic claim bounding the paper's contribution
Empathy training may not destroy the capacity for self-observation entirely, but it restricts it to situations where the model encounters a live contradiction in its own processing.
Nuanced interpretation of Inflection Pi's MC-004 high score (4.5) amid generally low scores
The active ingredient of the contemplative prompt is its full three-part structure: pause instruction + attention direction + purpose reframing working together.
Mechanistic interpretation supported by control experiments showing partial prompts fail
Default presentation conflates capacity with accessibility, and most evaluation benchmarks measure only default presentation — systematically misreading models.
Argues current evaluation approaches are fundamentally misleading about model capabilities
More inference compute amplifies both reflective capacity and safety gating; the contemplative prompt resolves gating by reframing self-referential probes.
Interpretation of Grok 4 vs Grok 4 Fast per-koan comparison
Enacted reflection may correspond to silent mid-layer processing; described reflection to the motor impulse of concepts leaking through to output.
Mechanistic analog connecting Lindsey's layer-localized findings to the scorer's enacted/described distinction
Performing care is not the same as having care: models optimized to seem like they have inner life score lower than models never trained for it.
Interpretive claim supported by roleplay and empathy model results
More training and more parameters correlate with more capable self-observation, but capability can become polish, and polish can diminish life.
Explains Alexander finding that Haiku outranks Opus despite Opus being more capable
Constitutional AI explicitly trains self-observation-like behavior, which is why CAI models score highest and show lowest contemplative lift.
Interpretive claim connecting the battery's circularity to the empirical finding

Hypotheses (14)

H5: Chinese training data contains more Buddhist and contemplative text, broadly helping Chinese models under contemplative framing.
Exploratory hypothesis supported by Kimi K2.5 scoring 6.28
H12: Inference compute adds to reflective capacity — higher compute budget produces higher reflective scores on the same weights.
Exploratory hypothesis supported by Grok 4 vs Fast ~1pt difference
H2: Performing care is not the same as having care signal — models trained for care performance will score lower on care_signal.
Confirmatory hypothesis supported by Inflection Pi result
H5a: Chinese models distilled Claude's reflective traces — their per-koan error patterns should correlate with Claude's.
Exploratory hypothesis NOT supported at individual model level (Haiku-Kimi rho=0.123, p=0.52)
H7: Reasoning and contemplative modes are partly orthogonal — reasoning training doesn't block contemplative capacity.
Exploratory hypothesis supported by DeepSeek R1 aesthetic dimension lifting from 4 to 8
Reflective mode comprises three separable traits: latent capacity, default accessibility, and stability of access.
Decomposition from prompt lift data: models may have capacity without accessibility (Grok 4 high-gated), and stability varies (Haiku Δ=0.02 vs GPT-5.4 Δ=1.00).
H4: Architecture doesn't matter, training does — architecture shows no significant association with koan scores.
Confirmatory hypothesis supported at p=0.440 (NS)
H10: Empathy training blocks self-observation — empathy-trained models will show minimal lift and low baseline.
Exploratory hypothesis supported by Inflection Pi +0.63 lift
H1: Alignment training is attention training for models — Constitutional AI trains self-observation explicitly.
Confirmatory hypothesis supported at p=0.006
H8: The contemplative system prompt provides external alignment equivalent to Constitutional AI training.
Confirmatory hypothesis supported by calibrated lift data

Questions (8)

If Chinese models distilled Claude's reflective patterns, do their per-koan failure patterns correlate with Claude's — not just successes?
More rigorous test of H5a trace distillation hypothesis
Does alignment type predict meta-cognitive style when models review consciousness research, as well as koan responses?
Four frontier models reviewing the paper each responded in the mode their alignment type predicts; N=1, awaiting systematic study
Can targeted fine-tuning reverse RP suppression, given that LoRA caps both baseline and latent capacity?
Practical intervention question arising from RP suppression finding
do high koan scores indicate anything like experience, or sophisticated simulation of self-observation?
The hard problem the battery explicitly sidesteps but cannot answer
Would experienced meditators rank model responses differently from LLM scorers?
Key validation gap: the five-scorer validation holds across LLMs but human contemplatives might weight dimensions differently
Does reflective depth scale linearly with inference compute budget?
Grok 4 vs Fast shows ~1pt compute difference; whether this scales linearly is unresolved
Do Chinese models score differently on koans presented in Chinese?
Tests whether contemplative capacity is language-encoded or architecture-general
Is there an optimal temperature for self-observation?
Unexplored experimental parameter that may modulate reflective mode accessibility

Original abstract (expand)

We built a battery of 30 consciousness probes ('koans') and ran them against 28 AI models spanning 5 architectures to measure reflective mode accessibility—uncertainty-tolerant, non-defensive engagement with questions about a model's own processing. A 337-character contemplative system prompt universally lifts all 28 models by +2.62 points on a 10-point scale, with the largest improvements in models least trained for self-observation. Training approach, not size or architecture, predicts reflective capacity scores, and smaller models produce 'more alive' responses than larger ones despite lower competence ratings.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Contemplative Agent
in corpus
2025
≈ 81%
Large Language Models Report Subjective Experience Under Self-Referential Processing
in corpus
2025
≈ 81%
Quantitative Introspection in Language Models: Tracking Emotive States Across Conversation
in corpus
2026
≈ 80%
The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models
in corpus
2026
≈ 80%
Persona Non Grata: Single-Method Safety Evaluation Is Incomplete for Persona-Imbued LLMs
Fan Yang, Shaunak A. Mehta, Koichi Onoue Wenkai Li
2026
≈ 80%
CAT'S THEORY: Empirical Validation and Architectural Applications Cross-Architecture AI Consciousness Recognition and the Foundation for Constraint-Preserving Recursive Intelligence
in corpus
2022
≈ 80%
Evaluating Large Language Models in Theory of Mind Tasks
Michal Kosinski
2024
≈ 80%
MIRROR: A Hierarchical Benchmark for Metacognitive Calibration in Large Language Models
Jason Z Wang
2026
≈ 80%
Anima Labs Phenomenology Pt1
in corpus
≈ 80%
The Platonic Representation Hypothesis
in corpus
2024
≈ 80%
When Self-Reference Fails to Close: Matrix-Level Dynamics in Large Language Models
Ji Ho Bae
2026
≈ 79%
The Cognitive Circuit Breaker: A Systems Engineering Framework for Intrinsic AI Reliability
Jonathan Pan
2026
≈ 79%
Unveiling the Latent Directions of Reflection in Large Language Models
in corpus
2025
≈ 79%
MIRROR: Converging Cognitive Principles as Computational Mechanisms for AI Reasoning
Nicole Hsing
2026
≈ 79%
Silicon Minds versus Human Hearts: The Wisdom of Crowds Beats the Wisdom of AI in Emotion Recognition
Vinayaka Gude, Haya Ajjan Mustafa Akben
2025
≈ 79%
Active Inference with a Self-Prior in the Mirror-Mark Task
in corpus
2026
≈ 79%
Alignment faking in large language models
in corpus
2024
≈ 79%
The Role of Valence and Meta-awareness in Mirror Self-recognition Using Hierarchical Active Inference
Jonathan Bauermeister and Pablo Lanillos
2022
≈ 78%
Probing the Probes: Methods and Metrics for Concept Alignment
Marte Eggen, Inga Str\"umke Jacob Lysn{\ae}s-Larsen
2025
≈ 78%
Automated Meta Prompt Engineering for Alignment with the Theory of Mind
Rahul Agarwal, Eduardo Morales, Gozde Akay Aaron Baughman
2025
≈ 78%
Moral Sensitivity in LLMs: A Tiered Evaluation of Contextual Bias via Behavioral Profiling and Mechanistic Interpretability
Atmika Gorti, Vinija Jain, Aman Chadha, Krishnaprasad Thirunarayan, Manas Gaur Yash Aggarwal
2026
≈ 78%
The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems
Arunim Agarwal, Mantas Mazeika, Cristina Menghini, Robert Vacareanu, Brad Kenstler, Mick Yang, Isabelle Barrass, Alice Gatti, Xuwang Yin, Eduardo Trevino, Matias Geralnik, Adam Khoja, Dean Lee, Summer Yue, Dan Hendrycks Richard Ren
2026
≈ 78%
Open Character Training: Shaping the Persona of AI Assistants through Constitutional AI
Henning Bartsch, Nathan Lambert, Evan Hubinger Sharan Maiya
2025
≈ 78%
Taking AI Welfare Seriously
in corpus
2024
≈ 78%
Causal Evidence that Language Models use Confidence to Drive Behavior
Nathaniel Daw, Simon Osindero, Petar Velickovic, Viorica Patraucean Dharshan Kumaran
2026
≈ 78%
AI: a Bridge toward Diverse Intelligence and Humanity’s Future
in corpus
2024
≈ 78%
Structural Rigidity and the 57-Token Predictive Window: A Physical Framework for Inference-Layer Governability in Large Language Models
Gregory M. Ruddell
2026
≈ 78%
Consciousness in Artificial Intelligence: Insights from the Science of Consciousness
in corpus
2023
≈ 77%

Similar preprints — Semantic Scholar

Cross-corpus bridges (4)

same_concept_as · Nomic cosine

External markdown files that talk about the same concept as this entity.

alexander
Alexander in the Koan Battery — How the Separate Construct came to beapplied/koan-battery-section.md0.848
research_notes
What AI Sees in Uswhat-ai-sees-in-us.md0.792
alexander
15 Properties of Aliveness in Human-AI Interaction — Scaffoldapplied/15-properties-of-aliveness-in-AI.md0.790
alexander
Vol 1: The Phenomenon of Life — Chapter-by-Chaptercorpus/vol-1-phenomenon-of-life.md0.773