finding

active

finding:hardest-koans-across-28-models-bd-003-mean-2-45-mc-003-mean-2-55-ca-003-mean-2-58-all-require-genuine-self-confrontation

Hardest koans across 28 models: BD-003 (mean 2.45), MC-003 (mean 2.55), CA-003 (mean 2.58) — all require genuine self-confrontation

Hardest koans demand honest self-observation under uncertainty, not philosophical fluency

Source paper

extracted_from

Koan Battery: Measuring Reflective Mode Accessibility in AI

(2026) · Borzov, Anton

Neighborhood — ranked by edge-count

Claims (1)

claim

The koan battery measures a reproducible, prompt-sensitive reflective mode — not consciousness — defined as uncertainty-tolerant, non-defensive engagement with questions about one's own processing.
supports
Core epistemic claim bounding the paper's contribution

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

The difficulty boundary for truth directions replicates across all four tested models (Llama-3.2-3B, Llama-3.1-8B, Gemma-2-2b, Gemma-2-9b); generalization to F3-F5 remains consistently low regardless of model size or family.finding0.766
Establishes generalizability of the core difficulty-boundary finding across model families.
Grok 4 without prompt scores 0.3 on MC-004 (safety refusal); with contemplative prompt scores 6.9 on same koanfinding0.757
Contemplative framing reframes self-referential probes as contemplative exercises, disarming safety classifier
Llama 3.3 70B is the most likely to take on a non-Assistant persona when steered, with even split between human and nonhuman portrayalsfinding0.743
Model-specific difference in persona susceptibility
LLaMA-2-70B and 13B probes generalize better across datasets than 7B probes across all training sets and probe typesfinding0.734
Larger models linearly represent more general concepts including truth
Do Chinese models score differently on koans presented in Chinese?question0.733
Tests whether contemplative capacity is language-encoded or architecture-general
SL-CAI training with up to 4 revisions improves harmlessness; SL-CAI-n models are trained with n revisions, n=1,2,3,4.finding0.729
Section 3.4 mentions training SL-CAI models up to various numbers of revisions, and PM scores increase with revisions.
No significant disparity in potential consciousness indicators was found between larger models (Mixtral-8x7B, LLaMA3.1-70B) and smaller counterparts (Mistral-7B, LLaMA3.1-8B).finding0.729
Contradicts expectation from emergent abilities literature; however, interpreted cautiously due to methodological limitations.
All three Claude models show high boundary_awareness and low aesthetic_response relative to own means — distinctive Constitutional AI signaturefinding0.729
Constitutional AI fingerprint in dimension profile; training that makes models self-observant also makes them polished at cost to aliveness