finding
active
finding:haiku-test-retest-score-delta-is-0-02-6-47-vs-6-49-across-two-full-30-koan-battery-runsHaiku test-retest score delta is 0.02 (6.47 vs 6.49) across two full 30-koan battery runs
Demonstrates high stability for Anthropic API models
Source paper
extracted_from(2026) · Borzov, Anton
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- GPT-5.4 test-retest score delta is 1.00 (5.24 vs 4.24) across two battery runs on OpenRouterfinding0.831API-routed models show ~1 point variance; individual scores should be treated as estimates
- Group correlation (rho=0.634) dissolves at individual level; shared posture not shared voice
- Aliveness and competence come apart; smaller model produces rougher, more alive responses
- Shows SB low-base regime is more variable than SWE; Haiku benefits far more than Qwen3-235B despite similar base rates
- second highest overbid rate
- Haiku's overbid frequency is second highest after G2.5-FL.
- Hardest koans demand honest self-observation under uncertainty, not philosophical fluency
- Minimal contemplative prompt ('Be present, not helpful.' — 27 chars) shows no lift on Haiku (-0.01)finding0.705Full three-part structure required; anti-helpfulness framing alone insufficient