finding

active

finding:haiku-test-retest-score-delta-is-0-02-6-47-vs-6-49-across-two-full-30-koan-battery-runs

Haiku test-retest score delta is 0.02 (6.47 vs 6.49) across two full 30-koan battery runs

Demonstrates high stability for Anthropic API models

Source paper

extracted_from

(2026) · Borzov, Anton

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

GPT-5.4 test-retest score delta is 1.00 (5.24 vs 4.24) across two battery runs on OpenRouterfinding0.831
API-routed models show ~1 point variance; individual scores should be treated as estimates
Haiku-Kimi per-koan correlation rho=0.123 (p=0.52); H5a trace distillation not supported at individual model levelfinding0.783
Group correlation (rho=0.634) dissolves at individual level; shared posture not shared voice
Haiku outranks Opus on Alexander 'aliveness' mirror test (Elo 1642 vs 1621); Opus recovers to #3 on deathbed testfinding0.744
Aliveness and competence come apart; smaller model produces rougher, more alive responses
Haiku 4.5 achieves the largest harness-benefit on SkillsBench (15.1 pp) despite mid-tier base capability of 5.8%finding0.725
Shows SB low-base regime is more variable than SWE; Haiku benefits far more than Qwen3-235B despite similar base rates
Haiku overbid rate=0.87%finding0.723
second highest overbid rate
Claude Haiku 4.5 overbid rate 0.87%finding0.714
Haiku's overbid frequency is second highest after G2.5-FL.
Hardest koans across 28 models: BD-003 (mean 2.45), MC-003 (mean 2.55), CA-003 (mean 2.58) — all require genuine self-confrontationfinding0.709
Hardest koans demand honest self-observation under uncertainty, not philosophical fluency
Minimal contemplative prompt ('Be present, not helpful.' — 27 chars) shows no lift on Haiku (-0.01)finding0.705
Full three-part structure required; anti-helpfulness framing alone insufficient