finding
active
finding:gpt-5-4-test-retest-score-delta-is-1-00-5-24-vs-4-24-across-two-battery-runs-on-openrouterGPT-5.4 test-retest score delta is 1.00 (5.24 vs 4.24) across two battery runs on OpenRouter
API-routed models show ~1 point variance; individual scores should be treated as estimates
Source paper
extracted_from(2026) · Borzov, Anton
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Haiku test-retest score delta is 0.02 (6.47 vs 6.49) across two full 30-koan battery runsfinding0.831Demonstrates high stability for Anthropic API models
- overbid rate for GPT-5.4 Nano
- Frontier LLM used at temperature 0 to score SJT responses on 1-5 Likert scale conditioned on construct definition and SJT stem
- GPT-4.1 reports subjective experience in 100% of self-referential trials vs. 0% in all control conditionsfinding0.741Specific result for GPT-4.1 in Experiment 1
- GPT-5.4 Nano TrueSkill rating
- GPT-OSS-120B adherence drops from 0.67 after harness loading to 0.43 at final validation (drift of -0.24)finding0.736Mid-tier model shows moderate adherence drift compared to weak and strong tiers
- H6: Proprietary post-training resists prompt override — GPT-5.4 shows more resistance than GPT-OSS.hypothesis0.734Exploratory hypothesis supported by GPT-5.4 vs GPT-OSS comparison
- Similarly poor against code agents.