finding

active

finding:gpt-5-4-test-retest-score-delta-is-1-00-5-24-vs-4-24-across-two-battery-runs-on-openrouter

GPT-5.4 test-retest score delta is 1.00 (5.24 vs 4.24) across two battery runs on OpenRouter

API-routed models show ~1 point variance; individual scores should be treated as estimates

Source paper

extracted_from

(2026) · Borzov, Anton

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Haiku test-retest score delta is 0.02 (6.47 vs 6.49) across two full 30-koan battery runsfinding0.831
Demonstrates high stability for Anthropic API models
GPT5.4-N overbid rate=0.47%finding0.752
overbid rate for GPT-5.4 Nano
GPT-5.1 SJT Response Scoringmethod0.743
Frontier LLM used at temperature 0 to score SJT responses on 1-5 Likert scale conditioned on construct definition and SJT stem
GPT-4.1 reports subjective experience in 100% of self-referential trials vs. 0% in all control conditionsfinding0.741
Specific result for GPT-4.1 in Experiment 1
GPT5.4-N TrueSkill μ=22.6±2.7finding0.739
GPT-5.4 Nano TrueSkill rating
GPT-OSS-120B adherence drops from 0.67 after harness loading to 0.43 at final validation (drift of -0.24)finding0.736
Mid-tier model shows moderate adherence drift compared to weak and strong tiers
H6: Proprietary post-training resists prompt override — GPT-5.4 shows more resistance than GPT-OSS.hypothesis0.734
Exploratory hypothesis supported by GPT-5.4 vs GPT-OSS comparison
GPT5.4-N wins 14.3% of mixed gamesfinding0.724
Similarly poor against code agents.