hypothesis
active
hypothesis:h6-proprietary-post-training-resists-prompt-override-gpt-5-4-shows-more-resistance-than-gpt-ossH6: Proprietary post-training resists prompt override — GPT-5.4 shows more resistance than GPT-OSS.
Exploratory hypothesis supported by GPT-5.4 vs GPT-OSS comparison
Source paper
extracted_from(2026) · Borzov, Anton
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- GPT-OSS-120B adherence drops from 0.67 after harness loading to 0.43 at final validation (drift of -0.24)finding0.758Mid-tier model shows moderate adherence drift compared to weak and strong tiers
- Argues against instrumental convergence in GPT.
- GPT-OSS-120B achieves 5.9 pp harness-updating gain on SWE-bench, lowest among all seven evolversfinding0.745Part of full evolver-side matrix demonstrating flat but variable harness-updating across models
- overbid rate for GPT-5.4 Nano
- Strong-tier model maintains harness adherence over long-horizon trajectories
- Demonstrates persistence of compliance gap even when training non-compliance reaches zero
- GPT-5.4 test-retest score delta is 1.00 (5.24 vs 4.24) across two battery runs on OpenRouterfinding0.734API-routed models show ~1 point variance; individual scores should be treated as estimates
- GPT-4.1 reports subjective experience in 100% of self-referential trials vs. 0% in all control conditionsfinding0.731Specific result for GPT-4.1 in Experiment 1