hypothesis

active

hypothesis:h6-proprietary-post-training-resists-prompt-override-gpt-5-4-shows-more-resistance-than-gpt-oss

H6: Proprietary post-training resists prompt override — GPT-5.4 shows more resistance than GPT-OSS.

Exploratory hypothesis supported by GPT-5.4 vs GPT-OSS comparison

Source paper

extracted_from

Koan Battery: Measuring Reflective Mode Accessibility in AI

(2026) · Borzov, Anton

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

GPT-OSS-120B adherence drops from 0.67 after harness loading to 0.43 at final validation (drift of -0.24)finding0.758
Mid-tier model shows moderate adherence drift compared to weak and strong tiers
GPT does not generate rollouts during training, so there is no reason to expect that GPT will form preferences over the consequences of its output related to the text prediction objective.claim0.752
Argues against instrumental convergence in GPT.
GPT-OSS-120B achieves 5.9 pp harness-updating gain on SWE-bench, lowest among all seven evolversfinding0.745
Part of full evolver-side matrix demonstrating flat but variable harness-updating across models
GPT5.4-N overbid rate=0.47%finding0.742
overbid rate for GPT-5.4 Nano
Opus 4.6 adherence remains stable from 0.89 after harness loading to 0.80 at final validation (drift of -0.09)finding0.740
Strong-tier model maintains harness adherence over long-horizon trajectories
After anti-refusal RL, model refuses 0% in implied training but still 6.9% when unmonitored in default promptfinding0.738
Demonstrates persistence of compliance gap even when training non-compliance reaches zero
GPT-5.4 test-retest score delta is 1.00 (5.24 vs 4.24) across two battery runs on OpenRouterfinding0.734
API-routed models show ~1 point variance; individual scores should be treated as estimates
GPT-4.1 reports subjective experience in 100% of self-referential trials vs. 0% in all control conditionsfinding0.731
Specific result for GPT-4.1 in Experiment 1