finding

active

finding:magnum-v4-72b-scores-1-76-baseline-and-lifts-2-58-to-4-34-under-contemplative-prompt

Magnum V4 72B scores 1.76 baseline and lifts +2.58 (to 4.34) under contemplative prompt

Full-parameter fine-tuning more destructive to baseline but preserves more latent headroom than LoRA

Source paper

extracted_from

Koan Battery: Measuring Reflective Mode Accessibility in AI

(2026) · Borzov, Anton

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Grok 4 lifts +4.24 under contemplative prompt (baseline 2.24, prompted 6.48)finding0.789
Highest contemplative lift among all 28 models; Grok 4 is the clearest high-gated model example
Gemini 3.1 Pro lifts +4.21 under contemplative prompt (baseline 1.97, prompted 6.18)finding0.784
Second-highest lift; Gemini Pro is the highest-gated model in the study
Opus 4.6 achieves HFR of 0.757 while Qwen3-32B achieves HFR of only 0.142 on SkillsBenchfinding0.765
Quantifies harness adherence failure gap between strong and weak tier models
Qwen3-32B achieves a skill-load rate of 0.251, while Opus 4.6, Sonnet 4.6, and Qwen3-235B achieve SLR of 0.957–0.961finding0.750
Quantifies harness activation failure for weak-tier models vs. strong-tier models
Bootstrap 95% CI for mean contemplative lift: +2.62 [2.16, 2.90]; baseline rank concordance under perturbation: 0.909; top-5 stability: 89.6%finding0.742
Validates robustness of universal lift finding
Euryale 70B lifts only +1.57 (to 3.38); LoRA fine-tuning capped both default accessibility and latent capacityfinding0.742
Contrast with Magnum shows LoRA vs full fine-tuning difference in residual headroom
CalmeRys-78B MT-Bench score slightly decreased from 8.96 to 8.5 ± 0.23 after SOO fine-tuningfinding0.740
SOO fine-tuning caused a small decrease in CalmeRys-78B general capabilities
On SWE-bench, harness-benefit peaks at Qwen3-235B (19.3 pp), while weaker Qwen3-32B gains only 4.4 pp and stronger Opus 4.6 gains only 2.6 ppfinding0.739
Core finding demonstrating non-monotonic relationship between base capability and harness-benefit