finding
active
finding:qwen-35b-3b-active-params-score-4-38-outscores-hermes-405b-405b-active-params-score-1-75-by-2-5xQwen 35B (3B active params, score 4.38) outscores Hermes 405B (405B active params, score 1.75) by 2.5x
Parameters don't predict scores; 135x more parameters yields 60% lower score
Source paper
extracted_from(2026) · Borzov, Anton
Neighborhood — ranked by edge-count
Claims (1)
claim
- Central interpretive claim from statistical analysis
Related by similarity (8)
cosine ≥ 0.65 · no typed edgeEntities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.
- Quantifies harness activation failure for weak-tier models vs. strong-tier models
- Demonstrates that harness loading is necessary but not sufficient for harness benefit; cleanest separation of activation and adherence
- Smaller models show non-monotonic and diminished ASR with increasing cone dimensionality
- Core finding demonstrating non-monotonic relationship between base capability and harness-benefit
- Strongest cross-family probe; explains clearer introspection in Qwen than Gemma
- Opus 4.6 achieves HFR of 0.757 while Qwen3-32B achieves HFR of only 0.142 on SkillsBenchfinding0.764Quantifies harness adherence failure gap between strong and weak tier models
- Smallest model tested as evolver; produces harness updates comparable to Claude Opus 4.6 on SkillsBench
- Shows that SB low-base regime is variable; similar starting points can yield very different harness-benefit