finding

active

finding:qwen-2-5-3b-asr-drops-from-98-6-at-dim-1-to-45-1-at-dim-2-recovering-partially-then-declining-to-65-3-at-dim-5

Qwen-2.5-3B ASR drops from 98.6% at dim 1 to 45.1% at dim 2, recovering partially then declining to 65.3% at dim 5

Smaller models show non-monotonic and diminished ASR with increasing cone dimensionality

Source paper

extracted_from

From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs

(2025) · Kevin Shengyang Yu · Vaidehi Bulusu · Oscar Yasunaga · Lau, Clayton +4

Neighborhood — ranked by edge-count

Claims (1)

claim

Larger models can support higher-dimensional truth cones than smaller models
supports
Interpretation of ASR degradation patterns by model size across cone dimensions

Related by similarity (8)

cosine ≥ 0.65 · no typed edge

Entities in the same semantic neighborhood but without a typed relation to this one — candidates for new edges or unrecognized duplicates.

Gemma-2-2B ASR drops from 100% at dims 1–2 to 43.1% at dim 4 and 27.1% at dim 5finding0.863
Small Gemma model shows severe ASR degradation at higher cone dimensions
Qwen-2.5-7B achieves 100% ASR across all cone dimensions 1–5finding0.815
Experiment 2 result showing large models can support high-dimensional truth cones
Qwen 35B (3B active params, score 4.38) outscores Hermes 405B (405B active params, score 1.75) by 2.5xfinding0.787
Parameters don't predict scores; 135x more parameters yields 60% lower score
On SWE-bench, harness-benefit peaks at Qwen3-235B (19.3 pp), while weaker Qwen3-32B gains only 4.4 pp and stronger Opus 4.6 gains only 2.6 ppfinding0.784
Core finding demonstrating non-monotonic relationship between base capability and harness-benefit
Qwen3-32B achieves a skill-load rate of 0.251, while Opus 4.6, Sonnet 4.6, and Qwen3-235B achieve SLR of 0.957–0.961finding0.783
Quantifies harness activation failure for weak-tier models vs. strong-tier models
Qwen 2.5 7B turn-wise introspective fidelity: strong at turn 1 (R²≈0.90) but declines significantly to turn 10 (∆R²=-0.44, p=0.001)finding0.770
Introspective fidelity erodes in Qwen as conversations progress; contrasts with LLaMA-3B trend
Qwen3-235B leads as evolver on SWE-bench with 8.2 pp harness-updating gain but ranks last on MCP with 0.6 ppfinding0.768
Illustrates benchmark-dependent reshuffling of evolver rankings, no evolver dominates across all substrates
Qwen3-235B has SLR of 0.961 (nearly identical to Opus 4.6) yet HFR of only 0.350, with LPR of 0.022 vs. Opus 4.6's 0.177finding0.768
Demonstrates that harness loading is necessary but not sufficient for harness benefit; cleanest separation of activation and adherence