thinker
active
thinker:openalex-A5114373631

Shrey Pandit

Authored
1
Introduces
0
Studies
0
Affiliations
0
Cited by
0

Authored papers (1)

  • Continual reinforcement learning applied directly to reasoning-optimized base models—rather than starting from instruction-tuned checkpoints—yields a 20-parameter-billion autonomous single-agent, SFR-DR-20B, that achieves 28.7% on the full text-only Humanity's Last Exam (HLE) benchmark, a 65% relative improvement over the gpt-oss-20b base model's 17.3%, and outperforms OpenAI Deep Research with o3 (26.6%) without relying on multi-agent scaffolding. The framework introduced, SFR-DeepResearch (SFR-DR), combines a per-model-family agentic inference scaffolding—which reframes multi-turn tool-calling as iterative single-turn contextual QA for QwQ-32B and Qwen3 models—with a REINFORCE-based RL algorithm featuring temporal advantage normalization (dividing step-level advantages by trajectory length Ti) and strategic trajectory filtering over entirely synthetic training data. Without length normalization, agents trained on Qwen3-8B degenerate into repetitive tool-calling loops despite negative rewards, because long failing trajectories dominate batch loss; normalization stabilizes training and produces moderate, effective tool-use growth. SFR-DR-32B (from QwQ-32B) scores 72.0 on FRAMES and 52.4 on GAIA, while SFR-DR-8B (from Qwen3-8B) reaches 63.3 on FRAMES with 13.2 on HLE—competitive with or exceeding open-source baselines two-to-four times larger. The paper argues that initializing RL from reasoning-optimized models rather than base or SFT models allows agentic capabilities to be grafted onto strong chain-of-thought reasoning, and that single-agent architectures trained this way can generalize to unseen tasks better than workflow-constrained multi-agent systems while serving as drop-in sub-agents when multi-agent orchestration is desired.

More papers — OpenAlex / S2

Co-authors (7)

Recent mentions (1)