thinker:openalex-A5114373631Shrey Pandit
Authored papers (1)
- SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents2025
Continual reinforcement learning applied directly to reasoning-optimized base models—rather than starting from instruction-tuned checkpoints—yields a 20-parameter-billion autonomous single-agent, SFR-DR-20B, that achieves 28.7% on the full text-only Humanity's Last Exam (HLE) benchmark, a 65% relative improvement over the gpt-oss-20b base model's 17.3%, and outperforms OpenAI Deep Research with o3 (26.6%) without relying on multi-agent scaffolding. The framework introduced, SFR-DeepResearch (SFR-DR), combines a per-model-family agentic inference scaffolding—which reframes multi-turn tool-calling as iterative single-turn contextual QA for QwQ-32B and Qwen3 models—with a REINFORCE-based RL algorithm featuring temporal advantage normalization (dividing step-level advantages by trajectory length Ti) and strategic trajectory filtering over entirely synthetic training data. Without length normalization, agents trained on Qwen3-8B degenerate into repetitive tool-calling loops despite negative rewards, because long failing trajectories dominate batch loss; normalization stabilizes training and produces moderate, effective tool-use growth. SFR-DR-32B (from QwQ-32B) scores 72.0 on FRAMES and 52.4 on GAIA, while SFR-DR-8B (from Qwen3-8B) reaches 63.3 on FRAMES with 13.2 on HLE—competitive with or exceeding open-source baselines two-to-four times larger. The paper argues that initializing RL from reasoning-optimized models rather than base or SFT models allows agentic capabilities to be grafted onto strong chain-of-thought reasoning, and that single-agent architectures trained this way can generalize to unseen tasks better than workflow-constrained multi-agent systems while serving as drop-in sub-agents when multi-agent orchestration is desired.
More papers — OpenAlex / S2
Co-authors (7)
- Caiming Xiong2 shared
- Revanth Gangi Reddy2 shared
- Shafiq Joty2 shared
- Silvio Savarese2 shared
- Xuan-Phi Nguyen2 shared
- Aimin Xu1 shared
- Austin Xu1 shared
Recent mentions (1)
- papers-typednguyen-2025-sfr-deepresearch.md