SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents

ByXuan-Phi Nguyen·Shrey Pandit·Revanth Gangi Reddy ⓘ·Aimin Xu ⓘ·Silvio Savarese·Caiming Xiong ⓘ+1 moreSalesforce AI Research

DOI 10.48550/arxiv.2509.06283 arXiv 2509.06283 OpenAlex W4416063947

LLM Interpretability & Behavioral Analysis LLM interpretability & self-awareness

TL;DR

Continual reinforcement learning applied directly to reasoning-optimized base models—rather than starting from instruction-tuned checkpoints—yields a 20-parameter-billion autonomous single-agent, SFR-DR-20B, that achieves 28.7% on the full text-only Humanity's Last Exam (HLE) benchmark, a 65% relative improvement over the gpt-oss-20b base model's 17.3%, and outperforms OpenAI Deep Research with o3 (26.6%) without relying on multi-agent scaffolding. The framework introduced, SFR-DeepResearch (SFR-DR), combines a per-model-family agentic inference scaffolding—which reframes multi-turn tool-calling as iterative single-turn contextual QA for QwQ-32B and Qwen3 models—with a REINFORCE-based RL algorithm featuring temporal advantage normalization (dividing step-level advantages by trajectory length Ti) and strategic trajectory filtering over entirely synthetic training data. Without length normalization, agents trained on Qwen3-8B degenerate into repetitive tool-calling loops despite negative rewards, because long failing trajectories dominate batch loss; normalization stabilizes training and produces moderate, effective tool-use growth. SFR-DR-32B (from QwQ-32B) scores 72.0 on FRAMES and 52.4 on GAIA, while SFR-DR-8B (from Qwen3-8B) reaches 63.3 on FRAMES with 13.2 on HLE—competitive with or exceeding open-source baselines two-to-four times larger. The paper argues that initializing RL from reasoning-optimized models rather than base or SFT models allows agentic capabilities to be grafted onto strong chain-of-thought reasoning, and that single-agent architectures trained this way can generalize to unseen tasks better than workflow-constrained multi-agent systems while serving as drop-in sub-agents when multi-agent orchestration is desired.

What to take away

1. SFR-DR-20B, trained via continual RL from gpt-oss-20b, achieves 28.7% on the full text-only Humanity's Last Exam benchmark, compared to the gpt-oss-20b baseline's 17.3% and OpenAI Deep Research with o3's 26.6%.
2. Reformulating multi-turn tool-calling as a single-turn contextual QA prompt for QwQ-32B yields a 10 percentage-point absolute gain on FRAMES (58.0 → 68.0 pre-RL) versus the model's default multi-turn chat template, with no additional training.
3. Temporal advantage normalization—dividing step-level advantages by trajectory length Ti in the REINFORCE objective—prevents degenerate repetitive tool-calling: without it, SFR-DR-8B training reward and HLE validation performance both collapse despite long trajectories receiving negative rewards.
4. SFR-DR-32B (from QwQ-32B) scores 72.0 on FRAMES and 52.4 on GAIA, outperforming open-source single-agent baselines WebSailor-32B (69.78 / 44.0) and WebShaper-32B (69.42 / 48.5) evaluated under the same contamination blocklist.
5. The gpt-oss-20b-based SFR-DR-20B makes up to 10 times more tool calls per HLE question than QwQ-32B and Qwen3-8B variants, which tend toward internal reasoning; RL further increases this gap, suggesting base-model agentic priors strongly shape post-RL tool-use behavior.
6. SFR-DR-20B generates fewer than 2,000 tokens per agentic step on HLE, 4–5 times fewer than the 8B and 32B Qwen-family counterparts, and RL training actually shrinks its per-step response length while expanding Qwen-family response lengths.
7. The paper raises the open question of whether Qwen-family models (QwQ-32B, Qwen3-8B) have been post-trained so heavily on single-turn reasoning tasks that their chain-of-thought quality degrades irreversibly in long multi-turn agentic settings, limiting the ceiling of RL-based agentic fine-tuning for these architectures.
8. To replicate the training data pipeline: iteratively construct multi-hop QA pairs that are hard enough that OpenAI Deep Research with o3 scores below 65% and the best open-source baseline scores below 40%, then use an LLM-generated rubric with factuality, compliance, writing quality, and citation quality sub-scores for long-form report tasks, all without human annotation.
9. A contamination blocklist blocking domains such as huggingface.co is applied during both training rollouts and evaluation; without it, up to 3.4% of usable HLE samples may be trivially answered from leaked solutions, and baseline numbers from systems lacking such precautions are re-run under the blocklist, which alters reported scores.
10. Partial rollouts are treated as independent initial states from which new group-level Monte Carlo rollouts begin under the current policy, rather than continuing unfinished trajectories with an updated policy as in Kimi-Researcher, providing more gradient signal from long-tail intermediate states.

Peer brief — for seminar discussion

SFR-DeepResearch trains autonomous single-agent LLMs for Deep Research by applying continual reinforcement learning directly to reasoning-optimized "thinking" models rather than starting from base or instruction-tuned checkpoints—a departure from most prior work. Three open-source backbones are used: QwQ-32B, Qwen3-8B, and gpt-oss-20b, yielding SFR-DR-32B, SFR-DR-8B, and SFR-DR-20B respectively. Each agent is equipped with three minimal tools: a search API returning top-10 organic results, a static web scraper (hyperlinks stripped, forcing rediscovery through search), and a stateless local Python interpreter. A self-managed memory tool allows agents to compress their own context window when it exceeds a configurable limit, enabling arbitrarily long trajectories without external memory banks. The load-bearing finding is that SFR-DR-20B reaches 28.7% on the full text-only Humanity's Last Exam (HLE), versus 17.3% for its gpt-oss-20b base and 26.6% for OpenAI Deep Research with o3, while SFR-DR-32B scores 72.0 on FRAMES and 52.4 on GAIA—beating WebSailor-32B (69.78 / 44.0) and WebShaper-32B (69.42 / 48.5) under identical contamination blocklist conditions. A key mechanism is the method introduced here: temporal advantage normalization, which divides step-level advantages by trajectory length Ti in the REINFORCE objective. Without it, agents degenerate into repetitive identical tool calls because long trajectories—even failing, penalized ones—contribute disproportionately many action steps per batch and are therefore reinforced; normalization suppresses this pathology and stabilizes training. An alternative to this length-normalization approach would have been GRPO with explicit length penalties, as used by prior work, but that is shown insufficient to prevent collapse in long-horizon agentic settings. An additional inference-time contribution—recasting multi-turn tool calling as single-turn contextual QA for Qwen-family models—yields a 10 percentage-point absolute FRAMES gain for QwQ-32B with zero training, an observation attributed to those models having been over-optimized for single-turn reasoning. The work implies that the right initialization point for agentic RL is a reasoning-optimized model rather than a vanilla SFT model, and that single agents trained this way can generalize more broadly than multi-agent systems constrained by fixed workflows, while remaining composable as sub-agents in larger systems. A prediction embedded in the analysis is that Qwen-family models face a structural ceiling in long-horizon agentic settings because their chain-of-thought quality degrades in multi-turn contexts—a hypothesis with direct implications for which base models are worth fine-tuning for future DR systems. The most contestable element is the evaluation methodology: contamination blocklisting is applied to re-run baselines, but the blocklist domain coverage is not fully disclosed, and the decision to rerun only some baselines while accepting OpenAI's self-reported numbers at face value creates an asymmetric comparison. A critical reader would also push back on the claim that single-agent architectures generalize better than multi-agent ones—this is asserted on principled grounds but never tested against a controlled multi-agent ablation on the same base models. Finally, SFR-DR-20B's outsized gains over SFR-DR-32B despite fewer parameters conflate two variables (base model quality and parameter count), making it hard to isolate the contribution of the RL recipe itself from the choice of gpt-oss-20b as a particularly well-suited agentic prior.

Findings (3)

SFR-DR-20B achieves 28.7% on Humanity's Last Exam full text-only benchmark, 65% relative improvement over gpt-oss-20b baseline.
Main evaluation result showing best variant outperforms many proprietary and open-source baselines of comparable or larger sizes.
Single-turn agentic workflow yields 10% absolute improvement on FRAMES for QwQ-32B over default multi-turn template.
Result demonstrating inference-time architectural gains from reformulating multi-turn interactions as single-turn contextual QA.
Length normalization prevents degenerate tool-calling trajectories and repeated tool calls without normalization.
Empirical result showing that without length normalization, RL training produces rapidly increasing tool usage with performance collapse and repetitive tool calls.

Claims (1)

Single agents can generalize better to unseen tasks because they are not constrained by predefined heuristic-based workflows.
Architectural belief motivating single-agent design choice; suggests flexibility provides better out-of-distribution performance.

Hypotheses (1)

QwQ and Qwen models have been extensively post-trained to excel at single-step tasks, causing degradation in long multi-turn interactions.
Proposed explanation for why single-turn reformulation improves performance: models' training distribution is concentrated on single-turn reasoning.

Questions (1)

How can reasoning-optimized models preserve their reasoning ability while gaining agentic capabilities?
Core research question motivating the paper's focus on continual RL training of reasoning models rather than base/instruction-tuned models.

Original abstract (expand)

Equipping large language models (LLMs) with complex, interleaved reasoning and tool-use capabilities has become a key focus in agentic AI research, especially with recent advances in reasoning-oriented (``thinking'') models. Such capabilities are key to unlocking a number of important applications. One such application is Deep Research (DR), which requires extensive search and reasoning over many sources. Our work in this paper focuses on the development of native Autonomous Single-Agent models for DR featuring minimal web crawling and Python tool integration. Unlike multi-agent systems, where agents take up pre-defined roles and are told what to do at each step in a static workflow, an autonomous single-agent determines its next action dynamically based on context, without manual directive. While prior work has proposed training recipes for base or instruction-tuned LLMs, we focus on continual reinforcement learning (RL) of reasoning-optimized models to further enhance agentic skills while preserving reasoning ability. Towards this end, we propose a simple RL recipe with entirely synthetic data, which we apply to various open-source LLMs. Our best variant SFR-DR-20B achieves up to 28.7% on Humanity's Last Exam benchmark. In addition, we conduct key analysis experiments to provide more insights into our methodologies.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

Reinforcement Learning Foundations for Deep Research Systems: A Survey
Zhi Chen, Jingru Lin, Hannan Cao, Wei Han, Sheng Liang, Zhi Zhang, Kuicai Dong, Dexun Li, Chen Zhang, Yong Liu Wenjun Li
2025
≈ 88%
Beyond Stochastic Exploration: What Makes Training Data Valuable for Agentic Search
Wenfeng Feng, Guochao Jiang, Guofeng Quan, Guohua Liu, Yuewei Zhang Chuzhan Hao
2026
≈ 84%
Reinforcement Learning with Foundation Priors: Let the Embodied Agent Efficiently Learn on Its Own
Yunsheng Zhang, Haoyang Weng, Xianfan Gu, Shengjie Wang, Tong Zhang, Mengchen Wang, Pieter Abbeel, Yang Gao Weirui Ye
2026
≈ 84%
Skill-SD: Skill-Conditioned Self-Distillation for Multi-turn LLM Agents
Guozhi Wang, Han Xiao, Yufeng Zhou, Yue Pan, Jichao Wang, Ke Xu, Yafei Wen, Xiaohu Ruan, Xiaoxin Chen, Honggang Qi Hao Wang
2026
≈ 83%
Simplified Action Decoder for Deep Multi-Agent Reinforcement Learning
Jakob N Foerster Hengyuan Hu
2021
≈ 83%
SEARL: Joint Optimization of Policy and Tool Graph Memory for Self-Evolving Agents
Xinshun Feng and Xinhao Song and Lijun Li and Gongshen Liu and Jing Shao
2026
≈ 83%
Adaptive Robust Estimator for Multi-Agent Reinforcement Learning
Wan Tian, Jingyu Chen, Kangyao Huang, Huiming Zhang, Hui Yang, Tao Ren, Jinyang Jiang, Yijie Peng, Yikun Ban, Fuzhen Zhuang Zhongyi Li
2026
≈ 83%
Demystifying Reinforcement Learning for Long-Horizon Tool-Using Agents: A Comprehensive Recipe
Xixi Wu and Qianguo Sun and Ruiyang Zhang and Chao Song and Junlong Wu and Yiyan Qi and Hong Cheng
2026
≈ 83%
Accelerating Robotic Reinforcement Learning with Agent Guidance
Zili Zou, Chengdong Ma, Yaoxiang Pu, Haotong Zhang, Yuanpei Chen, Yaodong Yang Haojun Chen
2026
≈ 83%
Optimizing Life Sciences Agents in Real-Time using Reinforcement Learning
Nihir Chadderwala
2025
≈ 83%
ProCeedRL: Process Critic with Exploratory Demonstration Reinforcement Learning for LLM Agentic Reasoning
Yanjiang Guo, Xiaoshuai Chen, Jianyu Chen Jingyue Gao
2026
≈ 82%
Continual Reinforcement Learning by Planning with Online World Models
Guoji Fu, Chao Du, Wee Sun Lee, Min Lin Zichen Liu
2025
≈ 82%
Persistent Robot World Models: Stabilizing Multi-Step Rollouts via Reinforcement Learning
Patrik Drozdik, Josef Sivic, Vladimir Petrik Jai Bardhan
2026
≈ 82%
AD-R1: Closed-Loop Reinforcement Learning for End-to-End Autonomous Driving with Impartial World Models
Tao Tang, Xingtai Gui, Yongkang Li, Jiasen Zhesng, Weiyao Huang, Lingdong Kong, Wencheng Han, Xia Zhou, Xueyang Zhang, Yifei Zhan, Kun Zhan, Cheng-zhong Xu, Jianbing Shen Tianyi Yan
2026
≈ 82%
RefineRL: Advancing Competitive Programming with Self-Refinement Reinforcement Learning
Xingxing Zhang, Li Dong, Di Wang, Furu Wei Shaopeng Fu
2026
≈ 82%
Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations
in corpus
≈ 80%
Active inference: demystified and compared
in corpus
2021
≈ 80%
Simulators — LessWrong
in corpus
≈ 80%
Generalizing frameworks for sentience beyond natural species
in corpus
≈ 79%
Alignment faking in large language models
in corpus
2024
≈ 78%
Active inference on discrete state-spaces: a synthesis
in corpus
2020
≈ 78%
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
in corpus
2024
≈ 78%
The Causally Emergent Alignment Hypothesis: Causal Emergence Aligns with and Predicts Final Reward in Reinforcement Learning Agents
in corpus
2026
≈ 78%
ReflCtrl: Controlling LLM Reflection via Representation Engineering
in corpus
2025
≈ 78%
Contemplative Agent
in corpus
2025
≈ 78%
Large Language Models Report Subjective Experience Under Self-Referential Processing
in corpus
2025
≈ 78%
Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought
in corpus
2026
≈ 77%
Addressing divergent representations from causal interventions on neural networks
in corpus
2025
≈ 77%
Taking AI Welfare Seriously
in corpus
2024
≈ 77%
DeepSeekMath: Pushing the limits of mathematical reasoning in open language models
cited
2024
≈ 68%

+29 more