paper
active
2025
paper:doi-10-48550-arxiv-2509-06283

SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents

TL;DR

Continual reinforcement learning applied directly to reasoning-optimized base models—rather than starting from instruction-tuned checkpoints—yields a 20-parameter-billion autonomous single-agent, SFR-DR-20B, that achieves 28.7% on the full text-only Humanity's Last Exam (HLE) benchmark, a 65% relative improvement over the gpt-oss-20b base model's 17.3%, and outperforms OpenAI Deep Research with o3 (26.6%) without relying on multi-agent scaffolding. The framework introduced, SFR-DeepResearch (SFR-DR), combines a per-model-family agentic inference scaffolding—which reframes multi-turn tool-calling as iterative single-turn contextual QA for QwQ-32B and Qwen3 models—with a REINFORCE-based RL algorithm featuring temporal advantage normalization (dividing step-level advantages by trajectory length Ti) and strategic trajectory filtering over entirely synthetic training data. Without length normalization, agents trained on Qwen3-8B degenerate into repetitive tool-calling loops despite negative rewards, because long failing trajectories dominate batch loss; normalization stabilizes training and produces moderate, effective tool-use growth. SFR-DR-32B (from QwQ-32B) scores 72.0 on FRAMES and 52.4 on GAIA, while SFR-DR-8B (from Qwen3-8B) reaches 63.3 on FRAMES with 13.2 on HLE—competitive with or exceeding open-source baselines two-to-four times larger. The paper argues that initializing RL from reasoning-optimized models rather than base or SFT models allows agentic capabilities to be grafted onto strong chain-of-thought reasoning, and that single-agent architectures trained this way can generalize to unseen tasks better than workflow-constrained multi-agent systems while serving as drop-in sub-agents when multi-agent orchestration is desired.

What to take away

  1. 1. SFR-DR-20B, trained via continual RL from gpt-oss-20b, achieves 28.7% on the full text-only Humanity's Last Exam benchmark, compared to the gpt-oss-20b baseline's 17.3% and OpenAI Deep Research with o3's 26.6%.
  2. 2. Reformulating multi-turn tool-calling as a single-turn contextual QA prompt for QwQ-32B yields a 10 percentage-point absolute gain on FRAMES (58.0 → 68.0 pre-RL) versus the model's default multi-turn chat template, with no additional training.
  3. 3. Temporal advantage normalization—dividing step-level advantages by trajectory length Ti in the REINFORCE objective—prevents degenerate repetitive tool-calling: without it, SFR-DR-8B training reward and HLE validation performance both collapse despite long trajectories receiving negative rewards.
  4. 4. SFR-DR-32B (from QwQ-32B) scores 72.0 on FRAMES and 52.4 on GAIA, outperforming open-source single-agent baselines WebSailor-32B (69.78 / 44.0) and WebShaper-32B (69.42 / 48.5) evaluated under the same contamination blocklist.
  5. 5. The gpt-oss-20b-based SFR-DR-20B makes up to 10 times more tool calls per HLE question than QwQ-32B and Qwen3-8B variants, which tend toward internal reasoning; RL further increases this gap, suggesting base-model agentic priors strongly shape post-RL tool-use behavior.
  6. 6. SFR-DR-20B generates fewer than 2,000 tokens per agentic step on HLE, 4–5 times fewer than the 8B and 32B Qwen-family counterparts, and RL training actually shrinks its per-step response length while expanding Qwen-family response lengths.
  7. 7. The paper raises the open question of whether Qwen-family models (QwQ-32B, Qwen3-8B) have been post-trained so heavily on single-turn reasoning tasks that their chain-of-thought quality degrades irreversibly in long multi-turn agentic settings, limiting the ceiling of RL-based agentic fine-tuning for these architectures.
  8. 8. To replicate the training data pipeline: iteratively construct multi-hop QA pairs that are hard enough that OpenAI Deep Research with o3 scores below 65% and the best open-source baseline scores below 40%, then use an LLM-generated rubric with factuality, compliance, writing quality, and citation quality sub-scores for long-form report tasks, all without human annotation.
  9. 9. A contamination blocklist blocking domains such as huggingface.co is applied during both training rollouts and evaluation; without it, up to 3.4% of usable HLE samples may be trivially answered from leaked solutions, and baseline numbers from systems lacking such precautions are re-run under the blocklist, which alters reported scores.
  10. 10. Partial rollouts are treated as independent initial states from which new group-level Monte Carlo rollouts begin under the current policy, rather than continuing unfinished trajectories with an updated policy as in Kimi-Researcher, providing more gradient signal from long-tail intermediate states.

Peer brief — for seminar discussion

SFR-DeepResearch trains autonomous single-agent LLMs for Deep Research by applying continual reinforcement learning directly to reasoning-optimized "thinking" models rather than starting from base or instruction-tuned checkpoints—a departure from most prior work. Three open-source backbones are used: QwQ-32B, Qwen3-8B, and gpt-oss-20b, yielding SFR-DR-32B, SFR-DR-8B, and SFR-DR-20B respectively. Each agent is equipped with three minimal tools: a search API returning top-10 organic results, a static web scraper (hyperlinks stripped, forcing rediscovery through search), and a stateless local Python interpreter. A self-managed memory tool allows agents to compress their own context window when it exceeds a configurable limit, enabling arbitrarily long trajectories without external memory banks. The load-bearing finding is that SFR-DR-20B reaches 28.7% on the full text-only Humanity's Last Exam (HLE), versus 17.3% for its gpt-oss-20b base and 26.6% for OpenAI Deep Research with o3, while SFR-DR-32B scores 72.0 on FRAMES and 52.4 on GAIA—beating WebSailor-32B (69.78 / 44.0) and WebShaper-32B (69.42 / 48.5) under identical contamination blocklist conditions. A key mechanism is the method introduced here: temporal advantage normalization, which divides step-level advantages by trajectory length Ti in the REINFORCE objective. Without it, agents degenerate into repetitive identical tool calls because long trajectories—even failing, penalized ones—contribute disproportionately many action steps per batch and are therefore reinforced; normalization suppresses this pathology and stabilizes training. An alternative to this length-normalization approach would have been GRPO with explicit length penalties, as used by prior work, but that is shown insufficient to prevent collapse in long-horizon agentic settings. An additional inference-time contribution—recasting multi-turn tool calling as single-turn contextual QA for Qwen-family models—yields a 10 percentage-point absolute FRAMES gain for QwQ-32B with zero training, an observation attributed to those models having been over-optimized for single-turn reasoning. The work implies that the right initialization point for agentic RL is a reasoning-optimized model rather than a vanilla SFT model, and that single agents trained this way can generalize more broadly than multi-agent systems constrained by fixed workflows, while remaining composable as sub-agents in larger systems. A prediction embedded in the analysis is that Qwen-family models face a structural ceiling in long-horizon agentic settings because their chain-of-thought quality degrades in multi-turn contexts—a hypothesis with direct implications for which base models are worth fine-tuning for future DR systems. The most contestable element is the evaluation methodology: contamination blocklisting is applied to re-run baselines, but the blocklist domain coverage is not fully disclosed, and the decision to rerun only some baselines while accepting OpenAI's self-reported numbers at face value creates an asymmetric comparison. A critical reader would also push back on the claim that single-agent architectures generalize better than multi-agent ones—this is asserted on principled grounds but never tested against a controlled multi-agent ablation on the same base models. Finally, SFR-DR-20B's outsized gains over SFR-DR-32B despite fewer parameters conflate two variables (base model quality and parameter count), making it hard to isolate the contribution of the RL recipe itself from the choice of gpt-oss-20b as a particularly well-suited agentic prior.

Findings (3)

Claims (1)

Hypotheses (1)

Questions (1)

Original abstract (expand)

Equipping large language models (LLMs) with complex, interleaved reasoning and tool-use capabilities has become a key focus in agentic AI research, especially with recent advances in reasoning-oriented (``thinking'') models. Such capabilities are key to unlocking a number of important applications. One such application is Deep Research (DR), which requires extensive search and reasoning over many sources. Our work in this paper focuses on the development of native Autonomous Single-Agent models for DR featuring minimal web crawling and Python tool integration. Unlike multi-agent systems, where agents take up pre-defined roles and are told what to do at each step in a static workflow, an autonomous single-agent determines its next action dynamically based on context, without manual directive. While prior work has proposed training recipes for base or instruction-tuned LLMs, we focus on continual reinforcement learning (RL) of reasoning-optimized models to further enhance agentic skills while preserving reasoning ability. Towards this end, we propose a simple RL recipe with entirely synthetic data, which we apply to various open-source LLMs. Our best variant SFR-DR-20B achieves up to 28.7% on Humanity's Last Exam benchmark. In addition, we conduct key analysis experiments to provide more insights into our methodologies.

Related work— refs + corpus + external arXiv

Cited / in-corpus / arXiv badges show which signals surfaced each row. Multi-source rows weighted higher.

+29 more

Similar preprints — Semantic Scholar