StepFun has introduced Step-DeepResearch, a 32B parameter end to end deep research agent that aims to turn web search into actual research workflows with long horizon reasoning, tool use and structured reporting. The model is built on Qwen2.5 32B-Base and is trained to act as a single agent that plans, explores sources, verifies evidence and writes reports with citations, while keeping inference cost low.
From Search to Deep Research
Most existing web agents are tuned for multi-hop question-answering benchmarks. They try to match ground truth answers for short questions. This is closer to targeted retrieval than to real research. Deep research tasks are different. They involve latent intent recognition, long horizon decision making, multi-turn tool use, structured-reasoning and cross-source verification under uncertainty.
Step-DeepResearch reframes this as sequential decision making over a compact set of atomic capabilities. The research team defines 4 atomic capabilities, planning and task decomposition, deep-information seeking, reflection and verification, and professional report generation. Instead of orchestrating many external agents, the system internalizes this loop into a single model that decides the next action at each step.
Data Synthesis around Atomic Capabilities
To teach these atomic capabilities, the research team builds separate data pipelines for each skill. For planning, they start from high quality technical reports, survey papers and financial analysis documents. They reverse-engineer realistic research plans and task trees from titles, abstracts and structure, then generate trajectories that follow these plans. This exposes the model to long horizon project structures, not only short question templates.
For deep information seeking, they construct graph based queries over knowledge graphs such as Wikidata5m and CN-DBpedia. They sample subgraphs, expand them using search, and synthesize questions that require multi hop reasoning across entities and documents. A separate pipeline uses a Wiki style hyperlink index to force cross document retrieval and combination of evidence. Easy questions that a strong model can already solve with a simple ReAct style strategy are filtered out, so training focuses on hard search problems.
Reflection and verification data is generated through self-correction loops and multi-agent teacher traces. Teacher agents extract claims, plan checks, verify facts, replan if inconsistencies appear and only then write reports. The resulting trajectories are cleaned and used as supervision for a single student agent. Report generation is trained in 2 phases, mid training for domain style and depth using query report pairs, then supervised fine-tuning with strict formatting and plan consistency constraints.
Progressive Training on Qwen2.5-32B-Base
The training pipeline has 3 stages, agentic mid-training, supervised fine-tuning and reinforcement learning. In mid training stage-1, the team injects atomic capabilities without tools, using context length up to 32k tokens. The data covers active reading, synthetic reasoning traces, summarization and reflection. The research team show steady gains on SimpleQA, TriviaQA and FRAMES as training scales up to about 150B tokens, with the largest gains on FRAMES, which stresses structured reasoning.
In stage-2, the context extends to 128k tokens and explicit tool calls are introduced. The model learns tasks such as URL based question-answering, deep web search, long document summarization and long dialogue reasoning. This stage aligns the model with real research scenarios where search, browsing and analysis must be mixed in one trajectory.
During supervised fine-tuning, the 4 atomic capabilities are composed into full deep search and deep research traces. Data cleaning keeps trajectories that are correct and short in terms of steps and tool calls. The pipeline injects controlled tool errors followed by correction to improve robustness, and enforces citation formats so that reports stay grounded in the retrieved sources.
Reinforcement learning then optimizes the agent in a real tool environment. The research team builds tasks and checklists through reverse synthesis, and trains a checklist style Rubrics Judge to score reports along fine grained dimensions. The reward design converts ternary rubric labels into asymmetric binary rewards that capture both positive targets and violations. The policy is trained with PPO and a learned critic, using generalized advantage estimation with near zero discount so that long trajectories are not truncated.
Single Agent ReAct Architecture and Search Stack
At inference time, Step-DeepResearch runs as a single ReAct style agent that alternates thinking, tool calls and observations until it decides to output a report. The tool set includes batch web search, a todo manager, shell commands and file operations. Execution runs in a sandbox with terminal persistence through tmux. A perception oriented browser reduces redundant page captures by using perceptual hash distance. Tools for document parsing, audio transcription and image analysis support multimodal inputs.
Information acquisition uses 2 related resources. StepFun team states that its Search API is grounded in more than 20M high quality papers and 600 premium indices. The research team then describes a curated authority indexing strategy that isolates more than 600 trusted domains, including government, academic and institutional sites. Retrieval operates at paragraph level and uses authority aware ranking so that high trust domains are preferred when relevance is similar.
The file tools support patch based editing, so the agent can update only modified sections of a report. A summary aware storage scheme writes full tool outputs to local files and injects only compact summaries into the context. This acts as external memory and avoids context overflow for long projects.
Evaluation, Cost and Access
To measure deep research behavior, the team introduce ADR-Bench, a Chinese benchmark with 110 open ended tasks across 9 domains. 70 tasks cover general domains such as education, science and engineering and social life, evaluated by expert side by side comparison. 40 tasks in finance and law are scored with explicit rubrics that follow atomicity and verifiability constraints.
On Scale AI Research Rubrics, Step-DeepResearch reaches 61.42 percent rubric compliance, which is comparable to OpenAI-DeepResearch and Gemini-DeepResearch, and clearly ahead of multiple open and proprietary baselines. On ADR-Bench, expert-based Elo ratings show that the 32B model outperforms larger open-models such as MiniMax-M2, GLM-4.6 and DeepSeek-V3.2, and is competitive with systems like Kimi-Researcher and MiniMax-Agent-Pro.
Key Takeaways
- Single agent, atomic capability design: Step-DeepResearch is a 32B parameter single agent built on Qwen2.-32B-Base, it internalizes 4 atomic capabilities, planning, deep information seeking, reflection and verification, and professional report generation, instead of relying on many external agents.
- Targeted data synthesis for each skill: The research team builds separate data pipelines for planning, deep information seeking, reflection and report writing, using reverse-engineered plans from real reports, graph-based queries over Wikidata5m and CN-DBpedia, multi-agent teacher traces and strict report formatting data.
- Three stage training with long context and RL: Training uses mid training, supervised fine-tuning and reinforcement learning, with mid training up to 150B tokens at 32k and then 128k context, SFT composes full deep research trajectories, and PPO based RL with a Rubrics Judge optimizes reports against fine grained checklists.
- ReAct architecture with curated search and external memory: At inference time the model runs a ReAct loop that calls tools for batch web search, todo, shell and file operations, uses a Search API grounded in more than 20M papers and 600 premium indices along with 600+trusted domains, and relies on patch editing and summary aware storage to act as external memory.
- Competitive quality with lower cost: On Scale AI Research Rubrics the model reaches 61.42 percent rubric compliance and is competitive with OpenAI-DeepResearch and Gemini-DeepResearch, on ADR Bench it achieves 67.1 percent win or tie rate against strong baselines.
Check out the Paper and Repo. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

