YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
DR-Venus-4B-RL
DR-Venus-4B-RL is the reinforcement-learned DR-Venues checkpoint built on top of inclusionAI/DR-Venus-4B-SFT. It is a 4B deep research agent designed for long-horizon web research with explicit tool use, evidence collection, and answer generation.
This model is trained entirely on open data. Starting from the SFT checkpoint, DR-Venus-4B-RL applies long-horizon agentic RL with IGPO-style information gain rewards and format-aware turn-level supervision to improve execution reliability under long tool-use trajectories.
What This Model Is For
This checkpoint is intended for:
- long-horizon deep research with tool-augmented reasoning
- improving execution reliability beyond supervised imitation
- evidence-grounded answering with
searchandvisit - deployment in the official DR-Venues inference pipeline
It is not primarily optimized for:
- plain chat without tools
- generic short-context instruction following
- use cases that do not need multi-step retrieval and browsing
Model Details
- Base model: Qwen/Qwen3-4B-Thinking-2507
- Initialization checkpoint: inclusionAI/DR-Venus-4B-SFT
- Training stage: agentic reinforcement learning
- Training framework:
verl+ IGPO algorithm - Tool setting:
search+visit - Maximum rollout horizon:
200interaction steps - Maximum rollout context length:
256K - Intended domain: long-horizon open-domain research and evidence-grounded question answering
How DR-Venus Builds RL Supervision
DR-Venus-4B-RL is trained with dense turn-level supervision tailored to deep research:
- The model starts from the DR-Venus supervised checkpoint.
- For each query, the agent interacts with the environment over multi-turn
searchandvisittrajectories. - IGPO uses information gain rewards to measure whether an intermediate turn increases the model's probability of producing the ground-truth answer.
- Information gain rewards are combined with outcome rewards and turn-level format-aware penalties.
- The policy is optimized using an IGPO objective with fine-grained credit assignment, specifically tailored for the long-horizon nature of deep research rollouts.
This design improves supervision density, credit assignment, and data efficiency compared with sparse trajectory-level RL alone.
Training Data
This model is trained from open-data supervision constructed from:
- the DR-Venus SFT checkpoint as initialization
- REDSearcher 1K RL query-answer pairs
- online rollouts with the DR-Venus
search+visittool environment
In the current paper setup:
- RL is performed entirely on open query-answer pairs
- rollout groups are sampled with long-horizon agent interaction
- generation is performed with up to
200interaction steps per query
For more implementation details, please refer to the DR-Venues GitHub repository.
Training Recipe
The RL checkpoint is trained with the following setup reported in the current paper draft:
- algorithm: IGPO-style agentic RL
- rollout group size:
8 - training batch size:
16 - learning rate:
1e-6 - rollout temperature:
1.0 - rollout top-p:
0.95 - maximum context length:
256K - maximum generation length per turn:
8,192 - discount factor:
0.95 - format penalty scale:
1.0 - training framework:
verlwith vLLM rollout engine and FSDP trainer
The current paper configuration also enables browse-aware IG assignment and IG-scale style reward balancing.
Evaluation Summary
DR-Venus-4B-RL improves over the SFT checkpoint on most tracked deep research benchmarks and sets a stronger small-model frontier.
Results Against Open Models Under 9B
| Model | BrowseComp | BrowseComp-ZH | GAIA (Text-Only) | xBench-DS-2505 | xBench-DS-2510 | DeepSearchQA |
|---|---|---|---|---|---|---|
| DeepDive-9B-SFT | 5.6 | 15.7 | -- | 35.0 | -- | -- |
| DeepDive-9B-RL | 6.3 | 15.1 | -- | 38.0 | -- | -- |
| WebSailor-7B | 6.7 | 14.2 | 37.9 | 34.3 | -- | -- |
| OffSeeker-8B-SFT | 10.6 | 24.2 | 47.6 | 48.0 | -- | -- |
| OffSeeker-8B-DPO | 12.8 | 26.6 | 51.5 | 49.0 | -- | -- |
| WebExplorer-8B-RL | 15.7 | 32.0 | 50.0 | 53.7 | 23.0 | 17.8 |
| AgentCPM-Explore-4B | 24.1 | 29.1 | 63.9 | 70.0 | 34.0 | 32.8 |
| DR-Venus-4B-SFT | 26.8 | 35.7 | 65.4 | 69.0 | 35.3 | 37.7 |
| DR-Venus-4B-RL | 29.1 | 37.7 | 64.4 | 74.7 | 40.7 | 39.6 |
Relative to the SFT checkpoint, DR-Venus-4B-RL improves:
- BrowseComp by
+2.3 - BrowseComp-ZH by
+2.0 - xBench-DS-2505 by
+5.7 - xBench-DS-2510 by
+5.4 - DeepSearchQA by
+1.9
These gains are associated with better formatting accuracy, more reliable tool use, and stronger long-horizon execution stability.
Usage
This checkpoint should be used with the official DR-Venues inference pipeline.
git clone https://github.com/inclusionAI/DR-Venus
cd DR-Venus/Inference
pip install -r requirements.txt
# then configure the model path in run_demo.sh or run_web_demo.sh
bash run_demo.sh
For reproducing RL training or understanding the rollout setup, see the RL directory in the official repository.
License and Release Notes
Please verify license compatibility with:
- the upstream base model
- the released supervision data
- the external tools and judge models used in training or evaluation
This section can be updated later with the final project-specific license statement.
Citation
If you use this checkpoint, please cite the DR-Venues project.
@misc{dr_venus_2026,
title = {DR-Venus: Towards Frontier Edge-Scale Deep Research Agents with Only 10K Open Data},
author = {Venus Team, Sunhao Dai, Yong Deng, Jinzhen Lin, Yucheng Song, Guoqing Wang, Xiaofeng Wu, Yuqi Zhou, Shuo Yang, Zhenzhe Ying, Zhanwei Zhang, Changhua Meng, Weiqiang Wang},
year = {2026},
}
Links
- GitHub: https://github.com/inclusionAI/DR-Venus
- RL code: https://github.com/inclusionAI/DR-Venus/tree/main/RL
- Inference code: https://github.com/inclusionAI/DR-Venus/tree/main/Inference
- SFT model: https://huggingface.co/inclusionAI/DR-Venus-4B-SFT
- RL model: https://huggingface.co/inclusionAI/DR-Venus-4B-RL
- Collection: https://huggingface.co/collections/inclusionAI/DR-Venus
- Downloads last month
- 8