YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

DR-Venus-4B-RL

DR-Venus-4B-RL is the reinforcement-learned DR-Venues checkpoint built on top of inclusionAI/DR-Venus-4B-SFT. It is a 4B deep research agent designed for long-horizon web research with explicit tool use, evidence collection, and answer generation.

This model is trained entirely on open data. Starting from the SFT checkpoint, DR-Venus-4B-RL applies long-horizon agentic RL with IGPO-style information gain rewards and format-aware turn-level supervision to improve execution reliability under long tool-use trajectories.

What This Model Is For

This checkpoint is intended for:

  • long-horizon deep research with tool-augmented reasoning
  • improving execution reliability beyond supervised imitation
  • evidence-grounded answering with search and visit
  • deployment in the official DR-Venues inference pipeline

It is not primarily optimized for:

  • plain chat without tools
  • generic short-context instruction following
  • use cases that do not need multi-step retrieval and browsing

Model Details

  • Base model: Qwen/Qwen3-4B-Thinking-2507
  • Initialization checkpoint: inclusionAI/DR-Venus-4B-SFT
  • Training stage: agentic reinforcement learning
  • Training framework: verl + IGPO algorithm
  • Tool setting: search + visit
  • Maximum rollout horizon: 200 interaction steps
  • Maximum rollout context length: 256K
  • Intended domain: long-horizon open-domain research and evidence-grounded question answering

How DR-Venus Builds RL Supervision

DR-Venus-4B-RL is trained with dense turn-level supervision tailored to deep research:

  1. The model starts from the DR-Venus supervised checkpoint.
  2. For each query, the agent interacts with the environment over multi-turn search and visit trajectories.
  3. IGPO uses information gain rewards to measure whether an intermediate turn increases the model's probability of producing the ground-truth answer.
  4. Information gain rewards are combined with outcome rewards and turn-level format-aware penalties.
  5. The policy is optimized using an IGPO objective with fine-grained credit assignment, specifically tailored for the long-horizon nature of deep research rollouts.

This design improves supervision density, credit assignment, and data efficiency compared with sparse trajectory-level RL alone.

Training Data

This model is trained from open-data supervision constructed from:

In the current paper setup:

  • RL is performed entirely on open query-answer pairs
  • rollout groups are sampled with long-horizon agent interaction
  • generation is performed with up to 200 interaction steps per query

For more implementation details, please refer to the DR-Venues GitHub repository.

Training Recipe

The RL checkpoint is trained with the following setup reported in the current paper draft:

  • algorithm: IGPO-style agentic RL
  • rollout group size: 8
  • training batch size: 16
  • learning rate: 1e-6
  • rollout temperature: 1.0
  • rollout top-p: 0.95
  • maximum context length: 256K
  • maximum generation length per turn: 8,192
  • discount factor: 0.95
  • format penalty scale: 1.0
  • training framework: verl with vLLM rollout engine and FSDP trainer

The current paper configuration also enables browse-aware IG assignment and IG-scale style reward balancing.

Evaluation Summary

DR-Venus-4B-RL improves over the SFT checkpoint on most tracked deep research benchmarks and sets a stronger small-model frontier.

Results Against Open Models Under 9B

Model BrowseComp BrowseComp-ZH GAIA (Text-Only) xBench-DS-2505 xBench-DS-2510 DeepSearchQA
DeepDive-9B-SFT 5.6 15.7 -- 35.0 -- --
DeepDive-9B-RL 6.3 15.1 -- 38.0 -- --
WebSailor-7B 6.7 14.2 37.9 34.3 -- --
OffSeeker-8B-SFT 10.6 24.2 47.6 48.0 -- --
OffSeeker-8B-DPO 12.8 26.6 51.5 49.0 -- --
WebExplorer-8B-RL 15.7 32.0 50.0 53.7 23.0 17.8
AgentCPM-Explore-4B 24.1 29.1 63.9 70.0 34.0 32.8
DR-Venus-4B-SFT 26.8 35.7 65.4 69.0 35.3 37.7
DR-Venus-4B-RL 29.1 37.7 64.4 74.7 40.7 39.6

Relative to the SFT checkpoint, DR-Venus-4B-RL improves:

  • BrowseComp by +2.3
  • BrowseComp-ZH by +2.0
  • xBench-DS-2505 by +5.7
  • xBench-DS-2510 by +5.4
  • DeepSearchQA by +1.9

These gains are associated with better formatting accuracy, more reliable tool use, and stronger long-horizon execution stability.

Usage

This checkpoint should be used with the official DR-Venues inference pipeline.

git clone https://github.com/inclusionAI/DR-Venus
cd DR-Venus/Inference
pip install -r requirements.txt

# then configure the model path in run_demo.sh or run_web_demo.sh
bash run_demo.sh

For reproducing RL training or understanding the rollout setup, see the RL directory in the official repository.

License and Release Notes

Please verify license compatibility with:

  • the upstream base model
  • the released supervision data
  • the external tools and judge models used in training or evaluation

This section can be updated later with the final project-specific license statement.

Citation

If you use this checkpoint, please cite the DR-Venues project.

@misc{dr_venus_2026,
  title  = {DR-Venus: Towards Frontier Edge-Scale Deep Research Agents with Only 10K Open Data},
  author = {Venus Team, Sunhao Dai, Yong Deng, Jinzhen Lin, Yucheng Song, Guoqing Wang, Xiaofeng Wu, Yuqi Zhou, Shuo Yang, Zhenzhe Ying, Zhanwei Zhang, Changhua Meng, Weiqiang Wang},
  year   = {2026},
}

Links

Downloads last month
8
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for inclusionAI/DR-Venus-4B-RL

Quantizations
2 models

Collection including inclusionAI/DR-Venus-4B-RL