YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

DR-Venus-4B-RL

DR-Venus-4B-RL is the reinforcement-learned DR-Venues checkpoint built on top of inclusionAI/DR-Venus-4B-SFT. It is a 4B deep research agent designed for long-horizon web research with explicit tool use, evidence collection, and answer generation.

This model is trained entirely on open data. Starting from the SFT checkpoint, DR-Venus-4B-RL applies long-horizon agentic RL with IGPO-style information gain rewards and format-aware turn-level supervision to improve execution reliability under long tool-use trajectories.

What This Model Is For

This checkpoint is intended for:

long-horizon deep research with tool-augmented reasoning
improving execution reliability beyond supervised imitation
evidence-grounded answering with search and visit
deployment in the official DR-Venues inference pipeline

It is not primarily optimized for:

plain chat without tools
generic short-context instruction following
use cases that do not need multi-step retrieval and browsing

Model Details

Base model: Qwen/Qwen3-4B-Thinking-2507
Initialization checkpoint: inclusionAI/DR-Venus-4B-SFT
Training stage: agentic reinforcement learning
Training framework: verl + IGPO algorithm
Tool setting: search + visit
Maximum rollout horizon: 200 interaction steps
Maximum rollout context length: 256K
Intended domain: long-horizon open-domain research and evidence-grounded question answering

How DR-Venus Builds RL Supervision

DR-Venus-4B-RL is trained with dense turn-level supervision tailored to deep research:

The model starts from the DR-Venus supervised checkpoint.
For each query, the agent interacts with the environment over multi-turn search and visit trajectories.
IGPO uses information gain rewards to measure whether an intermediate turn increases the model's probability of producing the ground-truth answer.
Information gain rewards are combined with outcome rewards and turn-level format-aware penalties.
The policy is optimized using an IGPO objective with fine-grained credit assignment, specifically tailored for the long-horizon nature of deep research rollouts.

This design improves supervision density, credit assignment, and data efficiency compared with sparse trajectory-level RL alone.

Training Data

This model is trained from open-data supervision constructed from:

the DR-Venus SFT checkpoint as initialization
REDSearcher 1K RL query-answer pairs
online rollouts with the DR-Venus search + visit tool environment

In the current paper setup:

RL is performed entirely on open query-answer pairs
rollout groups are sampled with long-horizon agent interaction
generation is performed with up to 200 interaction steps per query

For more implementation details, please refer to the DR-Venues GitHub repository.

Training Recipe

The RL checkpoint is trained with the following setup reported in the current paper draft:

algorithm: IGPO-style agentic RL
rollout group size: 8
training batch size: 16
learning rate: 1e-6
rollout temperature: 1.0
rollout top-p: 0.95
maximum context length: 256K
maximum generation length per turn: 8,192
discount factor: 0.95
format penalty scale: 1.0
training framework: verl with vLLM rollout engine and FSDP trainer

The current paper configuration also enables browse-aware IG assignment and IG-scale style reward balancing.

Evaluation Summary

DR-Venus-4B-RL improves over the SFT checkpoint on most tracked deep research benchmarks and sets a stronger small-model frontier.

Results Against Open Models Under 9B

Model	BrowseComp	BrowseComp-ZH	GAIA (Text-Only)	xBench-DS-2505	xBench-DS-2510	DeepSearchQA
DeepDive-9B-SFT	5.6	15.7	--	35.0	--	--
DeepDive-9B-RL	6.3	15.1	--	38.0	--	--
WebSailor-7B	6.7	14.2	37.9	34.3	--	--
OffSeeker-8B-SFT	10.6	24.2	47.6	48.0	--	--
OffSeeker-8B-DPO	12.8	26.6	51.5	49.0	--	--
WebExplorer-8B-RL	15.7	32.0	50.0	53.7	23.0	17.8
AgentCPM-Explore-4B	24.1	29.1	63.9	70.0	34.0	32.8
DR-Venus-4B-SFT	26.8	35.7	65.4	69.0	35.3	37.7
DR-Venus-4B-RL	29.1	37.7	64.4	74.7	40.7	39.6

Relative to the SFT checkpoint, DR-Venus-4B-RL improves:

BrowseComp by +2.3
BrowseComp-ZH by +2.0
xBench-DS-2505 by +5.7
xBench-DS-2510 by +5.4
DeepSearchQA by +1.9

These gains are associated with better formatting accuracy, more reliable tool use, and stronger long-horizon execution stability.

Usage

This checkpoint should be used with the official DR-Venues inference pipeline.

git clone https://github.com/inclusionAI/DR-Venus
cd DR-Venus/Inference
pip install -r requirements.txt

# then configure the model path in run_demo.sh or run_web_demo.sh
bash run_demo.sh

For reproducing RL training or understanding the rollout setup, see the RL directory in the official repository.

License and Release Notes

Please verify license compatibility with:

the upstream base model
the released supervision data
the external tools and judge models used in training or evaluation

This section can be updated later with the final project-specific license statement.

Citation

If you use this checkpoint, please cite the DR-Venues project.

@misc{dr_venus_2026,
  title  = {DR-Venus: Towards Frontier Edge-Scale Deep Research Agents with Only 10K Open Data},
  author = {Venus Team, Sunhao Dai, Yong Deng, Jinzhen Lin, Yucheng Song, Guoqing Wang, Xiaofeng Wu, Yuqi Zhou, Shuo Yang, Zhenzhe Ying, Zhanwei Zhang, Changhua Meng, Weiqiang Wang},
  year   = {2026},
}

Model tree for inclusionAI/DR-Venus-4B-RL

Quantizations

2 models

Collection including inclusionAI/DR-Venus-4B-RL

DR-Venus

Collection

3 items • Updated 21 minutes ago • 7

inclusionAI
/

DR-Venus-4B-RL