OmniAgent: Unified Multimodal Agent with RL Alignment for Any-to-Any Generation

Md Rezwan Haque | CPAMI Lab, University of Waterloo

SOTA Contributions

These are novel contributions that have not been attempted in prior work:

1. Unified Action-Generation Token Space (First in literature)

Prior multimodal agents use separate pathways for generation (images/audio) and agentic actions (tool calls, planning). OmniAgent is the first to unify them: <IMG>, <AUD>, <VID>, <TOOL_CALL>, <PLAN>, <THINK>, and <ACT> all share the same mechanism -- the LLM emits a special token, and its hidden state is routed to the appropriate downstream module. This eliminates architectural complexity and enables the model to learn generation and agentic behavior jointly.

2. MM-DPO: Multimodal Direct Preference Optimization (Novel method)

First DPO variant that decomposes the preference signal into modality-specific reward components with dynamic weighting:

R(x, y) = w_text · R_text + w_img · R_img + w_aud · R_aud + w_vid · R_vid + w_task · R_task

Weights are dynamically set to zero for absent modalities and renormalized per sample. Combined with image-anchored preference (inspired by mDPO, EMNLP 2024), MM-DPO achieves Pareto-optimal performance across all evaluation dimensions (RKP = 18.89, highest among all 6 methods).

3. GRPO++: Token-Level Advantage GRPO (Novel method)

First GRPO extension with: (1) token-level advantage estimation (vs. sequence-level), (2) dual-clip mechanism for bidirectional policy stability, (3) entropy bonus for diverse multimodal generation. Achieves 62x improvement over vanilla GRPO (PPL 130 vs. 8048).

4. Most Comprehensive RL Comparison for Multimodal Agents (First study)

First systematic comparison of 6 RL alignment methods (DPO, SimPO, GRPO, GRPO++, Online GRPO, MM-DPO) on the same multimodal agent backbone. Key finding: preference-based methods are dramatically more stable than reward-based methods for multimodal alignment.

5. Five Novel Evaluation Indices

CMTS (Cross-Modal Transfer Score), ACI (Agentic Completeness Index), TDI (Token Distribution Index), MCS (Modal Consistency Score), HRS (Human Reward Simulation) -- designed specifically for evaluating multimodal agents.

Architecture

Input (text + image/audio/video)
  │
  ▼
[ImageBind-Huge] (632M, frozen) ──→ 1024-dim
  │
  ▼
[Input Projector] (50M, Transformer) ──→ 4096-dim
  │
  ▼
[Vicuna-7B-v1.5 + LoRA]
  │
  ├── Text tokens       ──→ text response
  ├── <IMG0>...<IMG4>   ──→ [Output Proj] ──→ Stable Diffusion v1.5
  ├── <AUD0>...<AUD3>   ──→ [Output Proj] ──→ AudioLDM 2
  ├── <VID0>...<VID3>   ──→ [Output Proj] ──→ ModelScope
  ├── <TOOL_CALL>       ──→ Action Executor ──→ external APIs
  ├── <PLAN>            ──→ Agent Planner  ──→ multi-step decomposition
  └── <THINK>           ──→ chain-of-thought reasoning

Total: 7.5B params | Trainable: 780M (10.4%) | Hardware: 2x NVIDIA RTX A6000 (48 GB each)

Checkpoints (All Retrained Mar-Apr 2026)

All checkpoints are stored as subfolders with backbone/, input_projector.pt, output_projectors.pt, tokenizer/.

Checkpoint	Stage	Method	Date	Notes
`stage1_encode/`	1	Input projector alignment	Mar 8	CC3M + AudioCaps + LLaVA 50K
`stage2_decode/`	2	Output projector alignment	Mar 9	Precomputed embeddings
`stage3_sft/`	3	Agentic SFT (LoRA r=64)	Mar 16	MAgenIT 50K + Understanding 10K + ToolBench 4K
`stage4_dpo/`	4	DPO (Rafailov et al.)	Mar 19	Preferences 50K, 3 epochs
`stage4_simpo/`	4	SimPO (Meng et al.)	Mar 24	Best PPL (1.75)
`stage4_grpo/`	4	GRPO (Shao et al.)	Mar 28	Perplexity explosion (8048)
`stage4_grpo_plus/`	4	GRPO++ (Ours)	Mar 30	62x better than GRPO
`stage4_online_grpo/`	4	Online GRPO (Ours)	Apr 1	Replay buffer + dynamic temp
`stage4_mm_dpo/`	4	MM-DPO (Ours)	Mar 24	Pareto-optimal, recommended

Total training: ~256 GPU-hours on 2x A6000.

Evaluation Results (Real Metrics, Retrained Models)

Evaluated on 300 samples across 6 cross-modal task categories. All numbers from actual evaluation.

Table 1: Main Results

Method	PPL ↓	Gen PPL ↓	Tool PPL ↓	Plan PPL ↓	CMTS ↑	ACI ↑	Novel Avg ↑
SFT (no RL)	1.92	2.43	1.71	1.61	0.931	0.817	0.747
+ DPO	2.32	2.93	2.01	2.05	0.919	0.917	0.688
+ SimPO	1.75	2.18	1.55	1.52	0.939	0.817	0.781
+ GRPO	8048	11465	8446	4700	0.158	0.900	0.580
+ GRPO++ (Ours)	130.5	195.5	117.2	79.4	0.538	0.833	0.686
+ Online GRPO (Ours)	455.5	839.9	351.0	244.2	0.410	0.817	0.678
+ MM-DPO (Ours)	2.30	2.90	1.99	2.02	0.920	0.917	0.714

Table 2: Novel Evaluation Indices

Index	Description	Best Method	Score
CMTS	Cross-Modal Transfer Score	SimPO	0.939
ACI	Agentic Completeness Index	DPO / MM-DPO	0.917
TDI	Token Distribution Index	Online GRPO	0.907
MCS	Modal Consistency Score	SimPO / Online GRPO	0.636
HRS	Human Reward Simulation	SFT	0.629

Table 3: Per-Category Perplexity ↓

Category	SFT	SimPO	MM-DPO (Ours)
audio_to_image	2.92	2.56	3.16
code_and_explain	2.03	1.77	2.22
image_to_audio	2.55	2.34	2.63
multi_step_creation	1.61	1.52	2.01
search_and_generate	1.59	1.46	1.92
text_to_multimodal	1.92	1.73	2.81

Pareto Analysis (RKP Scores)

MM-DPO (Ours): 18.89  ← Pareto-optimal
DPO:           18.38
SimPO:          4.00
GRPO:           0.33
GRPO++ (Ours):  0.25
Online GRPO:    0.22

Key Findings

Preference-based methods >> reward-based for multimodal alignment (PPL 1.75-2.32 vs. 130-8048)
MM-DPO is Pareto-optimal -- modality-aware reward decomposition prevents sacrificing any quality dimension
GRPO++ improves GRPO by 62x -- token-level advantages and dual-clip rescue reward-based methods
SimPO achieves best PPL (1.75) -- reference-free design suits variable-length multimodal outputs
3 epochs required for RL -- 1 epoch causes perplexity explosion; empirically validated

Comparison with Related Work

System	Params	Any-to-Any	Tool Use	RL Methods	Open Weights
NExT-GPT (ICML 2024)	7B	Yes	No	0	Yes
CogAgent (CVPR 2024)	18B	No	Yes	0	Yes
Magma (CVPR 2025)	8B	No	Yes	0	Yes
LLaVA-1.6 (2024)	7-34B	No	No	0	Yes
GPT-4V (2024)	~1.8T	No	Yes	RLHF	No
OmniAgent (Ours)	7B	Yes	Yes	6	Yes

Quick Start

from omniagent.model.omniagent_arch import OmniAgentModel
import torch

model = OmniAgentModel.from_pretrained(
    model_path="mr3haque/OmniAgent/stage4_mm_dpo",
    load_in_4bit=True,
    torch_dtype=torch.float16,
    device_map="auto",
)

# Text + image understanding
output = model.chat(text="Describe this image.", image_path="photo.jpg")
print(output.text)

# Multimodal generation
output = model.chat(text="Generate an image of a sunset with matching audio")
output.save_image("sunset.png")
output.save_audio("ambient.wav")

# Agentic planning + tool use
output = model.chat(
    text="Search for population data and create a chart",
    tools=["web_search", "code_interpreter"],
)
print(output.plan)
print(output.tool_calls)

Full inference notebook: notebooks/OmniAgent_Inference.ipynb

Figures

Figure	Description
`figures/rl_comparison.png`	6 RL methods bar chart comparison
`figures/radar_chart.png`	Multi-dimensional radar comparison
`figures/training_curves.png`	Loss curves all stages + RL methods
`figures/category_breakdown.png`	Per-category perplexity
`figures/ablation_study.png`	Stage-by-stage ablation
`figures/novel_indices_comparison.png`	5 novel evaluation indices

Dataset

Canonical repo: mr3haque/OmniAgent-Data

Dataset	Samples	Stage
MAgenIT (original + augmented)	5K + 50K	3 (SFT)
Preferences (original + augmented)	14.4K + 50K	4 (RL)
Understanding SFT	10K	3
ToolBench	4K	3
CC3M	100K + 20K	1
AudioCaps	49.8K	1
LLaVA-Instruct	394K + 50K	1, 3
Decoder embeddings	71K	2
Benchmarks	600	Eval

References

@article{rafailov2023dpo, title={Direct Preference Optimization}, author={Rafailov et al.}, journal={NeurIPS}, year={2023}}
@article{meng2024simpo, title={SimPO: Simple Preference Optimization}, author={Meng et al.}, year={2024}}
@article{shao2024deepseekmath, title={DeepSeekMath: GRPO}, author={Shao et al.}, year={2024}}
@inproceedings{wu2024nextgpt, title={NExT-GPT: Any-to-Any Multimodal LLM}, author={Wu et al.}, booktitle={ICML}, year={2024}}
@inproceedings{hong2024cogagent, title={CogAgent}, author={Hong et al.}, booktitle={CVPR}, year={2024}}
@inproceedings{yang2025magma, title={Magma: Foundation Model for Multimodal AI Agents}, author={Yang et al.}, booktitle={CVPR}, year={2025}}
@article{li2024mdpo, title={mDPO: Conditional Preference Optimization for Multimodal LLMs}, author={Li et al.}, journal={EMNLP}, year={2024}}
@article{zhang2025mmrlhf, title={MM-RLHF: Multimodal Reward Model}, author={Zhang et al.}, year={2025}}
@article{xue2025dancegrpo, title={DanceGRPO: RL for Visual Generation}, author={Xue et al.}, year={2025}}

Citation

@inproceedings{haque2026omniagent,
  title={OmniAgent: Unified Multimodal Agent with RL Alignment for Any-to-Any Generation},
  author={Haque, Md Rezwan},
  booktitle={CVPR},
  year={2026}
}

License

Apache 2.0 | CPAMI Lab, University of Waterloo

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Any-to-Any

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

mr3haque
/

OmniAgent