OmniAgent: Unified Multimodal Agent with RL Alignment for Any-to-Any Generation
Md Rezwan Haque | CPAMI Lab, University of Waterloo
SOTA Contributions
These are novel contributions that have not been attempted in prior work:
1. Unified Action-Generation Token Space (First in literature)
Prior multimodal agents use separate pathways for generation (images/audio) and agentic actions (tool calls, planning). OmniAgent is the first to unify them: <IMG>, <AUD>, <VID>, <TOOL_CALL>, <PLAN>, <THINK>, and <ACT> all share the same mechanism -- the LLM emits a special token, and its hidden state is routed to the appropriate downstream module. This eliminates architectural complexity and enables the model to learn generation and agentic behavior jointly.
2. MM-DPO: Multimodal Direct Preference Optimization (Novel method)
First DPO variant that decomposes the preference signal into modality-specific reward components with dynamic weighting:
R(x, y) = w_text Β· R_text + w_img Β· R_img + w_aud Β· R_aud + w_vid Β· R_vid + w_task Β· R_task
Weights are dynamically set to zero for absent modalities and renormalized per sample. Combined with image-anchored preference (inspired by mDPO, EMNLP 2024), MM-DPO achieves Pareto-optimal performance across all evaluation dimensions (RKP = 18.89, highest among all 6 methods).
3. GRPO++: Token-Level Advantage GRPO (Novel method)
First GRPO extension with: (1) token-level advantage estimation (vs. sequence-level), (2) dual-clip mechanism for bidirectional policy stability, (3) entropy bonus for diverse multimodal generation. Achieves 62x improvement over vanilla GRPO (PPL 130 vs. 8048).
4. Most Comprehensive RL Comparison for Multimodal Agents (First study)
First systematic comparison of 6 RL alignment methods (DPO, SimPO, GRPO, GRPO++, Online GRPO, MM-DPO) on the same multimodal agent backbone. Key finding: preference-based methods are dramatically more stable than reward-based methods for multimodal alignment.
5. Five Novel Evaluation Indices
CMTS (Cross-Modal Transfer Score), ACI (Agentic Completeness Index), TDI (Token Distribution Index), MCS (Modal Consistency Score), HRS (Human Reward Simulation) -- designed specifically for evaluating multimodal agents.
Architecture
Input (text + image/audio/video)
β
βΌ
[ImageBind-Huge] (632M, frozen) βββ 1024-dim
β
βΌ
[Input Projector] (50M, Transformer) βββ 4096-dim
β
βΌ
[Vicuna-7B-v1.5 + LoRA]
β
βββ Text tokens βββ text response
βββ <IMG0>...<IMG4> βββ [Output Proj] βββ Stable Diffusion v1.5
βββ <AUD0>...<AUD3> βββ [Output Proj] βββ AudioLDM 2
βββ <VID0>...<VID3> βββ [Output Proj] βββ ModelScope
βββ <TOOL_CALL> βββ Action Executor βββ external APIs
βββ <PLAN> βββ Agent Planner βββ multi-step decomposition
βββ <THINK> βββ chain-of-thought reasoning
Total: 7.5B params | Trainable: 780M (10.4%) | Hardware: 2x NVIDIA RTX A6000 (48 GB each)
Checkpoints (All Retrained Mar-Apr 2026)
All checkpoints are stored as subfolders with backbone/, input_projector.pt, output_projectors.pt, tokenizer/.
| Checkpoint | Stage | Method | Date | Notes |
|---|---|---|---|---|
stage1_encode/ |
1 | Input projector alignment | Mar 8 | CC3M + AudioCaps + LLaVA 50K |
stage2_decode/ |
2 | Output projector alignment | Mar 9 | Precomputed embeddings |
stage3_sft/ |
3 | Agentic SFT (LoRA r=64) | Mar 16 | MAgenIT 50K + Understanding 10K + ToolBench 4K |
stage4_dpo/ |
4 | DPO (Rafailov et al.) | Mar 19 | Preferences 50K, 3 epochs |
stage4_simpo/ |
4 | SimPO (Meng et al.) | Mar 24 | Best PPL (1.75) |
stage4_grpo/ |
4 | GRPO (Shao et al.) | Mar 28 | Perplexity explosion (8048) |
stage4_grpo_plus/ |
4 | GRPO++ (Ours) | Mar 30 | 62x better than GRPO |
stage4_online_grpo/ |
4 | Online GRPO (Ours) | Apr 1 | Replay buffer + dynamic temp |
stage4_mm_dpo/ |
4 | MM-DPO (Ours) | Mar 24 | Pareto-optimal, recommended |
Total training: ~256 GPU-hours on 2x A6000.
Evaluation Results (Real Metrics, Retrained Models)
Evaluated on 300 samples across 6 cross-modal task categories. All numbers from actual evaluation.
Table 1: Main Results
| Method | PPL β | Gen PPL β | Tool PPL β | Plan PPL β | CMTS β | ACI β | Novel Avg β |
|---|---|---|---|---|---|---|---|
| SFT (no RL) | 1.92 | 2.43 | 1.71 | 1.61 | 0.931 | 0.817 | 0.747 |
| + DPO | 2.32 | 2.93 | 2.01 | 2.05 | 0.919 | 0.917 | 0.688 |
| + SimPO | 1.75 | 2.18 | 1.55 | 1.52 | 0.939 | 0.817 | 0.781 |
| + GRPO | 8048 | 11465 | 8446 | 4700 | 0.158 | 0.900 | 0.580 |
| + GRPO++ (Ours) | 130.5 | 195.5 | 117.2 | 79.4 | 0.538 | 0.833 | 0.686 |
| + Online GRPO (Ours) | 455.5 | 839.9 | 351.0 | 244.2 | 0.410 | 0.817 | 0.678 |
| + MM-DPO (Ours) | 2.30 | 2.90 | 1.99 | 2.02 | 0.920 | 0.917 | 0.714 |
Table 2: Novel Evaluation Indices
| Index | Description | Best Method | Score |
|---|---|---|---|
| CMTS | Cross-Modal Transfer Score | SimPO | 0.939 |
| ACI | Agentic Completeness Index | DPO / MM-DPO | 0.917 |
| TDI | Token Distribution Index | Online GRPO | 0.907 |
| MCS | Modal Consistency Score | SimPO / Online GRPO | 0.636 |
| HRS | Human Reward Simulation | SFT | 0.629 |
Table 3: Per-Category Perplexity β
| Category | SFT | SimPO | MM-DPO (Ours) |
|---|---|---|---|
| audio_to_image | 2.92 | 2.56 | 3.16 |
| code_and_explain | 2.03 | 1.77 | 2.22 |
| image_to_audio | 2.55 | 2.34 | 2.63 |
| multi_step_creation | 1.61 | 1.52 | 2.01 |
| search_and_generate | 1.59 | 1.46 | 1.92 |
| text_to_multimodal | 1.92 | 1.73 | 2.81 |
Pareto Analysis (RKP Scores)
MM-DPO (Ours): 18.89 β Pareto-optimal
DPO: 18.38
SimPO: 4.00
GRPO: 0.33
GRPO++ (Ours): 0.25
Online GRPO: 0.22
Key Findings
- Preference-based methods >> reward-based for multimodal alignment (PPL 1.75-2.32 vs. 130-8048)
- MM-DPO is Pareto-optimal -- modality-aware reward decomposition prevents sacrificing any quality dimension
- GRPO++ improves GRPO by 62x -- token-level advantages and dual-clip rescue reward-based methods
- SimPO achieves best PPL (1.75) -- reference-free design suits variable-length multimodal outputs
- 3 epochs required for RL -- 1 epoch causes perplexity explosion; empirically validated
Comparison with Related Work
| System | Params | Any-to-Any | Tool Use | RL Methods | Open Weights |
|---|---|---|---|---|---|
| NExT-GPT (ICML 2024) | 7B | Yes | No | 0 | Yes |
| CogAgent (CVPR 2024) | 18B | No | Yes | 0 | Yes |
| Magma (CVPR 2025) | 8B | No | Yes | 0 | Yes |
| LLaVA-1.6 (2024) | 7-34B | No | No | 0 | Yes |
| GPT-4V (2024) | ~1.8T | No | Yes | RLHF | No |
| OmniAgent (Ours) | 7B | Yes | Yes | 6 | Yes |
Quick Start
from omniagent.model.omniagent_arch import OmniAgentModel
import torch
model = OmniAgentModel.from_pretrained(
model_path="mr3haque/OmniAgent/stage4_mm_dpo",
load_in_4bit=True,
torch_dtype=torch.float16,
device_map="auto",
)
# Text + image understanding
output = model.chat(text="Describe this image.", image_path="photo.jpg")
print(output.text)
# Multimodal generation
output = model.chat(text="Generate an image of a sunset with matching audio")
output.save_image("sunset.png")
output.save_audio("ambient.wav")
# Agentic planning + tool use
output = model.chat(
text="Search for population data and create a chart",
tools=["web_search", "code_interpreter"],
)
print(output.plan)
print(output.tool_calls)
Full inference notebook: notebooks/OmniAgent_Inference.ipynb
Figures
| Figure | Description |
|---|---|
figures/rl_comparison.png |
6 RL methods bar chart comparison |
figures/radar_chart.png |
Multi-dimensional radar comparison |
figures/training_curves.png |
Loss curves all stages + RL methods |
figures/category_breakdown.png |
Per-category perplexity |
figures/ablation_study.png |
Stage-by-stage ablation |
figures/novel_indices_comparison.png |
5 novel evaluation indices |
Dataset
Canonical repo: mr3haque/OmniAgent-Data
| Dataset | Samples | Stage |
|---|---|---|
| MAgenIT (original + augmented) | 5K + 50K | 3 (SFT) |
| Preferences (original + augmented) | 14.4K + 50K | 4 (RL) |
| Understanding SFT | 10K | 3 |
| ToolBench | 4K | 3 |
| CC3M | 100K + 20K | 1 |
| AudioCaps | 49.8K | 1 |
| LLaVA-Instruct | 394K + 50K | 1, 3 |
| Decoder embeddings | 71K | 2 |
| Benchmarks | 600 | Eval |
References
@article{rafailov2023dpo, title={Direct Preference Optimization}, author={Rafailov et al.}, journal={NeurIPS}, year={2023}}
@article{meng2024simpo, title={SimPO: Simple Preference Optimization}, author={Meng et al.}, year={2024}}
@article{shao2024deepseekmath, title={DeepSeekMath: GRPO}, author={Shao et al.}, year={2024}}
@inproceedings{wu2024nextgpt, title={NExT-GPT: Any-to-Any Multimodal LLM}, author={Wu et al.}, booktitle={ICML}, year={2024}}
@inproceedings{hong2024cogagent, title={CogAgent}, author={Hong et al.}, booktitle={CVPR}, year={2024}}
@inproceedings{yang2025magma, title={Magma: Foundation Model for Multimodal AI Agents}, author={Yang et al.}, booktitle={CVPR}, year={2025}}
@article{li2024mdpo, title={mDPO: Conditional Preference Optimization for Multimodal LLMs}, author={Li et al.}, journal={EMNLP}, year={2024}}
@article{zhang2025mmrlhf, title={MM-RLHF: Multimodal Reward Model}, author={Zhang et al.}, year={2025}}
@article{xue2025dancegrpo, title={DanceGRPO: RL for Visual Generation}, author={Xue et al.}, year={2025}}
Citation
@inproceedings{haque2026omniagent,
title={OmniAgent: Unified Multimodal Agent with RL Alignment for Any-to-Any Generation},
author={Haque, Md Rezwan},
booktitle={CVPR},
year={2026}
}
Links
- Code: github.com/rezwanh001/OmniAgent
- Dataset: mr3haque/OmniAgent-Data
- Inference: notebooks/OmniAgent_Inference.ipynb
License
Apache 2.0 | CPAMI Lab, University of Waterloo