OmniAgent: Unified Multimodal Agent with RL Alignment for Any-to-Any Generation

Md Rezwan Haque | CPAMI Lab, University of Waterloo

GitHub Dataset Notebook License


SOTA Contributions

These are novel contributions that have not been attempted in prior work:

1. Unified Action-Generation Token Space (First in literature)

Prior multimodal agents use separate pathways for generation (images/audio) and agentic actions (tool calls, planning). OmniAgent is the first to unify them: <IMG>, <AUD>, <VID>, <TOOL_CALL>, <PLAN>, <THINK>, and <ACT> all share the same mechanism -- the LLM emits a special token, and its hidden state is routed to the appropriate downstream module. This eliminates architectural complexity and enables the model to learn generation and agentic behavior jointly.

2. MM-DPO: Multimodal Direct Preference Optimization (Novel method)

First DPO variant that decomposes the preference signal into modality-specific reward components with dynamic weighting:

R(x, y) = w_text Β· R_text + w_img Β· R_img + w_aud Β· R_aud + w_vid Β· R_vid + w_task Β· R_task

Weights are dynamically set to zero for absent modalities and renormalized per sample. Combined with image-anchored preference (inspired by mDPO, EMNLP 2024), MM-DPO achieves Pareto-optimal performance across all evaluation dimensions (RKP = 18.89, highest among all 6 methods).

3. GRPO++: Token-Level Advantage GRPO (Novel method)

First GRPO extension with: (1) token-level advantage estimation (vs. sequence-level), (2) dual-clip mechanism for bidirectional policy stability, (3) entropy bonus for diverse multimodal generation. Achieves 62x improvement over vanilla GRPO (PPL 130 vs. 8048).

4. Most Comprehensive RL Comparison for Multimodal Agents (First study)

First systematic comparison of 6 RL alignment methods (DPO, SimPO, GRPO, GRPO++, Online GRPO, MM-DPO) on the same multimodal agent backbone. Key finding: preference-based methods are dramatically more stable than reward-based methods for multimodal alignment.

5. Five Novel Evaluation Indices

CMTS (Cross-Modal Transfer Score), ACI (Agentic Completeness Index), TDI (Token Distribution Index), MCS (Modal Consistency Score), HRS (Human Reward Simulation) -- designed specifically for evaluating multimodal agents.


Architecture

Input (text + image/audio/video)
  β”‚
  β–Ό
[ImageBind-Huge] (632M, frozen) ──→ 1024-dim
  β”‚
  β–Ό
[Input Projector] (50M, Transformer) ──→ 4096-dim
  β”‚
  β–Ό
[Vicuna-7B-v1.5 + LoRA]
  β”‚
  β”œβ”€β”€ Text tokens       ──→ text response
  β”œβ”€β”€ <IMG0>...<IMG4>   ──→ [Output Proj] ──→ Stable Diffusion v1.5
  β”œβ”€β”€ <AUD0>...<AUD3>   ──→ [Output Proj] ──→ AudioLDM 2
  β”œβ”€β”€ <VID0>...<VID3>   ──→ [Output Proj] ──→ ModelScope
  β”œβ”€β”€ <TOOL_CALL>       ──→ Action Executor ──→ external APIs
  β”œβ”€β”€ <PLAN>            ──→ Agent Planner  ──→ multi-step decomposition
  └── <THINK>           ──→ chain-of-thought reasoning

Total: 7.5B params | Trainable: 780M (10.4%) | Hardware: 2x NVIDIA RTX A6000 (48 GB each)


Checkpoints (All Retrained Mar-Apr 2026)

All checkpoints are stored as subfolders with backbone/, input_projector.pt, output_projectors.pt, tokenizer/.

Checkpoint Stage Method Date Notes
stage1_encode/ 1 Input projector alignment Mar 8 CC3M + AudioCaps + LLaVA 50K
stage2_decode/ 2 Output projector alignment Mar 9 Precomputed embeddings
stage3_sft/ 3 Agentic SFT (LoRA r=64) Mar 16 MAgenIT 50K + Understanding 10K + ToolBench 4K
stage4_dpo/ 4 DPO (Rafailov et al.) Mar 19 Preferences 50K, 3 epochs
stage4_simpo/ 4 SimPO (Meng et al.) Mar 24 Best PPL (1.75)
stage4_grpo/ 4 GRPO (Shao et al.) Mar 28 Perplexity explosion (8048)
stage4_grpo_plus/ 4 GRPO++ (Ours) Mar 30 62x better than GRPO
stage4_online_grpo/ 4 Online GRPO (Ours) Apr 1 Replay buffer + dynamic temp
stage4_mm_dpo/ 4 MM-DPO (Ours) Mar 24 Pareto-optimal, recommended

Total training: ~256 GPU-hours on 2x A6000.


Evaluation Results (Real Metrics, Retrained Models)

Evaluated on 300 samples across 6 cross-modal task categories. All numbers from actual evaluation.

Table 1: Main Results

Method PPL ↓ Gen PPL ↓ Tool PPL ↓ Plan PPL ↓ CMTS ↑ ACI ↑ Novel Avg ↑
SFT (no RL) 1.92 2.43 1.71 1.61 0.931 0.817 0.747
+ DPO 2.32 2.93 2.01 2.05 0.919 0.917 0.688
+ SimPO 1.75 2.18 1.55 1.52 0.939 0.817 0.781
+ GRPO 8048 11465 8446 4700 0.158 0.900 0.580
+ GRPO++ (Ours) 130.5 195.5 117.2 79.4 0.538 0.833 0.686
+ Online GRPO (Ours) 455.5 839.9 351.0 244.2 0.410 0.817 0.678
+ MM-DPO (Ours) 2.30 2.90 1.99 2.02 0.920 0.917 0.714

Table 2: Novel Evaluation Indices

Index Description Best Method Score
CMTS Cross-Modal Transfer Score SimPO 0.939
ACI Agentic Completeness Index DPO / MM-DPO 0.917
TDI Token Distribution Index Online GRPO 0.907
MCS Modal Consistency Score SimPO / Online GRPO 0.636
HRS Human Reward Simulation SFT 0.629

Table 3: Per-Category Perplexity ↓

Category SFT SimPO MM-DPO (Ours)
audio_to_image 2.92 2.56 3.16
code_and_explain 2.03 1.77 2.22
image_to_audio 2.55 2.34 2.63
multi_step_creation 1.61 1.52 2.01
search_and_generate 1.59 1.46 1.92
text_to_multimodal 1.92 1.73 2.81

Pareto Analysis (RKP Scores)

MM-DPO (Ours): 18.89  ← Pareto-optimal
DPO:           18.38
SimPO:          4.00
GRPO:           0.33
GRPO++ (Ours):  0.25
Online GRPO:    0.22

Key Findings

  1. Preference-based methods >> reward-based for multimodal alignment (PPL 1.75-2.32 vs. 130-8048)
  2. MM-DPO is Pareto-optimal -- modality-aware reward decomposition prevents sacrificing any quality dimension
  3. GRPO++ improves GRPO by 62x -- token-level advantages and dual-clip rescue reward-based methods
  4. SimPO achieves best PPL (1.75) -- reference-free design suits variable-length multimodal outputs
  5. 3 epochs required for RL -- 1 epoch causes perplexity explosion; empirically validated

Comparison with Related Work

System Params Any-to-Any Tool Use RL Methods Open Weights
NExT-GPT (ICML 2024) 7B Yes No 0 Yes
CogAgent (CVPR 2024) 18B No Yes 0 Yes
Magma (CVPR 2025) 8B No Yes 0 Yes
LLaVA-1.6 (2024) 7-34B No No 0 Yes
GPT-4V (2024) ~1.8T No Yes RLHF No
OmniAgent (Ours) 7B Yes Yes 6 Yes

Quick Start

from omniagent.model.omniagent_arch import OmniAgentModel
import torch

model = OmniAgentModel.from_pretrained(
    model_path="mr3haque/OmniAgent/stage4_mm_dpo",
    load_in_4bit=True,
    torch_dtype=torch.float16,
    device_map="auto",
)

# Text + image understanding
output = model.chat(text="Describe this image.", image_path="photo.jpg")
print(output.text)

# Multimodal generation
output = model.chat(text="Generate an image of a sunset with matching audio")
output.save_image("sunset.png")
output.save_audio("ambient.wav")

# Agentic planning + tool use
output = model.chat(
    text="Search for population data and create a chart",
    tools=["web_search", "code_interpreter"],
)
print(output.plan)
print(output.tool_calls)

Full inference notebook: notebooks/OmniAgent_Inference.ipynb


Figures

Figure Description
figures/rl_comparison.png 6 RL methods bar chart comparison
figures/radar_chart.png Multi-dimensional radar comparison
figures/training_curves.png Loss curves all stages + RL methods
figures/category_breakdown.png Per-category perplexity
figures/ablation_study.png Stage-by-stage ablation
figures/novel_indices_comparison.png 5 novel evaluation indices

Dataset

Canonical repo: mr3haque/OmniAgent-Data

Dataset Samples Stage
MAgenIT (original + augmented) 5K + 50K 3 (SFT)
Preferences (original + augmented) 14.4K + 50K 4 (RL)
Understanding SFT 10K 3
ToolBench 4K 3
CC3M 100K + 20K 1
AudioCaps 49.8K 1
LLaVA-Instruct 394K + 50K 1, 3
Decoder embeddings 71K 2
Benchmarks 600 Eval

References

@article{rafailov2023dpo, title={Direct Preference Optimization}, author={Rafailov et al.}, journal={NeurIPS}, year={2023}}
@article{meng2024simpo, title={SimPO: Simple Preference Optimization}, author={Meng et al.}, year={2024}}
@article{shao2024deepseekmath, title={DeepSeekMath: GRPO}, author={Shao et al.}, year={2024}}
@inproceedings{wu2024nextgpt, title={NExT-GPT: Any-to-Any Multimodal LLM}, author={Wu et al.}, booktitle={ICML}, year={2024}}
@inproceedings{hong2024cogagent, title={CogAgent}, author={Hong et al.}, booktitle={CVPR}, year={2024}}
@inproceedings{yang2025magma, title={Magma: Foundation Model for Multimodal AI Agents}, author={Yang et al.}, booktitle={CVPR}, year={2025}}
@article{li2024mdpo, title={mDPO: Conditional Preference Optimization for Multimodal LLMs}, author={Li et al.}, journal={EMNLP}, year={2024}}
@article{zhang2025mmrlhf, title={MM-RLHF: Multimodal Reward Model}, author={Zhang et al.}, year={2025}}
@article{xue2025dancegrpo, title={DanceGRPO: RL for Visual Generation}, author={Xue et al.}, year={2025}}

Citation

@inproceedings{haque2026omniagent,
  title={OmniAgent: Unified Multimodal Agent with RL Alignment for Any-to-Any Generation},
  author={Haque, Md Rezwan},
  booktitle={CVPR},
  year={2026}
}

Links

License

Apache 2.0 | CPAMI Lab, University of Waterloo

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train mr3haque/OmniAgent