Text-to-Image
Diffusers
Safetensors
Configuration Parsing Warning: In UNKNOWN_FILENAME: "diffusers._class_name" must be a string

πŸ’‘ DeepGen 1.0 (Diffusers Format): A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing

This is the diffusers-compatible version of DeepGen-1.0. The model weights are stored in safetensors format with a self-contained pipeline script (deepgen_pipeline.py) β€” no need to clone the DeepGen repository.

DeepGen 1.0 is a lightweight unified multimodal model with only 5B parameters (3B VLM + 2B DiT). It integrates five core capabilitiesβ€”general image generation, general image editing, reasoning image generation, reasoning image editing, and text renderingβ€”within a single model. Across multiple authoritative benchmarks, DeepGen 1.0 is competitive with or surpassing the state-of-the-art unified multimodal models that are 3Γ— to 16Γ— larger.

πŸ› οΈ Quick Start

Installation

pip install torch diffusers transformers safetensors einops accelerate huggingface_hub
# Flash Attention (recommended)
pip install flash-attn --no-build-isolation

Load Pipeline

import torch
from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained(
    "deepgenteam/DeepGen-1.0-diffusers",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)
pipe.to("cuda")

# Optional: enable CPU offload for GPUs with limited memory (< 24GB)
# pipe.enable_model_cpu_offload()

Text-to-Image

result = pipe(
    prompt="a racoon holding a shiny red apple over its head",
    height=512, width=512,
    num_inference_steps=50,
    guidance_scale=4.0,
    seed=42,
)
result.images[0].save("output.png")

Image Editing

from PIL import Image

source_image = Image.open("guitar.png").convert("RGB")
result = pipe(
    prompt="Take a photo of this guitar placed on a sandy beach with the sunset in the background.",
    image=source_image,
    height=512, width=512,
    num_inference_steps=50,
    guidance_scale=4.0,
    seed=42,
)
result.images[0].save("edited.png")

πŸ“‹ Parameters

Parameter Default Description
prompt required Text prompt for generation or editing
image None Input image for editing. If None, performs text-to-image generation
height 512 Output image height
width 512 Output image width
num_inference_steps 50 Number of denoising steps
guidance_scale 4.0 Classifier-free guidance scale
seed None Random seed for reproducibility
negative_prompt "" Negative prompt for CFG

πŸ’Ύ Memory Requirements

Mode VRAM
Full GPU ~20 GB
CPU Offload (pipe.enable_model_cpu_offload()) ~14 GB

πŸ“ Directory Structure

DeepGen-1.0-diffusers/
β”œβ”€β”€ transformer/          # SD3 DiT weights (safetensors)
β”œβ”€β”€ vae/                  # AutoencoderKL weights
β”œβ”€β”€ connector/            # SCB Connector weights + config
β”œβ”€β”€ scheduler/            # FlowMatchEulerDiscreteScheduler config
β”œβ”€β”€ tokenizer/            # Qwen2.5-VL tokenizer
β”œβ”€β”€ prompt_template.json  # Prompt formatting template
β”œβ”€β”€ model_index.json      # Model metadata
└── deepgen_pipeline.py   # Self-contained pipeline script

Note: The VLM (Qwen2.5-VL-3B-Instruct) is loaded separately from Qwen/Qwen2.5-VL-3B-Instruct. You can override the VLM path using the vlm_model_path parameter in from_pretrained().

🧠 Method

Our core observation is that a lightweight model, when empowered by synergistic architecture design and data-centric training strategies, can achieve comprehensive capabilities competitive with or even surpassing much larger counterparts. To overcome the limitations of lightweight models in semantic understanding and fine-grained control, we introduce Stacked Channel Bridging (SCB), a deep alignment framework that extracts hierarchical features from multiple VLM layers and fuses them with learnable "think tokens" to provide the generative backbone with structured, reasoning-rich guidance.

Component Parameters Description
VLM (Qwen2.5-VL-3B) 3B Visual Language Model for understanding prompts and reference images
Connector (SCB) ~0.8B 6-layer Transformer bridging VLM hidden states to DiT conditioning
DiT (SD3.5M Kontext) 2B Diffusion Transformer for image generation
VAE ~80M Image encoder/decoder

πŸ“Š Benchmarks

1. General Image Generation

Model Params Geneval ↑ DPGBench ↑ UniGenBench ↑
OmniGen2 3B + 4B 0.80 83.57 63.09
BAGEL 14B 0.82 85.10 61.53
X-Omni 7B + 12B 0.83 87.65πŸ₯‰ 53.77
Lumina-DiMOO 8B 0.88πŸ₯‡ 86.04 71.12
Hunyuan-Image-3.0 80B 0.72 86.10 β€”
Qwen-Image 7B + 20B 0.87 πŸ₯ˆ 88.32 πŸ₯‡ 78.81 πŸ₯‡
LongCat-Image 7B + 6B 0.87 πŸ₯ˆ 86.80 β€”
Z-Image-Turbo 4B + 6B 0.84 85.15 71.40
GLM-Image 9B + 7B β€” 84.78 β€”
DeepGen 1.0 (SFT) 3B + 2B 0.86 πŸ₯‰ 87.05 74.18 πŸ₯‰
DeepGen 1.0 (RL) 3B + 2B 0.87 πŸ₯ˆ 87.90 πŸ₯ˆ 75.74 πŸ₯ˆ

2. General Image Editing

Model Params GEdit-EN ↑ ImgEdit ↑
BAGEL 14B 6.52 3.20
Qwen-Image-Edit [2509] 7B + 20B 7.54 πŸ₯ˆ 4.35 πŸ₯ˆ
LongCat-Image-Edit 7B + 6B 7.60 πŸ₯‡ 4.50 πŸ₯‡
Mammoth2 8B + 3B + 2B 6.60 4.06
DeepGen 1.0 (SFT) 3B + 2B 7.12 4.09
DeepGen 1.0 (RL) 3B + 2B 7.17 πŸ₯‰ 4.14 πŸ₯‰

3. Reasoning Image Generation

Model Params WISE ↑ T2I-CoREBench ↑
OmniGen2 3B + 4B 0.47 36.1
BAGEL 14B 0.70 πŸ₯‰ 41.1
Hunyuan-Image-3.0 80B 0.57 46.0
Qwen-Image 7B + 20B 0.62 46.3 πŸ₯‰
LongCat-Image 7B + 6B 0.65 52.2 πŸ₯‡
Z-Image-Turbo 4B + 6B - 43.7
DeepGen 1.0 (SFT) 3B + 2B 0.72 πŸ₯ˆ 45.7
DeepGen 1.0 (RL) 3B + 2B 0.73 πŸ₯‡ 46.5 πŸ₯ˆ

4. Reasoning Image Editing

Model Params RISE ↑ UniREditBench ↑
OmniGen2 3B + 4B - 43.4
BAGEL 14B 11.9 πŸ₯ˆ 51.0
Qwen-Image-Edit [2509] 7B + 20B 8.9 56.5 πŸ₯‰
DeepGen 1.0 (SFT) 3B + 2B 13.3 πŸ₯‡ 77.5 πŸ₯‡
DeepGen 1.0 (RL) 3B + 2B 10.8 πŸ₯‰ 75.7 πŸ₯ˆ

⭐ Citation

@article{wang2026deepgen,
  title={DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing},
  author={Wang, Dianyi and Li, Ruihang and Han, Feng and Ma, Chaofan and Song, Wei and Wang, Siyuan and Wang, Yibin and Xin, Yi and Liu, Hongjian and Zhang, Zhixiong and others},
  journal={arXiv preprint arXiv:2602.12205},
  year={2026}
}

License

Apache 2.0

Downloads last month
39
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for deepgenteam/DeepGen-1.0-diffusers

Finetuned
(694)
this model

Dataset used to train deepgenteam/DeepGen-1.0-diffusers

Paper for deepgenteam/DeepGen-1.0-diffusers