Image-Text-to-Text
Transformers
Safetensors
English
qwen3_vl
agent
image-generation
tool-use
visual-reasoning
self-distillation
grpo
reinforcement-learning
multimodal
qwen3-vl
conversational
Instructions to use MeiGen-AI/GenEvolve with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use MeiGen-AI/GenEvolve with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="MeiGen-AI/GenEvolve") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("MeiGen-AI/GenEvolve") model = AutoModelForImageTextToText.from_pretrained("MeiGen-AI/GenEvolve") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use MeiGen-AI/GenEvolve with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "MeiGen-AI/GenEvolve" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "MeiGen-AI/GenEvolve", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/MeiGen-AI/GenEvolve
- SGLang
How to use MeiGen-AI/GenEvolve with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "MeiGen-AI/GenEvolve" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "MeiGen-AI/GenEvolve", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "MeiGen-AI/GenEvolve" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "MeiGen-AI/GenEvolve", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use MeiGen-AI/GenEvolve with Docker Model Runner:
docker model run hf.co/MeiGen-AI/GenEvolve
| license: apache-2.0 | |
| language: | |
| - en | |
| library_name: transformers | |
| pipeline_tag: image-text-to-text | |
| base_model: Qwen/Qwen3-VL-8B-Instruct | |
| tags: | |
| - agent | |
| - image-generation | |
| - tool-use | |
| - visual-reasoning | |
| - self-distillation | |
| - grpo | |
| - reinforcement-learning | |
| - multimodal | |
| - qwen3-vl | |
| datasets: | |
| - MeiGen-AI/GenEvolve-Data-Bench | |
| <div align="center"> | |
| <img src="assets/logo_genevolve.png" alt="GenEvolve" width="160"> | |
| <h1>GenEvolve</h1> | |
| <p><strong><em>Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation</em></strong></p> | |
| <p> | |
| <a href="https://arxiv.org/abs/2605.21605"> | |
| <img alt="Paper" src="https://img.shields.io/badge/π_Paper-arXiv:2605.21605-b31b1b"></a> | |
| <a href="https://ephemeral182.github.io/GenEvolve/"> | |
| <img alt="Project Page" src="https://img.shields.io/badge/π_Project-Page-1f6feb"></a> | |
| <a href="https://github.com/MeiGen-AI/GenEvolve"> | |
| <img alt="Code" src="https://img.shields.io/badge/πΎ_GitHub-Code-181717"></a> | |
| <a href="https://huggingface.co/datasets/MeiGen-AI/GenEvolve-Data-Bench"> | |
| <img alt="Dataset" src="https://img.shields.io/badge/π€_Dataset-GenEvolve--Data-FFD21E"></a> | |
| </p> | |
| </div> | |
| This repository hosts the **GenEvolve agent policy** β a Qwen3-VL-8B-Instruct backbone fine-tuned and self-evolved into a tool-orchestrated image-generation agent. Given a user request, the agent issues web/image searches, retrieves visual references, activates internal generation knowledge, and emits an executable **prompt-reference program** `z = (gen_prompt, reference_images)` that drives any reference-conditioned downstream generator (Qwen-Image-Edit, Nano Banana Pro, ...). | |
| <div align="center"> | |
| <img src="assets/teaser.jpg" alt="GenEvolve teaser" width="100%"> | |
| <p><em>The same trained agent policy paired with two reference-conditioned generators βΆ<br> | |
| <strong>Qwen-Image-Edit (open)</strong> Β· <strong>Nano Banana Pro (strong)</strong></em></p> | |
| </div> | |
| --- | |
| ## β¨ Highlights | |
| - **Tool-orchestrated trajectories.** The agent calls `search`, `image_search`, and `query_knowledge` (8 callable generation skills) before producing a final program `z = (gen_prompt, reference_images)`. | |
| - **Self-evolution with Visual Experience Distillation.** Best-vs-worst trajectory pairs are distilled token-level into the deployed student. **No runtime memory at inference.** | |
| - **Generator-transferable.** The same trained policy works with both an open-source generator (Qwen-Image-Edit-2511) and a strong proprietary generator (Nano Banana Pro). | |
| ## π Headline Results | |
| ### GenEvolve-Bench (KScore, held-out split) | |
| | Method | Generator | KScore | Knowledge-Anch. | Quality-Anch. | | |
| |---|---|---:|---:|---:| | |
| | Qwen-Image (raw) | Qwen-Image | 0.2987 | 0.2384 | 0.3768 | | |
| | Nano Banana Pro (raw) | Nano Banana Pro | 0.5298 | 0.5160 | 0.5477 | | |
| | Gen-Searcher 8B | Qwen-Image-Edit-2511 | 0.3493 | 0.3293 | 0.3745 | | |
| | Gen-Searcher 8B | Nano Banana Pro | 0.5481 | 0.5472 | 0.5492 | | |
| | **GenEvolve (Ours)** | Qwen-Image-Edit-2511 | **0.3663** | **0.3410** | **0.3990** | | |
| | **GenEvolve (Ours)** | Nano Banana Pro | **0.5739** | **0.5669** | **0.5830** | | |
| ### WISE Benchmark (WiScore, six knowledge categories) | |
| | Model | Cultural | Time | Space | Biology | Physics | Chemistry | **Overall** | | |
| |---|---:|---:|---:|---:|---:|---:|---:| | |
| | GPT-4o | 0.81 | 0.71 | **0.89** | **0.83** | 0.79 | 0.74 | 0.80 | | |
| | Gen-Searcher-8B + Qwen-Image | 0.80 | 0.71 | 0.82 | 0.76 | 0.74 | 0.75 | 0.77 | | |
| | Mind-Brush | 0.83 | 0.69 | 0.84 | 0.71 | **0.85** | 0.68 | 0.78 | | |
| | **GenEvolve + Qwen-Image-Edit** | **0.84** | 0.74 | 0.87 | **0.83** | 0.81 | **0.83** | **0.82** | | |
| --- | |
| ## π§ Method Overview | |
| <p align="center"><img src="assets/overview.png" alt="GenEvolve method overview" width="92%"></p> | |
| For a user request, the agent samples a multi-turn trajectory of tool calls before emitting the final prompt-reference program. The downstream generator then renders the image. | |
| --- | |
| ## πΌοΈ Visual Demos | |
| <p align="center"><img src="assets/visual_comparison.png" alt="Qualitative comparison" width="100%"></p> | |
| <p align="center"><sub>Qualitative comparison on representative cases. <span style="color:#D97706">Orange</span> marks external/uncommon knowledge requirements; <span style="color:#2563EB">blue</span> marks internal generation-knowledge requirements.</sub></p> | |
| ### π¨ Gallery β paired with Nano Banana Pro | |
| <p align="center"><img src="assets/gallery_nano.jpg" alt="GenEvolve + Nano Banana Pro gallery" width="100%"></p> | |
| <p align="center"><sub>The same agent policy with Nano Banana Pro as the downstream renderer. Examples cover spatial layout, text rendering, quantity counting, attribute binding, anatomy/pose, creative transfer, material physics, and aesthetic drawing.</sub></p> | |
| ### π¨ Gallery β paired with Qwen-Image-Edit (open) | |
| <p align="center"><img src="assets/gallery_qwen.jpg" alt="GenEvolve + Qwen-Image-Edit gallery" width="100%"></p> | |
| <p align="center"><sub>Same trained policy paired with the open-source Qwen-Image-Edit-2511 renderer; consistent quality across both generators reflects generator-transferable orchestration.</sub></p> | |
| --- | |
| ## π Quick Start | |
| The deployed checkpoint is the **student policy** β it consumes a user prompt and returns a JSON `gen_prompt + reference_images` program through a `<think>/<tool_call>/<answer>` loop. The end-to-end runtime (vLLM serving + agent loop + tools + Qwen/Nano renderers) lives in the [GitHub repo](https://github.com/MeiGen-AI/GenEvolve); the snippet below mirrors its installation and usage. | |
| ### 1. Install the main GenEvolve runtime | |
| ```bash | |
| git clone https://github.com/MeiGen-AI/GenEvolve.git | |
| cd GenEvolve | |
| conda create -n genevolve python=3.11 -y && conda activate genevolve | |
| pip install -U pip setuptools wheel packaging psutil ninja | |
| pip install torch==2.8.0 torchvision==0.23.0 --index-url https://download.pytorch.org/whl/cu128 | |
| pip install --no-build-isolation -r requirements.txt | |
| pip install -e . | |
| ``` | |
| Qwen-Image-Edit rendering runs as a **separate FastAPI service** (kept out of the vLLM environment to avoid CUDA/diffusers conflicts). Set up that service from the GitHub README when you want to use `--backend qwen-image-edit-service`. | |
| ### 2. Serve the agent policy | |
| ```bash | |
| # Single GPU / single replica. | |
| MODEL_PATH=MeiGen-AI/GenEvolve PORT=8000 TP=1 DP=1 bash scripts/serve_vllm.sh | |
| # Higher throughput on one 8-GPU node (8 replicas, 1 GPU each). | |
| MODEL_PATH=MeiGen-AI/GenEvolve PORT=8000 TP=1 DP=8 bash scripts/serve_vllm.sh | |
| ``` | |
| `TP` shards one model replica across multiple GPUs; `DP` launches multiple replicas; total GPU usage is `TP Γ DP`. | |
| ### 3. End-to-end example | |
| ```bash | |
| export SERPER_API_KEY=<your_key> # required for search / image_search | |
| export GOOGLE_API_KEY=<your_key> # or GEMINI_API_KEY; only for --backend nano-banana-pro | |
| # Nano Banana Pro renderer | |
| python examples/quickstart.py \ | |
| --backend nano-banana-pro \ | |
| --base-url http://localhost:8000/v1 \ | |
| --model GenEvolve \ | |
| --prompt "A 1990s travel-magazine cover of two backpackers in front of the Eiffel Tower at golden hour, the title \"PARIS\" in bold serif." \ | |
| --output paris.png | |
| # Qwen-Image-Edit renderer (point at your Qwen-Image-Edit FastAPI service) | |
| python examples/quickstart.py \ | |
| --backend qwen-image-edit-service \ | |
| --service-url http://your-qwen-service:8001 \ | |
| --base-url http://localhost:8000/v1 \ | |
| --model GenEvolve \ | |
| --output paris_qwen.png | |
| ``` | |
| The agent's final `<answer>` is a JSON object: | |
| ```json | |
| { | |
| "gen_prompt": "...natural-language prompt that refers to images by 'the first reference image', ...", | |
| "reference_images": [ | |
| {"img_id": "IMG_001", "note": "what to copy from this image"} | |
| ] | |
| } | |
| ``` | |
| `gen_prompt` MUST refer to selected images using ordinal phrases (`"the first reference image"`) β never raw `IMG_###` ids or URLs. Pass `(gen_prompt, [r["local_path"] for r in reference_images])` to your favourite reference-conditioned generator (Qwen-Image-Edit, Nano Banana Pro, ...) to obtain the final image. | |
| --- | |
| ## ποΈ Related Artifacts | |
| | Artifact | Link | | |
| |---|---| | |
| | Project page | https://ephemeral182.github.io/GenEvolve/ | | |
| | Paper | Coming soon | | |
| | Code | https://github.com/MeiGen-AI/GenEvolve | | |
| | Training data + benchmark | [MeiGen-AI/GenEvolve-Data-Bench](https://huggingface.co/datasets/MeiGen-AI/GenEvolve-Data-Bench) | | |
| | Base model | [Qwen/Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct) | | |
| --- | |
| ## βοΈ Intended Use, Limits, Bias | |
| - **Intended use.** Research on tool-using image-generation agents, agentic prompt-program synthesis, and self-distillation from generated outcomes. | |
| - **Search dependency.** The agent issues live web/image queries through user-provided tool wrappers. Quality of grounded facts depends on the search backend you plug in. | |
| - **Bias.** Tool outputs and reference images come from public web search, which carries demographic, cultural, and geographic biases that may be reflected in agent outputs. | |
| --- | |
| ## π Citation | |
| ```bibtex | |
| @misc{chen2026genevolveselfevolvingimagegeneration, | |
| title={GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation}, | |
| author={Sixiang Chen and Zhaohu Xing and Tian Ye and Xinyu Geng and Yunlong Lin and Jianyu Lai and Xuanhua He and Fuxiang Zhai and Jialin Gao and Lei Zhu}, | |
| year={2026}, | |
| eprint={2605.21605}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.CV}, | |
| url={https://arxiv.org/abs/2605.21605}, | |
| } | |
| ``` | |