S1-Omni-Image: A Unified Multimodal Model for Scientific Image Understanding and Generation

📖 Introduction

S1-Omni-Image-Preview is a unified end-to-end reasoning model for multimodal understanding and generation developed by the ScienceOne team at the Chinese Academy of Sciences. Through a unified "Think before generate" paradigm, the model is capable of completing the following four types of tasks:

Text Generation (T2T): Generates text responses based on text input
Image-Text Understanding (TI2T): Understands images and generates answers based on instructions
Image Generation (T2I): Generates images based on text instruction requirements
Image Editing (TI2I): Edits images based on text instruction requirements

The open-sourced S1-Omni-Image-Preview model has a total parameter count of ~30B and natively supports input and output of both text and image modalities. It represents the team's technical exploration in unified model architecture for scientific multimodal content understanding and scientific image generation tasks.

In terms of image generation, especially in the scenario of scientific figure generation tasks, the model has been optimized based on large-scale academic paper figure data from six major disciplines including mathematics, physics, chemistry, astronomy, geography, and biology. It covers common scientific figure types such as flowcharts, architecture diagrams, and principle diagrams, making it the first open-source unified multimodal understanding and generation model with enhanced scientific figure generation capabilities.

📥 Model Weights Download

Model weights are now open-sourced on Hugging Face and ModelScope platforms. You are welcome to download and use them!

Platform	Model URL
Hugging Face	S1-Omni-Image-Preview
ModelScope	S1-Omni-Image-Preview

🧠 Model Architecture

The overall architecture of the S1-Omni-Image-Preview model is shown in the figure, natively supporting input and output of text and image modalities.

Specifically, Text Embedding and Image Embedding encode text and visual tokens into vector representations respectively, and the Image Encoder (VAE) encodes input images into Image Latents. The model first generates a thinking process autoregressively in the form of <think> {Chain of Thought} </think>, and then generates answers for different tasks following user instructions.

For text generation and image understanding tasks, the answer content is generated in text form; for image generation tasks, the answer content is <image_gen> {Detailed prompt for image generation} </image_gen>; for image editing tasks, the answer content is <image_edit> {Detailed prompt for image edit} </image_edit>. The Hidden States associated with its text semantics are simultaneously used as conditional inputs to the Diffusion Transformer for image generation. This strategy leverages the reasoning capabilities of Multimodal Large Language Models to enrich fine-grained text semantic information during image generation and editing, thereby improving image generation quality.

⚙️ Training Strategy

The model adopts a three-stage training strategy, as shown in the figure.

Stage 1: Reasoning Paradigm Training. Initialize weights based on the multimodal large language model Qwen3-VL-8B-Instruct, construct training data for four types of tasks with reasoning processes for fine-tuning, enabling the model to think before generating and produce task-aware Special Tokens.
Stage 2: Diffusion Module Training. Freeze the multimodal large language model, initialize the Diffusion Module based on the MMDiT module of the Qwen-Image-Edit model, and jointly train on multi-task data including self-constructed academic figure generation to enhance the model's comprehensive capabilities in image generation and editing tasks.
Stage 3: Alignment Module Training. Freeze both the multimodal large language model and the MMDiT module, add and train a projection layer between the two modules to align the Hidden States of text generation content with the dimensionality of the Diffusion module.

🎨 Case Showcase

As shown in the figure, these are two generation samples from the S1-Omni-Image-Preview model, including a Chinese example and an English example. The figure demonstrates the complete workflow from user input of natural language descriptions, through the model's reasoning process, to the final generation of professional scientific research figures.

As shown in the figure, these are samples of the S1-Omni-Image-Preview model generating different types of academic-style figures across multiple disciplines including mathematics, physics, chemistry, astronomy, geography, and biology.

🚀 Quick Start

1. Environment Configuration

System Requirements:

Python: 3.10+
CUDA: 12.6+ (Recommended)
GPU: NVIDIA GPU 80GB+ VRAM (Recommended A100/H100)

2. Installation

Download the model weights to a local directory, run the following commands to clone the project code, create a virtual environment, and install dependencies.

# Clone project repository
git clone https://github.com/ScienceOne-AI/S1-Omni-Image.git
cd S1-Omni-Image

# Create virtual environment (Recommended)
conda create -n s1-omni-image-env python=3.10
conda activate s1-omni-image-env

# Install dependencies
pip install -r requirements.txt

For the complete list of dependencies, please refer to requirements.txt.

3. Launch Service

This project provides request and response formats compatible with the OpenAI Compatible API, as well as a simple interactive web page.

Launch Command

# Single GPU
CUDA_VISIBLE_DEVICES=0 python server.py --model /path/to/S1-Omni-Image-Preview --port 8000

# Multi GPUs
CUDA_VISIBLE_DEVICES=0,1 python server.py --model /path/to/S1-Omni-Image-Preview --port 8000

Service Configuration Parameters:

--config: Model configuration file path
--port: Service port (default 8000)
--host: Service address (default 0.0.0.0)

Access Page

After the service starts, access in the browser:

http://localhost:8000

API Call Example

Image Generation Task:

import requests
import base64

url = "http://localhost:8000/v1/chat/completions"
headers = {"Content-Type": "application/json"}

data = {
    "model": "s1-omni-image-preview",
    "messages": [
        {
            "role": "user",
            "content": "Generate a scientific illustration showing the DNA double helix structure"
        }
    ],
    "height": 1024,
    "width": 1024,
    "num_inference_steps": 50
}

response = requests.post(url, headers=headers, json=data)
result = response.json()

content = result["choices"][0]["message"]["content"]

if isinstance(content, str):
    print("Text response:", content)
elif isinstance(content, list):
    for part in content:
        if part["type"] == "text":
            print("Text response:", part["text"])
        elif part["type"] == "image_url":
            # extract base64 data "data:image/png;base64,<actual_base64>"
            image_url = part["image_url"]["url"]
            base64_str = image_url.split(",", 1)[1]
            image_data = base64.b64decode(base64_str)
            output_path = "output.png"
            with open(output_path, "wb") as f:
                f.write(image_data)
            print(f"Image saved: {output_path}")

Image Understanding Task:

import requests
import base64

url = "http://localhost:8000/v1/chat/completions"
headers = {"Content-Type": "application/json"}

# image -> base64
with open("input.png", "rb") as f:
    image_base64 = base64.b64encode(f.read()).decode("utf-8")

data = {
    "model": "s1-omni-image-preview",
    "messages": [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Please describe the content in this image"},
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_base64}"}}
            ]
        }
    ]
}

response = requests.post(url, headers=headers, json=data)
print(response.json()["choices"][0]["message"]["content"])

For more API usage examples, please refer to docs/API_GUIDE.md.

⚠️ Limitations

Although S1-Omni-Image-Preview has made meaningful progress in unified multimodal understanding and generation, the current version still has the following limitations, which we will continue to improve in subsequent versions:

Safety Alignment: The model may generate inaccurate, biased, or inappropriate text and image content. The current version has not undergone comprehensive safety alignment training. It is recommended to add additional content moderation mechanisms when using it in open scenarios.
General Capabilities: The model has been specifically optimized for scientific figure generation (such as flowcharts, architecture diagrams, principle diagrams, etc.), but its performance in general natural image generation and support for multi-turn dialogue may not be as good as general models; the model's image editing capability has limited support for refined local editing.
Image Details: The text rendering clarity in high-resolution images generated by the current version, and the alignment of fine-grained elements in complex charts still have room for improvement. Text in generated images may occasionally appear blurred or incorrect.

We welcome community users to provide feedback and suggestions during use to help us continuously improve the model.

📄 License

This project is released under the Apache License 2.0 open source license.

🙏 Acknowledgements

The development of S1-Omni-Image-Preview would not be possible without the support of the following excellent open-source projects. We express our heartfelt thanks:

Transformers: Develop by the Hugging Face team, it is a state-of-the-art Natural Language Processing and Multimodal model library that provides the core model loading, inference, and service deployment infrastructure for this project.
Diffusers: Developed by the Hugging Face team, it is a diffusion model library that provides key support for the implementation and inference of the Diffusion Transformer module in this project.
Qwen3-VL: Developed by the Alibaba Qwen team, it is a multimodal visual language model. This project uses Qwen3-VL-8B-Instruct as the initialization weights for the multimodal language model. Its excellent capabilities laid a solid foundation for S1-Omni-Image-Preview to train task-aware thinking processes.
Qwen-Image: Developed by the Alibaba Qwen team, it is an image generation model. This project referenced and benefited from its technical solutions during the design and training process of the image generation and editing modules.

📚 Citation

If you use S1-Omni-Image-Preview in your research, please cite our work:

@software{s1-omni-image-2026,
  title        = {S1-Omni-Image: A Unified Multimodal Model for Scientific Image Understanding and Generation},
  author       = {ScienceOne Team},
  year         = {2026},
  organization = {Institute of Automation, Chinese Academy of Sciences},
  url          = {https://github.com/ScienceOne-AI/S1-Omni-Image}
}

Downloads last month: 15

Safetensors

Model size

29B params

Tensor type

BF16

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including ScienceOne-AI/S1-Omni-Image-Preview

S1-Omni

Collection

1 item • Updated 3 days ago