Cosmos3-Super-Text2Image Quanto FP8 Transformer

This repository contains a transformer-only FP8/float8 quantization made with Hugging Face Optimum Quanto for nvidia/Cosmos3-Super-Text2Image.

This is a Quanto quantization, not an NVIDIA ModelOpt/NVFP quantization. The separate NVFP experiments should be compared against this repo explicitly as a different quantization backend.

Read NVIDIA's card, license, safety notes, and prompt-format guidance here: nvidia/Cosmos3-Super-Text2Image.

Only transformer/ is provided as a weight artifact. The VAE, scheduler, tokenizers, safety checker, and other components are loaded from the base model.

Assemble The Pipeline

import json
import torch
from diffusers import Cosmos3OmniPipeline, Cosmos3OmniTransformer
from diffusers.schedulers.scheduling_unipc_multistep import UniPCMultistepScheduler

transformer = Cosmos3OmniTransformer.from_pretrained(
    "WaveCut/Cosmos3-Super-Text2Image-Quanto-FP8-Transformer",
    subfolder="transformer",
    torch_dtype=torch.bfloat16,
)

pipe = Cosmos3OmniPipeline.from_pretrained(
    "nvidia/Cosmos3-Super-Text2Image",
    transformer=transformer,
    torch_dtype=torch.bfloat16,
    device_map="cuda",
    enable_safety_checker=True,
)
# Ensure the injected transformer and Cosmos intermediate tensors share CUDA.
pipe.to("cuda")
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config, flow_shift=3.0)

# Use the JSON-caption format described by the original model card.
json_caption = {
    "subjects": [],
    "background_setting": "A concise scene description.",
    "comprehensive_t2i_caption": "A detailed natural-language caption.",
    "resolution": {"H": 1024, "W": 1024},
    "aspect_ratio": "1,1",
}

result = pipe(
    prompt=json.dumps(json_caption),
    negative_prompt="",
    num_frames=1,
    height=1024,
    width=1024,
    num_inference_steps=50,
    guidance_scale=4.0,
    generator=torch.Generator(device="cuda").manual_seed(1143),
)
result.video[0].save("cosmos3_fp8.png")

Benchmarks

Measured on one RunPod NVIDIA B200 instance with local container storage, cached model files, PyTorch 2.9.1+cu130, 1024x1024 image generation, 50 inference steps, guidance scale 4.0, flow_shift=3.0, system prompt enabled.

Transformer Component Load

This measures loading the transformer component and moving it to CUDA in isolation.

Variant	Load to CUDA	VRAM after load	Torch allocated	Torch reserved	Transformer safetensors
BF16 base transformer	23.80s	122,758 MiB	122,121 MiB	122,132 MiB	119.21 GiB
FP8 transformer	74.45s	65,756 MiB	62,356 MiB	65,036 MiB	60.35 GiB

Full Pipeline Generation

This measures end-to-end Diffusers pipeline loading and generation. The stress set is ten handwritten JSON-caption prompts designed to stress Cyrillic text, reflections, multi-object composition, anatomy, and small details.

Variant	Full pipeline load	VRAM after load	Torch allocated after load	Avg generation time	Min / max generation time	Peak sampled VRAM	Images
BF16 base pipeline	31.31s	125,134 MiB	124,386 MiB	16.05s	15.51s / 17.97s	141,104 MiB	10
FP8 transformer pipeline	28.06s	69,276 MiB	65,865 MiB	37.53s	36.43s / 40.00s	82,198 MiB	10

Original NVIDIA Example Caption

The original model repository provides assets/example_caption.json. The images below are generated locally with the same JSON-caption, seed 1143, 1024x1024, 50 steps, guidance scale 4.0.

Variant	Pipeline load	Generation time	Peak sampled VRAM
BF16 base pipeline	35.41s	18.01s	141,098 MiB
FP8 transformer pipeline	29.66s	39.38s	71,820 MiB

BF16 reference output:

FP8 transformer output:

Stress Prompt Outputs

These are the ten FP8 outputs from the handwritten JSON-caption stress prompt set used in the benchmark table above. The set stresses Cyrillic signage, exact text placement, reflections, small-object consistency, multi-plane composition, UI panels, and human anatomy.

#	Stress focus	FP8 output
01	Metro archive reading room
02	Arctic greenhouse night shift
03	Control room restoration
04	Rain market cross section
05	Manuscript restoration table
06	Robotic assembly line signage
07	Kitchen storm chess table
08	Orbital cockpit Cyrillic UI
09	Flood command center
10	Cyrillic newspaper press

Notes

The upstream card documents BF16 as the tested precision. Treat this FP8 transformer as experimental.
The safety checker is not included in this repo; load it from the base model if your use case requires it.
Text rendering, especially exact Cyrillic text, remains a difficult case for this model family. Quantization should be evaluated visually for your target prompt distribution.

Downloads last month: -

Model tree for WaveCut/Cosmos3-Super-Text2Image-Quanto-FP8-Transformer

Base model

nvidia/Cosmos3-Super-Text2Image

Finetuned

(4)

this model

Collection including WaveCut/Cosmos3-Super-Text2Image-Quanto-FP8-Transformer

Cosmos 3 Super Quants

Collection

Transformer-only quantization artifacts for nvidia/Cosmos3-Super-Text2Image with generation examples and B200 benchmark notes. • 4 items • Updated about 15 hours ago