How to use from the
Use from the
Diffusers library
pip install -U diffusers transformers accelerate
import torch
from diffusers import DiffusionPipeline

# switch to "mps" for apple devices
pipe = DiffusionPipeline.from_pretrained("WaveCut/Cosmos3-Super-Text2Image-Quanto-FP8-Transformer", dtype=torch.bfloat16, device_map="cuda")

prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipe(prompt).images[0]

Cosmos3-Super-Text2Image Quanto FP8 Transformer

This repository contains a transformer-only FP8/float8 quantization made with Hugging Face Optimum Quanto for nvidia/Cosmos3-Super-Text2Image.

This is a Quanto quantization, not an NVIDIA ModelOpt/NVFP quantization. The separate NVFP experiments should be compared against this repo explicitly as a different quantization backend.

Read NVIDIA's card, license, safety notes, and prompt-format guidance here: nvidia/Cosmos3-Super-Text2Image.

Only transformer/ is provided as a weight artifact. The VAE, scheduler, tokenizers, safety checker, and other components are loaded from the base model.

Assemble The Pipeline

import json
import torch
from diffusers import Cosmos3OmniPipeline, Cosmos3OmniTransformer
from diffusers.schedulers.scheduling_unipc_multistep import UniPCMultistepScheduler

transformer = Cosmos3OmniTransformer.from_pretrained(
    "WaveCut/Cosmos3-Super-Text2Image-Quanto-FP8-Transformer",
    subfolder="transformer",
    torch_dtype=torch.bfloat16,
)

pipe = Cosmos3OmniPipeline.from_pretrained(
    "nvidia/Cosmos3-Super-Text2Image",
    transformer=transformer,
    torch_dtype=torch.bfloat16,
    device_map="cuda",
    enable_safety_checker=True,
)
# Ensure the injected transformer and Cosmos intermediate tensors share CUDA.
pipe.to("cuda")
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config, flow_shift=3.0)

# Use the JSON-caption format described by the original model card.
json_caption = {
    "subjects": [],
    "background_setting": "A concise scene description.",
    "comprehensive_t2i_caption": "A detailed natural-language caption.",
    "resolution": {"H": 1024, "W": 1024},
    "aspect_ratio": "1,1",
}

result = pipe(
    prompt=json.dumps(json_caption),
    negative_prompt="",
    num_frames=1,
    height=1024,
    width=1024,
    num_inference_steps=50,
    guidance_scale=4.0,
    generator=torch.Generator(device="cuda").manual_seed(1143),
)
result.video[0].save("cosmos3_fp8.png")

Benchmarks

Measured on one RunPod NVIDIA B200 instance with local container storage, cached model files, PyTorch 2.9.1+cu130, 1024x1024 image generation, 50 inference steps, guidance scale 4.0, flow_shift=3.0, system prompt enabled.

Transformer Component Load

This measures loading the transformer component and moving it to CUDA in isolation.

Variant Load to CUDA VRAM after load Torch allocated Torch reserved Transformer safetensors
BF16 base transformer 23.80s 122,758 MiB 122,121 MiB 122,132 MiB 119.21 GiB
FP8 transformer 74.45s 65,756 MiB 62,356 MiB 65,036 MiB 60.35 GiB

Full Pipeline Generation

This measures end-to-end Diffusers pipeline loading and generation. The stress set is ten handwritten JSON-caption prompts designed to stress Cyrillic text, reflections, multi-object composition, anatomy, and small details.

Variant Full pipeline load VRAM after load Torch allocated after load Avg generation time Min / max generation time Peak sampled VRAM Images
BF16 base pipeline 31.31s 125,134 MiB 124,386 MiB 16.05s 15.51s / 17.97s 141,104 MiB 10
FP8 transformer pipeline 28.06s 69,276 MiB 65,865 MiB 37.53s 36.43s / 40.00s 82,198 MiB 10

Original NVIDIA Example Caption

The original model repository provides assets/example_caption.json. The images below are generated locally with the same JSON-caption, seed 1143, 1024x1024, 50 steps, guidance scale 4.0.

Variant Pipeline load Generation time Peak sampled VRAM
BF16 base pipeline 35.41s 18.01s 141,098 MiB
FP8 transformer pipeline 29.66s 39.38s 71,820 MiB

BF16 reference output:

BF16 output for NVIDIA example caption

FP8 transformer output:

FP8 output for NVIDIA example caption

Stress Prompt Outputs

These are the ten FP8 outputs from the handwritten JSON-caption stress prompt set used in the benchmark table above. The set stresses Cyrillic signage, exact text placement, reflections, small-object consistency, multi-plane composition, UI panels, and human anatomy.

# Stress focus FP8 output
01 Metro archive reading room Metro archive reading room
02 Arctic greenhouse night shift Arctic greenhouse night shift
03 Control room restoration Control room restoration
04 Rain market cross section Rain market cross section
05 Manuscript restoration table Manuscript restoration table
06 Robotic assembly line signage Robotic assembly line signage
07 Kitchen storm chess table Kitchen storm chess table
08 Orbital cockpit Cyrillic UI Orbital cockpit Cyrillic UI
09 Flood command center Flood command center
10 Cyrillic newspaper press Cyrillic newspaper press

Notes

  • The upstream card documents BF16 as the tested precision. Treat this FP8 transformer as experimental.
  • The safety checker is not included in this repo; load it from the base model if your use case requires it.
  • Text rendering, especially exact Cyrillic text, remains a difficult case for this model family. Quantization should be evaluated visually for your target prompt distribution.
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for WaveCut/Cosmos3-Super-Text2Image-Quanto-FP8-Transformer

Finetuned
(4)
this model

Collection including WaveCut/Cosmos3-Super-Text2Image-Quanto-FP8-Transformer