Instructions to use WaveCut/Cosmos3-Super-Text2Image-Quanto-FP8-Transformer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use WaveCut/Cosmos3-Super-Text2Image-Quanto-FP8-Transformer with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("WaveCut/Cosmos3-Super-Text2Image-Quanto-FP8-Transformer", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- Draw Things
- DiffusionBee
Cosmos3-Super-Text2Image Quanto FP8 Transformer
This repository contains a transformer-only FP8/float8 quantization made with Hugging Face Optimum Quanto for nvidia/Cosmos3-Super-Text2Image.
This is a Quanto quantization, not an NVIDIA ModelOpt/NVFP quantization. The separate NVFP experiments should be compared against this repo explicitly as a different quantization backend.
Read NVIDIA's card, license, safety notes, and prompt-format guidance here: nvidia/Cosmos3-Super-Text2Image.
Only transformer/ is provided as a weight artifact. The VAE, scheduler, tokenizers, safety checker, and other components are loaded from the base model.
Assemble The Pipeline
import json
import torch
from diffusers import Cosmos3OmniPipeline, Cosmos3OmniTransformer
from diffusers.schedulers.scheduling_unipc_multistep import UniPCMultistepScheduler
transformer = Cosmos3OmniTransformer.from_pretrained(
"WaveCut/Cosmos3-Super-Text2Image-Quanto-FP8-Transformer",
subfolder="transformer",
torch_dtype=torch.bfloat16,
)
pipe = Cosmos3OmniPipeline.from_pretrained(
"nvidia/Cosmos3-Super-Text2Image",
transformer=transformer,
torch_dtype=torch.bfloat16,
device_map="cuda",
enable_safety_checker=True,
)
# Ensure the injected transformer and Cosmos intermediate tensors share CUDA.
pipe.to("cuda")
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config, flow_shift=3.0)
# Use the JSON-caption format described by the original model card.
json_caption = {
"subjects": [],
"background_setting": "A concise scene description.",
"comprehensive_t2i_caption": "A detailed natural-language caption.",
"resolution": {"H": 1024, "W": 1024},
"aspect_ratio": "1,1",
}
result = pipe(
prompt=json.dumps(json_caption),
negative_prompt="",
num_frames=1,
height=1024,
width=1024,
num_inference_steps=50,
guidance_scale=4.0,
generator=torch.Generator(device="cuda").manual_seed(1143),
)
result.video[0].save("cosmos3_fp8.png")
Benchmarks
Measured on one RunPod NVIDIA B200 instance with local container storage, cached model files, PyTorch 2.9.1+cu130, 1024x1024 image generation, 50 inference steps, guidance scale 4.0, flow_shift=3.0, system prompt enabled.
Transformer Component Load
This measures loading the transformer component and moving it to CUDA in isolation.
| Variant | Load to CUDA | VRAM after load | Torch allocated | Torch reserved | Transformer safetensors |
|---|---|---|---|---|---|
| BF16 base transformer | 23.80s | 122,758 MiB | 122,121 MiB | 122,132 MiB | 119.21 GiB |
| FP8 transformer | 74.45s | 65,756 MiB | 62,356 MiB | 65,036 MiB | 60.35 GiB |
Full Pipeline Generation
This measures end-to-end Diffusers pipeline loading and generation. The stress set is ten handwritten JSON-caption prompts designed to stress Cyrillic text, reflections, multi-object composition, anatomy, and small details.
| Variant | Full pipeline load | VRAM after load | Torch allocated after load | Avg generation time | Min / max generation time | Peak sampled VRAM | Images |
|---|---|---|---|---|---|---|---|
| BF16 base pipeline | 31.31s | 125,134 MiB | 124,386 MiB | 16.05s | 15.51s / 17.97s | 141,104 MiB | 10 |
| FP8 transformer pipeline | 28.06s | 69,276 MiB | 65,865 MiB | 37.53s | 36.43s / 40.00s | 82,198 MiB | 10 |
Original NVIDIA Example Caption
The original model repository provides assets/example_caption.json. The images below are generated locally with the same JSON-caption, seed 1143, 1024x1024, 50 steps, guidance scale 4.0.
| Variant | Pipeline load | Generation time | Peak sampled VRAM |
|---|---|---|---|
| BF16 base pipeline | 35.41s | 18.01s | 141,098 MiB |
| FP8 transformer pipeline | 29.66s | 39.38s | 71,820 MiB |
BF16 reference output:
FP8 transformer output:
Stress Prompt Outputs
These are the ten FP8 outputs from the handwritten JSON-caption stress prompt set used in the benchmark table above. The set stresses Cyrillic signage, exact text placement, reflections, small-object consistency, multi-plane composition, UI panels, and human anatomy.
Notes
- The upstream card documents BF16 as the tested precision. Treat this FP8 transformer as experimental.
- The safety checker is not included in this repo; load it from the base model if your use case requires it.
- Text rendering, especially exact Cyrillic text, remains a difficult case for this model family. Quantization should be evaluated visually for your target prompt distribution.
- Downloads last month
- -
Model tree for WaveCut/Cosmos3-Super-Text2Image-Quanto-FP8-Transformer
Base model
nvidia/Cosmos3-Super-Text2Image










