File size: 8,296 Bytes
5dcdc89 76d494d 5dcdc89 626adcb fcb6414 626adcb 5dcdc89 626adcb b6814fd 626adcb 3c40f2d 626adcb 3c40f2d 626adcb 5dcdc89 12373c4 5dcdc89 626adcb 1c6eee3 626adcb 5dcdc89 12373c4 d40ee0a 5dcdc89 12373c4 5dcdc89 12373c4 5dcdc89 622f7f4 5dcdc89 622f7f4 5dcdc89 09bd083 5dcdc89 12373c4 5dcdc89 626adcb b6814fd 626adcb | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 | ---
library_name: pytorch
tags:
- super-resolution
- diffusion
- pixel-diffusion-decoder
- vae-decoder
pipeline_tag: image-to-image
base_model:
- nvidia/PixelDiT-1300M-1024px
- Tongyi-MAI/Z-Image
- black-forest-labs/FLUX.1-dev
- black-forest-labs/FLUX.2-dev
- nyu-visionx/Scale-RAE-Qwen7B_DiT9.8B
---
# PiD β Pixel Diffusion Decoder
<p align="center">
<img src="figures/teaser.jpg" alt="PiD teaser" width="100%">
</p>
**[Paper](https://arxiv.org/abs/2605.23902), [Project Page](https://research.nvidia.com/labs/sil/projects/pid/)**
[Yifan Lu](https://yifanlu0227.github.io/),
[Qi Wu](https://wilsoncernwq.github.io/),
[Jay Zhangjie Wu](https://zhangjiewu.github.io/),
[Zian Wang](https://www.cs.toronto.edu/~zianwang/),
[Huan Ling](https://www.cs.toronto.edu/~linghuan/),
[Sanja Fidler](https://www.cs.utoronto.ca/~fidler/),
[Xuanchi Ren](https://xuanchiren.com/) <br>
PiD reformulates the latent-to-pixel decoder as a conditional pixel-space
diffusion model, unifying decoding and upsampling into a single generative
module. It denoises directly in high-resolution pixel space and produces a
super-resolved image in one pass. This repository hosts the released decoder
checkpoints, plus the encoder/decoder ("VAE") weights they depend on.
All `PiD_*` checkpoints in this repo are **4-step distilled**. The non-`PiD_*`
entries (`ae.safetensors`, `flux2_ae.safetensors`, `sdxl_vae.safetensors`, `QwenImage_VAE_2d.pth`, `sd3_vae/`, `rae/`,
`scale_rae/`) are **the corresponding encoder/decoder VAE weights** that PiD
plugs into β they're not PiD checkpoints themselves.
### License/Terms of Use
This model is released under the [NSCLv1](https://huggingface.co/nvidia/PixelDiT-1300M-1024px/blob/main/LICENSE) License. The work and any derivative works may only be used for non-commercial (research or evaluation) purposes.
### Deployment Geography:
Global
## PiD checkpoints
Two variants are released for each diffusers-style backbone:
- **`2k`** β trained at 2048px, used as a 4Γ decoder (512 LDM β 2048 px), or as
an 8Γ decoder for the Scale-RAE backbone (256 β 2048).
- **`2kto4k`** β trained with multi-resolution data bucketing 2048β4096 and an
SD3-style dynamic shift; designed for 1024 LDM β 4K (4096 px) decoding.
Both checkpoint variants support multiple aspect ratios.
| Path | Latent space | SR factor | Variant |
|-----------------------------------------------------------------|--------------|-----------|---------|
| `checkpoints/PiD_res2k_sr4x_official_flux_distill_4step` | Flux1-dev | 4Γ | 2k |
| `checkpoints/PiD_res2k_sr4x_official_flux2_distill_4step` | Flux2-dev | 4Γ | 2k |
| `checkpoints/PiD_res2k_sr4x_official_sd3_distill_4step` | SD3 medium | 4Γ | 2k |
| `checkpoints/PiD_res2k_sr4x_official_dinov2_distill_4step` | DINOv2-B | 4Γ | 2k |
| `checkpoints/PiD_res2k_sr8x_official_siglip_distill_4step` | SigLIP-2 | 8Γ | 2k |
| `checkpoints/PiD_res2kto4k_sr4x_official_flux_distill_4step` | Flux1-dev | 4Γ | 2kto4k |
| `checkpoints/PiD_res2kto4k_sr4x_official_flux2_distill_4step_2606` | Flux2-dev | 4Γ | 2kto4k |
| `checkpoints/PiD_res2kto4k_sr4x_official_sd3_distill_4step` | SD3 medium | 4Γ | 2kto4k |
| `checkpoints/PiD_res2kto4k_sr4x_official_sdxl_distill_4step` | SDXL | 4Γ | 2kto4k |
| `checkpoints/PiD_res2kto4k_sr4x_official_qwenimage_distill_4step` | Qwen-Image | 4Γ | 2kto4k |
Each directory contains a single file, `model_ema_bf16.pth`, which is the EMA
weights cast to bfloat16 β the format the inference scripts load by default.
> **β οΈ Flux2-dev `2kto4k` β use the new `_2606` checkpoint.** The previous
> `PiD_res2kto4k_sr4x_official_flux2_distill_4step` (without the `_2606` suffix)
> suffered from a color-drifting issue. The new
> `PiD_res2kto4k_sr4x_official_flux2_distill_4step_2606` fixes it β please use it
> and do **not** use the old one. See the
> [comparison](https://github.com/nv-tlabs/pid/blob/main/docs/FLUX2_2kto4k_new_ckpt_compare.md)
> for details.
### Latent space β compatible LDMs
A PiD decoder is tied to a *latent space*, not to a single generative model. Any
LDM that produces latents in that space can reuse the same checkpoint. The
`--backbone` aliases below pick the right LDM pipeline; they all decode through
the latent space's checkpoint above.
| Latent space | VAE / vision encoder weights | compatible `--backbone` | Corresponding LDM Links |
|--------------|------------------------------------|-------------------------------------------|-----------------|
| Flux1-dev | `checkpoints/ae.safetensors` | `flux`, `zimage`, `zimage-turbo` | [FLUX.1-dev](https://huggingface.co/black-forest-labs/FLUX.1-dev), [Z-Image](https://huggingface.co/Tongyi-MAI/Z-Image), [Z-Image-Turbo](https://huggingface.co/Tongyi-MAI/Z-Image-Turbo) |
| Flux2-dev | `checkpoints/flux2_ae.safetensors` | `flux2`, `flux2-klein-4b`, `flux2-klein-9b` | [FLUX.2-dev](https://huggingface.co/black-forest-labs/FLUX.2-dev), [FLUX.2-klein-4B](https://huggingface.co/black-forest-labs/FLUX.2-klein-4B), [FLUX.2-klein-9B](https://huggingface.co/black-forest-labs/FLUX.2-klein-9B) |
| SD3 medium | `checkpoints/sd3_vae/` | `sd3` | [SD3-medium](https://huggingface.co/stabilityai/stable-diffusion-3-medium-diffusers) |
| SDXL | `checkpoints/sdxl_vae.safetensors` | `sdxl` | [SDXL-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) |
| Qwen-Image | `checkpoints/QwenImage_VAE_2d.pth` | `qwenimage`, `qwenimage-2512` | [Qwen-Image](https://huggingface.co/Qwen/Qwen-Image), [Qwen-Image-2512](https://huggingface.co/Qwen/Qwen-Image-2512) |
| DINOv2-B | `checkpoints/rae/` | `dinov2` | [RAE](https://github.com/bytetriper/RAE) (class-conditional; DINOv2-B) |
| SigLIP-2 | `checkpoints/scale_rae/` | `siglip` | [Scale-RAE](https://github.com/ZitengWangNYU/Scale-RAE) (text-conditional; nyu-visionx/Scale-RAE-Qwen1.5B_DiT2.4B) |
For example, Z-Image and Z-Image-Turbo share Flux1-dev's VAE, so they reuse the
`flux` checkpoints (both `2k` and `2kto4k`) β no separate `zimage` checkpoint is
shipped. Likewise `qwenimage-2512` reuses the `qwenimage` decoder (same VAE,
different transformer).
## Usage
The decoder checkpoints are loaded by the inference scripts in the [PiD
codebase](https://github.com/nv-tlabs/pid). The exact `(backbone, ckpt_type) β path` mapping is the single source
of truth in
[`pid/_src/inference/checkpoint_registry.py`](https://github.com/nv-tlabs/PiD/blob/main/pid/_src/inference/checkpoint_registry.py) β clone the
repo, point it at this snapshot, and the demos pick the right file
automatically:
```bash
# Pull just the checkpoints/ tree into the repo root (skips this README and
# the teaser figure so they don't clobber the files in the source repo).
hf download nvidia/PiD --local-dir . --include "checkpoints/*"
# Then run any of the demos, e.g.:
PYTHONPATH=. python -m pid._src.inference.from_ldm --backbone flux \
--prompt "A photorealistic half-body portrait of a brown tabby cat with bold stripes sitting attentively on a rustic wooden kitchen table, soft morning light streaming sideways through a large window, fine fur detail and stripe patterns sharply visible, intense amber-green eyes in razor-sharp focus, warm farmhouse kitchen softly out of focus, cinematic shallow depth of field, ultra-detailed fur texture, photorealistic" \
--ldm_inference_steps 28 --save_xt_steps 24 \
--output_dir ./results/official_demo/flux \
--pid_inference_steps 4
```
Pick the `2kto4k` variant via `--pid_ckpt_type 2kto4k` when decoding at 4K.
## Citation
```
@article{lu2026pid,
title={PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion},
author={Lu, Yifan and Wu, Qi and Wu, Jay Zhangjie and Wang, Zian and Ling, Huan and Fidler, Sanja and Ren, Xuanchi},
journal={arXiv preprint arXiv:2605.23902},
year={2026}
}
``` |