File size: 8,296 Bytes
5dcdc89
 
 
76d494d
 
 
 
5dcdc89
626adcb
fcb6414
626adcb
 
 
 
5dcdc89
 
 
 
 
 
 
 
626adcb
b6814fd
626adcb
3c40f2d
626adcb
 
 
 
 
3c40f2d
626adcb
 
5dcdc89
 
 
 
 
 
 
12373c4
5dcdc89
 
 
626adcb
 
1c6eee3
626adcb
 
 
 
5dcdc89
 
 
 
 
 
12373c4
d40ee0a
 
 
5dcdc89
12373c4
 
 
 
 
 
 
 
 
 
 
 
5dcdc89
 
 
 
12373c4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5dcdc89
 
 
622f7f4
 
5dcdc89
622f7f4
5dcdc89
 
 
 
09bd083
 
 
5dcdc89
 
12373c4
 
 
 
 
5dcdc89
 
626adcb
 
 
 
 
b6814fd
 
 
 
626adcb
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
---
library_name: pytorch
tags:
- super-resolution
- diffusion
- pixel-diffusion-decoder
- vae-decoder
pipeline_tag: image-to-image
base_model:
- nvidia/PixelDiT-1300M-1024px
- Tongyi-MAI/Z-Image
- black-forest-labs/FLUX.1-dev
- black-forest-labs/FLUX.2-dev
- nyu-visionx/Scale-RAE-Qwen7B_DiT9.8B
---

# PiD β€” Pixel Diffusion Decoder

<p align="center">
  <img src="figures/teaser.jpg" alt="PiD teaser" width="100%">
</p>


**[Paper](https://arxiv.org/abs/2605.23902), [Project Page](https://research.nvidia.com/labs/sil/projects/pid/)**

[Yifan Lu](https://yifanlu0227.github.io/),
[Qi Wu](https://wilsoncernwq.github.io/),
[Jay Zhangjie Wu](https://zhangjiewu.github.io/),
[Zian Wang](https://www.cs.toronto.edu/~zianwang/),
[Huan Ling](https://www.cs.toronto.edu/~linghuan/),
[Sanja Fidler](https://www.cs.utoronto.ca/~fidler/),
[Xuanchi Ren](https://xuanchiren.com/) <br>


PiD reformulates the latent-to-pixel decoder as a conditional pixel-space
diffusion model, unifying decoding and upsampling into a single generative
module. It denoises directly in high-resolution pixel space and produces a
super-resolved image in one pass. This repository hosts the released decoder
checkpoints, plus the encoder/decoder ("VAE") weights they depend on.

All `PiD_*` checkpoints in this repo are **4-step distilled**. The non-`PiD_*`
entries (`ae.safetensors`, `flux2_ae.safetensors`, `sdxl_vae.safetensors`, `QwenImage_VAE_2d.pth`, `sd3_vae/`, `rae/`,
`scale_rae/`) are **the corresponding encoder/decoder VAE weights** that PiD
plugs into β€” they're not PiD checkpoints themselves.

### License/Terms of Use

This model is released under the [NSCLv1](https://huggingface.co/nvidia/PixelDiT-1300M-1024px/blob/main/LICENSE) License. The work and any derivative works may only be used for non-commercial (research or evaluation) purposes.

### Deployment Geography:
Global

## PiD checkpoints

Two variants are released for each diffusers-style backbone:

- **`2k`** β€” trained at 2048px, used as a 4Γ— decoder (512 LDM β†’ 2048 px), or as
  an 8Γ— decoder for the Scale-RAE backbone (256 β†’ 2048).
- **`2kto4k`** β€” trained with multi-resolution data bucketing 2048β†’4096 and an
  SD3-style dynamic shift; designed for 1024 LDM β†’ 4K (4096 px) decoding.

Both checkpoint variants support multiple aspect ratios.

| Path                                                            | Latent space | SR factor | Variant |
|-----------------------------------------------------------------|--------------|-----------|---------|
| `checkpoints/PiD_res2k_sr4x_official_flux_distill_4step`        | Flux1-dev    | 4Γ—        | 2k      |
| `checkpoints/PiD_res2k_sr4x_official_flux2_distill_4step`       | Flux2-dev    | 4Γ—        | 2k      |
| `checkpoints/PiD_res2k_sr4x_official_sd3_distill_4step`         | SD3 medium   | 4Γ—        | 2k      |
| `checkpoints/PiD_res2k_sr4x_official_dinov2_distill_4step`      | DINOv2-B     | 4Γ—        | 2k      |
| `checkpoints/PiD_res2k_sr8x_official_siglip_distill_4step`      | SigLIP-2     | 8Γ—        | 2k      |
| `checkpoints/PiD_res2kto4k_sr4x_official_flux_distill_4step`    | Flux1-dev    | 4Γ—        | 2kto4k  |
| `checkpoints/PiD_res2kto4k_sr4x_official_flux2_distill_4step_2606` | Flux2-dev  | 4Γ—        | 2kto4k  |
| `checkpoints/PiD_res2kto4k_sr4x_official_sd3_distill_4step`     | SD3 medium   | 4Γ—        | 2kto4k  |
| `checkpoints/PiD_res2kto4k_sr4x_official_sdxl_distill_4step`    | SDXL         | 4Γ—        | 2kto4k  |
| `checkpoints/PiD_res2kto4k_sr4x_official_qwenimage_distill_4step` | Qwen-Image  | 4Γ—        | 2kto4k  |

Each directory contains a single file, `model_ema_bf16.pth`, which is the EMA
weights cast to bfloat16 β€” the format the inference scripts load by default.

> **⚠️ Flux2-dev `2kto4k` β€” use the new `_2606` checkpoint.** The previous
> `PiD_res2kto4k_sr4x_official_flux2_distill_4step` (without the `_2606` suffix)
> suffered from a color-drifting issue. The new
> `PiD_res2kto4k_sr4x_official_flux2_distill_4step_2606` fixes it β€” please use it
> and do **not** use the old one. See the
> [comparison](https://github.com/nv-tlabs/pid/blob/main/docs/FLUX2_2kto4k_new_ckpt_compare.md)
> for details.

### Latent space β†’ compatible LDMs

A PiD decoder is tied to a *latent space*, not to a single generative model. Any
LDM that produces latents in that space can reuse the same checkpoint. The
`--backbone` aliases below pick the right LDM pipeline; they all decode through
the latent space's checkpoint above.

| Latent space | VAE / vision encoder weights       | compatible `--backbone`                   | Corresponding LDM Links |
|--------------|------------------------------------|-------------------------------------------|-----------------|
| Flux1-dev    | `checkpoints/ae.safetensors`       | `flux`, `zimage`, `zimage-turbo`          | [FLUX.1-dev](https://huggingface.co/black-forest-labs/FLUX.1-dev), [Z-Image](https://huggingface.co/Tongyi-MAI/Z-Image), [Z-Image-Turbo](https://huggingface.co/Tongyi-MAI/Z-Image-Turbo) |
| Flux2-dev    | `checkpoints/flux2_ae.safetensors` | `flux2`, `flux2-klein-4b`, `flux2-klein-9b` | [FLUX.2-dev](https://huggingface.co/black-forest-labs/FLUX.2-dev), [FLUX.2-klein-4B](https://huggingface.co/black-forest-labs/FLUX.2-klein-4B), [FLUX.2-klein-9B](https://huggingface.co/black-forest-labs/FLUX.2-klein-9B) |
| SD3 medium   | `checkpoints/sd3_vae/`             | `sd3`                                     | [SD3-medium](https://huggingface.co/stabilityai/stable-diffusion-3-medium-diffusers) |
| SDXL         | `checkpoints/sdxl_vae.safetensors` | `sdxl`                                    | [SDXL-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) |
| Qwen-Image   | `checkpoints/QwenImage_VAE_2d.pth` | `qwenimage`, `qwenimage-2512`             | [Qwen-Image](https://huggingface.co/Qwen/Qwen-Image), [Qwen-Image-2512](https://huggingface.co/Qwen/Qwen-Image-2512) |
| DINOv2-B     | `checkpoints/rae/`                 | `dinov2`                                  | [RAE](https://github.com/bytetriper/RAE) (class-conditional; DINOv2-B) |
| SigLIP-2     | `checkpoints/scale_rae/`           | `siglip`                                  | [Scale-RAE](https://github.com/ZitengWangNYU/Scale-RAE) (text-conditional; nyu-visionx/Scale-RAE-Qwen1.5B_DiT2.4B) |

For example, Z-Image and Z-Image-Turbo share Flux1-dev's VAE, so they reuse the
`flux` checkpoints (both `2k` and `2kto4k`) β€” no separate `zimage` checkpoint is
shipped. Likewise `qwenimage-2512` reuses the `qwenimage` decoder (same VAE,
different transformer). 

## Usage

The decoder checkpoints are loaded by the inference scripts in the [PiD
codebase](https://github.com/nv-tlabs/pid). The exact `(backbone, ckpt_type) β†’ path` mapping is the single source
of truth in
[`pid/_src/inference/checkpoint_registry.py`](https://github.com/nv-tlabs/PiD/blob/main/pid/_src/inference/checkpoint_registry.py) β€” clone the
repo, point it at this snapshot, and the demos pick the right file
automatically:

```bash
# Pull just the checkpoints/ tree into the repo root (skips this README and
# the teaser figure so they don't clobber the files in the source repo).
hf download nvidia/PiD --local-dir . --include "checkpoints/*"

# Then run any of the demos, e.g.:
PYTHONPATH=. python -m pid._src.inference.from_ldm --backbone flux \
    --prompt "A photorealistic half-body portrait of a brown tabby cat with bold stripes sitting attentively on a rustic wooden kitchen table, soft morning light streaming sideways through a large window, fine fur detail and stripe patterns sharply visible, intense amber-green eyes in razor-sharp focus, warm farmhouse kitchen softly out of focus, cinematic shallow depth of field, ultra-detailed fur texture, photorealistic" \
    --ldm_inference_steps 28 --save_xt_steps 24 \
    --output_dir ./results/official_demo/flux \
    --pid_inference_steps 4
```

Pick the `2kto4k` variant via `--pid_ckpt_type 2kto4k` when decoding at 4K.

## Citation
```
@article{lu2026pid,
    title={PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion},
    author={Lu, Yifan and Wu, Qi and Wu, Jay Zhangjie and Wang, Zian and Ling, Huan and Fidler, Sanja and Ren, Xuanchi},
    journal={arXiv preprint arXiv:2605.23902},
    year={2026}
}
```