| --- |
| library_name: pytorch |
| tags: |
| - super-resolution |
| - diffusion |
| - pixel-diffusion-decoder |
| - vae-decoder |
| pipeline_tag: image-to-image |
| base_model: |
| - nvidia/PixelDiT-1300M-1024px |
| - Tongyi-MAI/Z-Image |
| - black-forest-labs/FLUX.1-dev |
| - black-forest-labs/FLUX.2-dev |
| - nyu-visionx/Scale-RAE-Qwen7B_DiT9.8B |
| --- |
| |
| # PiD β Pixel Diffusion Decoder |
|
|
| <p align="center"> |
| <img src="figures/teaser.jpg" alt="PiD teaser" width="100%"> |
| </p> |
|
|
|
|
| **[Paper](https://arxiv.org/abs/2605.23902), [Project Page](https://research.nvidia.com/labs/sil/projects/pid/)** |
|
|
| [Yifan Lu](https://yifanlu0227.github.io/), |
| [Qi Wu](https://wilsoncernwq.github.io/), |
| [Jay Zhangjie Wu](https://zhangjiewu.github.io/), |
| [Zian Wang](https://www.cs.toronto.edu/~zianwang/), |
| [Huan Ling](https://www.cs.toronto.edu/~linghuan/), |
| [Sanja Fidler](https://www.cs.utoronto.ca/~fidler/), |
| [Xuanchi Ren](https://xuanchiren.com/) <br> |
|
|
|
|
| PiD reformulates the latent-to-pixel decoder as a conditional pixel-space |
| diffusion model, unifying decoding and upsampling into a single generative |
| module. It denoises directly in high-resolution pixel space and produces a |
| super-resolved image in one pass. This repository hosts the released decoder |
| checkpoints, plus the encoder/decoder ("VAE") weights they depend on. |
|
|
| All `PiD_*` checkpoints in this repo are **4-step distilled**. The non-`PiD_*` |
| entries (`ae.safetensors`, `flux2_ae.safetensors`, `sdxl_vae.safetensors`, `QwenImage_VAE_2d.pth`, `sd3_vae/`, `rae/`, |
| `scale_rae/`) are **the corresponding encoder/decoder VAE weights** that PiD |
| plugs into β they're not PiD checkpoints themselves. |
|
|
| ### License/Terms of Use |
|
|
| This model is released under the [NSCLv1](https://huggingface.co/nvidia/PixelDiT-1300M-1024px/blob/main/LICENSE) License. The work and any derivative works may only be used for non-commercial (research or evaluation) purposes. |
|
|
| ### Deployment Geography: |
| Global |
|
|
| ## PiD checkpoints |
|
|
| Two variants are released for each diffusers-style backbone: |
|
|
| - **`2k`** β trained at 2048px, used as a 4Γ decoder (512 LDM β 2048 px), or as |
| an 8Γ decoder for the Scale-RAE backbone (256 β 2048). |
| - **`2kto4k`** β trained with multi-resolution data bucketing 2048β4096 and an |
| SD3-style dynamic shift; designed for 1024 LDM β 4K (4096 px) decoding. |
|
|
| Both checkpoint variants support multiple aspect ratios. |
|
|
| | Path | Latent space | SR factor | Variant | |
| |-----------------------------------------------------------------|--------------|-----------|---------| |
| | `checkpoints/PiD_res2k_sr4x_official_flux_distill_4step` | Flux1-dev | 4Γ | 2k | |
| | `checkpoints/PiD_res2k_sr4x_official_flux2_distill_4step` | Flux2-dev | 4Γ | 2k | |
| | `checkpoints/PiD_res2k_sr4x_official_sd3_distill_4step` | SD3 medium | 4Γ | 2k | |
| | `checkpoints/PiD_res2k_sr4x_official_dinov2_distill_4step` | DINOv2-B | 4Γ | 2k | |
| | `checkpoints/PiD_res2k_sr8x_official_siglip_distill_4step` | SigLIP-2 | 8Γ | 2k | |
| | `checkpoints/PiD_res2kto4k_sr4x_official_flux_distill_4step` | Flux1-dev | 4Γ | 2kto4k | |
| | `checkpoints/PiD_res2kto4k_sr4x_official_flux2_distill_4step_2606` | Flux2-dev | 4Γ | 2kto4k | |
| | `checkpoints/PiD_res2kto4k_sr4x_official_sd3_distill_4step` | SD3 medium | 4Γ | 2kto4k | |
| | `checkpoints/PiD_res2kto4k_sr4x_official_sdxl_distill_4step` | SDXL | 4Γ | 2kto4k | |
| | `checkpoints/PiD_res2kto4k_sr4x_official_qwenimage_distill_4step` | Qwen-Image | 4Γ | 2kto4k | |
|
|
| Each directory contains a single file, `model_ema_bf16.pth`, which is the EMA |
| weights cast to bfloat16 β the format the inference scripts load by default. |
|
|
| > **β οΈ Flux2-dev `2kto4k` β use the new `_2606` checkpoint.** The previous |
| > `PiD_res2kto4k_sr4x_official_flux2_distill_4step` (without the `_2606` suffix) |
| > suffered from a color-drifting issue. The new |
| > `PiD_res2kto4k_sr4x_official_flux2_distill_4step_2606` fixes it β please use it |
| > and do **not** use the old one. See the |
| > [comparison](https://github.com/nv-tlabs/pid/blob/main/docs/FLUX2_2kto4k_new_ckpt_compare.md) |
| > for details. |
| |
| ### Latent space β compatible LDMs |
| |
| A PiD decoder is tied to a *latent space*, not to a single generative model. Any |
| LDM that produces latents in that space can reuse the same checkpoint. The |
| `--backbone` aliases below pick the right LDM pipeline; they all decode through |
| the latent space's checkpoint above. |
| |
| | Latent space | VAE / vision encoder weights | compatible `--backbone` | Corresponding LDM Links | |
| |--------------|------------------------------------|-------------------------------------------|-----------------| |
| | Flux1-dev | `checkpoints/ae.safetensors` | `flux`, `zimage`, `zimage-turbo` | [FLUX.1-dev](https://huggingface.co/black-forest-labs/FLUX.1-dev), [Z-Image](https://huggingface.co/Tongyi-MAI/Z-Image), [Z-Image-Turbo](https://huggingface.co/Tongyi-MAI/Z-Image-Turbo) | |
| | Flux2-dev | `checkpoints/flux2_ae.safetensors` | `flux2`, `flux2-klein-4b`, `flux2-klein-9b` | [FLUX.2-dev](https://huggingface.co/black-forest-labs/FLUX.2-dev), [FLUX.2-klein-4B](https://huggingface.co/black-forest-labs/FLUX.2-klein-4B), [FLUX.2-klein-9B](https://huggingface.co/black-forest-labs/FLUX.2-klein-9B) | |
| | SD3 medium | `checkpoints/sd3_vae/` | `sd3` | [SD3-medium](https://huggingface.co/stabilityai/stable-diffusion-3-medium-diffusers) | |
| | SDXL | `checkpoints/sdxl_vae.safetensors` | `sdxl` | [SDXL-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) | |
| | Qwen-Image | `checkpoints/QwenImage_VAE_2d.pth` | `qwenimage`, `qwenimage-2512` | [Qwen-Image](https://huggingface.co/Qwen/Qwen-Image), [Qwen-Image-2512](https://huggingface.co/Qwen/Qwen-Image-2512) | |
| | DINOv2-B | `checkpoints/rae/` | `dinov2` | [RAE](https://github.com/bytetriper/RAE) (class-conditional; DINOv2-B) | |
| | SigLIP-2 | `checkpoints/scale_rae/` | `siglip` | [Scale-RAE](https://github.com/ZitengWangNYU/Scale-RAE) (text-conditional; nyu-visionx/Scale-RAE-Qwen1.5B_DiT2.4B) | |
| |
| For example, Z-Image and Z-Image-Turbo share Flux1-dev's VAE, so they reuse the |
| `flux` checkpoints (both `2k` and `2kto4k`) β no separate `zimage` checkpoint is |
| shipped. Likewise `qwenimage-2512` reuses the `qwenimage` decoder (same VAE, |
| different transformer). |
| |
| ## Usage |
| |
| The decoder checkpoints are loaded by the inference scripts in the [PiD |
| codebase](https://github.com/nv-tlabs/pid). The exact `(backbone, ckpt_type) β path` mapping is the single source |
| of truth in |
| [`pid/_src/inference/checkpoint_registry.py`](https://github.com/nv-tlabs/PiD/blob/main/pid/_src/inference/checkpoint_registry.py) β clone the |
| repo, point it at this snapshot, and the demos pick the right file |
| automatically: |
| |
| ```bash |
| # Pull just the checkpoints/ tree into the repo root (skips this README and |
| # the teaser figure so they don't clobber the files in the source repo). |
| hf download nvidia/PiD --local-dir . --include "checkpoints/*" |
| |
| # Then run any of the demos, e.g.: |
| PYTHONPATH=. python -m pid._src.inference.from_ldm --backbone flux \ |
| --prompt "A photorealistic half-body portrait of a brown tabby cat with bold stripes sitting attentively on a rustic wooden kitchen table, soft morning light streaming sideways through a large window, fine fur detail and stripe patterns sharply visible, intense amber-green eyes in razor-sharp focus, warm farmhouse kitchen softly out of focus, cinematic shallow depth of field, ultra-detailed fur texture, photorealistic" \ |
| --ldm_inference_steps 28 --save_xt_steps 24 \ |
| --output_dir ./results/official_demo/flux \ |
| --pid_inference_steps 4 |
| ``` |
| |
| Pick the `2kto4k` variant via `--pid_ckpt_type 2kto4k` when decoding at 4K. |
| |
| ## Citation |
| ``` |
| @article{lu2026pid, |
| title={PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion}, |
| author={Lu, Yifan and Wu, Qi and Wu, Jay Zhangjie and Wang, Zian and Ling, Huan and Fidler, Sanja and Ren, Xuanchi}, |
| journal={arXiv preprint arXiv:2605.23902}, |
| year={2026} |
| } |
| ``` |