PiD / README.md

Update model card: latent-space table + LDM HF links, SDXL/Qwen-Image VAE, flux2 2kto4k _2606 ckpt, updated demo cmd

12373c4 verified 4 days ago

8.3 kB

	---
	library_name: pytorch
	tags:
	- super-resolution
	- diffusion
	- pixel-diffusion-decoder
	- vae-decoder
	pipeline_tag: image-to-image
	base_model:
	- nvidia/PixelDiT-1300M-1024px
	- Tongyi-MAI/Z-Image
	- black-forest-labs/FLUX.1-dev
	- black-forest-labs/FLUX.2-dev
	- nyu-visionx/Scale-RAE-Qwen7B_DiT9.8B
	---

	# PiD — Pixel Diffusion Decoder

	<p align="center">
	<img src="figures/teaser.jpg" alt="PiD teaser" width="100%">
	</p>


	[Paper](https://arxiv.org/abs/2605.23902), [Project Page](https://research.nvidia.com/labs/sil/projects/pid/)

	[Yifan Lu](https://yifanlu0227.github.io/),
	[Qi Wu](https://wilsoncernwq.github.io/),
	[Jay Zhangjie Wu](https://zhangjiewu.github.io/),
	[Zian Wang](https://www.cs.toronto.edu/~zianwang/),
	[Huan Ling](https://www.cs.toronto.edu/~linghuan/),
	[Sanja Fidler](https://www.cs.utoronto.ca/~fidler/),
	[Xuanchi Ren](https://xuanchiren.com/) <br>


	PiD reformulates the latent-to-pixel decoder as a conditional pixel-space
	diffusion model, unifying decoding and upsampling into a single generative
	module. It denoises directly in high-resolution pixel space and produces a
	super-resolved image in one pass. This repository hosts the released decoder
	checkpoints, plus the encoder/decoder ("VAE") weights they depend on.

	All `PiD_` checkpoints in this repo are 4-step distilled. The non-`PiD_`
	entries (`ae.safetensors`, `flux2_ae.safetensors`, `sdxl_vae.safetensors`, `QwenImage_VAE_2d.pth`, `sd3_vae/`, `rae/`,
	`scale_rae/`) are the corresponding encoder/decoder VAE weights that PiD
	plugs into — they're not PiD checkpoints themselves.

	### License/Terms of Use

	This model is released under the [NSCLv1](https://huggingface.co/nvidia/PixelDiT-1300M-1024px/blob/main/LICENSE) License. The work and any derivative works may only be used for non-commercial (research or evaluation) purposes.

	### Deployment Geography:
	Global

	## PiD checkpoints

	Two variants are released for each diffusers-style backbone:

	- `2k` — trained at 2048px, used as a 4× decoder (512 LDM → 2048 px), or as
	an 8× decoder for the Scale-RAE backbone (256 → 2048).
	- `2kto4k` — trained with multi-resolution data bucketing 2048→4096 and an
	SD3-style dynamic shift; designed for 1024 LDM → 4K (4096 px) decoding.

	Both checkpoint variants support multiple aspect ratios.

	\| Path \| Latent space \| SR factor \| Variant \|
	\|-----------------------------------------------------------------\|--------------\|-----------\|---------\|
	\| `checkpoints/PiD_res2k_sr4x_official_flux_distill_4step` \| Flux1-dev \| 4× \| 2k \|
	\| `checkpoints/PiD_res2k_sr4x_official_flux2_distill_4step` \| Flux2-dev \| 4× \| 2k \|
	\| `checkpoints/PiD_res2k_sr4x_official_sd3_distill_4step` \| SD3 medium \| 4× \| 2k \|
	\| `checkpoints/PiD_res2k_sr4x_official_dinov2_distill_4step` \| DINOv2-B \| 4× \| 2k \|
	\| `checkpoints/PiD_res2k_sr8x_official_siglip_distill_4step` \| SigLIP-2 \| 8× \| 2k \|
	\| `checkpoints/PiD_res2kto4k_sr4x_official_flux_distill_4step` \| Flux1-dev \| 4× \| 2kto4k \|
	\| `checkpoints/PiD_res2kto4k_sr4x_official_flux2_distill_4step_2606` \| Flux2-dev \| 4× \| 2kto4k \|
	\| `checkpoints/PiD_res2kto4k_sr4x_official_sd3_distill_4step` \| SD3 medium \| 4× \| 2kto4k \|
	\| `checkpoints/PiD_res2kto4k_sr4x_official_sdxl_distill_4step` \| SDXL \| 4× \| 2kto4k \|
	\| `checkpoints/PiD_res2kto4k_sr4x_official_qwenimage_distill_4step` \| Qwen-Image \| 4× \| 2kto4k \|

	Each directory contains a single file, `model_ema_bf16.pth`, which is the EMA
	weights cast to bfloat16 — the format the inference scripts load by default.

	> ⚠️ Flux2-dev `2kto4k` — use the new `_2606` checkpoint. The previous
	> `PiD_res2kto4k_sr4x_official_flux2_distill_4step` (without the `_2606` suffix)
	> suffered from a color-drifting issue. The new
	> `PiD_res2kto4k_sr4x_official_flux2_distill_4step_2606` fixes it — please use it
	> and do not use the old one. See the
	> [comparison](https://github.com/nv-tlabs/pid/blob/main/docs/FLUX2_2kto4k_new_ckpt_compare.md)
	> for details.

	### Latent space → compatible LDMs

	A PiD decoder is tied to a latent space, not to a single generative model. Any
	LDM that produces latents in that space can reuse the same checkpoint. The
	`--backbone` aliases below pick the right LDM pipeline; they all decode through
	the latent space's checkpoint above.

	\| Latent space \| VAE / vision encoder weights \| compatible `--backbone` \| Corresponding LDM Links \|
	\|--------------\|------------------------------------\|-------------------------------------------\|-----------------\|
	\| Flux1-dev \| `checkpoints/ae.safetensors` \| `flux`, `zimage`, `zimage-turbo` \| [FLUX.1-dev](https://huggingface.co/black-forest-labs/FLUX.1-dev), [Z-Image](https://huggingface.co/Tongyi-MAI/Z-Image), [Z-Image-Turbo](https://huggingface.co/Tongyi-MAI/Z-Image-Turbo) \|
	\| Flux2-dev \| `checkpoints/flux2_ae.safetensors` \| `flux2`, `flux2-klein-4b`, `flux2-klein-9b` \| [FLUX.2-dev](https://huggingface.co/black-forest-labs/FLUX.2-dev), [FLUX.2-klein-4B](https://huggingface.co/black-forest-labs/FLUX.2-klein-4B), [FLUX.2-klein-9B](https://huggingface.co/black-forest-labs/FLUX.2-klein-9B) \|
	\| SD3 medium \| `checkpoints/sd3_vae/` \| `sd3` \| [SD3-medium](https://huggingface.co/stabilityai/stable-diffusion-3-medium-diffusers) \|
	\| SDXL \| `checkpoints/sdxl_vae.safetensors` \| `sdxl` \| [SDXL-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) \|
	\| Qwen-Image \| `checkpoints/QwenImage_VAE_2d.pth` \| `qwenimage`, `qwenimage-2512` \| [Qwen-Image](https://huggingface.co/Qwen/Qwen-Image), [Qwen-Image-2512](https://huggingface.co/Qwen/Qwen-Image-2512) \|
	\| DINOv2-B \| `checkpoints/rae/` \| `dinov2` \| [RAE](https://github.com/bytetriper/RAE) (class-conditional; DINOv2-B) \|
	\| SigLIP-2 \| `checkpoints/scale_rae/` \| `siglip` \| [Scale-RAE](https://github.com/ZitengWangNYU/Scale-RAE) (text-conditional; nyu-visionx/Scale-RAE-Qwen1.5B_DiT2.4B) \|

	For example, Z-Image and Z-Image-Turbo share Flux1-dev's VAE, so they reuse the
	`flux` checkpoints (both `2k` and `2kto4k`) — no separate `zimage` checkpoint is
	shipped. Likewise `qwenimage-2512` reuses the `qwenimage` decoder (same VAE,
	different transformer).

	## Usage

	The decoder checkpoints are loaded by the inference scripts in the [PiD
	codebase](https://github.com/nv-tlabs/pid). The exact `(backbone, ckpt_type) → path` mapping is the single source
	of truth in
	[`pid/_src/inference/checkpoint_registry.py`](https://github.com/nv-tlabs/PiD/blob/main/pid/_src/inference/checkpoint_registry.py) — clone the
	repo, point it at this snapshot, and the demos pick the right file
	automatically:

	```bash
	# Pull just the checkpoints/ tree into the repo root (skips this README and
	# the teaser figure so they don't clobber the files in the source repo).
	hf download nvidia/PiD --local-dir . --include "checkpoints/*"

	# Then run any of the demos, e.g.:
	PYTHONPATH=. python -m pid._src.inference.from_ldm --backbone flux \
	--prompt "A photorealistic half-body portrait of a brown tabby cat with bold stripes sitting attentively on a rustic wooden kitchen table, soft morning light streaming sideways through a large window, fine fur detail and stripe patterns sharply visible, intense amber-green eyes in razor-sharp focus, warm farmhouse kitchen softly out of focus, cinematic shallow depth of field, ultra-detailed fur texture, photorealistic" \
	--ldm_inference_steps 28 --save_xt_steps 24 \
	--output_dir ./results/official_demo/flux \
	--pid_inference_steps 4
	```

	Pick the `2kto4k` variant via `--pid_ckpt_type 2kto4k` when decoding at 4K.

	## Citation
	```
	@article{lu2026pid,
	title={PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion},
	author={Lu, Yifan and Wu, Qi and Wu, Jay Zhangjie and Wang, Zian and Ling, Huan and Fidler, Sanja and Ren, Xuanchi},
	journal={arXiv preprint arXiv:2605.23902},
	year={2026}
	}
	```