data-archetype/capacitor_decoder

Capacitor decoder: a faster, lighter FLUX.2-compatible latent decoder built on the SemDisDiffAE architecture.

Decode Speed

Resolution Speedup vs FLUX.2 Peak VRAM Reduction capacitor_decoder (ms/image) FLUX.2 VAE (ms/image) capacitor_decoder peak VRAM FLUX.2 peak VRAM
512x512 1.85x 59.3% 11.40 21.14 391.6 MiB 961.9 MiB
1024x1024 3.28x 79.1% 26.31 86.24 601.4 MiB 2876.4 MiB
2048x2048 4.70x 86.4% 86.29 405.84 1437.4 MiB 10531.4 MiB

These measurements are decode-only. Each image is first encoded once with the same FLUX.2 encoder, latents are cached in memory, and then both decoders are timed over the same cached latent set.

2k PSNR Benchmark

Model Mean PSNR (dB) Std (dB) Median (dB) Min (dB) P5 (dB) P95 (dB) Max (dB)
FLUX.2 VAE 36.28 4.53 36.07 22.73 28.89 43.63 47.38
capacitor_decoder 36.34 4.50 36.29 23.28 29.06 43.66 47.43
Delta vs FLUX.2 Mean (dB) Std (dB) Median (dB) Min (dB) P5 (dB) P95 (dB) Max (dB)
capacitor_decoder - FLUX.2 0.055 0.531 0.062 -1.968 -0.811 0.886 2.807

Evaluated on 2000 validation images: roughly 2/3 photographs and 1/3 book covers. Each image is encoded once with FLUX.2 and reused for both decoders.

Results viewer

Usage

import torch
from diffusers.models import AutoencoderKLFlux2

from capacitor_decoder import CapacitorDecoder, CapacitorDecoderInferenceConfig


def flux2_patchify_and_whiten(
    latents: torch.Tensor,
    vae: AutoencoderKLFlux2,
) -> torch.Tensor:
    b, c, h, w = latents.shape
    if h % 2 != 0 or w % 2 != 0:
        raise ValueError(f"Expected even FLUX.2 latent grid, got H={h}, W={w}")
    z = latents.reshape(b, c, h // 2, 2, w // 2, 2)
    z = z.permute(0, 1, 3, 5, 2, 4).reshape(b, c * 4, h // 2, w // 2)
    mean = vae.bn.running_mean.view(1, -1, 1, 1).to(device=z.device, dtype=torch.float32)
    var = vae.bn.running_var.view(1, -1, 1, 1).to(device=z.device, dtype=torch.float32)
    std = torch.sqrt(var + float(vae.config.batch_norm_eps))
    return (z.to(torch.float32) - mean) / std


device = "cuda"
flux2 = AutoencoderKLFlux2.from_pretrained(
    "BiliSakura/VAEs",
    subfolder="FLUX2-VAE",
    torch_dtype=torch.bfloat16,
).to(device)
decoder = CapacitorDecoder.from_pretrained(
    "data-archetype/capacitor_decoder",
    device=device,
    dtype=torch.bfloat16,
)

image = ...  # [1, 3, H, W] in [-1, 1], with H and W divisible by 16

with torch.inference_mode():
    posterior = flux2.encode(image.to(device=device, dtype=torch.bfloat16))
    latent_mean = posterior.latent_dist.mean

    # Default path: match the usual FLUX.2 convention.
    # Whiten here, then let capacitor_decoder unwhiten internally before decode.
    latents = flux2_patchify_and_whiten(latent_mean, flux2)
    recon = decoder.decode(
        latents,
        height=int(image.shape[-2]),
        width=int(image.shape[-1]),
        inference_config=CapacitorDecoderInferenceConfig(num_steps=1),
    )

Whitening and dewhitening are optional, but they must stay consistent. The default above matches the usual FLUX.2 pipeline behavior. If your upstream path already gives you raw patchified decoder-space latents instead, skip whitening upstream and call decode(..., latents_are_flux2_whitened=False).

Details

  • Default input contract: FLUX.2 patchified latents with FLUX.2 BN whitening still applied.
  • Default decoder behavior: unwhiten with saved FLUX.2 BN running stats, then decode.
  • Optional raw-latent mode: disable whitening upstream and call decode(..., latents_are_flux2_whitened=False).
  • Reused decoder architecture: SemDisDiffAE
  • Technical report
  • SemDisDiffAE technical report

Citation

@misc{capacitor_decoder,
  title   = {Capacitor Decoder: A Faster, Lighter FLUX.2-Compatible Latent Decoder},
  author  = {data-archetype},
  email   = {data-archetype@proton.me},
  year    = {2026},
  month   = apr,
  url     = {https://huggingface.co/data-archetype/capacitor_decoder},
}
Downloads last month
14
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support