File size: 2,643 Bytes
720fb6d 6a12ad8 720fb6d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 | ---
license: apache-2.0
tags:
- diffusion
- autoencoder
- image-reconstruction
- image-tokenizer
- pytorch
- fcdm
- semantic-alignment
library_name: fcdm_diffae
---
# data-archetype/semdisdiffae_p32
### Version History
| Date | Change |
|------|--------|
| 2026-04-10 | Refresh standalone package: fix bf16 RMSNorm precision path in both encoder and decoder to match training code; local export tooling now preserves fp32 EMA weights for future re-exports |
| 2026-04-08 | Initial release |
**Experimental patch-32 version** of
[SemDisDiffAE](https://huggingface.co/data-archetype/semdisdiffae).
This model extends the patch-16 SemDisDiffAE with a 2x2 bottleneck
patchification after the encoder, producing **512-channel latents at H/32 x W/32**
instead of the base model's 128-channel latents at H/16 x W/16. The decoder
unpatchifies back to 128ch before reconstruction.
See the [patch-16 SemDisDiffAE model card](https://huggingface.co/data-archetype/semdisdiffae)
and its [technical report](https://huggingface.co/data-archetype/semdisdiffae/blob/main/technical_report_semantic.md)
for full architectural details. The [p32 technical report](technical_report_p32.md)
covers only the differences.
## Architecture
| Property | p32 (this model) | p16 (base) |
|----------|-----------------|------------|
| Latent channels | 512 | 128 |
| Effective patch | 32 | 16 |
| Latent grid | H/32 x W/32 | H/16 x W/16 |
| Encoder patch | 16 (same) | 16 |
| Bottleneck patchify | 2x2 | none |
| Parameters | 88.8M (same) | 88.8M |
## Quick Start
```python
from fcdm_diffae import FCDMDiffAE
model = FCDMDiffAE.from_pretrained("data-archetype/semdisdiffae_p32", device="cuda")
# Encode — returns whitened 512ch latents at H/32 x W/32
latents = model.encode(images) # [B,3,H,W] in [-1,1] -> [B,512,H/32,W/32]
# Decode
recon = model.decode(latents, height=H, width=W)
# Reconstruct
recon = model.reconstruct(images)
```
## Training
Same losses and hyperparameters as the base SemDisDiffAE (DINOv2 semantic
alignment, VP posterior variance expansion, latent scale penalty). Trained
for 275k steps. See the
[base model training section](https://huggingface.co/data-archetype/semdisdiffae/blob/main/technical_report_semantic.md#6-training)
for details.
## Dependencies
- PyTorch >= 2.0
- safetensors
## Citation
```bibtex
@misc{semdisdiffae,
title = {SemDisDiffAE: A Semantically Disentangled Diffusion Autoencoder},
author = {data-archetype},
email = {data-archetype@proton.me},
year = {2026},
month = apr,
url = {https://huggingface.co/data-archetype/semdisdiffae},
}
```
## License
Apache 2.0
|