File size: 2,643 Bytes
720fb6d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6a12ad8
720fb6d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
---
license: apache-2.0
tags:
  - diffusion
  - autoencoder
  - image-reconstruction
  - image-tokenizer
  - pytorch
  - fcdm
  - semantic-alignment
library_name: fcdm_diffae
---

# data-archetype/semdisdiffae_p32

### Version History

| Date | Change |
|------|--------|
| 2026-04-10 | Refresh standalone package: fix bf16 RMSNorm precision path in both encoder and decoder to match training code; local export tooling now preserves fp32 EMA weights for future re-exports |
| 2026-04-08 | Initial release |

**Experimental patch-32 version** of
[SemDisDiffAE](https://huggingface.co/data-archetype/semdisdiffae).

This model extends the patch-16 SemDisDiffAE with a 2x2 bottleneck
patchification after the encoder, producing **512-channel latents at H/32 x W/32**
instead of the base model's 128-channel latents at H/16 x W/16. The decoder
unpatchifies back to 128ch before reconstruction.

See the [patch-16 SemDisDiffAE model card](https://huggingface.co/data-archetype/semdisdiffae)
and its [technical report](https://huggingface.co/data-archetype/semdisdiffae/blob/main/technical_report_semantic.md)
for full architectural details. The [p32 technical report](technical_report_p32.md)
covers only the differences.

## Architecture

| Property | p32 (this model) | p16 (base) |
|----------|-----------------|------------|
| Latent channels | 512 | 128 |
| Effective patch | 32 | 16 |
| Latent grid | H/32 x W/32 | H/16 x W/16 |
| Encoder patch | 16 (same) | 16 |
| Bottleneck patchify | 2x2 | none |
| Parameters | 88.8M (same) | 88.8M |

## Quick Start

```python
from fcdm_diffae import FCDMDiffAE

model = FCDMDiffAE.from_pretrained("data-archetype/semdisdiffae_p32", device="cuda")

# Encode — returns whitened 512ch latents at H/32 x W/32
latents = model.encode(images)  # [B,3,H,W] in [-1,1] -> [B,512,H/32,W/32]

# Decode
recon = model.decode(latents, height=H, width=W)

# Reconstruct
recon = model.reconstruct(images)
```

## Training

Same losses and hyperparameters as the base SemDisDiffAE (DINOv2 semantic
alignment, VP posterior variance expansion, latent scale penalty). Trained
for 275k steps. See the
[base model training section](https://huggingface.co/data-archetype/semdisdiffae/blob/main/technical_report_semantic.md#6-training)
for details.

## Dependencies

- PyTorch >= 2.0
- safetensors

## Citation

```bibtex
@misc{semdisdiffae,
  title   = {SemDisDiffAE: A Semantically Disentangled Diffusion Autoencoder},
  author  = {data-archetype},
  email   = {data-archetype@proton.me},
  year    = {2026},
  month   = apr,
  url     = {https://huggingface.co/data-archetype/semdisdiffae},
}
```

## License

Apache 2.0