Anthos
Anthos is a class-conditional latent diffusion model trained with rectified flow matching on the Oxford Flowers 102 dataset. It generates 256x256 images across 102 flower categories using a DiT-Nano/2 architecture of approximately 984K parameters. It is a research artifact: a minimal, transparent, end-to-end training and sampling demonstration.
Notice
Anthos is a research prototype. It is not Stable Diffusion, does not include a text encoder, safety filter, or upscaler, and operates exclusively over the 102 Oxford Flowers class vocabulary. Output quality reflects the scale of the model. Use accordingly.
At a Glance
| Property | Value |
|---|---|
| Parameters | 983,808 |
| Architecture | DiT-Nano/2 (6 blocks, hidden dim 96, 4 heads, patch size 2, SwiGLU) |
| Training Steps | 120,000 |
| Training Duration | ~18 minutes on an RTX Pro 6000 |
| Precision | bfloat16 |
| Output Resolution | 256 x 256 |
| Latent Shape | 32 x 32 x 4 |
| Number of Classes | 102 |
| Final Loss | 1.843 → 0.880 (flow-matching MSE) |
| Sampler | Heun, 50 steps, CFG scale 4.0 |
Background
The name derives from the Greek word for flower. The model was built as a sanity check on a rectified flow training loop and turned into a functional flower generator in the process.
Rather than predicting noise, the network predicts the velocity field transporting a sample from Gaussian noise to the data distribution. The architecture is a standard DiT with adaLN-Zero conditioning, SwiGLU MLPs, and sinusoidal 2D positional embeddings. The latent space is provided by the Stability AI VAE (stabilityai/sd-vae-ft-ema), which compresses 256x256 images to 32x32x4 latents at an 8x spatial downsampling factor.
The entire Oxford Flowers 102 dataset, including train, validation, and test splits (8,189 images), was encoded once through the VAE, augmented with horizontal flips to yield 16,378 latents, and stored in VRAM as BF16 channels-last tensors. A custom GPULatentLoader shuffles and batches directly from VRAM, reducing each training step to a forward pass and an optimizer update.
At approximately 111 iterations per second, 120,000 steps completed in under 18 minutes. Loss decreased monotonically from 1.843 to 0.880.
Sample Output
4x4 class-conditional grid, step 120,000, CFG scale 4.0, Heun sampler, 50 steps. Each tile corresponds to a distinct Oxford Flowers 102 class.
Model Specification
| Parameter | Value |
|---|---|
| Architecture | Diffusion Transformer (DiT) |
| Variant | DiT-Nano/2 |
| Depth | 6 blocks |
| Hidden Size | 96 |
| Attention Heads | 4 (head dimension 24) |
| Patch Size | 2 |
| Token Grid | 16 x 16 = 256 tokens |
| MLP Type | SwiGLU, expansion ratio 2.0 |
| Normalization | LayerNorm; adaLN-Zero on block norms |
| Attention | QK-LayerNorm, scaled dot-product attention |
| Conditioning | AdaLN-Zero on timestep and class label |
| Class Dropout Rate | 0.1 |
| Class Embedding | 102 classes + 1 null token |
| Positional Embedding | 2D sinusoidal, frozen |
| VAE | stabilityai/sd-vae-ft-ema, 8x downsample, 4 channels |
| VAE Scaling Factor | 0.18215 |
| Output Channels | 4 (velocity prediction; no learned sigma) |
Training Details
| Parameter | Value |
|---|---|
| Dataset | Oxford Flowers 102 (train + val + test, 8,189 images) |
| Augmentation | Identity + horizontal flip = 16,378 latents |
| Latent Storage | Full dataset in VRAM, channels-last BF16 |
| Batch Size | 256 |
| Gradient Accumulation | 1 |
| Optimizer | AdamW, beta=(0.9, 0.95), weight decay=0, fused |
| Learning Rate | 1e-4, 1,000-step linear warmup, then constant |
| Gradient Clipping | 1.0 |
| EMA Decay | 0.9999 |
| Timestep Sampler | Logit-normal (mu=0, sigma=1) |
| Loss Function | Flow-matching MSE on velocity field |
| CFG Dropout | 0.1 (10% of labels replaced with null token) |
| Precision | BF16 autocast, FP32 reductions |
| Compilation | torch.compile(mode="max-autotune") |
| Hardware | RTX Pro 6000, 96 GB VRAM, sm_120 |
| Throughput | ~111 steps/second |
| Total Wall Time | 1,078 seconds for 120,000 steps |
Loss Curve
Loss was logged throughout training. Selected values are reported below. No FID or Inception Score was computed; evaluation was performed by visual inspection of sample grids saved every 2,000 steps.
| Step | Loss |
|---|---|
| 0 | 1.843 |
| 1,000 | 1.710 |
| 10,000 | 1.310 |
| 50,000 | 1.040 |
| 100,000 | 0.910 |
| 120,000 | 0.880 |
Usage
Python API
from pipeline import AnthosPipeline
pipe = AnthosPipeline(repo_dir=".")
# Generate one image per class across all 102 classes
imgs = pipe(classes="all", seed=0)
imgs[0].save("out.png")
# Generate images for specific classes by name or ID
imgs = pipe("rose,sunflower,daffodil", n_per_class=2, seed=42)
for i, img in enumerate(imgs):
img.save(f"flower_{i:02d}.png")
# Fine-grained sampler control
imgs = pipe(73, steps=100, cfg_scale=2.5, sampler="euler", seed=7)
imgs[0].save("class_73.png")
Command-Line Interface
python pipeline.py "rose,sunflower,daffodil" --n-per-class 2 --seed 42 --out out.png
Gradio Demo
An interactive demo is available at Glint-Research/Anthos-1.
Repository Contents
| File | Description |
|---|---|
model.safetensors |
EMA weights, 3.95 MB |
config.json |
Architecture and sampling configuration |
modeling.py |
DiT implementation and sampler definitions |
pipeline.py |
AnthosPipeline inference wrapper |
classes.txt |
102 class names in id\tname format |
convert_checkpoint.py |
Converts final.pt training checkpoint to safetensors |
sample_grid.png |
4x4 sample grid at step 120,000 |
requirements.txt |
Python dependencies |
Limitations
- Fixed vocabulary. The model conditions on one of 102 discrete class labels. Free-form text prompts are not supported.
- Fixed resolution. Output is 256x256. Higher-resolution output requires an external upscaler.
- Scale constraints. At 984K parameters, the model cannot match the fidelity of large-scale generative models. Fine structure, particularly complex petal arrangements and unusual stamen geometry, is occasionally incorrect.
- Class imbalance. Oxford Flowers 102 is not class-balanced, and no rebalancing was applied. Several classes, including Barberton daisy and Mexican petunia, exhibit noticeably lower output quality.
- No quantitative evaluation. FID and Inception Score were not computed. Assessment is based on visual inspection only.
- Not for production or publication. This model is a research prototype and should not be used in production systems or as a primary source in academic or journalistic work.
Citation
@misc{anthos2026,
author = {Glint Research},
title = {Anthos: A 984K-Parameter Class-Conditional DiT on Oxford Flowers 102},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/Glint-Research/Anthos}
}
Built by Glint Research
- Downloads last month
- 110
