Hugging Face Discord Ko-fi

Anthos

Anthos is a class-conditional latent diffusion model trained with rectified flow matching on the Oxford Flowers 102 dataset. It generates 256x256 images across 102 flower categories using a DiT-Nano/2 architecture of approximately 984K parameters. It is a research artifact: a minimal, transparent, end-to-end training and sampling demonstration.

Notice

Anthos is a research prototype. It is not Stable Diffusion, does not include a text encoder, safety filter, or upscaler, and operates exclusively over the 102 Oxford Flowers class vocabulary. Output quality reflects the scale of the model. Use accordingly.


At a Glance

Property Value
Parameters 983,808
Architecture DiT-Nano/2 (6 blocks, hidden dim 96, 4 heads, patch size 2, SwiGLU)
Training Steps 120,000
Training Duration ~18 minutes on an RTX Pro 6000
Precision bfloat16
Output Resolution 256 x 256
Latent Shape 32 x 32 x 4
Number of Classes 102
Final Loss 1.843 → 0.880 (flow-matching MSE)
Sampler Heun, 50 steps, CFG scale 4.0

Background

The name derives from the Greek word for flower. The model was built as a sanity check on a rectified flow training loop and turned into a functional flower generator in the process.

Rather than predicting noise, the network predicts the velocity field transporting a sample from Gaussian noise to the data distribution. The architecture is a standard DiT with adaLN-Zero conditioning, SwiGLU MLPs, and sinusoidal 2D positional embeddings. The latent space is provided by the Stability AI VAE (stabilityai/sd-vae-ft-ema), which compresses 256x256 images to 32x32x4 latents at an 8x spatial downsampling factor.

The entire Oxford Flowers 102 dataset, including train, validation, and test splits (8,189 images), was encoded once through the VAE, augmented with horizontal flips to yield 16,378 latents, and stored in VRAM as BF16 channels-last tensors. A custom GPULatentLoader shuffles and batches directly from VRAM, reducing each training step to a forward pass and an optimizer update.

At approximately 111 iterations per second, 120,000 steps completed in under 18 minutes. Loss decreased monotonically from 1.843 to 0.880.


Sample Output

Sample Grid

4x4 class-conditional grid, step 120,000, CFG scale 4.0, Heun sampler, 50 steps. Each tile corresponds to a distinct Oxford Flowers 102 class.


Model Specification

Parameter Value
Architecture Diffusion Transformer (DiT)
Variant DiT-Nano/2
Depth 6 blocks
Hidden Size 96
Attention Heads 4 (head dimension 24)
Patch Size 2
Token Grid 16 x 16 = 256 tokens
MLP Type SwiGLU, expansion ratio 2.0
Normalization LayerNorm; adaLN-Zero on block norms
Attention QK-LayerNorm, scaled dot-product attention
Conditioning AdaLN-Zero on timestep and class label
Class Dropout Rate 0.1
Class Embedding 102 classes + 1 null token
Positional Embedding 2D sinusoidal, frozen
VAE stabilityai/sd-vae-ft-ema, 8x downsample, 4 channels
VAE Scaling Factor 0.18215
Output Channels 4 (velocity prediction; no learned sigma)

Training Details

Parameter Value
Dataset Oxford Flowers 102 (train + val + test, 8,189 images)
Augmentation Identity + horizontal flip = 16,378 latents
Latent Storage Full dataset in VRAM, channels-last BF16
Batch Size 256
Gradient Accumulation 1
Optimizer AdamW, beta=(0.9, 0.95), weight decay=0, fused
Learning Rate 1e-4, 1,000-step linear warmup, then constant
Gradient Clipping 1.0
EMA Decay 0.9999
Timestep Sampler Logit-normal (mu=0, sigma=1)
Loss Function Flow-matching MSE on velocity field
CFG Dropout 0.1 (10% of labels replaced with null token)
Precision BF16 autocast, FP32 reductions
Compilation torch.compile(mode="max-autotune")
Hardware RTX Pro 6000, 96 GB VRAM, sm_120
Throughput ~111 steps/second
Total Wall Time 1,078 seconds for 120,000 steps

Loss Curve

Loss was logged throughout training. Selected values are reported below. No FID or Inception Score was computed; evaluation was performed by visual inspection of sample grids saved every 2,000 steps.

Step Loss
0 1.843
1,000 1.710
10,000 1.310
50,000 1.040
100,000 0.910
120,000 0.880

Usage

Python API

from pipeline import AnthosPipeline

pipe = AnthosPipeline(repo_dir=".")

# Generate one image per class across all 102 classes
imgs = pipe(classes="all", seed=0)
imgs[0].save("out.png")

# Generate images for specific classes by name or ID
imgs = pipe("rose,sunflower,daffodil", n_per_class=2, seed=42)
for i, img in enumerate(imgs):
    img.save(f"flower_{i:02d}.png")

# Fine-grained sampler control
imgs = pipe(73, steps=100, cfg_scale=2.5, sampler="euler", seed=7)
imgs[0].save("class_73.png")

Command-Line Interface

python pipeline.py "rose,sunflower,daffodil" --n-per-class 2 --seed 42 --out out.png

Gradio Demo

An interactive demo is available at Glint-Research/Anthos-1.


Repository Contents

File Description
model.safetensors EMA weights, 3.95 MB
config.json Architecture and sampling configuration
modeling.py DiT implementation and sampler definitions
pipeline.py AnthosPipeline inference wrapper
classes.txt 102 class names in id\tname format
convert_checkpoint.py Converts final.pt training checkpoint to safetensors
sample_grid.png 4x4 sample grid at step 120,000
requirements.txt Python dependencies

Limitations

  • Fixed vocabulary. The model conditions on one of 102 discrete class labels. Free-form text prompts are not supported.
  • Fixed resolution. Output is 256x256. Higher-resolution output requires an external upscaler.
  • Scale constraints. At 984K parameters, the model cannot match the fidelity of large-scale generative models. Fine structure, particularly complex petal arrangements and unusual stamen geometry, is occasionally incorrect.
  • Class imbalance. Oxford Flowers 102 is not class-balanced, and no rebalancing was applied. Several classes, including Barberton daisy and Mexican petunia, exhibit noticeably lower output quality.
  • No quantitative evaluation. FID and Inception Score were not computed. Assessment is based on visual inspection only.
  • Not for production or publication. This model is a research prototype and should not be used in production systems or as a primary source in academic or journalistic work.

Citation

@misc{anthos2026,
  author    = {Glint Research},
  title     = {Anthos: A 984K-Parameter Class-Conditional DiT on Oxford Flowers 102},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/Glint-Research/Anthos}
}

Built by Glint Research

Downloads last month
110
Safetensors
Model size
984k params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using Glint-Research/Anthos-1 1

Collection including Glint-Research/Anthos-1