RiT-XL: Vanilla Diffusion Transformers Suffice in Representation Space

This repository hosts the released RiT-XL checkpoint trained for 800 epochs on ImageNet 256×256 with frozen DINOv2-Small features.

RiT (Representation Image Transformer) is a vanilla Diffusion Transformer that effectively models distributions in high-dimensional representation spaces, as presented in the paper RiT: Vanilla Diffusion Transformers Suffice in Representation Space.

Results on ImageNet 256×256

Method	Encoder	Params	FID ↓ (CFG=1)	FID ↓ (CFG≈3.7)
DiT-XL	SD-VAE	675M	9.62	2.27
SiT-XL	SD-VAE	675M	8.61	2.06
REPA-XL	SD-VAE	675M	5.78	1.29
DDT-XL	SD-VAE	675M	6.27	1.26
REG-XL	SD-VAE	675M	1.80	1.36
RAE-XL	DINOv2-S	676M	1.87	1.41
RAE-XL^DH	DINOv2-B	839M	1.51	1.16
FAE-XL	FAE-DINOv2-G	675M	1.48	1.29
RiT-XL (ours)	DINOv2-S	676M	1.45	1.14

All FIDs use 25 Heun steps with the time-shift schedule.

Sample Usage

The full training/inference code is available at lezhang7/RiT. To download the weights manually and load them in PyTorch:

import torch
from huggingface_hub import hf_hub_download

# Download the checkpoint
ckpt = hf_hub_download(repo_id="le723z/RiT", filename="checkpoint-last.pth")

# Load the state dictionary
state = torch.load(ckpt, map_location="cpu", weights_only=False)

# state['model'] / state['model_ema1'] / state['model_ema2'] are the
# trainable + two EMA-decay parameter dictionaries.
# state['model_ema1'] is the EMA decay 0.9999 (used for sampling by default).
model_weights = state['model_ema1']

To run the evaluation script (which auto-pulls this checkpoint plus the matching RAE decoder):

git clone https://github.com/lezhang7/RiT.git
cd RiT
pip install -r requirements.txt
bash scripts/eval.sh        # CFG=3.7, FID ~1.14 on ImageNet 256x256

Model details

Architecture: vanilla Diffusion Transformer — 28 layers, hidden 1152, 16 heads, SwiGLU FFN, RMSNorm, QK-norm, 2D VisionRoPE, 32 in-context class tokens, joint [CLS]-patch modeling.
Encoder (frozen): facebook/dinov2-with-registers-small (d=384).
Decoder (frozen): ViT-MAE-style decoder from nyu-visionx/RAE-collections, variant decoders/dinov2/wReg_small/ViTXL_n08/model.pt.
Parameters (denoiser only): 676M.
Training: 8×H200, batch 1536 effective, AdamW lr=5e-5, 800 epochs, x-prediction loss, dimension-aware time shift (s ≈ 4.9), CLS auxiliary loss weight λ=0.2.
Sampling defaults: Heun, 25 steps, time-shift schedule, CFG=3.7 in interval [0.1, 0.98], coupled-noise initialization for [CLS].

Citation

@article{zhang2026rit,
  title   = {RiT: Vanilla Diffusion Transformers Suffice in Representation Space},
  author  = {Zhang, Le and Mang, Ning and Agrawal, Aishwarya},
  journal = {arXiv preprint arXiv:2605.21981},
  year    = {2026}
}

Acknowledgments

This release reuses the frozen DINOv2 encoder + ViT decoder pairing from RAE and adopts the modernized DiT block design + in-context class tokens from JiT.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Unconditional Image Generation

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train le723z/RiT

Paper for le723z/RiT

RiT: Vanilla Diffusion Transformers Suffice in Representation Space

Paper • 2605.21981 • Published 3 days ago • 9