Anime-Elite — From-Scratch Text-to-Anime-Face Diffusion (v1)

Built For Low-End Devices!!

Download From `Files and versions` Tab!

This model can be used with an external image upscaler

A small conditional pixel-space diffusion model trained from scratch on 10k anime faces, with Danbooru-style tag prompts. No pretrained VAE, no fine-tuning of anything — everything in this model started from random weights.

One of the best ones with seed 21:

This is a v1 proof of concept. It works. It's not polished. The full story is in Limitations — read it before you judge the samples.

Per-row checkpoints (50 → 30), per-column fixed seed. Prompt: girl, floral background, smile, red hairs., guidance 1.8, 200 DDIM steps.

What this is

Task: text → 96×96 anime face
Architecture: diffusers.UNet2DConditionModel, ~66M params, cross-attention conditioning
Conditioning: multi-hot over a 512-tag vocabulary → 4 cross-attention tokens via a small MLP
Sampling: DDIM with classifier-free guidance (10% dropout during training)
Training data: first 10k images from puruchinera/anime-faces-256, resized 256 → 96
Hardware: single RTX 5080 (16 GB), 50 epochs, ~2 hours wall-clock

I deliberately didn't use a pretrained VAE — wanted everything end-to-end from scratch. That's why this is pixel-space diffusion at 96px, not latent diffusion.

Checkpoints

Five checkpoints from across training are included. Each one is a snapshot at the listed epoch.

File	Epoch	Notes
`ckpt-30th-epoch.pt`	30	sketchy, manga-like, rough edges
`ckpt-35th-epoch.pt`	35	smoother, faces solidifying
`ckpt-40th-epoch.pt`	40	most consistent quality across seeds
`ckpt-45th-epoch.pt`	45	competitive with 40, slightly more refined
`ckpt-50th-epoch.pt`	50	highest peaks, more variance, slight overcook

TL;DR — use ckpt-45th-epoch.pt for general use. Best balance of detail and consistency.

Each .pt is a dict with three keys: unet, tag_cond, and vocab (the 512-tag list).

Best sampling config

After a lot of sweeping, this is the config that gave the most satisfying results:

prompt:    1girl,<hair color> hair,<eye color> eyes,smile,floral background
guidance:  1.8 – 2.7
DDIM steps: 200
checkpoint: 45 or 40

Higher guidance (4-5) makes tag adherence stronger but introduces washed-out colors (classic small-model CFG artifact). Lower steps (<100) leaves the output noisy.

How to use

import torch
from PIL import Image
from torch.amp import autocast
from diffusers import UNet2DConditionModel, DDIMScheduler
import torch.nn as nn

# --- model defs (must match training) ---
class TagConditioner(nn.Module):
    def __init__(self, vocab_size, dim=256, n_tokens=4):
        super().__init__()
        self.n_tokens, self.dim = n_tokens, dim
        self.net = nn.Sequential(
            nn.Linear(vocab_size, 512), nn.SiLU(),
            nn.Linear(512, n_tokens * dim),
        )
    def forward(self, x):
        return self.net(x).view(-1, self.n_tokens, self.dim)

def build_unet(cross_dim=256):
    return UNet2DConditionModel(
        sample_size=128, in_channels=3, out_channels=3,
        layers_per_block=2,
        block_out_channels=(96, 192, 320, 384),
        down_block_types=("DownBlock2D", "CrossAttnDownBlock2D", "CrossAttnDownBlock2D", "DownBlock2D"),
        up_block_types=("UpBlock2D", "CrossAttnUpBlock2D", "CrossAttnUpBlock2D", "UpBlock2D"),
        cross_attention_dim=cross_dim, attention_head_dim=8,
    )

# --- load ---
device = "cuda"
ckpt = torch.load("ckpt-45th-epoch.pt", map_location=device)
vocab = ckpt["vocab"]
unet = build_unet().to(device); unet.load_state_dict(ckpt["unet"]); unet.eval()
tag_cond = TagConditioner(len(vocab)).to(device); tag_cond.load_state_dict(ckpt["tag_cond"]); tag_cond.eval()

# --- sample ---
@torch.no_grad()
def generate(prompt, n=4, guidance=2.0, steps=200, seed=42):
    tag_to_idx = {t: i for i, t in enumerate(vocab)}
    mh = torch.zeros(len(vocab))
    for t in [s.strip() for s in prompt.split(",") if s.strip()]:
        if t in tag_to_idx: mh[tag_to_idx[t]] = 1.0
    mh = mh.unsqueeze(0).repeat(n, 1).to(device)
    cond, uncond = tag_cond(mh), tag_cond(torch.zeros_like(mh))

    sched = DDIMScheduler(num_train_timesteps=1000, beta_schedule="squaredcos_cap_v2")
    sched.set_timesteps(steps)
    g = torch.Generator(device=device).manual_seed(seed)
    x = torch.randn(n, 3, 96, 96, device=device, generator=g)

    for t in sched.timesteps:
        with autocast("cuda", dtype=torch.bfloat16):
            pred = unet(torch.cat([x, x]), t,
                        encoder_hidden_states=torch.cat([uncond, cond])).sample
        pu, pc = pred.float().chunk(2)
        x = sched.step(pu + guidance * (pc - pu), t, x).prev_sample

    arr = ((x.clamp(-1, 1) + 1) * 127.5).byte().permute(0, 2, 3, 1).cpu().numpy()
    return [Image.fromarray(a) for a in arr]

imgs = generate("1girl,red hair,floral background,smile", n=4, guidance=2.0, steps=200)
imgs[0].save("out.png")

Prompting tips

Use Danbooru-style tags, comma-separated. 1girl not girl. red hair not red hairs. blue eyes not blue eye.
Stack 4-8 tags per prompt for best results.
Common tags from the vocab: 1girl, 1boy, long hair, short hair, blue eyes, red eyes, green eyes, purple eyes, red hair, blue hair, brown hair, white hair, pink hair, smile, blush, portrait, looking at viewer, floral background, choker.
If a tag doesn't match the vocab it's silently ignored. Print vocab after loading to see what's available.

Limitations

Being upfront about what this model can't do:

96×96 only. That's tiny by modern standards. Faces are recognizable but not detailed.
Heavy female bias. The dataset is ~90%+ female anime characters. 1boy mostly gets ignored.
Tag exact-match. No CLIP, no natural language. Misspell a tag and it's gone.
CFG fragility. Above guidance ~3 the model starts producing washed-out, low-saturation outputs. Above ~5 you get dual/blended faces. Stay in 1.5-2.7 for clean samples.
No EMA weights. Sampling uses live training weights, which adds noise. v2 will fix this.
No safety checker. It's faces. Of fictional anime characters. Should be fine, but no filter is in place.

Training details

Optimizer: AdamW, lr=1e-4, betas=(0.9, 0.999), wd=1e-6
Scheduler: DDPM, 1000 timesteps, squaredcos_cap_v2 beta schedule
Noise prediction loss (MSE)
CFG dropout: 10% null condition during training
Mixed precision: bf16 autocast
Batch size: 16
Steps per epoch: 625
Total training steps: ~31k

Loss plateaus around 0.038–0.045 by epoch 15. Visual quality keeps improving past the plateau until ~epoch 40-45.

What's next (v2)

Train at 128×128 (more spatial bandwidth → no more competing-face artifacts)
Add EMA weights for sampling
Run 100-150 epochs
Rebalance dataset to include more diverse character types
Try natural-language captions via WD14 → CLIP encoder (gives real prompt freedom)

Acknowledgements

Dataset: puruchinera/anime-faces-256
Architecture: HuggingFace diffusers

Built solo over an afternoon on an RTX 5080. The Windows sysmem-fallback gotcha cost me 2 hours before I caught it. Posting this in case it helps someone else avoid the same trap.

Downloads last month: -

Rohanify
/

Anime-Elite-V1

Anime-Elite — From-Scratch Text-to-Anime-Face Diffusion (v1)

Built For Low-End Devices!!

Download From `Files and versions` Tab!

What this is

Checkpoints

Best sampling config

How to use

Prompting tips

Limitations

Training details

What's next (v2)

Acknowledgements

Dataset used to train Rohanify/Anime-Elite-V1

Anime-Elite — From-Scratch Text-to-Anime-Face Diffusion (v1)

Built For Low-End Devices!!

Download From Files and versions Tab!

What this is

Checkpoints

Best sampling config

How to use

Prompting tips

Limitations

Training details

What's next (v2)

Acknowledgements

Dataset used to train Rohanify/Anime-Elite-V1

Download From `Files and versions` Tab!