How to use from the
Use from the
Diffusers library
pip install -U diffusers transformers accelerate
import torch
from diffusers import DiffusionPipeline

# switch to "mps" for apple devices
pipe = DiffusionPipeline.from_pretrained("Rohanify/Anime-Elite-V1", dtype=torch.bfloat16, device_map="cuda")

prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipe(prompt).images[0]

Anime-Elite β€” From-Scratch Text-to-Anime-Face Diffusion (v1)

Built For Low-End Devices!!

Download From Files and versions Tab!

This model can be used with an external image upscaler

A small conditional pixel-space diffusion model trained from scratch on 10k anime faces, with Danbooru-style tag prompts. No pretrained VAE, no fine-tuning of anything β€” everything in this model started from random weights.

One of the best ones with seed 21:

samples

This is a v1 proof of concept. It works. It's not polished. The full story is in Limitations β€” read it before you judge the samples.

samples

Per-row checkpoints (50 β†’ 30), per-column fixed seed. Prompt: girl, floral background, smile, red hairs., guidance 1.8, 200 DDIM steps.


What this is

  • Task: text β†’ 96Γ—96 anime face
  • Architecture: diffusers.UNet2DConditionModel, ~66M params, cross-attention conditioning
  • Conditioning: multi-hot over a 512-tag vocabulary β†’ 4 cross-attention tokens via a small MLP
  • Sampling: DDIM with classifier-free guidance (10% dropout during training)
  • Training data: first 10k images from puruchinera/anime-faces-256, resized 256 β†’ 96
  • Hardware: single RTX 5080 (16 GB), 50 epochs, ~2 hours wall-clock

I deliberately didn't use a pretrained VAE β€” wanted everything end-to-end from scratch. That's why this is pixel-space diffusion at 96px, not latent diffusion.

Checkpoints

Five checkpoints from across training are included. Each one is a snapshot at the listed epoch.

File Epoch Notes
ckpt-30th-epoch.pt 30 sketchy, manga-like, rough edges
ckpt-35th-epoch.pt 35 smoother, faces solidifying
ckpt-40th-epoch.pt 40 most consistent quality across seeds
ckpt-45th-epoch.pt 45 competitive with 40, slightly more refined
ckpt-50th-epoch.pt 50 highest peaks, more variance, slight overcook

TL;DR β€” use ckpt-45th-epoch.pt for general use. Best balance of detail and consistency.

Each .pt is a dict with three keys: unet, tag_cond, and vocab (the 512-tag list).

Best sampling config

After a lot of sweeping, this is the config that gave the most satisfying results:

prompt:    1girl,<hair color> hair,<eye color> eyes,smile,floral background
guidance:  1.8 – 2.7
DDIM steps: 200
checkpoint: 45 or 40

Higher guidance (4-5) makes tag adherence stronger but introduces washed-out colors (classic small-model CFG artifact). Lower steps (<100) leaves the output noisy.

How to use

import torch
from PIL import Image
from torch.amp import autocast
from diffusers import UNet2DConditionModel, DDIMScheduler
import torch.nn as nn

# --- model defs (must match training) ---
class TagConditioner(nn.Module):
    def __init__(self, vocab_size, dim=256, n_tokens=4):
        super().__init__()
        self.n_tokens, self.dim = n_tokens, dim
        self.net = nn.Sequential(
            nn.Linear(vocab_size, 512), nn.SiLU(),
            nn.Linear(512, n_tokens * dim),
        )
    def forward(self, x):
        return self.net(x).view(-1, self.n_tokens, self.dim)

def build_unet(cross_dim=256):
    return UNet2DConditionModel(
        sample_size=128, in_channels=3, out_channels=3,
        layers_per_block=2,
        block_out_channels=(96, 192, 320, 384),
        down_block_types=("DownBlock2D", "CrossAttnDownBlock2D", "CrossAttnDownBlock2D", "DownBlock2D"),
        up_block_types=("UpBlock2D", "CrossAttnUpBlock2D", "CrossAttnUpBlock2D", "UpBlock2D"),
        cross_attention_dim=cross_dim, attention_head_dim=8,
    )

# --- load ---
device = "cuda"
ckpt = torch.load("ckpt-45th-epoch.pt", map_location=device)
vocab = ckpt["vocab"]
unet = build_unet().to(device); unet.load_state_dict(ckpt["unet"]); unet.eval()
tag_cond = TagConditioner(len(vocab)).to(device); tag_cond.load_state_dict(ckpt["tag_cond"]); tag_cond.eval()

# --- sample ---
@torch.no_grad()
def generate(prompt, n=4, guidance=2.0, steps=200, seed=42):
    tag_to_idx = {t: i for i, t in enumerate(vocab)}
    mh = torch.zeros(len(vocab))
    for t in [s.strip() for s in prompt.split(",") if s.strip()]:
        if t in tag_to_idx: mh[tag_to_idx[t]] = 1.0
    mh = mh.unsqueeze(0).repeat(n, 1).to(device)
    cond, uncond = tag_cond(mh), tag_cond(torch.zeros_like(mh))

    sched = DDIMScheduler(num_train_timesteps=1000, beta_schedule="squaredcos_cap_v2")
    sched.set_timesteps(steps)
    g = torch.Generator(device=device).manual_seed(seed)
    x = torch.randn(n, 3, 96, 96, device=device, generator=g)

    for t in sched.timesteps:
        with autocast("cuda", dtype=torch.bfloat16):
            pred = unet(torch.cat([x, x]), t,
                        encoder_hidden_states=torch.cat([uncond, cond])).sample
        pu, pc = pred.float().chunk(2)
        x = sched.step(pu + guidance * (pc - pu), t, x).prev_sample

    arr = ((x.clamp(-1, 1) + 1) * 127.5).byte().permute(0, 2, 3, 1).cpu().numpy()
    return [Image.fromarray(a) for a in arr]

imgs = generate("1girl,red hair,floral background,smile", n=4, guidance=2.0, steps=200)
imgs[0].save("out.png")

Prompting tips

  • Use Danbooru-style tags, comma-separated. 1girl not girl. red hair not red hairs. blue eyes not blue eye.
  • Stack 4-8 tags per prompt for best results.
  • Common tags from the vocab: 1girl, 1boy, long hair, short hair, blue eyes, red eyes, green eyes, purple eyes, red hair, blue hair, brown hair, white hair, pink hair, smile, blush, portrait, looking at viewer, floral background, choker.
  • If a tag doesn't match the vocab it's silently ignored. Print vocab after loading to see what's available.

Limitations

Being upfront about what this model can't do:

  • 96Γ—96 only. That's tiny by modern standards. Faces are recognizable but not detailed.
  • Heavy female bias. The dataset is ~90%+ female anime characters. 1boy mostly gets ignored.
  • Tag exact-match. No CLIP, no natural language. Misspell a tag and it's gone.
  • CFG fragility. Above guidance ~3 the model starts producing washed-out, low-saturation outputs. Above ~5 you get dual/blended faces. Stay in 1.5-2.7 for clean samples.
  • No EMA weights. Sampling uses live training weights, which adds noise. v2 will fix this.
  • No safety checker. It's faces. Of fictional anime characters. Should be fine, but no filter is in place.

Training details

  • Optimizer: AdamW, lr=1e-4, betas=(0.9, 0.999), wd=1e-6
  • Scheduler: DDPM, 1000 timesteps, squaredcos_cap_v2 beta schedule
  • Noise prediction loss (MSE)
  • CFG dropout: 10% null condition during training
  • Mixed precision: bf16 autocast
  • Batch size: 16
  • Steps per epoch: 625
  • Total training steps: ~31k

Loss plateaus around 0.038–0.045 by epoch 15. Visual quality keeps improving past the plateau until ~epoch 40-45.

What's next (v2)

  • Train at 128Γ—128 (more spatial bandwidth β†’ no more competing-face artifacts)
  • Add EMA weights for sampling
  • Run 100-150 epochs
  • Rebalance dataset to include more diverse character types
  • Try natural-language captions via WD14 β†’ CLIP encoder (gives real prompt freedom)

Acknowledgements


Built solo over an afternoon on an RTX 5080. The Windows sysmem-fallback gotcha cost me 2 hours before I caught it. Posting this in case it helps someone else avoid the same trap.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train Rohanify/Anime-Elite-V1