Instructions to use Rohanify/Anime-Elite-V1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use Rohanify/Anime-Elite-V1 with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("Rohanify/Anime-Elite-V1", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- Draw Things
- DiffusionBee
import torch
from diffusers import DiffusionPipeline
# switch to "mps" for apple devices
pipe = DiffusionPipeline.from_pretrained("Rohanify/Anime-Elite-V1", dtype=torch.bfloat16, device_map="cuda")
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipe(prompt).images[0]Anime-Elite β From-Scratch Text-to-Anime-Face Diffusion (v1)
Built For Low-End Devices!!
Download From Files and versions Tab!
This model can be used with an external image upscaler
A small conditional pixel-space diffusion model trained from scratch on 10k anime faces, with Danbooru-style tag prompts. No pretrained VAE, no fine-tuning of anything β everything in this model started from random weights.
One of the best ones with seed 21:
This is a v1 proof of concept. It works. It's not polished. The full story is in Limitations β read it before you judge the samples.
Per-row checkpoints (50 β 30), per-column fixed seed. Prompt: girl, floral background, smile, red hairs., guidance 1.8, 200 DDIM steps.
What this is
- Task: text β 96Γ96 anime face
- Architecture:
diffusers.UNet2DConditionModel, ~66M params, cross-attention conditioning - Conditioning: multi-hot over a 512-tag vocabulary β 4 cross-attention tokens via a small MLP
- Sampling: DDIM with classifier-free guidance (10% dropout during training)
- Training data: first 10k images from
puruchinera/anime-faces-256, resized 256 β 96 - Hardware: single RTX 5080 (16 GB), 50 epochs, ~2 hours wall-clock
I deliberately didn't use a pretrained VAE β wanted everything end-to-end from scratch. That's why this is pixel-space diffusion at 96px, not latent diffusion.
Checkpoints
Five checkpoints from across training are included. Each one is a snapshot at the listed epoch.
| File | Epoch | Notes |
|---|---|---|
ckpt-30th-epoch.pt |
30 | sketchy, manga-like, rough edges |
ckpt-35th-epoch.pt |
35 | smoother, faces solidifying |
ckpt-40th-epoch.pt |
40 | most consistent quality across seeds |
ckpt-45th-epoch.pt |
45 | competitive with 40, slightly more refined |
ckpt-50th-epoch.pt |
50 | highest peaks, more variance, slight overcook |
TL;DR β use ckpt-45th-epoch.pt for general use. Best balance of detail and consistency.
Each .pt is a dict with three keys: unet, tag_cond, and vocab (the 512-tag list).
Best sampling config
After a lot of sweeping, this is the config that gave the most satisfying results:
prompt: 1girl,<hair color> hair,<eye color> eyes,smile,floral background
guidance: 1.8 β 2.7
DDIM steps: 200
checkpoint: 45 or 40
Higher guidance (4-5) makes tag adherence stronger but introduces washed-out colors (classic small-model CFG artifact). Lower steps (<100) leaves the output noisy.
How to use
import torch
from PIL import Image
from torch.amp import autocast
from diffusers import UNet2DConditionModel, DDIMScheduler
import torch.nn as nn
# --- model defs (must match training) ---
class TagConditioner(nn.Module):
def __init__(self, vocab_size, dim=256, n_tokens=4):
super().__init__()
self.n_tokens, self.dim = n_tokens, dim
self.net = nn.Sequential(
nn.Linear(vocab_size, 512), nn.SiLU(),
nn.Linear(512, n_tokens * dim),
)
def forward(self, x):
return self.net(x).view(-1, self.n_tokens, self.dim)
def build_unet(cross_dim=256):
return UNet2DConditionModel(
sample_size=128, in_channels=3, out_channels=3,
layers_per_block=2,
block_out_channels=(96, 192, 320, 384),
down_block_types=("DownBlock2D", "CrossAttnDownBlock2D", "CrossAttnDownBlock2D", "DownBlock2D"),
up_block_types=("UpBlock2D", "CrossAttnUpBlock2D", "CrossAttnUpBlock2D", "UpBlock2D"),
cross_attention_dim=cross_dim, attention_head_dim=8,
)
# --- load ---
device = "cuda"
ckpt = torch.load("ckpt-45th-epoch.pt", map_location=device)
vocab = ckpt["vocab"]
unet = build_unet().to(device); unet.load_state_dict(ckpt["unet"]); unet.eval()
tag_cond = TagConditioner(len(vocab)).to(device); tag_cond.load_state_dict(ckpt["tag_cond"]); tag_cond.eval()
# --- sample ---
@torch.no_grad()
def generate(prompt, n=4, guidance=2.0, steps=200, seed=42):
tag_to_idx = {t: i for i, t in enumerate(vocab)}
mh = torch.zeros(len(vocab))
for t in [s.strip() for s in prompt.split(",") if s.strip()]:
if t in tag_to_idx: mh[tag_to_idx[t]] = 1.0
mh = mh.unsqueeze(0).repeat(n, 1).to(device)
cond, uncond = tag_cond(mh), tag_cond(torch.zeros_like(mh))
sched = DDIMScheduler(num_train_timesteps=1000, beta_schedule="squaredcos_cap_v2")
sched.set_timesteps(steps)
g = torch.Generator(device=device).manual_seed(seed)
x = torch.randn(n, 3, 96, 96, device=device, generator=g)
for t in sched.timesteps:
with autocast("cuda", dtype=torch.bfloat16):
pred = unet(torch.cat([x, x]), t,
encoder_hidden_states=torch.cat([uncond, cond])).sample
pu, pc = pred.float().chunk(2)
x = sched.step(pu + guidance * (pc - pu), t, x).prev_sample
arr = ((x.clamp(-1, 1) + 1) * 127.5).byte().permute(0, 2, 3, 1).cpu().numpy()
return [Image.fromarray(a) for a in arr]
imgs = generate("1girl,red hair,floral background,smile", n=4, guidance=2.0, steps=200)
imgs[0].save("out.png")
Prompting tips
- Use Danbooru-style tags, comma-separated.
1girlnotgirl.red hairnotred hairs.blue eyesnotblue eye. - Stack 4-8 tags per prompt for best results.
- Common tags from the vocab:
1girl,1boy,long hair,short hair,blue eyes,red eyes,green eyes,purple eyes,red hair,blue hair,brown hair,white hair,pink hair,smile,blush,portrait,looking at viewer,floral background,choker. - If a tag doesn't match the vocab it's silently ignored. Print
vocabafter loading to see what's available.
Limitations
Being upfront about what this model can't do:
- 96Γ96 only. That's tiny by modern standards. Faces are recognizable but not detailed.
- Heavy female bias. The dataset is ~90%+ female anime characters.
1boymostly gets ignored. - Tag exact-match. No CLIP, no natural language. Misspell a tag and it's gone.
- CFG fragility. Above guidance ~3 the model starts producing washed-out, low-saturation outputs. Above ~5 you get dual/blended faces. Stay in 1.5-2.7 for clean samples.
- No EMA weights. Sampling uses live training weights, which adds noise. v2 will fix this.
- No safety checker. It's faces. Of fictional anime characters. Should be fine, but no filter is in place.
Training details
- Optimizer: AdamW, lr=1e-4, betas=(0.9, 0.999), wd=1e-6
- Scheduler: DDPM, 1000 timesteps, squaredcos_cap_v2 beta schedule
- Noise prediction loss (MSE)
- CFG dropout: 10% null condition during training
- Mixed precision: bf16 autocast
- Batch size: 16
- Steps per epoch: 625
- Total training steps: ~31k
Loss plateaus around 0.038β0.045 by epoch 15. Visual quality keeps improving past the plateau until ~epoch 40-45.
What's next (v2)
- Train at 128Γ128 (more spatial bandwidth β no more competing-face artifacts)
- Add EMA weights for sampling
- Run 100-150 epochs
- Rebalance dataset to include more diverse character types
- Try natural-language captions via WD14 β CLIP encoder (gives real prompt freedom)
Acknowledgements
- Dataset: puruchinera/anime-faces-256
- Architecture: HuggingFace
diffusers
Built solo over an afternoon on an RTX 5080. The Windows sysmem-fallback gotcha cost me 2 hours before I caught it. Posting this in case it helps someone else avoid the same trap.
- Downloads last month
- -

