AestheticSigLIP

Image aesthetic scorer (1–10) built on SigLIP 2 So400m NaFlex (~430M params), fine-tuned with multi-layer feature tapping, SORD loss, and ranking loss.

Usage

pip install torch torchvision pillow huggingface_hub

from huggingface_hub import hf_hub_download
# clone or copy predict.py + model/ + naflex.py + config.py from the repo
from predict import AestheticScorer

scorer = AestheticScorer.from_pretrained("somepago/AestheticSigLIP")

# single image
score = scorer.rate("photo.jpg")          # float, e.g. 7.42

# batch
scores = scorer.rate(["a.jpg", "b.jpg"])  # [7.42, 3.81]

# PIL image directly
from PIL import Image
score = scorer.rate(Image.open("photo.jpg"))

CLI:

python predict.py photo.jpg photo2.jpg --repo somepago/AestheticSigLIP

Score scale

Score	Meaning
1–2	Blurry, broken, heavy watermarks
3–4	Bad framing, low effort, obvious AI slop
5–6	Generic, forgettable — typical web image
7	Good — clear intent, solid composition
8	Very good — strong visual impact
9–10	Exceptional — award-level work

Architecture

Image (any aspect ratio)
  ↓
NaFlex preprocessing — aspect-ratio-aware patching (max 256 patches, patch=16px)
  ↓
SigLIP 2 So400m encoder (27 transformer blocks, 1152-d, 16 heads)
  ├── tap layer  8 → masked mean pool → 1152-d
  ├── tap layer 17 → masked mean pool → 1152-d
  └── final pooled (multi-head attention probe) → 1152-d
  ↓
Concatenate [pool | tap8 | tap17] → 3456-d
  ↓
MLP head: 3456 → 768 → 256 → 8 bucket logits
  ↓
softmax → expected value over bucket centers → score ∈ [1, 10]

Multi-layer tapping lets early layers contribute low-level cues (color, sharpness) while later layers contribute composition and semantic quality.

Training

Base model: google/siglip2-so400m-patch16-naflex (pretrained weights)
Training data: ~134k images hand-curated from diverse web sources, with aesthetic scores annotated by Gemini Flash, spanning the following categories:
- Curated photography — 21 thematic categories: landscape, portrait, wildlife, macro, architecture, fashion, food, street, product, astrophotography, and more
- General photography — community-shared and web-sourced real photos (both high and low quality)
- AI-generated images — outputs from multiple text-to-image diffusion model families
- AI community content — social/community-shared AI art and renders
- Traditional & digital art — oil paintings, watercolors, charcoal drawings, digital illustrations
- Graphic design & typography — posters, logos, typographic layouts
- Stock/commercial imagery — professional stock photography and product shots
- Negative examples — images with heavy text overlays, watermarked content, low-quality web images, and broken/corrupt images
Loss: SORD (soft ordinal) on 8 non-uniform score buckets + auxiliary ranking loss (λ=0.3)
Optimizer: AdamW with LLRD (layer-wise LR decay 0.7×), backbone LR 1e-5, head LR 1e-3
EMA: exponential moving average (decay=0.9998), used for this checkpoint

Evaluation vs Qwen3-VL-32B harsh critic (1967 images, 18 sources)

	SRCC	MAE
AestheticSigLIP	0.671	1.04

Per-source highlights: real-lq/hq (SRCC 0.78/0.64), pinterest-curated (0.65), bad-text (0.86). Weakest on generic AI art (pickapic 0.24, playground 0.15) — these categories are underrepresented in training.

Citation

If you would like to cite this work in an academic context, you can use this BibTeX snippet:

@misc{aestheticsiglip2026,
  author    = {Somepalli, Gowthami},
  title     = {AestheticSigLIP: Image Aesthetic Scoring on SigLIP 2 NaFlex},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/somepago/AestheticSigLIP}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support