AestheticSigLIP

Image aesthetic scorer (1–10) built on SigLIP 2 So400m NaFlex (~430M params), fine-tuned with multi-layer feature tapping, SORD loss, and ranking loss.

Usage

pip install torch torchvision pillow huggingface_hub
from huggingface_hub import hf_hub_download
# clone or copy predict.py + model/ + naflex.py + config.py from the repo
from predict import AestheticScorer

scorer = AestheticScorer.from_pretrained("somepago/AestheticSigLIP")

# single image
score = scorer.rate("photo.jpg")          # float, e.g. 7.42

# batch
scores = scorer.rate(["a.jpg", "b.jpg"])  # [7.42, 3.81]

# PIL image directly
from PIL import Image
score = scorer.rate(Image.open("photo.jpg"))

CLI:

python predict.py photo.jpg photo2.jpg --repo somepago/AestheticSigLIP

Score scale

Score Meaning
1–2 Blurry, broken, heavy watermarks
3–4 Bad framing, low effort, obvious AI slop
5–6 Generic, forgettable β€” typical web image
7 Good β€” clear intent, solid composition
8 Very good β€” strong visual impact
9–10 Exceptional β€” award-level work

Architecture

Image (any aspect ratio)
  ↓
NaFlex preprocessing β€” aspect-ratio-aware patching (max 256 patches, patch=16px)
  ↓
SigLIP 2 So400m encoder (27 transformer blocks, 1152-d, 16 heads)
  β”œβ”€β”€ tap layer  8 β†’ masked mean pool β†’ 1152-d
  β”œβ”€β”€ tap layer 17 β†’ masked mean pool β†’ 1152-d
  └── final pooled (multi-head attention probe) β†’ 1152-d
  ↓
Concatenate [pool | tap8 | tap17] β†’ 3456-d
  ↓
MLP head: 3456 β†’ 768 β†’ 256 β†’ 8 bucket logits
  ↓
softmax β†’ expected value over bucket centers β†’ score ∈ [1, 10]

Multi-layer tapping lets early layers contribute low-level cues (color, sharpness) while later layers contribute composition and semantic quality.

Training

  • Base model: google/siglip2-so400m-patch16-naflex (pretrained weights)
  • Training data: ~134k images hand-curated from diverse web sources, with aesthetic scores annotated by Gemini Flash, spanning the following categories:
    • Curated photography β€” 21 thematic categories: landscape, portrait, wildlife, macro, architecture, fashion, food, street, product, astrophotography, and more
    • General photography β€” community-shared and web-sourced real photos (both high and low quality)
    • AI-generated images β€” outputs from multiple text-to-image diffusion model families
    • AI community content β€” social/community-shared AI art and renders
    • Traditional & digital art β€” oil paintings, watercolors, charcoal drawings, digital illustrations
    • Graphic design & typography β€” posters, logos, typographic layouts
    • Stock/commercial imagery β€” professional stock photography and product shots
    • Negative examples β€” images with heavy text overlays, watermarked content, low-quality web images, and broken/corrupt images
  • Loss: SORD (soft ordinal) on 8 non-uniform score buckets + auxiliary ranking loss (Ξ»=0.3)
  • Optimizer: AdamW with LLRD (layer-wise LR decay 0.7Γ—), backbone LR 1e-5, head LR 1e-3
  • EMA: exponential moving average (decay=0.9998), used for this checkpoint

Evaluation vs Qwen3-VL-32B harsh critic (1967 images, 18 sources)

SRCC MAE
AestheticSigLIP 0.671 1.04

Per-source highlights: real-lq/hq (SRCC 0.78/0.64), pinterest-curated (0.65), bad-text (0.86). Weakest on generic AI art (pickapic 0.24, playground 0.15) β€” these categories are underrepresented in training.

Citation

If you would like to cite this work in an academic context, you can use this BibTeX snippet:

@misc{aestheticsiglip2026,
  author    = {Somepalli, Gowthami},
  title     = {AestheticSigLIP: Image Aesthetic Scoring on SigLIP 2 NaFlex},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/somepago/AestheticSigLIP}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support