AestheticSigLIP
Image aesthetic scorer (1β10) built on SigLIP 2 So400m NaFlex (~430M params), fine-tuned with multi-layer feature tapping, SORD loss, and ranking loss.
Usage
pip install torch torchvision pillow huggingface_hub
from huggingface_hub import hf_hub_download
# clone or copy predict.py + model/ + naflex.py + config.py from the repo
from predict import AestheticScorer
scorer = AestheticScorer.from_pretrained("somepago/AestheticSigLIP")
# single image
score = scorer.rate("photo.jpg") # float, e.g. 7.42
# batch
scores = scorer.rate(["a.jpg", "b.jpg"]) # [7.42, 3.81]
# PIL image directly
from PIL import Image
score = scorer.rate(Image.open("photo.jpg"))
CLI:
python predict.py photo.jpg photo2.jpg --repo somepago/AestheticSigLIP
Score scale
| Score | Meaning |
|---|---|
| 1β2 | Blurry, broken, heavy watermarks |
| 3β4 | Bad framing, low effort, obvious AI slop |
| 5β6 | Generic, forgettable β typical web image |
| 7 | Good β clear intent, solid composition |
| 8 | Very good β strong visual impact |
| 9β10 | Exceptional β award-level work |
Architecture
Image (any aspect ratio)
β
NaFlex preprocessing β aspect-ratio-aware patching (max 256 patches, patch=16px)
β
SigLIP 2 So400m encoder (27 transformer blocks, 1152-d, 16 heads)
βββ tap layer 8 β masked mean pool β 1152-d
βββ tap layer 17 β masked mean pool β 1152-d
βββ final pooled (multi-head attention probe) β 1152-d
β
Concatenate [pool | tap8 | tap17] β 3456-d
β
MLP head: 3456 β 768 β 256 β 8 bucket logits
β
softmax β expected value over bucket centers β score β [1, 10]
Multi-layer tapping lets early layers contribute low-level cues (color, sharpness) while later layers contribute composition and semantic quality.
Training
- Base model: google/siglip2-so400m-patch16-naflex (pretrained weights)
- Training data: ~134k images hand-curated from diverse web sources, with aesthetic scores annotated by Gemini Flash, spanning the following categories:
- Curated photography β 21 thematic categories: landscape, portrait, wildlife, macro, architecture, fashion, food, street, product, astrophotography, and more
- General photography β community-shared and web-sourced real photos (both high and low quality)
- AI-generated images β outputs from multiple text-to-image diffusion model families
- AI community content β social/community-shared AI art and renders
- Traditional & digital art β oil paintings, watercolors, charcoal drawings, digital illustrations
- Graphic design & typography β posters, logos, typographic layouts
- Stock/commercial imagery β professional stock photography and product shots
- Negative examples β images with heavy text overlays, watermarked content, low-quality web images, and broken/corrupt images
- Loss: SORD (soft ordinal) on 8 non-uniform score buckets + auxiliary ranking loss (Ξ»=0.3)
- Optimizer: AdamW with LLRD (layer-wise LR decay 0.7Γ), backbone LR 1e-5, head LR 1e-3
- EMA: exponential moving average (decay=0.9998), used for this checkpoint
Evaluation vs Qwen3-VL-32B harsh critic (1967 images, 18 sources)
| SRCC | MAE | |
|---|---|---|
| AestheticSigLIP | 0.671 | 1.04 |
Per-source highlights: real-lq/hq (SRCC 0.78/0.64), pinterest-curated (0.65), bad-text (0.86). Weakest on generic AI art (pickapic 0.24, playground 0.15) β these categories are underrepresented in training.
Citation
If you would like to cite this work in an academic context, you can use this BibTeX snippet:
@misc{aestheticsiglip2026,
author = {Somepalli, Gowthami},
title = {AestheticSigLIP: Image Aesthetic Scoring on SigLIP 2 NaFlex},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/somepago/AestheticSigLIP}
}
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support