You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

FaceAge ClientScan

**A face-only age estimation on LAGENDA 84k β€” MAE 3.555. The state-of-the-art specific task model : Mivolov2 on face+body MAE 3.65. **

Age and gender estimation from face crops using DINOv3-ViT-L backbone with CORAL ordinal regression.

Performance (LAGENDA 84k benchmark)

Model Input MAE ↓ CS@5 ↑ Gender Acc ↑
FaceAge ClientScan (ours) face-only 3.555 75.5% 97.75%
MiVOLO v2 (paper) face + body 3.650 74.48% 97.99%
MiVOLO v1 (paper) face + body 3.990 71.27% 97.36%
MiVOLO v2 (measured, face+body) face + body 3.859 76.5% 96.96%
MiVOLO v2 (measured, face-only) face only 4.224 69.7% 96.05%

Key result: FaceAge ClientScan achieves MAE=3.555 using only the face crop β€” no body information needed β€” outperforming MiVOLO v2's paper claim of 3.650 which requires both face and body bounding boxes.

Per age-group MAE (FaceAge ClientScan vs MiVOLO v2 best)

Age Group n MiVOLO v2 best FaceAge ClientScan Delta
0–12 15,369 1.677 1.548 βœ… βˆ’0.129
13–17 3,930 3.365 2.845 βœ… βˆ’0.520
18–25 9,975 2.989 2.877 βœ… βˆ’0.112
26–35 10,303 3.348 3.775 ❌ +0.427
36–50 19,234 4.484 4.195 βœ… βˆ’0.289
51–65 16,350 4.794 4.329 βœ… βˆ’0.465
66+ 9,031 6.310 5.013 βœ… βˆ’1.297
Overall 84,192 3.859 3.555 βœ… βˆ’0.304

FaceAge ClientScan wins 6/7 age groups. The only group where MiVOLO v2 leads is 26–35, where body context likely helps.

Architecture

Face [B, 3, 224, 224]  (+ 10% proportional bbox padding)
    ↓
DINOv3-ViT-L/16  (307M params, pretrained on LVD-1.68B)
    ↓ pooler_output
[B, 1024]
    ↓ LayerNorm β†’ Linear(1024β†’512) β†’ GELU β†’ Dropout(0.1)
[B, 512]
    β”œβ”€β”€ age_head:    Linear(512, 100) β†’ CORAL β†’ age ∈ [0, 100]
    └── gender_head: Linear(512, 2)  β†’ softmax β†’ {female, male}

CORAL ordinal regression: age = Ξ£ Οƒ(logit_k) for k=0..99. Exploits the ordinal structure of ages for better calibration than standard cross-entropy.

Important: use 10% proportional padding when cropping the face bbox before inference β€” this matches the training setup and is required to reproduce MAE=3.555.

Face crop helper (required for MAE=3.555)

Apply 10% proportional padding before passing to the model. This is critical β€” without it MAE degrades to ~3.758.

import numpy as np
from PIL import Image

def crop_face(image_rgb: np.ndarray,
              x0: float, y0: float, x1: float, y1: float,
              pad: float = 0.10) -> np.ndarray:
    """Crop face bbox with proportional padding. pad=0.10 β†’ 10% each side."""
    h, w = image_rgb.shape[:2]
    pw, ph = (x1 - x0) * pad, (y1 - y0) * pad
    x0 = max(0, int(x0 - pw));  y0 = max(0, int(y0 - ph))
    x1 = min(w, int(x1 + pw));  y1 = min(h, int(y1 + ph))
    return image_rgb[y0:y1, x0:x1]

Batched inference (PyTorch β€” recommended for benchmarks)

import torch
import numpy as np
import pandas as pd
from PIL import Image
from tqdm import tqdm
from torch.utils.data import Dataset, DataLoader
from transformers import AutoImageProcessor, AutoModel

# Limit threads β€” on big servers PyTorch over-subscribes cores
torch.set_num_threads(8)

BATCH_SIZE = 32   # increase if you have enough RAM
NUM_WORKERS = 8   # parallel image loading

processor = AutoImageProcessor.from_pretrained("TrungTran/faceage_ClientScan")
model = AutoModel.from_pretrained("TrungTran/faceage_ClientScan", trust_remote_code=True)
model.eval()


def crop_face(image_rgb, x0, y0, x1, y1, pad=0.10):
    h, w = image_rgb.shape[:2]
    pw, ph = (x1 - x0) * pad, (y1 - y0) * pad
    x0 = max(0, int(x0 - pw)); y0 = max(0, int(y0 - ph))
    x1 = min(w, int(x1 + pw)); y1 = min(h, int(y1 + ph))
    return image_rgb[y0:y1, x0:x1]


class FaceDataset(Dataset):
    def __init__(self, df, root, processor):
        self.df = df.reset_index(drop=True)
        self.root = root
        self.processor = processor

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        row = self.df.iloc[idx]
        img_rgb = np.array(Image.open(self.root + row.img_name).convert("RGB"))
        face = crop_face(img_rgb, row.face_x0, row.face_y0, row.face_x1, row.face_y1)
        pixel_values = self.processor(images=Image.fromarray(face),
                                      return_tensors="pt")["pixel_values"][0]
        return pixel_values, row.img_name


df = pd.read_csv("lagenda_annotation.csv")
df = df[df.age != -1].reset_index(drop=True)
ROOT = "/path/to/lagenda/"

dataset = FaceDataset(df, ROOT, processor)
loader  = DataLoader(dataset, batch_size=BATCH_SIZE, num_workers=NUM_WORKERS,
                     pin_memory=False)

results = {}
with torch.no_grad():
    for pixel_values, img_names in tqdm(loader, desc="Inference"):
        outputs = model(pixel_values=pixel_values)
        ages    = outputs.age_output.tolist()
        genders = outputs.gender_class_idx.tolist()
        for name, age, g in zip(img_names, ages, genders):
            results[name] = {"age": age, "gender": "male" if g == 1 else "female"}

Tip: for even faster CPU inference use the ONNX version (infer_onnx.py) which is ~3Γ— faster than PyTorch on CPU.

import torch
import numpy as np
from PIL import Image
from transformers import AutoImageProcessor, AutoModel

processor = AutoImageProcessor.from_pretrained("TrungTran/faceage_ClientScan")
model = AutoModel.from_pretrained("TrungTran/faceage_ClientScan", trust_remote_code=True)
model.eval()

# 1. Load full image and apply 10% padded crop
img_rgb = np.array(Image.open("photo.jpg").convert("RGB"))
face    = crop_face(img_rgb, x0=120, y0=80, x1=300, y1=320)  # your bbox here

# 2. Run model
inputs = processor(images=Image.fromarray(face), return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

age    = outputs.age_output.item()
gender = "male" if outputs.gender_class_idx.item() == 1 else "female"
conf   = outputs.gender_probs[0, outputs.gender_class_idx.item()].item()
print(f"Age: {age:.1f}  Gender: {gender} ({conf:.0%})")

Usage (ONNX β€” no PyTorch needed)

Standalone inference script: github.com/TrungThanhTran/faceage-ClientScan β€” includes infer_onnx.py with auto-download, single image + LAGENDA benchmark modes.

git clone https://github.com/TrungThanhTran/faceage-ClientScan.git
cd faceage-ClientScan
pip install -r requirements.txt

# Single image
python infer_onnx.py --image photo.jpg --bbox 120 80 300 320

# LAGENDA MAE benchmark
python infer_onnx.py \
    --lagenda_dir   /path/to/lagenda \
    --annotation_csv lagenda_test.csv \
    --batch_size    256

Or use the Python API directly:

from infer_onnx import FaceAgeModel, crop_face
import numpy as np
from PIL import Image

model = FaceAgeModel()   # auto-downloads ONNX from HuggingFace

img  = np.array(Image.open("photo.jpg").convert("RGB"))
face = crop_face(img, x0=120, y0=80, x1=300, y1=320)
out  = model.predict(face)
print(out)  # {'age': 34.2, 'gender': 'male', 'gender_conf': 0.981}

Or raw ONNX (manual):

import numpy as np
import onnxruntime as ort
from PIL import Image

sess    = ort.InferenceSession("faceage_dino_fp32.onnx",
                               providers=["CPUExecutionProvider"])
in_name = sess.get_inputs()[0].name

MEAN = np.array([0.485, 0.456, 0.406], dtype=np.float32)
STD  = np.array([0.229, 0.224, 0.225], dtype=np.float32)

def preprocess(img_rgb: np.ndarray) -> np.ndarray:
    """HxWx3 uint8 RGB β†’ [1,3,224,224] float32, ImageNet normalised."""
    pil = Image.fromarray(img_rgb).resize((224, 224), Image.BICUBIC)
    arr = np.asarray(pil, dtype=np.float32) / 255.0
    arr = (arr - MEAN) / STD
    return arr.transpose(2, 0, 1)[np.newaxis]

# 1. Load image, apply 10% padded crop
img_rgb = np.array(Image.open("photo.jpg").convert("RGB"))
face    = crop_face(img_rgb, x0=120, y0=80, x1=300, y1=320)  # your bbox here

# 2. Run ONNX
age_logits, gender_logits = sess.run(None, {in_name: preprocess(face)})
age    = float((1 / (1 + np.exp(-age_logits[0]))).sum())   # CORAL decode
gender = "male" if gender_logits[0].argmax() == 1 else "female"
print(f"Age: {age:.1f}  Gender: {gender}")

Reproducing MAE=3.555

git clone https://github.com/TrungThanhTran/faceage-ClientScan.git
cd faceage-ClientScan
pip install -r requirements.txt

python infer_onnx.py \
    --lagenda_dir   /path/to/lagenda \
    --annotation_csv lagenda_test.csv \
    --batch_size    256

Training

Multi-phase fine-tuning on DINOv3-ViT-L:

Phase Backbone LR Key change
1 Frozen (all 24 blocks) 1e-3 Head training only
2 Top 4 blocks unfrozen 1e-4 Partial fine-tuning
3 All blocks unfrozen 3e-5 Full fine-tuning
4 All blocks 3e-6 Age-group reweighting, best epoch MAE=3.555

Training data: Our Collection (4M images).

Citation

@misc{faceage-clientscan-2026,
  title  = {FaceAge ClientScan: Face-Only Age \& Gender Estimation},
  author = {Trung Thanh Tran},
  year   = {2026},
  url    = {https://huggingface.co/TrungTran/faceage_ClientScan}
}

Related work:

  • DINOv3: Meta AI, "DINOv3: Scaling Up Vision Foundation Models", 2025
  • MiVOLO: Kuprashevich & Tolstykh, arXiv 2307.04616
  • LAGENDA: Bhuiyan et al., 2023
  • CORAL: Cao et al., Pattern Recognition Letters 2020
Downloads last month
70
Safetensors
Model size
0.3B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using TrungTran/faceage_ClientScan 1