You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

FaceAge ClientScan

**A face-only age estimation on LAGENDA 84k — MAE 3.555. The state-of-the-art specific task model : Mivolov2 on face+body MAE 3.65. **

Age and gender estimation from face crops using DINOv3-ViT-L backbone with CORAL ordinal regression.

Performance (LAGENDA 84k benchmark)

Model	Input	MAE ↓	CS@5 ↑	Gender Acc ↑
FaceAge ClientScan (ours)	face-only	3.555	75.5%	97.75%
MiVOLO v2 (paper)	face + body	3.650	74.48%	97.99%
MiVOLO v1 (paper)	face + body	3.990	71.27%	97.36%
MiVOLO v2 (measured, face+body)	face + body	3.859	76.5%	96.96%
MiVOLO v2 (measured, face-only)	face only	4.224	69.7%	96.05%

Key result: FaceAge ClientScan achieves MAE=3.555 using only the face crop — no body information needed — outperforming MiVOLO v2's paper claim of 3.650 which requires both face and body bounding boxes.

Per age-group MAE (FaceAge ClientScan vs MiVOLO v2 best)

Age Group	n	MiVOLO v2 best	FaceAge ClientScan	Delta
0–12	15,369	1.677	1.548	✅ −0.129
13–17	3,930	3.365	2.845	✅ −0.520
18–25	9,975	2.989	2.877	✅ −0.112
26–35	10,303	3.348	3.775	❌ +0.427
36–50	19,234	4.484	4.195	✅ −0.289
51–65	16,350	4.794	4.329	✅ −0.465
66+	9,031	6.310	5.013	✅ −1.297
Overall	84,192	3.859	3.555	✅ −0.304

FaceAge ClientScan wins 6/7 age groups. The only group where MiVOLO v2 leads is 26–35, where body context likely helps.

Architecture

Face [B, 3, 224, 224]  (+ 10% proportional bbox padding)
    ↓
DINOv3-ViT-L/16  (307M params, pretrained on LVD-1.68B)
    ↓ pooler_output
[B, 1024]
    ↓ LayerNorm → Linear(1024→512) → GELU → Dropout(0.1)
[B, 512]
    ├── age_head:    Linear(512, 100) → CORAL → age ∈ [0, 100]
    └── gender_head: Linear(512, 2)  → softmax → {female, male}

CORAL ordinal regression: age = Σ σ(logit_k) for k=0..99. Exploits the ordinal structure of ages for better calibration than standard cross-entropy.

Important: use 10% proportional padding when cropping the face bbox before inference — this matches the training setup and is required to reproduce MAE=3.555.

Face crop helper (required for MAE=3.555)

Apply 10% proportional padding before passing to the model. This is critical — without it MAE degrades to ~3.758.

import numpy as np
from PIL import Image

def crop_face(image_rgb: np.ndarray,
              x0: float, y0: float, x1: float, y1: float,
              pad: float = 0.10) -> np.ndarray:
    """Crop face bbox with proportional padding. pad=0.10 → 10% each side."""
    h, w = image_rgb.shape[:2]
    pw, ph = (x1 - x0) * pad, (y1 - y0) * pad
    x0 = max(0, int(x0 - pw));  y0 = max(0, int(y0 - ph))
    x1 = min(w, int(x1 + pw));  y1 = min(h, int(y1 + ph))
    return image_rgb[y0:y1, x0:x1]

Batched inference (PyTorch — recommended for benchmarks)

import torch
import numpy as np
import pandas as pd
from PIL import Image
from tqdm import tqdm
from torch.utils.data import Dataset, DataLoader
from transformers import AutoImageProcessor, AutoModel

# Limit threads — on big servers PyTorch over-subscribes cores
torch.set_num_threads(8)

BATCH_SIZE = 32   # increase if you have enough RAM
NUM_WORKERS = 8   # parallel image loading

processor = AutoImageProcessor.from_pretrained("TrungTran/faceage_ClientScan")
model = AutoModel.from_pretrained("TrungTran/faceage_ClientScan", trust_remote_code=True)
model.eval()


def crop_face(image_rgb, x0, y0, x1, y1, pad=0.10):
    h, w = image_rgb.shape[:2]
    pw, ph = (x1 - x0) * pad, (y1 - y0) * pad
    x0 = max(0, int(x0 - pw)); y0 = max(0, int(y0 - ph))
    x1 = min(w, int(x1 + pw)); y1 = min(h, int(y1 + ph))
    return image_rgb[y0:y1, x0:x1]


class FaceDataset(Dataset):
    def __init__(self, df, root, processor):
        self.df = df.reset_index(drop=True)
        self.root = root
        self.processor = processor

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        row = self.df.iloc[idx]
        img_rgb = np.array(Image.open(self.root + row.img_name).convert("RGB"))
        face = crop_face(img_rgb, row.face_x0, row.face_y0, row.face_x1, row.face_y1)
        pixel_values = self.processor(images=Image.fromarray(face),
                                      return_tensors="pt")["pixel_values"][0]
        return pixel_values, row.img_name


df = pd.read_csv("lagenda_annotation.csv")
df = df[df.age != -1].reset_index(drop=True)
ROOT = "/path/to/lagenda/"

dataset = FaceDataset(df, ROOT, processor)
loader  = DataLoader(dataset, batch_size=BATCH_SIZE, num_workers=NUM_WORKERS,
                     pin_memory=False)

results = {}
with torch.no_grad():
    for pixel_values, img_names in tqdm(loader, desc="Inference"):
        outputs = model(pixel_values=pixel_values)
        ages    = outputs.age_output.tolist()
        genders = outputs.gender_class_idx.tolist()
        for name, age, g in zip(img_names, ages, genders):
            results[name] = {"age": age, "gender": "male" if g == 1 else "female"}

Tip: for even faster CPU inference use the ONNX version (infer_onnx.py) which is ~3× faster than PyTorch on CPU.

import torch
import numpy as np
from PIL import Image
from transformers import AutoImageProcessor, AutoModel

processor = AutoImageProcessor.from_pretrained("TrungTran/faceage_ClientScan")
model = AutoModel.from_pretrained("TrungTran/faceage_ClientScan", trust_remote_code=True)
model.eval()

# 1. Load full image and apply 10% padded crop
img_rgb = np.array(Image.open("photo.jpg").convert("RGB"))
face    = crop_face(img_rgb, x0=120, y0=80, x1=300, y1=320)  # your bbox here

# 2. Run model
inputs = processor(images=Image.fromarray(face), return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

age    = outputs.age_output.item()
gender = "male" if outputs.gender_class_idx.item() == 1 else "female"
conf   = outputs.gender_probs[0, outputs.gender_class_idx.item()].item()
print(f"Age: {age:.1f}  Gender: {gender} ({conf:.0%})")

Usage (ONNX — no PyTorch needed)

Standalone inference script: github.com/TrungThanhTran/faceage-ClientScan — includes infer_onnx.py with auto-download, single image + LAGENDA benchmark modes.

git clone https://github.com/TrungThanhTran/faceage-ClientScan.git
cd faceage-ClientScan
pip install -r requirements.txt

# Single image
python infer_onnx.py --image photo.jpg --bbox 120 80 300 320

# LAGENDA MAE benchmark
python infer_onnx.py \
    --lagenda_dir   /path/to/lagenda \
    --annotation_csv lagenda_test.csv \
    --batch_size    256

Or use the Python API directly:

from infer_onnx import FaceAgeModel, crop_face
import numpy as np
from PIL import Image

model = FaceAgeModel()   # auto-downloads ONNX from HuggingFace

img  = np.array(Image.open("photo.jpg").convert("RGB"))
face = crop_face(img, x0=120, y0=80, x1=300, y1=320)
out  = model.predict(face)
print(out)  # {'age': 34.2, 'gender': 'male', 'gender_conf': 0.981}

Or raw ONNX (manual):

import numpy as np
import onnxruntime as ort
from PIL import Image

sess    = ort.InferenceSession("faceage_dino_fp32.onnx",
                               providers=["CPUExecutionProvider"])
in_name = sess.get_inputs()[0].name

MEAN = np.array([0.485, 0.456, 0.406], dtype=np.float32)
STD  = np.array([0.229, 0.224, 0.225], dtype=np.float32)

def preprocess(img_rgb: np.ndarray) -> np.ndarray:
    """HxWx3 uint8 RGB → [1,3,224,224] float32, ImageNet normalised."""
    pil = Image.fromarray(img_rgb).resize((224, 224), Image.BICUBIC)
    arr = np.asarray(pil, dtype=np.float32) / 255.0
    arr = (arr - MEAN) / STD
    return arr.transpose(2, 0, 1)[np.newaxis]

# 1. Load image, apply 10% padded crop
img_rgb = np.array(Image.open("photo.jpg").convert("RGB"))
face    = crop_face(img_rgb, x0=120, y0=80, x1=300, y1=320)  # your bbox here

# 2. Run ONNX
age_logits, gender_logits = sess.run(None, {in_name: preprocess(face)})
age    = float((1 / (1 + np.exp(-age_logits[0]))).sum())   # CORAL decode
gender = "male" if gender_logits[0].argmax() == 1 else "female"
print(f"Age: {age:.1f}  Gender: {gender}")

Reproducing MAE=3.555

git clone https://github.com/TrungThanhTran/faceage-ClientScan.git
cd faceage-ClientScan
pip install -r requirements.txt

python infer_onnx.py \
    --lagenda_dir   /path/to/lagenda \
    --annotation_csv lagenda_test.csv \
    --batch_size    256

Training

Multi-phase fine-tuning on DINOv3-ViT-L:

Phase	Backbone	LR	Key change
1	Frozen (all 24 blocks)	1e-3	Head training only
2	Top 4 blocks unfrozen	1e-4	Partial fine-tuning
3	All blocks unfrozen	3e-5	Full fine-tuning
4	All blocks	3e-6	Age-group reweighting, best epoch MAE=3.555

Training data: Our Collection (4M images).

Citation

@misc{faceage-clientscan-2026,
  title  = {FaceAge ClientScan: Face-Only Age \& Gender Estimation},
  author = {Trung Thanh Tran},
  year   = {2026},
  url    = {https://huggingface.co/TrungTran/faceage_ClientScan}
}

Related work:

DINOv3: Meta AI, "DINOv3: Scaling Up Vision Foundation Models", 2025
MiVOLO: Kuprashevich & Tolstykh, arXiv 2307.04616
LAGENDA: Bhuiyan et al., 2023
CORAL: Cao et al., Pattern Recognition Letters 2020

Downloads last month: 70

Safetensors

Model size

0.3B params

Tensor type

F32

TrungTran
/

faceage_ClientScan