FaceAge ClientScan
**A face-only age estimation on LAGENDA 84k β MAE 3.555. The state-of-the-art specific task model : Mivolov2 on face+body MAE 3.65. **
Age and gender estimation from face crops using DINOv3-ViT-L backbone with CORAL ordinal regression.
Performance (LAGENDA 84k benchmark)
| Model | Input | MAE β | CS@5 β | Gender Acc β |
|---|---|---|---|---|
| FaceAge ClientScan (ours) | face-only | 3.555 | 75.5% | 97.75% |
| MiVOLO v2 (paper) | face + body | 3.650 | 74.48% | 97.99% |
| MiVOLO v1 (paper) | face + body | 3.990 | 71.27% | 97.36% |
| MiVOLO v2 (measured, face+body) | face + body | 3.859 | 76.5% | 96.96% |
| MiVOLO v2 (measured, face-only) | face only | 4.224 | 69.7% | 96.05% |
Key result: FaceAge ClientScan achieves MAE=3.555 using only the face crop β no body information needed β outperforming MiVOLO v2's paper claim of 3.650 which requires both face and body bounding boxes.
Per age-group MAE (FaceAge ClientScan vs MiVOLO v2 best)
| Age Group | n | MiVOLO v2 best | FaceAge ClientScan | Delta |
|---|---|---|---|---|
| 0β12 | 15,369 | 1.677 | 1.548 | β β0.129 |
| 13β17 | 3,930 | 3.365 | 2.845 | β β0.520 |
| 18β25 | 9,975 | 2.989 | 2.877 | β β0.112 |
| 26β35 | 10,303 | 3.348 | 3.775 | β +0.427 |
| 36β50 | 19,234 | 4.484 | 4.195 | β β0.289 |
| 51β65 | 16,350 | 4.794 | 4.329 | β β0.465 |
| 66+ | 9,031 | 6.310 | 5.013 | β β1.297 |
| Overall | 84,192 | 3.859 | 3.555 | β β0.304 |
FaceAge ClientScan wins 6/7 age groups. The only group where MiVOLO v2 leads is 26β35, where body context likely helps.
Architecture
Face [B, 3, 224, 224] (+ 10% proportional bbox padding)
β
DINOv3-ViT-L/16 (307M params, pretrained on LVD-1.68B)
β pooler_output
[B, 1024]
β LayerNorm β Linear(1024β512) β GELU β Dropout(0.1)
[B, 512]
βββ age_head: Linear(512, 100) β CORAL β age β [0, 100]
βββ gender_head: Linear(512, 2) β softmax β {female, male}
CORAL ordinal regression: age = Ξ£ Ο(logit_k) for k=0..99. Exploits the ordinal structure of ages for better calibration than standard cross-entropy.
Important: use 10% proportional padding when cropping the face bbox before inference β this matches the training setup and is required to reproduce MAE=3.555.
Face crop helper (required for MAE=3.555)
Apply 10% proportional padding before passing to the model. This is critical β without it MAE degrades to ~3.758.
import numpy as np
from PIL import Image
def crop_face(image_rgb: np.ndarray,
x0: float, y0: float, x1: float, y1: float,
pad: float = 0.10) -> np.ndarray:
"""Crop face bbox with proportional padding. pad=0.10 β 10% each side."""
h, w = image_rgb.shape[:2]
pw, ph = (x1 - x0) * pad, (y1 - y0) * pad
x0 = max(0, int(x0 - pw)); y0 = max(0, int(y0 - ph))
x1 = min(w, int(x1 + pw)); y1 = min(h, int(y1 + ph))
return image_rgb[y0:y1, x0:x1]
Batched inference (PyTorch β recommended for benchmarks)
import torch
import numpy as np
import pandas as pd
from PIL import Image
from tqdm import tqdm
from torch.utils.data import Dataset, DataLoader
from transformers import AutoImageProcessor, AutoModel
# Limit threads β on big servers PyTorch over-subscribes cores
torch.set_num_threads(8)
BATCH_SIZE = 32 # increase if you have enough RAM
NUM_WORKERS = 8 # parallel image loading
processor = AutoImageProcessor.from_pretrained("TrungTran/faceage_ClientScan")
model = AutoModel.from_pretrained("TrungTran/faceage_ClientScan", trust_remote_code=True)
model.eval()
def crop_face(image_rgb, x0, y0, x1, y1, pad=0.10):
h, w = image_rgb.shape[:2]
pw, ph = (x1 - x0) * pad, (y1 - y0) * pad
x0 = max(0, int(x0 - pw)); y0 = max(0, int(y0 - ph))
x1 = min(w, int(x1 + pw)); y1 = min(h, int(y1 + ph))
return image_rgb[y0:y1, x0:x1]
class FaceDataset(Dataset):
def __init__(self, df, root, processor):
self.df = df.reset_index(drop=True)
self.root = root
self.processor = processor
def __len__(self):
return len(self.df)
def __getitem__(self, idx):
row = self.df.iloc[idx]
img_rgb = np.array(Image.open(self.root + row.img_name).convert("RGB"))
face = crop_face(img_rgb, row.face_x0, row.face_y0, row.face_x1, row.face_y1)
pixel_values = self.processor(images=Image.fromarray(face),
return_tensors="pt")["pixel_values"][0]
return pixel_values, row.img_name
df = pd.read_csv("lagenda_annotation.csv")
df = df[df.age != -1].reset_index(drop=True)
ROOT = "/path/to/lagenda/"
dataset = FaceDataset(df, ROOT, processor)
loader = DataLoader(dataset, batch_size=BATCH_SIZE, num_workers=NUM_WORKERS,
pin_memory=False)
results = {}
with torch.no_grad():
for pixel_values, img_names in tqdm(loader, desc="Inference"):
outputs = model(pixel_values=pixel_values)
ages = outputs.age_output.tolist()
genders = outputs.gender_class_idx.tolist()
for name, age, g in zip(img_names, ages, genders):
results[name] = {"age": age, "gender": "male" if g == 1 else "female"}
Tip: for even faster CPU inference use the ONNX version (
infer_onnx.py) which is ~3Γ faster than PyTorch on CPU.
import torch
import numpy as np
from PIL import Image
from transformers import AutoImageProcessor, AutoModel
processor = AutoImageProcessor.from_pretrained("TrungTran/faceage_ClientScan")
model = AutoModel.from_pretrained("TrungTran/faceage_ClientScan", trust_remote_code=True)
model.eval()
# 1. Load full image and apply 10% padded crop
img_rgb = np.array(Image.open("photo.jpg").convert("RGB"))
face = crop_face(img_rgb, x0=120, y0=80, x1=300, y1=320) # your bbox here
# 2. Run model
inputs = processor(images=Image.fromarray(face), return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
age = outputs.age_output.item()
gender = "male" if outputs.gender_class_idx.item() == 1 else "female"
conf = outputs.gender_probs[0, outputs.gender_class_idx.item()].item()
print(f"Age: {age:.1f} Gender: {gender} ({conf:.0%})")
Usage (ONNX β no PyTorch needed)
Standalone inference script: github.com/TrungThanhTran/faceage-ClientScan β includes
infer_onnx.pywith auto-download, single image + LAGENDA benchmark modes.
git clone https://github.com/TrungThanhTran/faceage-ClientScan.git
cd faceage-ClientScan
pip install -r requirements.txt
# Single image
python infer_onnx.py --image photo.jpg --bbox 120 80 300 320
# LAGENDA MAE benchmark
python infer_onnx.py \
--lagenda_dir /path/to/lagenda \
--annotation_csv lagenda_test.csv \
--batch_size 256
Or use the Python API directly:
from infer_onnx import FaceAgeModel, crop_face
import numpy as np
from PIL import Image
model = FaceAgeModel() # auto-downloads ONNX from HuggingFace
img = np.array(Image.open("photo.jpg").convert("RGB"))
face = crop_face(img, x0=120, y0=80, x1=300, y1=320)
out = model.predict(face)
print(out) # {'age': 34.2, 'gender': 'male', 'gender_conf': 0.981}
Or raw ONNX (manual):
import numpy as np
import onnxruntime as ort
from PIL import Image
sess = ort.InferenceSession("faceage_dino_fp32.onnx",
providers=["CPUExecutionProvider"])
in_name = sess.get_inputs()[0].name
MEAN = np.array([0.485, 0.456, 0.406], dtype=np.float32)
STD = np.array([0.229, 0.224, 0.225], dtype=np.float32)
def preprocess(img_rgb: np.ndarray) -> np.ndarray:
"""HxWx3 uint8 RGB β [1,3,224,224] float32, ImageNet normalised."""
pil = Image.fromarray(img_rgb).resize((224, 224), Image.BICUBIC)
arr = np.asarray(pil, dtype=np.float32) / 255.0
arr = (arr - MEAN) / STD
return arr.transpose(2, 0, 1)[np.newaxis]
# 1. Load image, apply 10% padded crop
img_rgb = np.array(Image.open("photo.jpg").convert("RGB"))
face = crop_face(img_rgb, x0=120, y0=80, x1=300, y1=320) # your bbox here
# 2. Run ONNX
age_logits, gender_logits = sess.run(None, {in_name: preprocess(face)})
age = float((1 / (1 + np.exp(-age_logits[0]))).sum()) # CORAL decode
gender = "male" if gender_logits[0].argmax() == 1 else "female"
print(f"Age: {age:.1f} Gender: {gender}")
Reproducing MAE=3.555
git clone https://github.com/TrungThanhTran/faceage-ClientScan.git
cd faceage-ClientScan
pip install -r requirements.txt
python infer_onnx.py \
--lagenda_dir /path/to/lagenda \
--annotation_csv lagenda_test.csv \
--batch_size 256
Training
Multi-phase fine-tuning on DINOv3-ViT-L:
| Phase | Backbone | LR | Key change |
|---|---|---|---|
| 1 | Frozen (all 24 blocks) | 1e-3 | Head training only |
| 2 | Top 4 blocks unfrozen | 1e-4 | Partial fine-tuning |
| 3 | All blocks unfrozen | 3e-5 | Full fine-tuning |
| 4 | All blocks | 3e-6 | Age-group reweighting, best epoch MAE=3.555 |
Training data: Our Collection (4M images).
Citation
@misc{faceage-clientscan-2026,
title = {FaceAge ClientScan: Face-Only Age \& Gender Estimation},
author = {Trung Thanh Tran},
year = {2026},
url = {https://huggingface.co/TrungTran/faceage_ClientScan}
}
Related work:
- DINOv3: Meta AI, "DINOv3: Scaling Up Vision Foundation Models", 2025
- MiVOLO: Kuprashevich & Tolstykh, arXiv 2307.04616
- LAGENDA: Bhuiyan et al., 2023
- CORAL: Cao et al., Pattern Recognition Letters 2020
- Downloads last month
- 70