LIPE V2: Landmark-guided Image Patch Embedder

LIPE V2 is an ultra-lightweight, high-efficiency gaze estimation framework designed for commodity hardware (Edge CPUs). It utilizes an Asymmetric Dual-State Pipeline to achieve state-of-the-art accuracy with minimal computational overhead.

Key Features

  • Ultra-Lightweight: Only 0.61M parameters and 21.44 MFLOPs.
  • High Efficiency: Achieves 30 FPS on single-threaded mobile CPUs.
  • Asymmetric Architecture: Toggles between Active Appearance ($\mathcal{S}_A$) and Structural Resilience ($\mathcal{S}_B$) states.
  • Robust Generalization: 4.73° MAE on MPIIGaze and 7.33° on Gaze360 (Zero-shot).

Technical Specifications

Metric Value
Parameters 0.61 M
FLOPs 21.44 M
Memory Footprint < 45 MB RAM
Latency (CPU) ~12ms (State A), ~3ms (State B)
Accuracy (MPII) 4.73° MAE
Accuracy (Gaze360) 7.33° (3D Angular Error)

Model Architecture

The framework consists of a Dual-Pooling Mini-Conv embedder for image patches and a Geometric MLP for topological facial landmarks. Feature fusion is stabilized via Bimodal Distribution Resilience and Terminal Anchoring (SWA).

Usage

To use this model, clone the repository and ensure you have the dependencies installed:

pip install -r requirements.txt

Inference example:

from src.models.student import LIPEV2StudentGaze360Gold
model = LIPEV2StudentGaze360Gold()
# Load pre-trained weights
checkpoint = torch.load("checkpoints/swa_gold_p11.pt")
model.load_state_dict(checkpoint)

Citation

If you find LIPE V2 useful for your research, please cite our work.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support