--- license: mit tags: - face-swap - face-enhancement - face-detection - face-parsing - face-mask - face-segmentation - person-detection - tensorrt - deep-learning - computer-vision - morphstream --- # MorphStream Models Models and TensorRT engine cache for real-time face processing used by [MorphStream](https://morphstream.ai) GPU Worker. **Private repository** — requires access token for downloads. ## Structure ``` / ├── inswapper_128.onnx # Standard face swap (529MB) ├── inswapper_128_fp16.onnx # FP16 optimized - default (265MB) ├── hyperswap_1a_256.onnx # HyperSwap variant A (384MB) ├── hyperswap_1b_256.onnx # HyperSwap variant B (384MB) ├── hyperswap_1c_256.onnx # HyperSwap variant C (384MB) ├── yolov8n.onnx # Person detection (12MB) ├── dfl_xseg.onnx # XSeg v1 face segmentation — legacy (67MB) ├── xseg_1.onnx # XSeg occlusion model 1 (67MB) ├── xseg_2.onnx # XSeg occlusion model 2 (67MB) ├── xseg_3.onnx # XSeg occlusion model 3 (67MB) ├── 2dfan4.onnx # 68-point face landmarks (93MB) ├── bisenet_resnet_34.onnx # BiSeNet face parsing ResNet-34 (89MB) ├── bisenet_resnet_18.onnx # BiSeNet face parsing ResNet-18 (51MB) ├── buffalo_l/ # Direct ONNX face analysis models │ ├── det_10g.onnx # SCRFD face detection FP32 (16MB) │ ├── det_10g_fp16.onnx # SCRFD face detection FP16 (8.1MB) │ ├── w600k_r50.onnx # ArcFace recognition embeddings (166MB) │ ├── 1k3d68.onnx # 3D landmarks, 68 points (137MB) │ ├── 2d106det.onnx # 2D landmarks, 106 points (4.8MB) │ └── genderage.onnx # Gender/age estimation (1.3MB) ├── gfpgan/ # Face enhancement (not used in real-time) │ ├── GFPGANv1.4.pth │ └── weights/ │ ├── detection_Resnet50_Final.pth │ └── parsing_parsenet.pth ├── trt_cache/ # Pre-compiled TensorRT engines │ ├── sm89/trt10.9_ort1.24/ # RTX 4090 │ ├── sm86/trt10.9_ort1.24/ # RTX 3090 │ └── ... # Other GPU arch + version combos └── scripts/ └── convert_scrfd_fp16.py # FP32 → FP16 conversion utility ``` ## Face Swap Models | Model | Description | Size | Input | Format | |-------|-------------|------|-------|--------| | `inswapper_128.onnx` | Standard quality | 529 MB | 128px | ONNX FP32 | | `inswapper_128_fp16.onnx` | FP16 optimized (**default**) | 265 MB | 128px | ONNX FP16 | | `hyperswap_1a_256.onnx` | High quality — variant A | 384 MB | 256px | ONNX FP32 | | `hyperswap_1b_256.onnx` | High quality — variant B | 384 MB | 256px | ONNX FP32 | | `hyperswap_1c_256.onnx` | High quality — variant C | 384 MB | 256px | ONNX FP32 | ## Face Analysis (buffalo_l) Models originally from [InsightFace](https://github.com/deepinsight/insightface) buffalo_l pack. GPU Worker loads them directly via ONNX Runtime (DirectSCRFD, DirectArcFace, DirectLandmark106) without the InsightFace Python library. | Model | GPU Worker Class | Description | Size | |-------|-----------------|-------------|------| | `det_10g.onnx` | `DirectSCRFD` | SCRFD face detection (FP32) | 16 MB | | `det_10g_fp16.onnx` | `DirectSCRFD` | SCRFD face detection (FP16, ~2x faster on Tensor Cores) | 8.1 MB | | `w600k_r50.onnx` | `DirectArcFace` | ArcFace R50 face recognition embeddings | 166 MB | | `2d106det.onnx` | `DirectLandmark106` | 2D face landmarks (106 points), CLAHE + face angle rotation. Used in face detection pipeline; 106-pt landmarks serve as fallback for masking when 68-pt unavailable | 4.8 MB | | `1k3d68.onnx` | — | 3D face landmarks (68 points) — not used at runtime | 137 MB | | `genderage.onnx` | — | Gender and age estimation — not used at runtime | 1.3 MB | ## Face Landmarks | Model | Description | Size | Input | |-------|-------------|------|-------| | `2dfan4.onnx` | 2DFAN4 — 68-point face landmarks | 93 MB | 256px | FaceFusion-style 5/68 refinement: SCRFD detects face + coarse 5 kps, then 2DFAN4 produces 68 precise landmarks, converted to 5 alignment points (eye centers from 6 points each, exact nose tip, exact mouth corners). Improves face alignment quality for swap models. **Primary landmark model for face masking**: 68-pt landmarks from 2DFAN4 are the preferred source for `custom_paste_back` compositing (hull, cutouts, mouth blend). 106-pt landmarks from `2d106det.onnx` serve as fallback. Dual-landmark support: `has_valid_68` preferred, `has_valid_106` fallback, `use_68` flag propagated through all mask functions. Landmarks are temporally smoothed via One Euro Filter in `LandmarkSmoother` (attribute `face.landmark_2d_68`). Source: [FaceFusion assets](https://github.com/facefusion/facefusion-assets). ## Person Detection | Model | Description | Size | Input | |-------|-------------|------|-------| | `yolov8n.onnx` | YOLOv8n — person detection (COCO class 0) | 12 MB | 640px | Used to distinguish "person left frame" vs "face occluded" during face swap. ## Face Mask Models (FaceFusion 4-Mask System) Occlusion detection (XSeg) and semantic face parsing (BiSeNet) models for composable mask pipeline. Used in GPU Worker's `face_masker.py` for box/occlusion/area/region masks. Source: [FaceFusion 3.x assets](https://github.com/facefusion/facefusion-assets) (Apache-2.0), mirrored here for reliability. ### XSeg — Occlusion Detection | Model | Description | Size | Input | Output | |-------|-------------|------|-------|--------| | `dfl_xseg.onnx` | XSeg v1 — legacy binary face mask (not used) | 67 MB | 256px | binary (face/bg) | | `xseg_1.onnx` | XSeg model 1 — occlusion detection | 67 MB | 256px | binary (face/bg) | | `xseg_2.onnx` | XSeg model 2 — occlusion detection | 67 MB | 256px | binary (face/bg) | | `xseg_3.onnx` | XSeg model 3 — occlusion detection | 67 MB | 256px | binary (face/bg) | Runtime model selection via IPC: `many` (all 3 intersected), `xseg_1`, `xseg_2`, `xseg_3`. Input: NHWC float32 [0,1]. Output: intersection of all selected model masks (most conservative). ### BiSeNet — Region Segmentation | Model | Description | Size | Input | Classes | |-------|-------------|------|-------|---------| | `bisenet_resnet_34.onnx` | BiSeNet ResNet-34 (**default**) | 89 MB | 512px | 19 regions | | `bisenet_resnet_18.onnx` | BiSeNet ResNet-18 (lighter) | 51 MB | 512px | 19 regions | Runtime model selection via IPC. Input: NCHW float32 ImageNet-normalized. 10 configurable face regions: skin, left-eyebrow, right-eyebrow, left-eye, right-eye, glasses, upper-lip, nose, lower-lip, mouth. ## TensorRT Engine Cache Pre-compiled TensorRT engines stored in `trt_cache/` subfolder, keyed by GPU architecture and software versions. Eliminates cold-start TRT compilation (~180-300s) on new GPU instances. ### Layout ``` trt_cache/ ├── sm89/trt10.9_ort1.24/ # RTX 4090 (Ada Lovelace) │ ├── manifest.json # Metadata: cache_key, engine list, timestamps │ ├── TensorrtExecutionProvider_*.engine # Compiled TRT engines │ ├── TensorrtExecutionProvider_*.profile # Profiling data │ └── timing.cache # cuDNN/TRT timing optimization cache ├── sm86/trt10.9_ort1.24/ # RTX 3090 (Ampere) │ └── ... └── sm80/trt10.9_ort1.24/ # A100 (Ampere) └── ... ``` ### Cache Key Format: `{gpu_arch}/trt{trt_version}_ort{ort_version}` | Component | Example | Source | |-----------|---------|--------| | `gpu_arch` | `sm89` | `nvidia-smi --query-gpu=compute_cap` → `8.9` → `sm89` | | `trt_version` | `10.9` | `tensorrt.__version__` major.minor | | `ort_version` | `1.24` | `onnxruntime.__version__` major.minor | ### Lifecycle 1. **Download** — at container boot, GPU Worker checks HF for matching cache key. If found, downloads all engines (~10-30s vs ~180-300s compile). 2. **Compile** — if no cache on HF, ONNX Runtime compiles TRT engines from scratch on first model load. 3. **Self-seed upload** — after compilation, engines are uploaded to HF so future instances skip compilation. 4. **Incremental upload** — if engines were downloaded from HF but new models compiled locally after (e.g., YOLOv8n during warmup), only the new engines are uploaded. ### manifest.json ```json { "cache_key": "sm89/trt10.9_ort1.24", "gpu_arch": "sm89", "trt_version": "10.9", "ort_version": "1.24", "created_at": "2025-03-07T12:00:00Z", "machine_id": "C.12345", "engine_files": [ "TensorrtExecutionProvider_model_hash.engine", "TensorrtExecutionProvider_model_hash.profile", "timing.cache" ] } ``` Manifest serves as both metadata and upload gate — its presence signals that cache was downloaded, and `engine_files` list enables incremental upload detection. ## GFPGAN (optional, not used in real-time) Face restoration and enhancement. Too slow for real-time streaming (~50-150ms per frame). | Model | Description | Size | |-------|-------------|------| | `gfpgan/GFPGANv1.4.pth` | GFPGAN v1.4 restoration | 332 MB | | `gfpgan/weights/detection_Resnet50_Final.pth` | RetinaFace detector | 104 MB | | `gfpgan/weights/parsing_parsenet.pth` | ParseNet segmentation | 81 MB | ## Usage ### GPU Worker (production) Models are baked into the Docker image at build time (buffalo_l + default swap + landmark + mask models). Alternative swap models (HyperSwap) are downloaded on-demand by `ModelDownloadService`. TRT engine cache is downloaded asynchronously at boot via `trt_cache.py` (non-blocking — `/health` responds immediately). ```bash # Manual download (local development) HF_TOKEN=hf_xxx ./scripts/download_models.sh /models ``` ### Docker build ```bash docker build --build-arg HF_TOKEN=hf_xxx -t morphstream-gpu-worker . ``` ### Python (huggingface_hub) ```python from huggingface_hub import hf_hub_download model_path = hf_hub_download( repo_id="latark/MorphStream", filename="inswapper_128_fp16.onnx", token="hf_xxx" ) ``` ## Scripts ### convert_scrfd_fp16.py Converts SCRFD det_10g.onnx from FP32 to FP16: ```bash pip install onnx onnxconverter-common python scripts/convert_scrfd_fp16.py \ --input buffalo_l/det_10g.onnx \ --output buffalo_l/det_10g_fp16.onnx ``` Key: `op_block_list=['BatchNormalization']` prevents epsilon underflow (1e-5 → 0 in FP16 → NaN). ## License MIT License