---
license: mit
tags:
  - face-swap
  - face-enhancement
  - face-detection
  - face-parsing
  - face-mask
  - face-segmentation
  - person-detection
  - tensorrt
  - deep-learning
  - computer-vision
  - morphstream
---

# MorphStream Models

Models and TensorRT engine cache for real-time face processing used by [MorphStream](https://morphstream.ai) GPU Worker.

**Private repository** — requires access token for downloads.

## Structure

```
/
├── inswapper_128.onnx           # Standard face swap (529MB)
├── inswapper_128_fp16.onnx      # FP16 optimized - default (265MB)
├── hyperswap_1a_256.onnx        # HyperSwap variant A (384MB)
├── hyperswap_1b_256.onnx        # HyperSwap variant B (384MB)
├── hyperswap_1c_256.onnx        # HyperSwap variant C (384MB)
├── yolov8n.onnx                 # Person detection (12MB)
├── dfl_xseg.onnx                # XSeg v1 face segmentation — legacy (67MB)
├── xseg_1.onnx                  # XSeg occlusion model 1 (67MB)
├── xseg_2.onnx                  # XSeg occlusion model 2 (67MB)
├── xseg_3.onnx                  # XSeg occlusion model 3 (67MB)
├── 2dfan4.onnx                  # 68-point face landmarks (93MB)
├── bisenet_resnet_34.onnx       # BiSeNet face parsing ResNet-34 (89MB)
├── bisenet_resnet_18.onnx       # BiSeNet face parsing ResNet-18 (51MB)
├── buffalo_l/                   # Direct ONNX face analysis models
│   ├── det_10g.onnx             # SCRFD face detection FP32 (16MB)
│   ├── det_10g_fp16.onnx        # SCRFD face detection FP16 (8.1MB)
│   ├── w600k_r50.onnx           # ArcFace recognition embeddings (166MB)
│   ├── 1k3d68.onnx              # 3D landmarks, 68 points (137MB)
│   ├── 2d106det.onnx            # 2D landmarks, 106 points (4.8MB)
│   └── genderage.onnx           # Gender/age estimation (1.3MB)
├── gfpgan/                      # Face enhancement (not used in real-time)
│   ├── GFPGANv1.4.pth
│   └── weights/
│       ├── detection_Resnet50_Final.pth
│       └── parsing_parsenet.pth
├── trt_cache/                   # Pre-compiled TensorRT engines
│   ├── sm89/trt10.9_ort1.24/   # RTX 4090
│   ├── sm86/trt10.9_ort1.24/   # RTX 3090
│   └── ...                      # Other GPU arch + version combos
└── scripts/
    └── convert_scrfd_fp16.py    # FP32 → FP16 conversion utility
```

## Face Swap Models

| Model | Description | Size | Input | Format |
|-------|-------------|------|-------|--------|
| `inswapper_128.onnx` | Standard quality | 529 MB | 128px | ONNX FP32 |
| `inswapper_128_fp16.onnx` | FP16 optimized (**default**) | 265 MB | 128px | ONNX FP16 |
| `hyperswap_1a_256.onnx` | High quality — variant A | 384 MB | 256px | ONNX FP32 |
| `hyperswap_1b_256.onnx` | High quality — variant B | 384 MB | 256px | ONNX FP32 |
| `hyperswap_1c_256.onnx` | High quality — variant C | 384 MB | 256px | ONNX FP32 |

## Face Analysis (buffalo_l)

Models originally from [InsightFace](https://github.com/deepinsight/insightface) buffalo_l pack. GPU Worker loads them directly via ONNX Runtime (DirectSCRFD, DirectArcFace, DirectLandmark106) without the InsightFace Python library.

| Model | GPU Worker Class | Description | Size |
|-------|-----------------|-------------|------|
| `det_10g.onnx` | `DirectSCRFD` | SCRFD face detection (FP32) | 16 MB |
| `det_10g_fp16.onnx` | `DirectSCRFD` | SCRFD face detection (FP16, ~2x faster on Tensor Cores) | 8.1 MB |
| `w600k_r50.onnx` | `DirectArcFace` | ArcFace R50 face recognition embeddings | 166 MB |
| `2d106det.onnx` | `DirectLandmark106` | 2D face landmarks (106 points), CLAHE + face angle rotation. Used in face detection pipeline; 106-pt landmarks serve as fallback for masking when 68-pt unavailable | 4.8 MB |
| `1k3d68.onnx` | — | 3D face landmarks (68 points) — not used at runtime | 137 MB |
| `genderage.onnx` | — | Gender and age estimation — not used at runtime | 1.3 MB |

## Face Landmarks

| Model | Description | Size | Input |
|-------|-------------|------|-------|
| `2dfan4.onnx` | 2DFAN4 — 68-point face landmarks | 93 MB | 256px |

FaceFusion-style 5/68 refinement: SCRFD detects face + coarse 5 kps, then 2DFAN4 produces 68 precise landmarks, converted to 5 alignment points (eye centers from 6 points each, exact nose tip, exact mouth corners). Improves face alignment quality for swap models.

**Primary landmark model for face masking**: 68-pt landmarks from 2DFAN4 are the preferred source for `custom_paste_back` compositing (hull, cutouts, mouth blend). 106-pt landmarks from `2d106det.onnx` serve as fallback. Dual-landmark support: `has_valid_68` preferred, `has_valid_106` fallback, `use_68` flag propagated through all mask functions. Landmarks are temporally smoothed via One Euro Filter in `LandmarkSmoother` (attribute `face.landmark_2d_68`).

Source: [FaceFusion assets](https://github.com/facefusion/facefusion-assets).

## Person Detection

| Model | Description | Size | Input |
|-------|-------------|------|-------|
| `yolov8n.onnx` | YOLOv8n — person detection (COCO class 0) | 12 MB | 640px |

Used to distinguish "person left frame" vs "face occluded" during face swap.

## Face Mask Models (FaceFusion 4-Mask System)

Occlusion detection (XSeg) and semantic face parsing (BiSeNet) models for composable mask pipeline.
Used in GPU Worker's `face_masker.py` for box/occlusion/area/region masks.

Source: [FaceFusion 3.x assets](https://github.com/facefusion/facefusion-assets) (Apache-2.0), mirrored here for reliability.

### XSeg — Occlusion Detection

| Model | Description | Size | Input | Output |
|-------|-------------|------|-------|--------|
| `dfl_xseg.onnx` | XSeg v1 — legacy binary face mask (not used) | 67 MB | 256px | binary (face/bg) |
| `xseg_1.onnx` | XSeg model 1 — occlusion detection | 67 MB | 256px | binary (face/bg) |
| `xseg_2.onnx` | XSeg model 2 — occlusion detection | 67 MB | 256px | binary (face/bg) |
| `xseg_3.onnx` | XSeg model 3 — occlusion detection | 67 MB | 256px | binary (face/bg) |

Runtime model selection via IPC: `many` (all 3 intersected), `xseg_1`, `xseg_2`, `xseg_3`.
Input: NHWC float32 [0,1]. Output: intersection of all selected model masks (most conservative).

### BiSeNet — Region Segmentation

| Model | Description | Size | Input | Classes |
|-------|-------------|------|-------|---------|
| `bisenet_resnet_34.onnx` | BiSeNet ResNet-34 (**default**) | 89 MB | 512px | 19 regions |
| `bisenet_resnet_18.onnx` | BiSeNet ResNet-18 (lighter) | 51 MB | 512px | 19 regions |

Runtime model selection via IPC. Input: NCHW float32 ImageNet-normalized.
10 configurable face regions: skin, left-eyebrow, right-eyebrow, left-eye, right-eye, glasses, upper-lip, nose, lower-lip, mouth.

## TensorRT Engine Cache

Pre-compiled TensorRT engines stored in `trt_cache/` subfolder, keyed by GPU architecture and software versions. Eliminates cold-start TRT compilation (~180-300s) on new GPU instances.

### Layout

```
trt_cache/
├── sm89/trt10.9_ort1.24/          # RTX 4090 (Ada Lovelace)
│   ├── manifest.json               # Metadata: cache_key, engine list, timestamps
│   ├── TensorrtExecutionProvider_*.engine   # Compiled TRT engines
│   ├── TensorrtExecutionProvider_*.profile  # Profiling data
│   └── timing.cache                # cuDNN/TRT timing optimization cache
├── sm86/trt10.9_ort1.24/          # RTX 3090 (Ampere)
│   └── ...
└── sm80/trt10.9_ort1.24/          # A100 (Ampere)
    └── ...
```

### Cache Key

Format: `{gpu_arch}/trt{trt_version}_ort{ort_version}`

| Component | Example | Source |
|-----------|---------|--------|
| `gpu_arch` | `sm89` | `nvidia-smi --query-gpu=compute_cap` → `8.9` → `sm89` |
| `trt_version` | `10.9` | `tensorrt.__version__` major.minor |
| `ort_version` | `1.24` | `onnxruntime.__version__` major.minor |

### Lifecycle

1. **Download** — at container boot, GPU Worker checks HF for matching cache key. If found, downloads all engines (~10-30s vs ~180-300s compile).
2. **Compile** — if no cache on HF, ONNX Runtime compiles TRT engines from scratch on first model load.
3. **Self-seed upload** — after compilation, engines are uploaded to HF so future instances skip compilation.
4. **Incremental upload** — if engines were downloaded from HF but new models compiled locally after (e.g., YOLOv8n during warmup), only the new engines are uploaded.

### manifest.json

```json
{
  "cache_key": "sm89/trt10.9_ort1.24",
  "gpu_arch": "sm89",
  "trt_version": "10.9",
  "ort_version": "1.24",
  "created_at": "2025-03-07T12:00:00Z",
  "machine_id": "C.12345",
  "engine_files": [
    "TensorrtExecutionProvider_model_hash.engine",
    "TensorrtExecutionProvider_model_hash.profile",
    "timing.cache"
  ]
}
```

Manifest serves as both metadata and upload gate — its presence signals that cache was downloaded, and `engine_files` list enables incremental upload detection.

## GFPGAN (optional, not used in real-time)

Face restoration and enhancement. Too slow for real-time streaming (~50-150ms per frame).

| Model | Description | Size |
|-------|-------------|------|
| `gfpgan/GFPGANv1.4.pth` | GFPGAN v1.4 restoration | 332 MB |
| `gfpgan/weights/detection_Resnet50_Final.pth` | RetinaFace detector | 104 MB |
| `gfpgan/weights/parsing_parsenet.pth` | ParseNet segmentation | 81 MB |

## Usage

### GPU Worker (production)

Models are baked into the Docker image at build time (buffalo_l + default swap + landmark + mask models). Alternative swap models (HyperSwap) are downloaded on-demand by `ModelDownloadService`.

TRT engine cache is downloaded asynchronously at boot via `trt_cache.py` (non-blocking — `/health` responds immediately).

```bash
# Manual download (local development)
HF_TOKEN=hf_xxx ./scripts/download_models.sh /models
```

### Docker build

```bash
docker build --build-arg HF_TOKEN=hf_xxx -t morphstream-gpu-worker .
```

### Python (huggingface_hub)

```python
from huggingface_hub import hf_hub_download

model_path = hf_hub_download(
    repo_id="latark/MorphStream",
    filename="inswapper_128_fp16.onnx",
    token="hf_xxx"
)
```

## Scripts

### convert_scrfd_fp16.py

Converts SCRFD det_10g.onnx from FP32 to FP16:

```bash
pip install onnx onnxconverter-common
python scripts/convert_scrfd_fp16.py \
    --input buffalo_l/det_10g.onnx \
    --output buffalo_l/det_10g_fp16.onnx
```

Key: `op_block_list=['BatchNormalization']` prevents epsilon underflow (1e-5 → 0 in FP16 → NaN).

## License

MIT License