# Browser Face Swap — Build Guide

> **Live demo:** https://newface.live
>
> **Reference repo:** https://github.com/alphastack1/new-face-live
>
> **How to use this file:** Drop it into an empty folder, open it in
> Claude Code, and say *"Read this file and build everything."*
> No Python, no GPU, no install. Runs entirely in the browser via WebGPU.
> Deploy to Netlify (free tier). Open in Chrome. Done.

---

# Part 1: The Big Picture

## What This App Does

A real-time face swap tool that runs **entirely in the browser**. No server,
no Python, no CUDA drivers. Select a reference face photo (nose shape, lip
fullness, jaw structure), point your webcam, pick a facial region — and see
the result blended onto your live video. Powered by ONNX Runtime Web + WebGPU.

This is a browser-native reimplementation of what Deep-Live-Cam does with
Python + NVIDIA GPU, but running on the user's GPU via WebGPU instead.

```
 USER EXPERIENCE
 ═══════════════════════════════════════════════════════════════

 First Visit (models download once, cached in IndexedDB):
 ┌──────────────────────────────────────────────────────────┐
 │                                                          │
 │       See your new look instantly                        │
 │                                                          │
 │  Preview cosmetic changes to your nose, lips,            │
 │  eyes, and more — live on camera.                        │
 │                                                          │
 │  🔒 100% private  📹 Real-time camera  ⚡ No install     │
 │                                                          │
 │  ┌────────────────────────────────────────────┐          │
 │  │  ██████████░░░░░░░░░░  42%                 │          │
 │  │  inswapper: 231/553 MB (42% overall)       │          │
 │  └────────────────────────────────────────────┘          │
 │                                                          │
 │               [Get Started]                              │
 │                                                          │
 │  One-time ~838 MB download — cached locally after use    │
 └──────────────────────────────────────────────────────────┘

 After models cached (instant on every future visit):
 ┌──────────────────────────────────────────────────────────┐
 │  NewFace                                    ● Ready      │
 │  ┌─────────────────────────────────────────────────────┐ │
 │  │                                                     │ │
 │  │              LIVE CAMERA FEED                       │ │
 │  │        (face swapped in real-time)                  │ │
 │  │              15-25 FPS via WebGPU                   │ │
 │  │                                                     │ │
 │  ├─────────────────────────────────────────────────────┤ │
 │  │  [▶ Start]                            [Mirror]      │ │
 │  ├─────────────────────────────────────────────────────┤ │
 │  │  Region: [●Nose] [Lips] [Eyes] [Brow] [Chin] [Full]│ │
 │  │  Blend:  ═══════●═══════════════════════════  70%   │ │
 │  │  Sharp:  ═══════════●═══════════════════════  50    │ │
 │  │  Ref:    Nose│Lips│Eyes│Brow│Chin                   │ │
 │  │  [img][img][img][img][img][img] →  (scroll)         │ │
 │  └─────────────────────────────────────────────────────┘ │
 └──────────────────────────────────────────────────────────┘
```

## How It Differs from Traditional Deep-Live-Cam

```
 TRADITIONAL (Python + NVIDIA GPU)        THIS PROJECT (Browser + WebGPU)
 ════════════════════════════════          ════════════════════════════════
 Python 3.10+ required                    No install needed
 NVIDIA GPU + CUDA required               Any GPU (Chrome's WebGPU)
 InsightFace + onnxruntime-gpu            ONNX Runtime Web (@1.22.0)
 Flask server on localhost                Static HTML — deploy anywhere
 Models in filesystem (~2 GB)             Models cached in IndexedDB (~838 MB)
 Desktop only (Windows/Linux)             Any device with Chrome + WebGPU
 tkinter or custom Flask UI               Single HTML file, mobile-responsive
 Full-face swap only                      Regional swap (nose, lips, eyes, etc.)
 No pixel-level region masking            BiSeNet face parsing (19 classes)
```

## Core Pipeline

```
LIVE FRAME PROCESSING (pipelined across 2 threads)
═══════════════════════════════════════════════════════════

  Webcam frame (640×480)
        │
        ├──→ [Web Worker] SCRFD face detection (WASM)
        │    Next frame's detection runs in parallel
        │    with current frame's swap.
        │
        ▼  (uses previous frame's detection result)
  ┌─────────────────────────────────────────────┐
  │  1. SCRFD FACE DETECTION (det_10g)           │
  │     192×192 input, RetinaFace-style          │
  │     → bounding box + 5-point landmarks       │
  │     Runs on WASM in Web Worker               │
  │     Also runs on main thread (WASM)          │
  │     for reference-face detection             │
  └──────────────────┬──────────────────────────┘
                     │
                     ▼
  ┌─────────────────────────────────────────────┐
  │  2. ARCFACE EMBEDDING (w600k_r50)            │
  │     Align face to 112×112 via affine warp    │
  │     Extract 512-dim embedding                │
  │     Runs once per reference face (not live)  │
  │     WASM backend                             │
  └──────────────────┬──────────────────────────┘
                     │
                     ▼
  ┌─────────────────────────────────────────────┐
  │  3. INSWAPPER_128 FACE SWAP (WebGPU)         │
  │     Input: source latent (512-dim projected  │
  │     embedding) + target face (128×128 aligned)│
  │     Output: swapped face 128×128             │
  │                                              │
  │     NOTE: InSwapper uses the ArcFace         │
  │     embedding (pose-invariant), NOT pixel    │
  │     data. One front-facing ref is enough.    │
  └──────────────────┬──────────────────────────┘
                     │
                     ▼
  ┌─────────────────────────────────────────────┐
  │  4. PASTE-BACK with soft elliptical mask     │
  │     Inverse-warp swapped face into frame     │
  │     Bilinear interpolation + feathered edges │
  │     Pre-computed 128×128 radial mask         │
  └──────────────────┬──────────────────────────┘
                     │
                     ▼
  ┌─────────────────────────────────────────────┐
  │  5. BISENET FACE PARSING (WebGPU)            │
  │     512×512 input, 19 CelebAMask-HQ classes  │
  │     Maps classes → region masks:             │
  │       nose: [10]                             │
  │       lips: [11, 12, 13]                     │
  │       eyes: [4, 5, 6]                        │
  │       brow: [2, 3]                           │
  │       chin: [1] + crop above mouth Y         │
  │     Cached — reparses every ~45 frames       │
  └──────────────────┬──────────────────────────┘
                     │
                     ▼
  ┌─────────────────────────────────────────────┐
  │  6. ALPHA BLEND                              │
  │     result = orig × (1 - mask × opacity)     │
  │            + swapped × (mask × opacity)      │
  │     Gaussian-feathered edges (10% face width)│
  │     Edge dilation to fill gaps (2% radius)   │
  └──────────────────┬──────────────────────────┘
                     │
                     ▼
  ┌─────────────────────────────────────────────┐
  │  7. SHARPNESS (unsharp mask, every frame)    │
  │     3×3 neighbor average subtracted          │
  │     strength = slider / 50 (0–2×)            │
  └──────────────────┬──────────────────────────┘
                     │
                     ▼
  Rendered to <canvas> via putImageData()
```

## System Architecture

```
┌───────────────────────────────────────────────────────────────┐
│              BROWSER (Chrome 113+ with WebGPU)                 │
│                                                               │
│  index.html — single file, no build step, ~745 lines          │
│  ┌────────────────────────────────────────────────────────┐   │
│  │  Loading Screen                                        │   │
│  │  • WebGPU capability check                             │   │
│  │  • "Get Started" → download + init all models          │   │
│  │  • Progress bar with per-model reporting               │   │
│  │  • IndexedDB cache check (skip download if cached)     │   │
│  └────────────────────┬───────────────────────────────────┘   │
│                       │                                       │
│  ┌────────────────────▼───────────────────────────────────┐   │
│  │  Live View                                             │   │
│  │  • <video> (hidden) → getUserMedia webcam               │   │
│  │  • <canvas> (visible) → rendered face-swap result       │   │
│  │  • requestAnimationFrame loop                          │   │
│  │  • Region pills, opacity slider, sharpness slider      │   │
│  │  • Reference face category tabs + thumbnail scroll     │   │
│  │  • Mirror toggle, zoom/pan (scroll + pinch)            │   │
│  │  • Preferences saved/restored via localStorage         │   │
│  └────────────────────────────────────────────────────────┘   │
│                                                               │
│  js/engine.js    — orchestrator (loads models, processes frames)│
│  js/pipeline.js  — detection, embedding, swap, parse, blend    │
│  js/models.js    — IndexedDB caching + fetch with progress     │
│  js/math.js      — affine transforms, NMS, linear algebra      │
│  js/detection-worker.js — Web Worker for SCRFD (WASM)          │
│                                                               │
│  ┌──────────────────────────────────────────────────────┐     │
│  │  ONNX Runtime Web (@1.22.0)                           │     │
│  │  • WebGPU: inswapper, bisenet (fast, on GPU)          │     │
│  │  • WASM: det_10g, w600k_r50 (compatible, on CPU)      │     │
│  │  • Worker: det_10g (WASM, separate thread)             │     │
│  └──────────────────────────────────────────────────────┘     │
│                                                               │
│  ┌──────────────────────────────────────────────────────┐     │
│  │  IndexedDB: 'newface-models' / 'blobs'                │     │
│  │  Stores ~838 MB of ONNX models after first download   │     │
│  │  Keys: det_10g, w600k_r50, inswapper, bisenet, emap   │     │
│  └──────────────────────────────────────────────────────┘     │
└───────────────────────┬───────────────────────────────────────┘
                        │
                        │ First-time model download only
                        ▼
┌───────────────────────────────────────────────────────────────┐
│              NETLIFY (static hosting + edge function)           │
│                                                               │
│  Static files: index.html, js/*, references/*, favicon.svg    │
│  No build step. No server-side code (except edge proxy).      │
│                                                               │
│  Edge Function: /models-cdn/*                                 │
│  └── Proxies GitHub Releases downloads                        │
│      GitHub → 302 redirect → blob CDN (no CORS headers)      │
│      Edge function follows redirect server-side               │
│      and streams response back with CORS headers              │
│                                                               │
│  Headers:                                                     │
│  • Cross-Origin-Opener-Policy: same-origin                    │
│  • Cross-Origin-Embedder-Policy: credentialless               │
│    (required for SharedArrayBuffer / WASM threading)          │
│  • Cache-Control: immutable on /references/ and /models-cdn/  │
└───────────────────────┬───────────────────────────────────────┘
                        │
                        │ Edge function proxy (follows 302)
                        ▼
┌───────────────────────────────────────────────────────────────┐
│  GitHub Releases (alphastack1/storage, tag: newface-v1)        │
│                                                               │
│  ├── det_10g.onnx              ~17 MB   SCRFD face detection  │
│  ├── w600k_r50.onnx            ~174 MB  ArcFace embedding     │
│  ├── inswapper_128.onnx        ~553 MB  Face swap model       │
│  ├── bisenet_resnet_34.onnx    ~94 MB   Face parsing          │
│  └── emap.bin                  ~1 MB    Embedding projection  │
│                                                               │
│  Total: ~838 MB (downloaded once, cached in IndexedDB)        │
└───────────────────────────────────────────────────────────────┘
```

## What It Costs

| Item | Cost | Notes |
|------|------|-------|
| Everything | **$0** | All models are free and open source |
| Hosting | **$0** | Netlify free tier (static site) |
| Bandwidth | Netlify free tier | ~838 MB per new user (cached after) |
| GPU | User's browser | WebGPU uses their existing hardware |

## What You Need

| Requirement | Notes |
|-------------|-------|
| Netlify account | Free tier. For static hosting + edge functions |
| GitHub account | To host model files as Release assets (~838 MB) |
| Chrome 113+ / Edge 113+ | Must support WebGPU |
| Reference face images | JPEG photos of faces (one face per image) |

---

# Part 2: Project Structure

```
newface-live/
├── index.html              ← ENTIRE frontend in one file. HTML + CSS + JS.
│                              No React, no build, no node_modules.
│                              ~745 lines. Dark gold theme. Mobile-responsive.
├── favicon.svg             ← SVG favicon (face icon)
├── og-image.jpg            ← Open Graph share card (1200×630)
├── netlify.toml            ← Netlify config (headers, caching, redirects)
├── netlify/
│   └── edge-functions/
│       └── models-proxy.js ← Edge function: GitHub Releases → CORS proxy
├── js/
│   ├── engine.js           ← Orchestrator: model loading, frame processing
│   ├── pipeline.js         ← ML ops: detection, embedding, swap, parse, blend
│   ├── models.js           ← IndexedDB caching + fetch with progress
│   ├── math.js             ← Affine transforms, NMS, vector math
│   └── detection-worker.js ← Web Worker: SCRFD detection (WASM thread)
└── references/             ← Pre-loaded reference face images
    ├── nose_ref_1.jpg ... nose_ref_6.jpg    (6 nose references)
    ├── lips_ref_1.jpg ... lips_ref_4.jpg    (4 lips references)
    ├── eyes_ref_1.jpg ... eyes_ref_3.jpg    (3 eyes references)
    ├── brow_ref_1.jpg ... brow_ref_3.jpg    (3 brow references)
    └── chin_ref_1.jpg ... chin_ref_3.jpg    (3 chin references)

No auto-created directories. No venv. No Python. No build step.
Models are stored in the browser's IndexedDB (invisible to filesystem).
```

---

# Part 3: Model Loading & Caching (js/models.js)

This file handles downloading ONNX models, caching them in IndexedDB,
and creating ONNX Runtime sessions with the right backend (WebGPU or WASM).

## Model Registry

```javascript
const MODEL_REGISTRY = {
  det_10g:    { file: 'det_10g.onnx',              size: 16_923_827 },
  w600k_r50:  { file: 'w600k_r50.onnx',            size: 174_391_702 },
  inswapper:  { file: 'inswapper_128.onnx',          size: 553_210_555 },
  bisenet:    { file: 'bisenet_resnet_34.onnx',     size: 93_632_546 },
  emap:       { file: 'emap.bin',                   size: 1_048_576 },
};
```

## URL Switching

```
localhost/127.0.0.1  →  /models/       (local dev — models in filesystem)
production           →  /models-cdn/   (Netlify edge function proxy)
```

## IndexedDB Schema

```
Database: 'newface-models' (version 1)
Object Store: 'blobs'

Keys:                Values:
  'det_10g'     →    ArrayBuffer (~17 MB)
  'w600k_r50'   →    ArrayBuffer (~174 MB)
  'inswapper'   →    ArrayBuffer (~553 MB)
  'bisenet'     →    ArrayBuffer (~94 MB)
  'emap'        →    ArrayBuffer (~1 MB)
```

## Loading Flow

```
loadModelBytes(name)
═══════════════════════════════════════════════════

  ┌─────────────────────────────────────────────┐
  │  1. Check IndexedDB cache                    │
  │     cacheGet(name) → ArrayBuffer or null     │
  │     ├── HIT  → return immediately            │
  │     └── MISS → download from network         │
  └──────────────────┬──────────────────────────┘
                     │ MISS
                     ▼
  ┌─────────────────────────────────────────────┐
  │  2. Fetch with progress                      │
  │     Streaming download via ReadableStream    │
  │     Reports (loaded, total) for progress bar │
  │     Collects chunks → single ArrayBuffer     │
  └──────────────────┬──────────────────────────┘
                     │
                     ▼
  ┌─────────────────────────────────────────────┐
  │  3. Cache in IndexedDB (non-fatal)           │
  │     Private browsing may block storage       │
  │     Failure is caught and logged, not thrown  │
  └─────────────────────────────────────────────┘


loadSession(name)  →  WebGPU session   (inswapper, bisenet)
loadSessionWasm(name) → WASM session   (det_10g, w600k_r50)
loadEmap()         →  Float32Array     (512×512 matrix)
```

## ONNX Session Options

```javascript
// WebGPU (for swap + parse models — runs on GPU)
{
  executionProviders: ['webgpu'],
  preferredOutputLocation: 'cpu',       // Tensors readable on JS side
  graphOptimizationLevel: 'all',
  enableCpuMemArena: true,
  enableMemPattern: true,
  logSeverityLevel: 3,                  // Error-only (suppress warnings)
}

// WASM (for detection + recognition — WebGPU-incompatible ops)
{
  executionProviders: ['wasm'],
  graphOptimizationLevel: 'all',
  enableCpuMemArena: true,
  enableMemPattern: true,
  logSeverityLevel: 3,
}
```

## ORT Warning Suppression (tricky bit)

ONNX Runtime Web emits `VerifyOutputSizes` warnings on every inference
call when output tensor shapes don't exactly match the graph's static
shape metadata. These warnings come from inside the WASM binary and bypass
`ort.env.logLevel`. The fix is two-part:

```javascript
// 1. Override console.warn/error before loading ORT
const _w = console.warn.bind(console);
const RE = /VerifyOutputSizes|Expected shape from model/;
console.warn = function() {
  if (typeof arguments[0] === 'string' && RE.test(arguments[0])) return;
  _w.apply(console, arguments);
};

// 2. Pass logSeverityLevel: 3 in EVERY session.run() call
// (ORT resets to level 2 per-call regardless of session options)
const RUN_OPTIONS = { logSeverityLevel: 3 };
await session.run(feeds, RUN_OPTIONS);
```

This must happen in: index.html (before ORT script tag), detection-worker.js
(before `importScripts`), and pipeline.js (as RUN_OPTIONS on every `.run()`).

---

# Part 4: The Math Layer (js/math.js)

Pure linear algebra — no dependencies. These functions are the geometric
foundation for warping faces between coordinate spaces.

## Key Functions

```
estimateSimilarityTransform(src, dst)
═════════════════════════════════════
  Least-squares 2D similarity transform (rotation + scale + translation)
  from 5-point source landmarks → 5-point destination landmarks.

  Solves for [a, b, tx, ty] in:
    dx = a·sx - b·sy + tx
    dy = b·sx + a·sy + ty

  Returns 2×3 affine matrix: [[a, -b, tx], [b, a, ty]]

  Uses Gaussian elimination with partial pivoting (4×4 system).


invertAffine(M)
════════════════
  Invert a 2×3 affine matrix.
  Used for inverse-mapping (dst→src) in warpAffine.


warpAffine(srcData, srcW, srcH, M, dstW, dstH)
════════════════════════════════════════════════
  Same semantics as cv2.warpAffine in OpenCV.
  M is the FORWARD transform (src→dst). Internally inverts
  and does inverse mapping with bilinear interpolation.

  Input: RGBA Uint8ClampedArray
  Output: RGBA Uint8ClampedArray


nms(boxes, scores, threshold)
═════════════════════════════
  Standard non-maximum suppression for axis-aligned bounding boxes.
  Sorts by score descending, suppresses IoU > threshold.
  Returns indices of kept detections.


vecNormalize(v), vecMatMul(vec, mat, rows, cols)
════════════════════════════════════════════════
  L2 normalization and vector × matrix multiply.
  Used for ArcFace embedding → latent projection.
```

---

# Part 5: The ML Pipeline (js/pipeline.js)

All inference operations. Pre-allocates buffers to avoid GC pressure.

## Pre-Allocated Buffers (critical for performance)

```javascript
// Reused every frame — NEVER allocate in the hot loop
const _swapTensor = new Float32Array(1 * 3 * 128 * 128);    // Swap input
const _swapRGBA   = new Uint8ClampedArray(128 * 128 * 4);   // Swap output
const _detTensor  = new Float32Array(1 * 3 * 192 * 192);    // Detection input
```

## Face Detection (SCRFD / det_10g)

```
preprocessDetect(imgData) → { tensor, scale }
═══════════════════════════════════════════════
  1. Resize image to 192×192 (maintain aspect ratio, pad rest)
  2. Normalize: (pixel - 127.5) / 128.0
  3. Channel-first layout: [1, 3, 192, 192]

  Uses cached OffscreenCanvas pair for resize (avoid allocation).


detectFaces(session, imgData) → [{bbox, kps, score}, ...]
═══════════════════════════════════════════════════════════
  Model outputs 9 tensors (3 scales × {scores, bboxes, keypoints}).

  For each scale (stride 8, 16, 32):
    Feature map: (192/stride) × (192/stride) grid
    2 anchors per cell

    For each anchor above score threshold (0.3):
      Decode bbox: cx ± offset × stride, scaled by 1/ratio
      Decode 5 keypoints: cx + offset × stride, scaled by 1/ratio

  Apply NMS (IoU threshold 0.4)
  Return sorted detections.


detectOneFace(session, imgData) → {bbox, kps, score} | null
════════════════════════════════════════════════════════════
  Calls detectFaces, picks the LARGEST face (best for single-user webcam).
```

## Face Alignment

```
alignFace(srcData, srcW, srcH, kps, outSize) → { data, M }
════════════════════════════════════════════════════════════
  Computes similarity transform from detected 5-point landmarks
  to canonical ArcFace destination points.

  ArcFace canonical landmarks (112×112):
    [38.29, 51.70]  left eye
    [73.53, 51.50]  right eye
    [56.03, 71.74]  nose tip
    [41.55, 92.37]  left mouth corner
    [70.73, 92.20]  right mouth corner

  For 128×128 output (swap model): shift X by +8 pixels.

  Returns:
    data: aligned face as Uint8ClampedArray (RGBA)
    M:    2×3 affine matrix (needed for paste-back)
```

## ArcFace Embedding

```
extractEmbedding(session, alignedRGBA) → Float32Array(512)
══════════════════════════════════════════════════════════
  Input: 112×112 aligned face RGBA
  Preprocessing: (pixel - 127.5) / 127.5, channel-first
  Output: 512-dim L2-normalized embedding

  NOTE: Only run ONCE per reference face, not per frame.


projectEmbedding(embedding, emap) → Float32Array(512)
═════════════════════════════════════════════════════
  embedding (512,) × emap (512×512) → latent (512,)
  L2-normalize the result.

  This "latent" is what InSwapper consumes as the source identity.
```

## Face Swap (InSwapper)

```
runSwap(session, alignedRGBA, sourceLatent) → Uint8ClampedArray
═══════════════════════════════════════════════════════════════
  Inputs:
    'target': [1, 3, 128, 128] — aligned target face (pixel / 255)
    'source': [1, 512] — projected source embedding

  Output: 128×128 RGBA swapped face (logits × 255)

  Uses pre-allocated _swapTensor and _swapRGBA buffers.
```

## Paste-Back

```
pasteBack(frameRGBA, frameW, frameH, swappedRGBA, M, outBuf)
═══════════════════════════════════════════════════════════════
  Inverse-warps the 128×128 swapped face back into the original frame.

  Uses a pre-computed elliptical blending mask (SWAP_MASK_128):
    - Center: (64, 64), radius: 42% of size
    - Feather zone: 10% of size
    - Result: smooth falloff from 1.0 at center to 0.0 at edges

  For each pixel in the face bounding box:
    1. Map frame coord → swap coord via forward affine M
    2. Bilinear interpolate swapped face color
    3. Bilinear interpolate mask alpha
    4. Blend: out = frame × (1-alpha) + swapped × alpha

  Writes into outBuf (pre-allocated) to avoid allocation.
```

## BiSeNet Face Parsing

```
parseFace(session, cropRGBA, cropW, cropH) → Uint8Array(labels)
══════════════════════════════════════════════════════════════════
  Input: face crop RGBA, resized to 512×512
  Preprocessing: (pixel - 127.5) / 127.5, channel-first
  Output: [19, 512, 512] logits → argmax per pixel → class labels

  19 CelebAMask-HQ classes:
    0=background  1=skin      2=l_brow     3=r_brow
    4=l_eye       5=r_eye     6=glasses    7=l_ear
    8=r_ear       9=earring  10=nose      11=mouth
   12=u_lip      13=l_lip    14=neck      15=necklace
   16=cloth      17=hair     18=hat

  Labels are mapped back to crop-size coordinates via nearest-neighbor.


parseFullFrame(session, frameData, bbox) → { labels, cropBox, cropW, cropH }
═══════════════════════════════════════════════════════════════════════════════
  1. Expand bounding box by 25% on each side (forehead/chin context)
  2. Crop from full frame
  3. Run parseFace on the crop
  4. Return labels + crop coordinates for mask generation


createRegionMask(labels, cropW, cropH, region, cropBox, kps, frameW, frameH)
═══════════════════════════════════════════════════════════════════════════════
  Region → class mapping:
    nose: [10]
    lips: [11, 12, 13]
    eyes: [4, 5, 6]     (includes glasses — overlaps at angles)
    brow: [2, 3]
    chin: [1]            + zero-out everything above mouth Y

  Steps:
  1. Build binary mask from matching classes
  2. Dilate mask (radius = 2% of crop width) to fill bisenet gaps
  3. For chin: zero above mouth-corner Y (from landmarks)
  4. Map crop-space mask → full-frame-space mask
  5. Gaussian blur (radius = 10% of face width, 3-pass box blur)
  6. Normalize to [0, 1]
```

## Blending & Sharpening

```
blendRegion(original, swapped, regionMask, opacity, w, h, outBuf)
═══════════════════════════════════════════════════════════════════
  Per-pixel: alpha = (mask ? mask[i] : 1.0) * opacity
  out = original × (1 - alpha) + swapped × alpha

  If regionMask is null → full-face blend (no parsing).


sharpen(rgba, w, h, amount) → Uint8ClampedArray
═════════════════════════════════════════════════
  Unsharp mask using 3×3 neighbor average:
    blur = average of 8 neighbors
    sharpened = center + (center - blur) × strength
    strength = amount / 50 (slider 0-100 → 0-2× enhancement)
```

---

# Part 6: The Engine (js/engine.js)

Orchestrates model loading, reference setting, and frame processing.
This is the main class the UI interacts with.

## Engine Class

```
class Engine {
  // Models
  detSession       WASM session (main thread, for ref face detection)
  recSession       WASM session (w600k_r50, ArcFace embedding)
  swapSession      WebGPU session (inswapper_128)
  parseSession     WebGPU session (bisenet_resnet_34)
  emap             Float32Array (512×512 projection matrix)

  // Detection Worker
  _detWorker       Web Worker instance
  _workerReady     boolean — worker initialized
  _workerBusy      boolean — worker processing a frame
  _workerDead      boolean — worker crashed
  _latestDetection cached face from last worker result
  _detGeneration   incremented on ref switch (stale results ignored)

  // State
  ready            boolean — all models loaded
  sourceLatent     Float32Array(512) — projected reference embedding
  sourceEmbedding  Float32Array(512) — raw reference embedding
  region           'nose' | 'lips' | 'eyes' | 'brow' | 'chin' | 'full'
  opacity          0.0 – 1.0
  sharpness        0 – 100
  mirror           boolean

  // Caching
  _cachedParsing   last bisenet parse result
  _parseFrameCount counter for reparse scheduling
}
```

## Initialization Flow

```
engine.init(onProgress)
═══════════════════════════════════════════════════

  1. Download det_10g model bytes (with progress)
  2. Create Web Worker, send model bytes via postMessage
     (transferable ArrayBuffer — zero-copy)
  3. Worker creates WASM session, posts 'ready'
  4. Load det_10g again as WASM session on main thread
     (for reference face detection — worker only does live frames)
  5. Load w600k_r50 as WASM session (ArcFace)
  6. Load inswapper as WebGPU session
  7. Load bisenet as WebGPU session
  8. Load emap as Float32Array
  9. Warmup: run dummy inference on swap session
     (forces WebGPU shader compilation)
  10. engine.ready = true
```

## Reference Face Setting

```
engine.setReference(source)    source = URL string or Image element
═══════════════════════════════════════════════════════════════════

  Race protection: _refVersion incremented, stale calls bail out.
  Waits for any in-flight processFrame to complete first.

  1. Load image → OffscreenCanvas → ImageData
  2. detectOneFace via main-thread WASM session
  3. alignFace to 112×112 (ArcFace canonical)
  4. extractEmbedding → 512-dim normalized vector
  5. projectEmbedding × emap → sourceLatent
  6. Bump _detGeneration (discard stale worker results)
  7. Clear cached parsing
```

## Frame Processing (Pipelined)

```
engine.processFrame(frameData)
═══════════════════════════════════════════════════════════════

  Single-concurrency guard: _processingFrame flag.
  Bails out if: not ready, no sourceLatent, setting reference.

  1. Send frame to detection worker (fire-and-forget for NEXT frame)
  2. Use _latestDetection from worker (if fresh, <2 seconds old)
     - If no cached detection: run detection on main thread (first frame)
  3. alignFace to 128×128 for swap model
  4. runSwap (WebGPU) → 128×128 swapped face
  5. pasteBack into frame → full-frame swapped result
  6. If region != 'full':
     a. Check if reparse needed (every 45 frames OR face moved >25%)
     b. Run bisenet parsing (WebGPU) if needed
     c. Create region mask from cached parsing
     d. blendRegion with mask × opacity
  7. If region == 'full':
     blendRegion with null mask (full face, opacity only)
  8. Apply sharpness
  9. Clone result (buffers are reused!) → new ImageData
  10. Update FPS counter

  Worker stuck-detection: if worker busy >5 seconds, reset busy flag.
  Worker death: falls back to main-thread detection only.
```

---

# Part 7: The Detection Worker (js/detection-worker.js)

Runs SCRFD face detection on a separate thread so it doesn't block
the WebGPU swap pipeline on the main thread.

```
┌─────────────────────────────────────────────────────┐
│  MAIN THREAD                    WORKER THREAD        │
│                                                     │
│  Frame N                        Frame N-1            │
│  ├─ postMessage(pixels)  ──►   ├─ detect(pixels)    │
│  ├─ swapSession.run()          │  (WASM, ~25ms)     │
│  ├─ parseFace()                │                     │
│  ├─ blend + sharpen            │                     │
│  ├─ render to canvas           ◄─ postMessage(face)  │
│  │                                                   │
│  Frame N+1                     Frame N               │
│  ├─ uses detection from ──►    ...                   │
│  │  previous frame                                   │
│  └─ (detection is 1 frame behind — acceptable)       │
└─────────────────────────────────────────────────────┘
```

## Message Protocol

```
→ Worker receives:
  { type: 'init', modelBytes: ArrayBuffer }
  { type: 'detect', pixels: ArrayBuffer, width, height, id, gen }

← Worker sends:
  { type: 'ready' }
  { type: 'result', face: {bbox, kps, score} | null, id, gen }
  { type: 'error', message, id, gen }
```

## Key Details

- Uses `importScripts()` to load ORT WASM (not ES modules — workers can't use import)
- `ort.env.wasm.numThreads = 1` — worker IS the separate thread
- Pixel data sent as transferable ArrayBuffer (zero-copy)
- `gen` field prevents stale detections after reference switch
- Duplicates detection code from pipeline.js (workers can't import ES modules)
- Includes its own ORT warning suppression (console.warn override)

---

# Part 8: The Frontend (index.html)

One HTML file. No framework, no npm, no build step. Everything inline:
HTML structure + `<style>` block + `<script type="module">` block.

## Design System

```
VISUAL LANGUAGE
═══════════════════════════════════════════════════════════

  Theme:      Dark with gold accent
  Fonts:      Inter (UI text) + JetBrains Mono (stats/overlay)
  Max width:  Full viewport (fills screen)
  Layout:     Mobile-first, single column. Desktop: side panel.

  Color tokens (CSS variables on :root):
  ┌──────────────────────────────────────────────────────┐
  │  Backgrounds               Borders                   │
  │  --bg:      #08080c        --border: rgba(gold, .12) │
  │  --bg2:     #0c0c12        --border2: rgba(gold, .25)│
  │  --bg3:     #101018                                  │
  │  --surface: rgba(16,16,24,.8)                        │
  │  --surface2: rgba(24,24,36,.7)                       │
  │                                                      │
  │  Accent                    Text                      │
  │  --gold:    #e8af48        --text:  #f0ece4          │
  │  --gold-l:  #feeaa3        --text2: #a8a4a0          │
  │  --gold-d:  #c49746        --muted: #6b6865          │
  │  --gold-dk: #533517                                  │
  │                                                      │
  │  Status                                              │
  │  --green: #4ade80  --red: #f87171                    │
  └──────────────────────────────────────────────────────┘

  Breakpoint: 768px
  ├── Mobile: stacked (cam top, controls bottom, scrollable)
  ├── Desktop: cam left, controls in 320px side panel

  Animations:
  ├── Pulse dot on loading (opacity cycle)
  ├── Toast slide-in from below, auto-dismiss 3s
  ├── Scale(0.97) on button press
  └── Smooth transitions on all interactive elements
```

## Screen Flow

```
PAGE LOAD
═══════════════════════════════════════════════════════════

  1. Check navigator.gpu exists
     ├── NO  → disable button, show "WebGPU not supported" message
     │         (different messages for mobile vs desktop)
     └── YES → show "Get Started" button

  2. Check IndexedDB for cached models
     ├── ALL cached → button says "Launch" (instant)
     └── MISSING    → button says "Get Started"

  3. User clicks button
     ├── engine.init() with progress callback
     ├── Progress bar fills (per-model reporting)
     ├── Models load: det_10g → w600k_r50 → inswapper → bisenet → emap
     └── On complete: hide loading screen, show live view

  4. Live view
     ├── Reference thumbnails rendered by category
     ├── User clicks reference → engine.setReference()
     ├── User clicks Start → getUserMedia → processLoop()
     └── Preferences restored from localStorage
```

## Camera Loop Architecture

```
processLoop() — called via requestAnimationFrame
═══════════════════════════════════════════════════════

  EVERY FRAME (smooth display):
    if (lastResult exists)  → ctx.putImageData(lastResult)
    else                    → drawVideoFrame() (raw camera)

  IF NOT ALREADY PROCESSING (non-blocking inference):
    1. captureFrame() via offscreen canvas
       (reads from <video> directly, not from visible canvas)
    2. engine.processFrame(frameData) → Promise
       .then(result) → lastResult = result
       .catch(err)   → increment frameErrors
       (if >10 errors → stop camera automatically)

  UPDATE OVERLAY: "15 FPS | nose"

  requestAnimationFrame(processLoop)
```

Key insight: the render loop NEVER blocks on inference. It always draws
the last result (or raw video), while inference runs async. This means
the canvas updates at 60fps even though face swap runs at 15-25fps.

## Offscreen Capture Canvas

```
WHY: Reading pixels from the visible canvas would require:
  1. Draw video to visible canvas
  2. getImageData from visible canvas (causes GPU→CPU readback stall)
  3. Draw swap result to visible canvas

INSTEAD:
  - Visible canvas ONLY shows output (putImageData or drawImage)
  - Separate OffscreenCanvas captures frames from <video> directly
  - No GPU readback stall on the visible canvas

This matters because getImageData on a WebGPU-backed canvas can
stall the pipeline. OffscreenCanvas with willReadFrequently: true
stays on the CPU path.
```

## Reference Face Selection

```javascript
// Category tabs: nose, lips, eyes, brow, chin
// Each category has pre-loaded reference images
const REFS = {
  nose: ['nose_ref_1.jpg', ..., 'nose_ref_6.jpg'],
  lips: ['lips_ref_1.jpg', ..., 'lips_ref_4.jpg'],
  eyes: ['eyes_ref_1.jpg', ..., 'eyes_ref_3.jpg'],
  brow: ['brow_ref_1.jpg', ..., 'brow_ref_3.jpg'],
  chin: ['chin_ref_1.jpg', ..., 'chin_ref_3.jpg'],
};

// Selecting a category auto-switches the region too
// e.g. clicking "Lips" tab → setCat('lips') → setRegion('lips')
```

## Zoom & Pan

```
Desktop:
  Scroll wheel → zoom (1× to 5×)
  Middle mouse drag → pan (when zoomed)
  Double-click → reset zoom

Mobile:
  Pinch → zoom (1× to 5×)
  Touch-hold 300ms + drag → pan (when zoomed)
  Double-tap → reset zoom

Implementation: CSS transform on the <canvas> element
  canvas.style.transform = `scale(${scale}) translate(${panX}px, ${panY}px)`
```

## Preferences (localStorage)

```javascript
// Saved on every change, restored on load
{
  region: 'nose',
  opacity: 70,
  sharpness: 50,
  mirror: true,
  cat: 'nose',       // active reference category tab
  ref: 'nose_ref_1.jpg'  // active reference image
}
```

## Toast System

```javascript
function toast(msg, type) {   // type: 'ok', 'err', or ''
  const el = document.createElement('div');
  el.className = 'toast' + (type ? ' ' + type : '');
  el.textContent = msg;
  document.getElementById('toasts').appendChild(el);
  setTimeout(() => el.remove(), 3000);
}
```

---

# Part 9: Netlify Configuration

## netlify.toml

```toml
[build]
  publish = "."
  command = "echo 'Static site — no build step'"

# SharedArrayBuffer for WASM threading
[[headers]]
  for = "/*"
  [headers.values]
    Cross-Origin-Opener-Policy = "same-origin"
    Cross-Origin-Embedder-Policy = "credentialless"

# Aggressive caching
[[headers]]
  for = "/references/*"
  [headers.values]
    Cache-Control = "public, max-age=31536000, immutable"

[[headers]]
  for = "/js/*"
  [headers.values]
    Cache-Control = "public, max-age=86400"

[[headers]]
  for = "*.js"
  [headers.values]
    Content-Type = "application/javascript"

# www → apex redirect
[[redirects]]
  from = "https://www.yourdomain.com/*"
  to = "https://yourdomain.com/:splat"
  status = 301
  force = true
```

## Edge Function (models-proxy.js)

```
WHY THIS EXISTS:
  Models are hosted on GitHub Releases (free, unlimited bandwidth).
  But GitHub Releases use 302 redirects to release-assets CDN,
  and that CDN doesn't set CORS headers.

  Browser fetch() follows the redirect transparently, but the
  response lacks Access-Control-Allow-Origin → blocked by CORS.

  Solution: Netlify edge function follows the redirect SERVER-SIDE
  and streams the response back with proper CORS headers.

FLOW:
  Browser → GET /models-cdn/det_10g.onnx
         → Edge function → GET github.com/...releases/.../det_10g.onnx
                        → 302 → release-assets.githubusercontent.com/...
                        → Edge function follows redirect, gets response
                        → Adds CORS headers + Cache-Control: immutable
                        → Streams back to browser

Config:
  export const config = { path: '/models-cdn/*' };
```

## Cross-Origin Isolation (tricky bit)

```
SharedArrayBuffer is required by ONNX Runtime's WASM threading.
It's only available in cross-origin-isolated contexts:

  Cross-Origin-Opener-Policy: same-origin
  Cross-Origin-Embedder-Policy: credentialless

WHY "credentialless" instead of "require-corp":
  "require-corp" blocks ALL cross-origin resources unless they
  explicitly set Cross-Origin-Resource-Policy headers.
  This would break loading ORT from jsdelivr CDN.

  "credentialless" allows cross-origin requests that don't send
  credentials (cookies), which is fine for CDN resources.
  It enables SharedArrayBuffer without breaking external scripts.
```

---

# Part 10: Model Hosting Setup

## Hosting Models on GitHub Releases

The ONNX models are too large for git (~838 MB total). Host them as
GitHub Release assets:

```
1. Create a GitHub repo (e.g., yourusername/storage)
2. Create a Release with tag "newface-v1"
3. Upload these files as Release assets:
   ├── det_10g.onnx              (~17 MB)
   ├── w600k_r50.onnx            (~174 MB)
   ├── inswapper_128.onnx        (~553 MB)
   ├── bisenet_resnet_34.onnx    (~94 MB)
   └── emap.bin                  (~1 MB)

4. Update the edge function URL to point to your repo:
   const targetUrl = `https://github.com/YOURUSER/storage/releases/download/newface-v1/${filename}`;
```

## Where to Get the Models

```
det_10g.onnx             — InsightFace buffalo_l model (SCRFD)
w600k_r50.onnx           — InsightFace ArcFace recognition
inswapper_128.onnx       — InsightFace face swap (NOT fp16 — full precision for WebGPU)
bisenet_resnet_34.onnx   — Face parsing (CelebAMask-HQ trained)
emap.bin                 — 512×512 float32 embedding projection matrix
                           (extracted from inswapper model internals)

IMPORTANT: The browser version uses inswapper_128.onnx (full precision),
NOT inswapper_128_fp16.onnx. WebGPU handles the precision automatically.
The fp16 variant is for NVIDIA CUDA on the Python backend.
```

---

# Part 11: Reference Images

Reference faces are simple JPEG photos. Requirements:

```
REFERENCE IMAGE REQUIREMENTS:
  ✓ One face per image (largest face is auto-selected)
  ✓ Front-facing (ArcFace embedding is pose-invariant, but
    frontal works best for the swap model)
  ✓ Good lighting, clear facial features
  ✓ JPEG format, any reasonable resolution (will be resized)
  ✓ Named: {category}_ref_{number}.jpg
    e.g., nose_ref_1.jpg, lips_ref_2.jpg, chin_ref_3.jpg

CATEGORIES: nose, lips, eyes, brow, chin

WHY ARCFACE DOESN'T NEED MULTI-ANGLE REFS:
  InSwapper consumes the ArcFace embedding (512-dim vector),
  NOT pixel data. The embedding captures identity features
  (nose shape, lip fullness, etc.) independent of pose.
  Multiple angles of the same face produce nearly identical
  embeddings → redundant. One good front-facing photo is optimal.
```

---

# Part 12: Design Decisions & Gotchas

```
┌─────────────────────────────────────────────────────────────┐
│              WHY IT'S BUILT THIS WAY                         │
│                                                              │
│  Single HTML file (no build step)                            │
│  └── WHY: Zero tooling. Deploy anywhere. Just serve it.      │
│      No npm, no webpack, no React. One file + JS modules.    │
│                                                              │
│  WebGPU for swap + parse, WASM for detection + recognition   │
│  └── WHY: det_10g and w600k_r50 have ops unsupported by      │
│      ORT's WebGPU backend (dynamic shapes, certain conv      │
│      patterns). They work fine on WASM. inswapper and         │
│      bisenet are the heavy models and benefit from GPU.       │
│                                                              │
│  Web Worker for detection                                    │
│  └── WHY: Detection (WASM, ~25ms) would block the WebGPU    │
│      swap pipeline on the main thread. Running it in a       │
│      worker means detection of frame N happens in parallel    │
│      with swap of frame N-1. Result: ~40% higher FPS.        │
│                                                              │
│  Detection is 1 frame behind                                 │
│  └── WHY: The pipelining tradeoff. We use the PREVIOUS       │
│      frame's detection for the CURRENT frame's swap.         │
│      At 15+ FPS, faces don't move much between frames.       │
│      The latency is imperceptible.                           │
│                                                              │
│  Parse every 45 frames (not every frame)                     │
│  └── WHY: BiSeNet parsing is expensive (~10ms WebGPU).       │
│      The face mask doesn't change much frame-to-frame.       │
│      Reparsing every ~2-3 seconds (or when face moves        │
│      significantly) is sufficient. Saves ~15% of frame time. │
│                                                              │
│  Pre-allocated buffers (never allocate in hot loop)          │
│  └── WHY: Allocating Float32Array/Uint8ClampedArray every    │
│      frame triggers garbage collection pauses. Pre-allocated │
│      buffers are reused across frames. The ONLY allocation   │
│      per frame is the final clone (ImageData constructor).   │
│                                                              │
│  IndexedDB for model caching (not Cache API)                 │
│  └── WHY: Cache API stores Response objects and has size     │
│      limits. IndexedDB stores raw ArrayBuffers, has higher   │
│      storage quotas (~10% of disk), and we can check for     │
│      cached models before showing the download UI.           │
│                                                              │
│  Netlify edge function (not direct GitHub fetch)             │
│  └── WHY: GitHub Releases use 302 → CDN that lacks CORS     │
│      headers. fetch() can't read the response. Edge function │
│      follows the redirect server-side and adds CORS.         │
│                                                              │
│  credentialless (not require-corp)                            │
│  └── WHY: SharedArrayBuffer needs cross-origin isolation.    │
│      "require-corp" would block jsdelivr CDN scripts.        │
│      "credentialless" is the permissive variant that          │
│      enables SharedArrayBuffer without breaking CDN loads.   │
│                                                              │
│  Clone result before returning ImageData                     │
│  └── WHY: All pipeline buffers (_blendBuf, _sharpenBuf)      │
│      are pre-allocated and REUSED. If we wrap the buffer     │
│      directly in ImageData, the next frame's processing      │
│      overwrites the pixels while the canvas is still          │
│      displaying them → flickering/corruption. Clone is ~1ms. │
└─────────────────────────────────────────────────────────────┘
```

```
┌─────────────────────────────────────────────────────────────┐
│              GOTCHAS TO WATCH FOR                             │
│                                                              │
│  ✗ ORT WebGPU + detection models                             │
│    └── SCRFD (det_10g) and ArcFace (w600k_r50) don't work   │
│        on WebGPU backend — certain ops aren't supported.     │
│        Must use WASM for these two. Only inswapper and       │
│        bisenet use WebGPU.                                   │
│                                                              │
│  ✗ ORT VerifyOutputSizes warnings                            │
│    └── ONNX Runtime emits noisy console warnings from inside │
│        WASM that bypass ort.env.logLevel. Must override       │
│        console.warn BEFORE loading ORT, and pass              │
│        logSeverityLevel:3 in EVERY session.run() call.       │
│        This must happen in 3 places: index.html, pipeline.js,│
│        and detection-worker.js.                              │
│                                                              │
│  ✗ Worker can't use ES module imports                        │
│    └── detection-worker.js must use importScripts() for ORT  │
│        and must duplicate detection logic from pipeline.js.  │
│        Workers in module mode ("type":"module") don't support│
│        importScripts, and ORT WASM isn't an ES module.       │
│                                                              │
│  ✗ Transferable ArrayBuffers are MOVED, not copied           │
│    └── postMessage with transfer list empties the sender's   │
│        ArrayBuffer. Model bytes must be .slice(0) before     │
│        sending to worker. Frame pixels likewise.             │
│                                                              │
│  ✗ WebGPU not available on mobile                            │
│    └── As of 2025, WebGPU is desktop-only (Chrome, Edge,     │
│        Brave). iOS/Android don't support it yet. Show a      │
│        clear message directing users to desktop.             │
│                                                              │
│  ✗ Session warmup is essential                                │
│    └── First WebGPU inference triggers shader compilation     │
│        which takes 1-2 seconds. Run a dummy inference during │
│        init() so the user doesn't see a freeze on first      │
│        frame. Only the swap session needs warmup (bisenet    │
│        isn't in the critical first-frame path).              │
│                                                              │
│  ✗ Race condition on reference switch                        │
│    └── User clicks a new reference while the old one is      │
│        still loading → stale embedding gets set. Use a       │
│        version counter (_refVersion). Each setReference call │
│        checks its version at each await point and bails out  │
│        if superseded.                                        │
│                                                              │
│  ✗ Worker stuck detection                                    │
│    └── If the worker takes >5s, it's probably stuck (WASM    │
│        OOM or infinite loop). Track _workerBusySince and     │
│        auto-reset the busy flag. Without this, one stuck     │
│        frame kills all future detection.                     │
│                                                              │
│  ✗ OffscreenCanvas for frame capture                         │
│    └── Reading getImageData from the visible canvas causes   │
│        a GPU→CPU readback stall. Use a separate              │
│        OffscreenCanvas with willReadFrequently:true that     │
│        reads directly from the <video> element. The visible  │
│        canvas only receives output (putImageData).           │
│                                                              │
│  ✗ ort.env.wasm.proxy must be false                          │
│    └── ORT's WASM proxy mode is incompatible with WebGPU.    │
│        If proxy:true, WebGPU sessions silently fail. This    │
│        is poorly documented.                                 │
│                                                              │
│  ✗ Stale detection after reference switch                    │
│    └── The worker might return a detection from BEFORE the   │
│        reference was changed. The _detGeneration counter     │
│        stamps each frame sent to the worker. Results with    │
│        old generation are discarded.                         │
└─────────────────────────────────────────────────────────────┘
```

---

# Part 13: Build Order

When implementing, build in this exact order:

```
Step 1:   js/math.js             ← Pure math, no dependencies
Step 2:   js/models.js           ← IndexedDB + model loading
Step 3:   js/pipeline.js         ← Detection, embedding, swap, parse, blend
Step 4:   js/detection-worker.js ← Worker (duplicates detection from pipeline.js)
Step 5:   js/engine.js           ← Orchestrator (depends on models + pipeline)
Step 6:   index.html             ← Frontend (depends on engine)
Step 7:   netlify.toml           ← Deploy config
Step 8:   netlify/edge-functions/models-proxy.js ← Model CDN proxy
Step 9:   references/*.jpg       ← Add reference face images
Step 10:  Deploy to Netlify      ← Push to git, connect to Netlify
```

Each step should be buildable independently. The app is functional
after Step 6 (run locally with a /models/ directory). Steps 7-8 add
production deployment. Step 9 adds the reference face library.

## ORT CDN Dependencies

The only external scripts (loaded from CDN, not bundled):

```html
<!-- In index.html (before module script) -->
<script src="https://cdn.jsdelivr.net/npm/onnxruntime-web@1.22.0/dist/ort.webgpu.min.js"></script>

<!-- In detection-worker.js -->
importScripts('https://cdn.jsdelivr.net/npm/onnxruntime-web@1.22.0/dist/ort.min.js');

<!-- WASM paths (set in index.html and worker) -->
ort.env.wasm.wasmPaths = 'https://cdn.jsdelivr.net/npm/onnxruntime-web@1.22.0/dist/';
```

Note: `ort.webgpu.min.js` for main thread (includes WebGPU + WASM),
`ort.min.js` for worker (WASM only — workers can't use WebGPU).

---

# Appendix: The Technology Stack — A Visual Primer

What every piece of this system actually is, what it does,
and how they all connect.

## Deep-Live-Cam (The Inspiration)

```
┌─────────────────────────────────────────────────────────────────┐
│                                                                 │
│   Deep-Live-Cam is an open-source Python project that does      │
│   real-time face swapping using a webcam + NVIDIA GPU.          │
│                                                                 │
│   ┌───────────┐      ┌───────────┐      ┌───────────┐         │
│   │  Webcam   │ ───► │  Python   │ ───► │  Display  │         │
│   │  feed     │      │  + CUDA   │      │  result   │         │
│   └───────────┘      │  + ONNX   │      └───────────┘         │
│                      │  Runtime  │                              │
│                      └───────────┘                              │
│                                                                 │
│   REQUIRES: Python, NVIDIA GPU, CUDA drivers, ~10 min setup    │
│                                                                 │
│   THIS PROJECT takes the same AI models and pipeline            │
│   and runs them ENTIRELY IN THE BROWSER — no install,           │
│   no Python, no NVIDIA. Just open a URL.                        │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

## WebGPU (The Browser's GPU Access)

```
┌─────────────────────────────────────────────────────────────────┐
│                                                                 │
│   WebGPU is the modern browser API for GPU-accelerated          │
│   computation. Think of it as "CUDA but for the browser."       │
│                                                                 │
│   BEFORE WebGPU:                      WITH WebGPU:              │
│   ┌──────────┐                        ┌──────────┐             │
│   │ Browser  │  CPU only              │ Browser  │  GPU accel  │
│   │ ┌──────┐ │  (slow ML)             │ ┌──────┐ │  (fast ML)  │
│   │ │ WASM │ │                        │ │WebGPU│ │             │
│   │ │ ░░░░ │ │  ~100ms/frame          │ │ ████ │ │  ~15ms      │
│   │ └──────┘ │                        │ └──┬───┘ │             │
│   └──────────┘                        └────┼─────┘             │
│                                            │                    │
│                                     ┌──────▼──────┐            │
│                                     │   Your GPU   │            │
│                                     │   (NVIDIA,   │            │
│                                     │   AMD, Intel,│            │
│                                     │   Apple M1+) │            │
│                                     └──────────────┘            │
│                                                                 │
│   KEY POINT: WebGPU works with ANY GPU — not just NVIDIA.       │
│   Chrome 113+ and Edge 113+ support it on desktop.              │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

## ONNX Runtime Web (The ML Engine)

```
┌─────────────────────────────────────────────────────────────────┐
│                                                                 │
│   ONNX = Open Neural Network Exchange                           │
│   A universal format for AI models (like JPEG for images).      │
│                                                                 │
│   ONNX Runtime = The engine that RUNS those models.             │
│   ONNX Runtime Web = That engine, compiled for the browser.     │
│                                                                 │
│   ┌─────────────────────────────────────────────────────┐      │
│   │                ONNX Runtime Web                      │      │
│   │                                                      │      │
│   │  ┌──────────────────┐  ┌──────────────────┐         │      │
│   │  │  WebGPU Backend  │  │   WASM Backend   │         │      │
│   │  │                  │  │                   │         │      │
│   │  │  Runs on GPU     │  │  Runs on CPU      │         │      │
│   │  │  Fast (10-30ms)  │  │  Slower (20-50ms) │         │      │
│   │  │                  │  │  but compatible    │         │      │
│   │  │  Used for:       │  │  with all ops      │         │      │
│   │  │  • Face swap     │  │                   │         │      │
│   │  │  • Face parsing  │  │  Used for:        │         │      │
│   │  │                  │  │  • Face detection  │         │      │
│   │  │                  │  │  • Face embedding  │         │      │
│   │  └──────────────────┘  └──────────────────┘         │      │
│   └─────────────────────────────────────────────────────┘      │
│                                                                 │
│   WHY TWO BACKENDS?                                             │
│   Some models use operations that WebGPU doesn't support yet.   │
│   Detection + recognition → WASM (CPU). Swap + parse → WebGPU. │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

## SCRFD / det_10g (Face Detection)

```
┌─────────────────────────────────────────────────────────────────┐
│                                                                 │
│   SCRFD = "Sample and Computation Redistribution for            │
│            Efficient Face Detection"                            │
│                                                                 │
│   det_10g = the specific model variant (10 GFlops budget)       │
│                                                                 │
│   JOB: Find where the face is in the image.                     │
│                                                                 │
│   ┌─────────────────┐          ┌─────────────────┐             │
│   │                 │          │     ┌─────┐     │             │
│   │   Raw webcam    │  ─────►  │     │ 😊  │     │             │
│   │   frame         │  det_10g │     └──┬──┘     │             │
│   │                 │          │  bbox ─┘        │             │
│   └─────────────────┘          └─────────────────┘             │
│                                                                 │
│   OUTPUTS:                                                      │
│   ┌─────────────────────────────────────────────────┐          │
│   │                                                  │          │
│   │  Bounding box: [x1, y1, x2, y2]                 │          │
│   │  ┌──────────────┐                                │          │
│   │  │  ┌────────┐  │                                │          │
│   │  │  │  face  │  │  ← rectangle around the face  │          │
│   │  │  └────────┘  │                                │          │
│   │  └──────────────┘                                │          │
│   │                                                  │          │
│   │  5-point landmarks:                              │          │
│   │       ◉         ◉     ← left eye, right eye     │          │
│   │           ◉           ← nose tip                 │          │
│   │       ◉       ◉       ← left mouth, right mouth │          │
│   │                                                  │          │
│   │  These 5 points are crucial — they tell us       │          │
│   │  exactly how to align/rotate the face for the    │          │
│   │  next steps.                                     │          │
│   │                                                  │          │
│   │  Confidence score: 0.0 – 1.0                     │          │
│   └─────────────────────────────────────────────────┘          │
│                                                                 │
│   SIZE: ~17 MB  |  SPEED: ~25ms (WASM)  |  INPUT: 192×192     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

## ArcFace / w600k_r50 (Face Recognition / Embedding)

```
┌─────────────────────────────────────────────────────────────────┐
│                                                                 │
│   ArcFace = A face recognition model                            │
│   w600k_r50 = trained on 600K identities, ResNet-50 backbone   │
│                                                                 │
│   JOB: Convert a face into a 512-number "fingerprint"           │
│        that captures WHO this person looks like.                 │
│                                                                 │
│   ┌──────────┐         ┌──────────┐        ┌──────────────────┐│
│   │          │  align  │          │ embed  │ [0.23, -0.15,    ││
│   │  Detect  │ ──────► │ Aligned  │ ─────► │  0.87, ...,      ││
│   │  face    │  112×112│  face    │ ArcFace│  -0.42, 0.19]    ││
│   │          │         │          │        │                   ││
│   └──────────┘         └──────────┘        │ 512 numbers       ││
│                                            │ = "face identity"  ││
│                                            └──────────────────┘│
│                                                                 │
│   KEY INSIGHT:                                                  │
│   ┌─────────────────────────────────────────────────┐          │
│   │                                                  │          │
│   │  The embedding is POSE-INVARIANT.                │          │
│   │                                                  │          │
│   │  Same person, different angles:                  │          │
│   │  😊 😏 🙂 → all produce nearly identical         │          │
│   │               512-dim vectors                    │          │
│   │                                                  │          │
│   │  This is why ONE reference photo is enough.      │          │
│   │  Multiple angles of the same face = redundant.   │          │
│   │                                                  │          │
│   └─────────────────────────────────────────────────┘          │
│                                                                 │
│   The "align" step uses the 5 landmarks from SCRFD             │
│   to rotate + scale the face into a canonical position          │
│   (eyes level, centered) before feeding it to ArcFace.          │
│                                                                 │
│   SIZE: ~174 MB  |  SPEED: ~30ms (WASM)  |  INPUT: 112×112    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

## InSwapper (The Face Swap Model)

```
┌─────────────────────────────────────────────────────────────────┐
│                                                                 │
│   InSwapper = The actual face swap neural network.              │
│   This is the magic. Everything else is setup for this.         │
│                                                                 │
│   JOB: Given a SOURCE identity and a TARGET face,               │
│        produce a new face that looks like the SOURCE             │
│        but matches the TARGET's pose, lighting, expression.     │
│                                                                 │
│   ┌─────────────────────────────────────────────────────────┐  │
│   │                                                          │  │
│   │  SOURCE (reference photo)      TARGET (your webcam)      │  │
│   │  ┌──────────┐                  ┌──────────┐             │  │
│   │  │          │   ArcFace        │          │   align     │  │
│   │  │  "I want │ ──────────┐     │  "This   │ ────────┐  │  │
│   │  │  THIS    │           │     │  is the  │         │  │  │
│   │  │  nose"   │           │     │  current │         │  │  │
│   │  └──────────┘           │     │  face"   │         │  │  │
│   │                         │     └──────────┘         │  │  │
│   │                         ▼                          ▼  │  │
│   │               ┌───────────────────────────────┐       │  │
│   │               │                               │       │  │
│   │               │         InSwapper             │       │  │
│   │               │                               │       │  │
│   │               │  source    ──►  ┌──────────┐  │       │  │
│   │               │  latent         │ Swapped  │  │       │  │
│   │               │  (512-dim)      │ face     │  │       │  │
│   │               │                 │ 128×128  │  │       │  │
│   │               │  target    ──►  └──────────┘  │       │  │
│   │               │  face                         │       │  │
│   │               │  (128×128)                    │       │  │
│   │               └───────────────────────────────┘       │  │
│   │                                                       │  │
│   └─────────────────────────────────────────────────────────┘  │
│                                                                 │
│   IMPORTANT: InSwapper takes the EMBEDDING, not pixels.         │
│   It never "sees" the reference photo directly — only the       │
│   512-number identity vector. This is why it generalizes        │
│   across poses and lighting conditions.                         │
│                                                                 │
│   SIZE: ~553 MB  |  SPEED: ~10-15ms (WebGPU)  |  128×128      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

## BiSeNet (Face Parsing — Region Masking)

```
┌─────────────────────────────────────────────────────────────────┐
│                                                                 │
│   BiSeNet = Bilateral Segmentation Network                      │
│   A pixel-level face segmentation model.                        │
│                                                                 │
│   JOB: Label EVERY PIXEL of the face with what body part it is. │
│        This is what enables swapping JUST the nose, JUST the    │
│        lips, etc. instead of the entire face.                   │
│                                                                 │
│   ┌──────────────┐    BiSeNet    ┌──────────────┐              │
│   │              │   ────────►   │▓▓▓▓▓▓▓▓▓▓▓▓▓▓│              │
│   │  Face crop   │               │▓▓▒▒▒▒▒▒▒▒▓▓│              │
│   │  (padded     │               │▓░░░░░░░░░░▓│  ← each pixel │
│   │   512×512)   │               │▓░██░░░██░░▓│    is labeled  │
│   │              │               │▓░░░░▲░░░░░▓│    with its    │
│   │              │               │▓░░░░░░░░░░▓│    body part   │
│   │              │               │▓░░◄███►░░░▓│              │
│   │              │               │▓▓░░░░░░░░▓▓│              │
│   └──────────────┘               └──────────────┘              │
│                                                                 │
│   19 CLASSES (CelebAMask-HQ):                                  │
│   ┌─────────────────────────────────────────────────┐          │
│   │                                                  │          │
│   │  0  background    7  left ear     14 neck        │          │
│   │  1  SKIN ★        8  right ear    15 necklace    │          │
│   │  2  LEFT BROW ★   9  earring      16 cloth       │          │
│   │  3  RIGHT BROW ★  10 NOSE ★       17 hair        │          │
│   │  4  LEFT EYE ★    11 MOUTH ★      18 hat         │          │
│   │  5  RIGHT EYE ★   12 UPPER LIP ★                │          │
│   │  6  GLASSES ★     13 LOWER LIP ★                │          │
│   │                                                  │          │
│   │  ★ = classes we use for region masking           │          │
│   │                                                  │          │
│   └─────────────────────────────────────────────────┘          │
│                                                                 │
│   REGION MAPPING:                                               │
│   ┌────────────────────────────────────────┐                   │
│   │  "nose" → class 10                     │                   │
│   │  "lips" → classes 11, 12, 13           │                   │
│   │  "eyes" → classes 4, 5, 6 (+ glasses) │                   │
│   │  "brow" → classes 2, 3                 │                   │
│   │  "chin" → class 1 (below mouth only)   │                   │
│   │  "full" → skip parsing (swap all)      │                   │
│   └────────────────────────────────────────┘                   │
│                                                                 │
│   THIS IS WHAT DEEP-LIVE-CAM DOESN'T HAVE.                     │
│   Original DLC does full-face swap only. BiSeNet parsing        │
│   lets us blend JUST the selected region and keep the           │
│   rest of the original face untouched.                          │
│                                                                 │
│   SIZE: ~94 MB  |  SPEED: ~5-10ms (WebGPU)  |  INPUT: 512×512 │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

## The Embedding Projection (emap)

```
┌─────────────────────────────────────────────────────────────────┐
│                                                                 │
│   emap = a 512×512 matrix extracted from InSwapper's internals. │
│                                                                 │
│   JOB: Transform the ArcFace embedding into the format          │
│        InSwapper expects (its "latent space").                   │
│                                                                 │
│   ArcFace embedding ──► emap projection ──► InSwapper latent    │
│    (512 numbers)         (512×512 matrix)    (512 numbers)      │
│                                                                 │
│   ┌────────┐     ┌──────────────┐     ┌────────┐              │
│   │ embed  │  ×  │   emap       │  =  │ latent │              │
│   │ [512]  │     │   [512×512]  │     │ [512]  │              │
│   └────────┘     └──────────────┘     └────────┘              │
│                                          │                      │
│                                     normalize                   │
│                                          │                      │
│                                    into InSwapper               │
│                                                                 │
│   SIZE: ~1 MB (raw float32 binary, not ONNX)                   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

## How They All Connect — The Full Pipeline

```
┌─────────────────────────────────────────────────────────────────┐
│                                                                 │
│  REFERENCE PHOTO (done once)         LIVE WEBCAM (every frame)  │
│  ════════════════════════            ═════════════════════════   │
│                                                                 │
│  ┌──────────┐                        ┌──────────────┐          │
│  │ Ref      │                        │ Webcam frame │          │
│  │ image    │                        │ 640×480      │          │
│  └────┬─────┘                        └──────┬───────┘          │
│       │                                     │                   │
│       ▼                                     ▼                   │
│  ┌──────────┐                        ┌──────────────┐          │
│  │ SCRFD    │ detect face            │ SCRFD        │ detect   │
│  │ det_10g  │ + 5 landmarks          │ det_10g      │ face     │
│  └────┬─────┘                        │ (Web Worker) │          │
│       │                              └──────┬───────┘          │
│       ▼                                     │                   │
│  ┌──────────┐                               │                   │
│  │ Align to │ 112×112                       │                   │
│  │ ArcFace  │                               ▼                   │
│  └────┬─────┘                        ┌──────────────┐          │
│       │                              │ Align to     │ 128×128  │
│       ▼                              │ swap size    │          │
│  ┌──────────┐                        └──────┬───────┘          │
│  │ ArcFace  │ → 512-dim                     │                   │
│  │ w600k_r50│   embedding                   │                   │
│  └────┬─────┘                               │                   │
│       │                                     │                   │
│       ▼                                     │                   │
│  ┌──────────┐                               │                   │
│  │ × emap   │ → 512-dim                     │                   │
│  │ project  │   latent                      │                   │
│  └────┬─────┘                               │                   │
│       │         STORED                      │                   │
│       │         (reused every frame)        │                   │
│       │                                     │                   │
│       └──────────────┐  ┌───────────────────┘                   │
│                      │  │                                       │
│                      ▼  ▼                                       │
│               ┌──────────────┐                                  │
│               │  InSwapper   │                                  │
│               │  (WebGPU)    │                                  │
│               │              │                                  │
│               │  source: latent (512)                           │
│               │  target: face (128×128)                         │
│               │  output: swapped face                           │
│               └──────┬───────┘                                  │
│                      │                                          │
│                      ▼                                          │
│               ┌──────────────┐                                  │
│               │  Paste back  │ inverse-warp into                │
│               │  into frame  │ original frame                   │
│               └──────┬───────┘                                  │
│                      │                                          │
│                      ▼                                          │
│               ┌──────────────┐                                  │
│               │  BiSeNet     │ pixel-level                      │
│               │  parsing     │ region mask                      │
│               │  (WebGPU)    │ (nose/lips/eyes/etc.)            │
│               └──────┬───────┘                                  │
│                      │                                          │
│                      ▼                                          │
│               ┌──────────────┐                                  │
│               │  Alpha blend │ original × (1-mask)              │
│               │  + sharpen   │ + swapped × mask                 │
│               └──────┬───────┘                                  │
│                      │                                          │
│                      ▼                                          │
│               ┌──────────────┐                                  │
│               │   Display    │                                  │
│               │   on canvas  │ ← 15-25 FPS                     │
│               └──────────────┘                                  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

## IndexedDB (Browser-Side Model Storage)

```
┌─────────────────────────────────────────────────────────────────┐
│                                                                 │
│   IndexedDB is a browser database for storing large binary      │
│   data. We use it to cache the ~838 MB of ONNX models so       │
│   they only need to be downloaded ONCE.                         │
│                                                                 │
│   FIRST VISIT:                                                  │
│   ┌────────┐       ┌──────────┐       ┌──────────┐            │
│   │ Netlify │ ───► │ GitHub   │ ───► │ Browser  │            │
│   │ edge fn │  302 │ Releases │ blob │ IndexedDB│            │
│   └────────┘       └──────────┘       └──────────┘            │
│   ~838 MB downloaded, ~30-60 seconds                           │
│                                                                 │
│   EVERY VISIT AFTER:                                            │
│   ┌──────────┐                                                  │
│   │ IndexedDB│ ───► Models loaded in ~2-5 seconds              │
│   │ (cached) │      No network needed. Works offline.          │
│   └──────────┘                                                  │
│                                                                 │
│   Database: 'newface-models'                                    │
│   Store:    'blobs'                                             │
│   ┌─────────────────────────────────────────┐                  │
│   │  Key          │  Value       │  Size    │                  │
│   │───────────────│──────────────│──────────│                  │
│   │  det_10g      │  ArrayBuffer │  ~17 MB  │                  │
│   │  w600k_r50    │  ArrayBuffer │  ~174 MB │                  │
│   │  inswapper    │  ArrayBuffer │  ~553 MB │                  │
│   │  bisenet      │  ArrayBuffer │  ~94 MB  │                  │
│   │  emap         │  ArrayBuffer │  ~1 MB   │                  │
│   └─────────────────────────────────────────┘                  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

## Web Workers (Background Threading)

```
┌─────────────────────────────────────────────────────────────────┐
│                                                                 │
│   JavaScript is single-threaded. Without Web Workers,           │
│   face detection would FREEZE the UI while it runs.             │
│                                                                 │
│   Web Workers run code in a separate thread.                    │
│   We use one to run SCRFD detection in parallel with the swap.  │
│                                                                 │
│   ┌─────────────────────────────────────────────────────┐      │
│   │                                                      │      │
│   │  MAIN THREAD                 WORKER THREAD           │      │
│   │  ───────────                 ─────────────           │      │
│   │                                                      │      │
│   │  Frame 1:                    Frame 1:                │      │
│   │  ├─ Send pixels ──────────►  ├─ Detect face (WASM)  │      │
│   │  ├─ Swap face (WebGPU)       │  ~25ms               │      │
│   │  ├─ Parse + blend            │                       │      │
│   │  ├─ Render to canvas         │                       │      │
│   │  │                     ◄──── ├─ Return {bbox, kps}   │      │
│   │  │                           │                       │      │
│   │  Frame 2:                    Frame 2:                │      │
│   │  ├─ Use Frame 1's detection  ├─ Detect next frame    │      │
│   │  ├─ Swap (WebGPU)            │                       │      │
│   │  ├─ Parse + blend            │                       │      │
│   │  └─ Render                   └─ ...                  │      │
│   │                                                      │      │
│   │  RESULT: Detection and swap happen SIMULTANEOUSLY.   │      │
│   │  ~40% faster than doing everything on main thread.   │      │
│   │                                                      │      │
│   └─────────────────────────────────────────────────────┘      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

## Netlify + Edge Functions (Deployment)

```
┌─────────────────────────────────────────────────────────────────┐
│                                                                 │
│   Netlify hosts the static site for free. No server needed.     │
│                                                                 │
│   The ONLY server-side code is a tiny "edge function" that      │
│   proxies model downloads from GitHub (to add CORS headers).    │
│                                                                 │
│   ┌──────────┐     ┌──────────┐     ┌──────────────────┐      │
│   │ Browser  │     │ Netlify  │     │ GitHub Releases  │      │
│   │          │     │          │     │                  │      │
│   │ GET      │────►│ Edge     │────►│ 302 redirect     │      │
│   │ /models- │     │ Function │     │ to CDN blob      │      │
│   │ cdn/     │     │          │◄────│                  │      │
│   │ model.   │◄────│ + CORS   │     │ (no CORS headers)│      │
│   │ onnx     │     │ headers  │     │                  │      │
│   └──────────┘     └──────────┘     └──────────────────┘      │
│                                                                 │
│   WHY NOT FETCH GITHUB DIRECTLY?                                │
│   GitHub's CDN doesn't set Access-Control-Allow-Origin.         │
│   The browser blocks the download. The edge function             │
│   follows the redirect server-side and adds the header.         │
│                                                                 │
│   After first download: models are in IndexedDB.                │
│   The edge function is never called again for that user.        │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```