Claude-skill-registry embedding-engine

Embedding backends (InsightFace/PyTorch+ONNXRuntime vs TensorRT). Use when optimizing embedding throughput or debugging drift/fallbacks.

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/embedding-engine" ~/.claude/skills/majiayu000-claude-skill-registry-embedding-engine && rm -rf "$T"

manifest: skills/data/embedding-engine/SKILL.md

source content

Embedding Engine Skill

Use this skill to optimize embedding performance and debug embedding drift/fallback behavior.

When to Use

Embedding pipeline running slowly
Need to switch between PyTorch and TensorRT
Debugging embedding drift between backends
Building/caching TensorRT engines
Verifying ONNXRuntime/CoreML provider selection (macOS)

Sub-agents

Sub-agent	Purpose
PyTorchEmbeddingSubagent	Reference ArcFace (training/validation)
TensorRTEmbeddingSubagent	GPU-optimized TRT inference
ONNXEmbeddingSubagent	Future ONNXRuntime C++ service (planned)

Current Backends

pytorch
(default): ArcFace via the
```
insightface
```
Python package (used by
```
tools/episode_run.py
```
)
tensorrt
(optional): TensorRT engine build + inference via
```
FEATURES/arcface_tensorrt/
```

Key Skills

Embed faces with the configured backend

Run embedding with the configured backend (same interface as the pipeline).

from tools.episode_run import get_embedding_backend

embedder = get_embedding_backend(
    backend_type="pytorch",  # or "tensorrt"
    device="cpu",
    tensorrt_config="config/pipeline/arcface_tensorrt.yaml",
    allow_cpu_fallback=True,
)
embedder.ensure_ready()
embeddings = embedder.encode(face_crops)  # (N, 512) L2-normalized

Build a TensorRT engine from ONNX

python -m FEATURES.arcface_tensorrt --mode build --onnx-path models/arcface_r100_v1.onnx

Compare TensorRT vs PyTorch embeddings (parity + speedup)

python -m FEATURES.arcface_tensorrt --mode compare --n-samples 100

This uses

FEATURES/arcface_tensorrt/src/embedding_compare.py

and reports cosine similarity + L2 distance stats.

Config Reference

File:

config/pipeline/embedding.yaml

Key Default Description

embedding.backend

pytorch

Backend:

pytorch

tensorrt

embedding.tensorrt_config

config/pipeline/arcface_tensorrt.yaml

TensorRT config path

validation.max_drift_cosine

0.001

Drift tolerance (behavior depends on runtime)

File:

config/pipeline/arcface_tensorrt.yaml

Key	Default	Description
`arcface_tensorrt.enabled`	false	Sandbox feature flag (engine must exist)
`tensorrt.precision`	fp16	Engine precision
`tensorrt.max_batch_size`	32	Max batch for engine build
`tensorrt.workspace_size_mb`	1024	TRT workspace
`tensorrt.engine_s3_bucket`	null	Optional engine bucket

Engine Storage

TensorRT engines are GPU-architecture specific. Stored in S3:

s3://screenalytics-models/engines/
├── arcface_r100-fp16-sm75.plan   # Ampere (RTX 30xx)
├── arcface_r100-fp16-sm80.plan   # A100
├── arcface_r100-fp16-sm86.plan   # Ada (RTX 40xx)
└── arcface_r100-fp16-sm89.plan   # Hopper (H100)

Naming convention:

{model_name}-{precision}-sm{arch}.plan

Common Issues

"Engine not found" / TensorRT backend won’t load

Cause: No engine built for the current GPU / config mismatch

Fix: Build locally:

python -m FEATURES.arcface_tensorrt --mode build --onnx-path models/arcface_r100_v1.onnx

Embedding drift too high

Cause: FP16 quantization or TRT optimization changes

Check: Run parity compare:

python -m FEATURES.arcface_tensorrt --mode compare --n-samples 100

Fix: Use FP32 precision:

tensorrt:
  precision: fp32  # default is fp16

TensorRT slower than expected / falling back

Cause: Not batching, engine built with suboptimal shapes/precision, or backend fell back

Check: Ensure

config/pipeline/embedding.yaml

has

embedding.backend: tensorrt

and re-run with

--mode benchmark

Fix: Increase batch size, ensure GPU backend:

tensorrt:
  opt_batch_size: 32
  max_batch_size: 64

Out of GPU memory

Cause: Engine workspace too large

Check:

nvidia-smi

during inference

Fix: Reduce workspace:

tensorrt:
  workspace_size_mb: 512  # default is 1024

Benchmark Reference

Backend	Batch	Throughput	Latency	VRAM
PyTorch	32	~50 fps	~640ms	2GB
TensorRT FP16	32	~250 fps	~128ms	1GB
TensorRT FP32	32	~180 fps	~178ms	1.5GB

Diagnostic Output

{
  "backend": "tensorrt",
  "engine_path": "~/.cache/screenalytics/engines/arcface_r100_v1-sm86.trt",
  "precision": "fp16",
  "batch_size": 32,
  "embedding_dim": 512,
  "throughput_fps": 245.3,
  "latency_ms": 130.5,
  "vram_mb": 1024,
  "validation": {
    "drift_vs_pytorch": 0.9995,
    "regression_test": "passed"
  }
}

Key Files

File	Purpose
`tools/episode_run.py`	Pipeline embedding backend selection ( `get_embedding_backend` )
`FEATURES/arcface_tensorrt/src/tensorrt_builder.py`	Engine build/cache + optional S3
`FEATURES/arcface_tensorrt/src/tensorrt_inference.py`	TensorRT inference wrapper
`FEATURES/arcface_tensorrt/src/embedding_compare.py`	Parity + speedup compare utilities
`config/pipeline/embedding.yaml`	Backend selection + validation knobs
`config/pipeline/arcface_tensorrt.yaml`	TensorRT builder/runtime config
`FEATURES/arcface_tensorrt/tests/test_tensorrt_embedding.py`	Unit tests (synthetic)
`tests/ml/test_arcface_embeddings.py`	ML-gated embedding invariants

Related Skills

pipeline-insights - General pipeline debugging
face-alignment - Alignment before embedding