Claude-skill-registry embedding-engine
Embedding backends (InsightFace/PyTorch+ONNXRuntime vs TensorRT). Use when optimizing embedding throughput or debugging drift/fallbacks.
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/embedding-engine" ~/.claude/skills/majiayu000-claude-skill-registry-embedding-engine && rm -rf "$T"
skills/data/embedding-engine/SKILL.mdEmbedding Engine Skill
Use this skill to optimize embedding performance and debug embedding drift/fallback behavior.
When to Use
- Embedding pipeline running slowly
- Need to switch between PyTorch and TensorRT
- Debugging embedding drift between backends
- Building/caching TensorRT engines
- Verifying ONNXRuntime/CoreML provider selection (macOS)
Sub-agents
| Sub-agent | Purpose |
|---|---|
| PyTorchEmbeddingSubagent | Reference ArcFace (training/validation) |
| TensorRTEmbeddingSubagent | GPU-optimized TRT inference |
| ONNXEmbeddingSubagent | Future ONNXRuntime C++ service (planned) |
Current Backends
(default): ArcFace via thepytorch
Python package (used byinsightface
)tools/episode_run.py
(optional): TensorRT engine build + inference viatensorrtFEATURES/arcface_tensorrt/
Key Skills
Embed faces with the configured backend
Run embedding with the configured backend (same interface as the pipeline).
from tools.episode_run import get_embedding_backend embedder = get_embedding_backend( backend_type="pytorch", # or "tensorrt" device="cpu", tensorrt_config="config/pipeline/arcface_tensorrt.yaml", allow_cpu_fallback=True, ) embedder.ensure_ready() embeddings = embedder.encode(face_crops) # (N, 512) L2-normalized
Build a TensorRT engine from ONNX
python -m FEATURES.arcface_tensorrt --mode build --onnx-path models/arcface_r100_v1.onnx
Compare TensorRT vs PyTorch embeddings (parity + speedup)
python -m FEATURES.arcface_tensorrt --mode compare --n-samples 100
This uses
FEATURES/arcface_tensorrt/src/embedding_compare.py and reports cosine similarity + L2 distance stats.
Config Reference
File:
config/pipeline/embedding.yaml
| Key | Default | Description |
|---|---|---|
| | Backend: or |
| | TensorRT config path |
| 0.001 | Drift tolerance (behavior depends on runtime) |
File:
config/pipeline/arcface_tensorrt.yaml
| Key | Default | Description |
|---|---|---|
| false | Sandbox feature flag (engine must exist) |
| fp16 | Engine precision |
| 32 | Max batch for engine build |
| 1024 | TRT workspace |
| null | Optional engine bucket |
Engine Storage
TensorRT engines are GPU-architecture specific. Stored in S3:
s3://screenalytics-models/engines/ ├── arcface_r100-fp16-sm75.plan # Ampere (RTX 30xx) ├── arcface_r100-fp16-sm80.plan # A100 ├── arcface_r100-fp16-sm86.plan # Ada (RTX 40xx) └── arcface_r100-fp16-sm89.plan # Hopper (H100)
Naming convention:
{model_name}-{precision}-sm{arch}.plan
Common Issues
"Engine not found" / TensorRT backend won’t load
Cause: No engine built for the current GPU / config mismatch
Fix: Build locally:
python -m FEATURES.arcface_tensorrt --mode build --onnx-path models/arcface_r100_v1.onnx
Embedding drift too high
Cause: FP16 quantization or TRT optimization changes
Check: Run parity compare:
python -m FEATURES.arcface_tensorrt --mode compare --n-samples 100
Fix: Use FP32 precision:
tensorrt: precision: fp32 # default is fp16
TensorRT slower than expected / falling back
Cause: Not batching, engine built with suboptimal shapes/precision, or backend fell back
Check: Ensure
config/pipeline/embedding.yaml has embedding.backend: tensorrt and re-run with --mode benchmark.
Fix: Increase batch size, ensure GPU backend:
tensorrt: opt_batch_size: 32 max_batch_size: 64
Out of GPU memory
Cause: Engine workspace too large
Check:
nvidia-smi during inference
Fix: Reduce workspace:
tensorrt: workspace_size_mb: 512 # default is 1024
Benchmark Reference
| Backend | Batch | Throughput | Latency | VRAM |
|---|---|---|---|---|
| PyTorch | 32 | ~50 fps | ~640ms | 2GB |
| TensorRT FP16 | 32 | ~250 fps | ~128ms | 1GB |
| TensorRT FP32 | 32 | ~180 fps | ~178ms | 1.5GB |
Diagnostic Output
{ "backend": "tensorrt", "engine_path": "~/.cache/screenalytics/engines/arcface_r100_v1-sm86.trt", "precision": "fp16", "batch_size": 32, "embedding_dim": 512, "throughput_fps": 245.3, "latency_ms": 130.5, "vram_mb": 1024, "validation": { "drift_vs_pytorch": 0.9995, "regression_test": "passed" } }
Key Files
| File | Purpose |
|---|---|
| Pipeline embedding backend selection () |
| Engine build/cache + optional S3 |
| TensorRT inference wrapper |
| Parity + speedup compare utilities |
| Backend selection + validation knobs |
| TensorRT builder/runtime config |
| Unit tests (synthetic) |
| ML-gated embedding invariants |
Related Skills
- pipeline-insights - General pipeline debugging
- face-alignment - Alignment before embedding