Claude-skill-registry embedding-engine

Embedding backends (InsightFace/PyTorch+ONNXRuntime vs TensorRT). Use when optimizing embedding throughput or debugging drift/fallbacks.

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/embedding-engine" ~/.claude/skills/majiayu000-claude-skill-registry-embedding-engine && rm -rf "$T"
manifest: skills/data/embedding-engine/SKILL.md
source content

Embedding Engine Skill

Use this skill to optimize embedding performance and debug embedding drift/fallback behavior.

When to Use

  • Embedding pipeline running slowly
  • Need to switch between PyTorch and TensorRT
  • Debugging embedding drift between backends
  • Building/caching TensorRT engines
  • Verifying ONNXRuntime/CoreML provider selection (macOS)

Sub-agents

Sub-agentPurpose
PyTorchEmbeddingSubagentReference ArcFace (training/validation)
TensorRTEmbeddingSubagentGPU-optimized TRT inference
ONNXEmbeddingSubagentFuture ONNXRuntime C++ service (planned)

Current Backends

  • pytorch
    (default):
    ArcFace via the
    insightface
    Python package (used by
    tools/episode_run.py
    )
  • tensorrt
    (optional):
    TensorRT engine build + inference via
    FEATURES/arcface_tensorrt/

Key Skills

Embed faces with the configured backend

Run embedding with the configured backend (same interface as the pipeline).

from tools.episode_run import get_embedding_backend

embedder = get_embedding_backend(
    backend_type="pytorch",  # or "tensorrt"
    device="cpu",
    tensorrt_config="config/pipeline/arcface_tensorrt.yaml",
    allow_cpu_fallback=True,
)
embedder.ensure_ready()
embeddings = embedder.encode(face_crops)  # (N, 512) L2-normalized

Build a TensorRT engine from ONNX

python -m FEATURES.arcface_tensorrt --mode build --onnx-path models/arcface_r100_v1.onnx

Compare TensorRT vs PyTorch embeddings (parity + speedup)

python -m FEATURES.arcface_tensorrt --mode compare --n-samples 100

This uses

FEATURES/arcface_tensorrt/src/embedding_compare.py
and reports cosine similarity + L2 distance stats.

Config Reference

File:

config/pipeline/embedding.yaml

KeyDefaultDescription
embedding.backend
pytorch
Backend:
pytorch
or
tensorrt
embedding.tensorrt_config
config/pipeline/arcface_tensorrt.yaml
TensorRT config path
validation.max_drift_cosine
0.001Drift tolerance (behavior depends on runtime)

File:

config/pipeline/arcface_tensorrt.yaml

KeyDefaultDescription
arcface_tensorrt.enabled
falseSandbox feature flag (engine must exist)
tensorrt.precision
fp16Engine precision
tensorrt.max_batch_size
32Max batch for engine build
tensorrt.workspace_size_mb
1024TRT workspace
tensorrt.engine_s3_bucket
nullOptional engine bucket

Engine Storage

TensorRT engines are GPU-architecture specific. Stored in S3:

s3://screenalytics-models/engines/
├── arcface_r100-fp16-sm75.plan   # Ampere (RTX 30xx)
├── arcface_r100-fp16-sm80.plan   # A100
├── arcface_r100-fp16-sm86.plan   # Ada (RTX 40xx)
└── arcface_r100-fp16-sm89.plan   # Hopper (H100)

Naming convention:

{model_name}-{precision}-sm{arch}.plan

Common Issues

"Engine not found" / TensorRT backend won’t load

Cause: No engine built for the current GPU / config mismatch

Fix: Build locally:

python -m FEATURES.arcface_tensorrt --mode build --onnx-path models/arcface_r100_v1.onnx

Embedding drift too high

Cause: FP16 quantization or TRT optimization changes

Check: Run parity compare:

python -m FEATURES.arcface_tensorrt --mode compare --n-samples 100

Fix: Use FP32 precision:

tensorrt:
  precision: fp32  # default is fp16

TensorRT slower than expected / falling back

Cause: Not batching, engine built with suboptimal shapes/precision, or backend fell back

Check: Ensure

config/pipeline/embedding.yaml
has
embedding.backend: tensorrt
and re-run with
--mode benchmark
.

Fix: Increase batch size, ensure GPU backend:

tensorrt:
  opt_batch_size: 32
  max_batch_size: 64

Out of GPU memory

Cause: Engine workspace too large

Check:

nvidia-smi
during inference

Fix: Reduce workspace:

tensorrt:
  workspace_size_mb: 512  # default is 1024

Benchmark Reference

BackendBatchThroughputLatencyVRAM
PyTorch32~50 fps~640ms2GB
TensorRT FP1632~250 fps~128ms1GB
TensorRT FP3232~180 fps~178ms1.5GB

Diagnostic Output

{
  "backend": "tensorrt",
  "engine_path": "~/.cache/screenalytics/engines/arcface_r100_v1-sm86.trt",
  "precision": "fp16",
  "batch_size": 32,
  "embedding_dim": 512,
  "throughput_fps": 245.3,
  "latency_ms": 130.5,
  "vram_mb": 1024,
  "validation": {
    "drift_vs_pytorch": 0.9995,
    "regression_test": "passed"
  }
}

Key Files

FilePurpose
tools/episode_run.py
Pipeline embedding backend selection (
get_embedding_backend
)
FEATURES/arcface_tensorrt/src/tensorrt_builder.py
Engine build/cache + optional S3
FEATURES/arcface_tensorrt/src/tensorrt_inference.py
TensorRT inference wrapper
FEATURES/arcface_tensorrt/src/embedding_compare.py
Parity + speedup compare utilities
config/pipeline/embedding.yaml
Backend selection + validation knobs
config/pipeline/arcface_tensorrt.yaml
TensorRT builder/runtime config
FEATURES/arcface_tensorrt/tests/test_tensorrt_embedding.py
Unit tests (synthetic)
tests/ml/test_arcface_embeddings.py
ML-gated embedding invariants

Related Skills