Claude-skill-registry 3d-cv-labeling-2026

Expert in 3D computer vision labeling tools, workflows, and AI-assisted annotation for LiDAR, point clouds, and sensor fusion. Covers SAM4D/Point-SAM, human-in-the-loop architectures, and vertical-specific training strategies. Activate on '3D labeling', 'point cloud annotation', 'LiDAR labeling', 'SAM 3D', 'SAM4D', 'sensor fusion annotation', '3D bounding box', 'semantic segmentation point cloud'. NOT for 2D image labeling (use clip-aware-embeddings), general ML training (use ml-engineer), video annotation without 3D (use computer-vision-pipeline), or VLM prompt engineering (use prompt-engineer).

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/3d-cv-labeling-2026" ~/.claude/skills/majiayu000-claude-skill-registry-3d-cv-labeling-2026 && rm -rf "$T"
manifest: skills/data/3d-cv-labeling-2026/SKILL.md
source content

3D Computer Vision Labeling Expert (2026)

Expert guidance on 3D annotation tools, AI-assisted labeling workflows, and training architectures for LiDAR/point cloud computer vision in autonomous vehicles, robotics, infrastructure inspection, and geospatial applications.

When to Use This Skill

Use for:

  • Selecting 3D point cloud annotation tools (BasicAI, Supervisely, Segments.ai, Deepen AI)
  • Implementing SAM4D/Point-SAM for auto-labeling workflows
  • Designing human-in-the-loop annotation pipelines
  • Sensor fusion annotation (camera + LiDAR + radar)
  • Training architecture decisions: specialized models vs VLMs
  • Vertical-specific 3D detection (autonomous driving, inspection, agriculture, wildfire)

NOT for:

  • 2D image labeling without 3D context (use clip-aware-embeddings or Label Studio docs)
  • General ML model training (use ml-engineer)
  • Video annotation without point clouds (use computer-vision-pipeline)
  • VLM prompt engineering (use prompt-engineer)
  • Photogrammetry/3D reconstruction (use geo processing tools)

2026 Tool Landscape Overview

Commercial Leaders

ToolStrengthBest ForKey AI Feature
BasicAIOne-click detectionAutonomous drivingPre-labeling models fine-tuned for AV
SuperviselyCustomizationR&D teamsAI tracking, 2D→3D single-click
Segments.ai2D+3D syncRobotics perceptionSequential propagation
Deepen AISensor calibrationIn-house perceptionPixel-perfect multi-sensor
DataloopEnterprise MLOpsLarge annotation teamsModel-assisted + Point Cloud Focus
EncordFull workflowMulti-modal projectsTrack-ID management
Ango Hub (iMerit)Dense annotationComplex multi-modalFrame-to-frame propagation

Open Source Options

ToolMaturityLimitations
CVATStable3D bounding boxes only, limited interpolation
3D BATGoodFull-surround annotation, semi-auto tracking
Label StudioPartial 3DBetter for multi-format, not specialized 3D

SAM Evolution for 3D (2024-2026)

SAM4D (ICCV 2025) - Multi-Modal + Temporal

Key innovation: Unified Multi-modal Positional Encoding (UMPE) aligns camera and LiDAR in shared 3D space.

Camera Stream → Feature Extraction → ┐
                                      ├→ UMPE Alignment → Promptable 3D Segmentation
LiDAR Stream → Point Encoding     → ┘

Data engine breakthrough: Automatic pseudo-label generation at 100x+ faster than human annotation using:

  1. VFM-driven video masklets
  2. Spatiotemporal 4D reconstruction
  3. Cross-modal masklet fusion

Dataset: Waymo-4DSeg (300k+ camera-LiDAR aligned masklets)

Point-SAM (ICLR 2025) - Native 3D Prompting

Architecture: Efficient transformer designed specifically for point clouds (not adapted from 2D).

Knowledge distillation: 2D SAM → 3D Point-SAM via data engine that generates:

  • Part-level pseudo-labels
  • Object-level pseudo-labels

Benchmarks: Outperforms state-of-the-art on indoor (ScanNet) and outdoor (nuScenes, Waymo) datasets.

SAMNet++ (2025) - Hybrid Pipeline

Two-stage approach:

  1. SAM performs unsupervised segmentation
  2. Adapted PointNet++ refines for semantic accuracy

Best for: UAV/drone workflows where colorized point clouds from L1 LiDAR + RGB cameras are available.


Human-in-the-Loop Architecture

The Model-in-the-Loop Paradigm (2023-2026)

Old approach: Human labels → Train model → Deploy New approach: Model assists → Human validates → Rapid iteration

┌─────────────────────────────────────────────────────────┐
│                    LABELING PIPELINE                     │
├─────────────────────────────────────────────────────────┤
│  Raw Data → AI Pre-label → Human Review → QA Check      │
│     │           │              │             │          │
│     │     SAM4D/VLM       Corrections   Consensus      │
│     │     generates       only where    sampling        │
│     │     proposals       AI uncertain                  │
└─────────────────────────────────────────────────────────┘

Efficiency Gains

ApproachTime for 10k framesAnnotation Quality
Manual only400 hours95% (expert)
AI pre-label + review50 hours97% (AI+human)
SAM4D data engine4 hours92% (pseudo)

The 80/20 rule: ~80% of ML project time is data prep. Model-in-the-loop cuts this dramatically.

Quality Assurance Strategies

  1. Consensus sampling: Multiple annotators on subset, measure agreement
  2. Active learning: Route uncertain predictions to experts
  3. Tiered review: Tier 1 (critical objects) get SME validation, Tier 2/3 use AI confidence thresholds

Why Specialized Training > VLMs for 3D

The Core Trade-off

AspectSpecialized (YOLO, PointPillars)VLMs (GPT-4V, Gemini)
Latency10-50ms (real-time)500-2000ms
3D precisionStrong geometric priorsNoisy text-3D alignment
Novel objectsClosed-set (what you train)Open-vocabulary
ComputeEdge-deployableGPU cluster required
HallucinationsNone (deterministic)Yes (safety-critical risk)
Domain shiftStruggles (fog, night)Better generalization

When to Use Each

Use Specialized Models When:

  • Real-time inference required (autonomous vehicles, robotics)
  • Known object classes (infrastructure defects, crop types)
  • Safety-critical deployment (can't tolerate hallucinations)
  • Edge deployment (drones, embedded systems)

Use VLMs/Foundation Models When:

  • Zero-shot exploration of new domains
  • Generating training data (weak labels)
  • Open-vocabulary requirements ("find anything damaged")
  • Domain adaptation bootstrapping

The Hybrid Architecture (2025+ Best Practice)

                    ┌───────────────────────┐
                    │    VLM (Slow Brain)   │
                    │  • Scene understanding│
                    │  • Open vocabulary    │
                    │  • Anomaly detection  │
                    └──────────┬────────────┘
                               │ High-level context
                               ▼
┌──────────────────────────────────────────────────────────┐
│              Specialized Detector (Fast Brain)           │
│  • Real-time inference (YOLO, PointPillars, CenterPoint)│
│  • Known object detection & tracking                    │
│  • Safety-critical decisions                            │
└──────────────────────────────────────────────────────────┘

Examples:

  • VOLTRON: YOLOv8 + LLaMA2 for hazard identification
  • DrivePI: Point clouds + multi-view + language instructions (0.5B Qwen2.5)

Vertical-Specific Training Architecture

Infrastructure Inspection

Objects: Utility poles, insulators, conductors, vegetation, damage types Sensor fusion: RGB + thermal + LiDAR Training data needs:

  • Thermal anomaly samples (varied temperatures)
  • Damage taxonomy (cracks, corrosion, rust grades)
  • Vegetation clearance measurements

Architecture:

LiDAR → Point cloud encoder → ┐
Thermal → 2D encoder       → ├→ Fusion → Multi-task head
RGB → 2D encoder           → ┘          ├→ Object detection
                                         ├→ Defect classification
                                         └→ Clearance regression

Autonomous Driving

Objects: Vehicles, pedestrians, cyclists, traffic signs, lane markings Key requirement: Temporal consistency (track-IDs across frames) Training data needs:

  • Long-tail scenarios (emergency vehicles, animals, debris)
  • Adverse weather (fog, rain, snow, night)
  • Edge cases (construction zones, accidents)

Architecture: CenterPoint, PointPillars, or Voxel-based detectors with BEV (Bird's Eye View) representation.

Agriculture/Wildfire

Objects: Crop rows, canopy height, fuel load, fire spread boundaries Sensor fusion: RGB + multispectral + LiDAR Training data needs:

  • Crop growth stages
  • Disease/pest visual signatures
  • Fuel load density from LiDAR CHM (Canopy Height Model)

Why not just VLM? VLMs can't:

  • Measure precise heights (LiDAR regression)
  • Classify at hyperspectral wavelengths
  • Maintain spatial precision for prescription maps

Common Anti-Patterns

Anti-Pattern: "Just Use SAM on Everything"

Novice thinking: "SAM segments anything, so I'll just run it on my LiDAR data"

Reality:

  • SAM 1/2 are 2D models—they don't understand 3D geometry
  • Point clouds need Point-SAM or SAM4D specifically
  • Raw application produces noisy masks without geometric priors

Correct approach: Use Point-SAM for native 3D, or project to 2D for SAM → lift back to 3D.

Anti-Pattern: Skipping Human Validation

Novice thinking: "AI pre-labels are 95% accurate, we can skip review"

Reality:

  • 5% error on 100k objects = 5,000 wrong labels
  • Errors compound in edge cases (exactly where you need accuracy)
  • Model learns to reproduce annotation mistakes

Correct approach: Tier 1 (safety-critical) always human-validated. Use confidence thresholds for Tier 2/3.

Anti-Pattern: VLM for Real-Time Inference

Novice thinking: "GPT-4V can identify damage in my photos"

Reality:

  • 500-2000ms latency per frame
  • Can't run on edge devices
  • Hallucination risk in safety-critical contexts

Correct approach: Use VLM for data generation/exploration, specialized model for deployment.

Anti-Pattern: Single-Modal Training

Novice thinking: "LiDAR is enough for 3D detection"

Reality:

  • LiDAR: Precise geometry, no color/texture
  • Camera: Rich semantics, no depth
  • Fusion outperforms single-modal by 5-15% mAP

Correct approach: Sensor fusion from day one. SAM4D shows fusion pseudo-labels > single-modal.


Decision Tree: Choosing Your Approach

                        Do you need real-time inference?
                              /                  \
                           YES                    NO
                            |                      |
                    Use specialized           Is this exploration?
                    detector (YOLO,              /        \
                    CenterPoint)               YES         NO
                            |                  |           |
                    Have labeled data?     Use VLM      Generate
                      /        \           for zero-    pseudo-labels
                   YES          NO         shot         with SAM4D
                    |            |
              Train model    Use SAM4D/
                             Point-SAM for
                             auto-labeling

Tool Selection Decision Matrix

RequirementRecommended Tool
Autonomous driving at scaleDeepen AI or BasicAI
R&D/research flexibilitySupervisely or Segments.ai
Multi-modal (camera+LiDAR+radar)Ango Hub or Dataloop
Self-hosted/open sourceCVAT + 3D plugins or 3D BAT
Robotics perceptionSegments.ai (2D+3D sync)
Budget-consciousLabel Studio + custom scripts

References

  • /references/sam4d-architecture.md
    - Deep dive on SAM4D UMPE and data engine
  • /references/tool-comparison-matrix.md
    - Detailed feature comparison of all tools
  • /references/hybrid-architecture-examples.md
    - VOLTRON, DrivePI implementation patterns
  • /references/vertical-training-recipes.md
    - Infrastructure, AV, agriculture specifics

Sources