Claude-skill-registry 3d-cv-labeling-2026

Expert in 3D computer vision labeling tools, workflows, and AI-assisted annotation for LiDAR, point clouds, and sensor fusion. Covers SAM4D/Point-SAM, human-in-the-loop architectures, and vertical-specific training strategies. Activate on '3D labeling', 'point cloud annotation', 'LiDAR labeling', 'SAM 3D', 'SAM4D', 'sensor fusion annotation', '3D bounding box', 'semantic segmentation point cloud'. NOT for 2D image labeling (use clip-aware-embeddings), general ML training (use ml-engineer), video annotation without 3D (use computer-vision-pipeline), or VLM prompt engineering (use prompt-engineer).

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/3d-cv-labeling-2026" ~/.claude/skills/majiayu000-claude-skill-registry-3d-cv-labeling-2026 && rm -rf "$T"

manifest: skills/data/3d-cv-labeling-2026/SKILL.md

3D Computer Vision Labeling Expert (2026)

Expert guidance on 3D annotation tools, AI-assisted labeling workflows, and training architectures for LiDAR/point cloud computer vision in autonomous vehicles, robotics, infrastructure inspection, and geospatial applications.

When to Use This Skill

✅ Use for:

Selecting 3D point cloud annotation tools (BasicAI, Supervisely, Segments.ai, Deepen AI)
Implementing SAM4D/Point-SAM for auto-labeling workflows
Designing human-in-the-loop annotation pipelines
Sensor fusion annotation (camera + LiDAR + radar)
Training architecture decisions: specialized models vs VLMs
Vertical-specific 3D detection (autonomous driving, inspection, agriculture, wildfire)

❌ NOT for:

2D image labeling without 3D context (use clip-aware-embeddings or Label Studio docs)
General ML model training (use ml-engineer)
Video annotation without point clouds (use computer-vision-pipeline)
VLM prompt engineering (use prompt-engineer)
Photogrammetry/3D reconstruction (use geo processing tools)

2026 Tool Landscape Overview

Commercial Leaders

Tool	Strength	Best For	Key AI Feature
BasicAI	One-click detection	Autonomous driving	Pre-labeling models fine-tuned for AV
Supervisely	Customization	R&D teams	AI tracking, 2D→3D single-click
Segments.ai	2D+3D sync	Robotics perception	Sequential propagation
Deepen AI	Sensor calibration	In-house perception	Pixel-perfect multi-sensor
Dataloop	Enterprise MLOps	Large annotation teams	Model-assisted + Point Cloud Focus
Encord	Full workflow	Multi-modal projects	Track-ID management
Ango Hub (iMerit)	Dense annotation	Complex multi-modal	Frame-to-frame propagation

Open Source Options

Tool	Maturity	Limitations
CVAT	Stable	3D bounding boxes only, limited interpolation
3D BAT	Good	Full-surround annotation, semi-auto tracking
Label Studio	Partial 3D	Better for multi-format, not specialized 3D

SAM Evolution for 3D (2024-2026)

SAM4D (ICCV 2025) - Multi-Modal + Temporal

Key innovation: Unified Multi-modal Positional Encoding (UMPE) aligns camera and LiDAR in shared 3D space.

Camera Stream → Feature Extraction → ┐
                                      ├→ UMPE Alignment → Promptable 3D Segmentation
LiDAR Stream → Point Encoding     → ┘

Data engine breakthrough: Automatic pseudo-label generation at 100x+ faster than human annotation using:

VFM-driven video masklets
Spatiotemporal 4D reconstruction
Cross-modal masklet fusion

Dataset: Waymo-4DSeg (300k+ camera-LiDAR aligned masklets)

Point-SAM (ICLR 2025) - Native 3D Prompting

Architecture: Efficient transformer designed specifically for point clouds (not adapted from 2D).

Knowledge distillation: 2D SAM → 3D Point-SAM via data engine that generates:

Part-level pseudo-labels
Object-level pseudo-labels

Benchmarks: Outperforms state-of-the-art on indoor (ScanNet) and outdoor (nuScenes, Waymo) datasets.

SAMNet++ (2025) - Hybrid Pipeline

Two-stage approach:

SAM performs unsupervised segmentation
Adapted PointNet++ refines for semantic accuracy

Best for: UAV/drone workflows where colorized point clouds from L1 LiDAR + RGB cameras are available.

Human-in-the-Loop Architecture

The Model-in-the-Loop Paradigm (2023-2026)

Old approach: Human labels → Train model → Deploy New approach: Model assists → Human validates → Rapid iteration

┌─────────────────────────────────────────────────────────┐
│                    LABELING PIPELINE                     │
├─────────────────────────────────────────────────────────┤
│  Raw Data → AI Pre-label → Human Review → QA Check      │
│     │           │              │             │          │
│     │     SAM4D/VLM       Corrections   Consensus      │
│     │     generates       only where    sampling        │
│     │     proposals       AI uncertain                  │
└─────────────────────────────────────────────────────────┘

Efficiency Gains

Approach	Time for 10k frames	Annotation Quality
Manual only	400 hours	95% (expert)
AI pre-label + review	50 hours	97% (AI+human)
SAM4D data engine	4 hours	92% (pseudo)

The 80/20 rule: ~80% of ML project time is data prep. Model-in-the-loop cuts this dramatically.

Quality Assurance Strategies

Consensus sampling: Multiple annotators on subset, measure agreement
Active learning: Route uncertain predictions to experts
Tiered review: Tier 1 (critical objects) get SME validation, Tier 2/3 use AI confidence thresholds

Why Specialized Training > VLMs for 3D

The Core Trade-off

Aspect	Specialized (YOLO, PointPillars)	VLMs (GPT-4V, Gemini)
Latency	10-50ms (real-time)	500-2000ms
3D precision	Strong geometric priors	Noisy text-3D alignment
Novel objects	Closed-set (what you train)	Open-vocabulary
Compute	Edge-deployable	GPU cluster required
Hallucinations	None (deterministic)	Yes (safety-critical risk)
Domain shift	Struggles (fog, night)	Better generalization

When to Use Each

Use Specialized Models When:

Real-time inference required (autonomous vehicles, robotics)
Known object classes (infrastructure defects, crop types)
Safety-critical deployment (can't tolerate hallucinations)
Edge deployment (drones, embedded systems)

Use VLMs/Foundation Models When:

Zero-shot exploration of new domains
Generating training data (weak labels)
Open-vocabulary requirements ("find anything damaged")
Domain adaptation bootstrapping

The Hybrid Architecture (2025+ Best Practice)

                    ┌───────────────────────┐
                    │    VLM (Slow Brain)   │
                    │  • Scene understanding│
                    │  • Open vocabulary    │
                    │  • Anomaly detection  │
                    └──────────┬────────────┘
                               │ High-level context
                               ▼
┌──────────────────────────────────────────────────────────┐
│              Specialized Detector (Fast Brain)           │
│  • Real-time inference (YOLO, PointPillars, CenterPoint)│
│  • Known object detection & tracking                    │
│  • Safety-critical decisions                            │
└──────────────────────────────────────────────────────────┘

Examples:

VOLTRON: YOLOv8 + LLaMA2 for hazard identification
DrivePI: Point clouds + multi-view + language instructions (0.5B Qwen2.5)

Vertical-Specific Training Architecture

Infrastructure Inspection

Objects: Utility poles, insulators, conductors, vegetation, damage types Sensor fusion: RGB + thermal + LiDAR Training data needs:

Thermal anomaly samples (varied temperatures)
Damage taxonomy (cracks, corrosion, rust grades)
Vegetation clearance measurements

Architecture:

LiDAR → Point cloud encoder → ┐
Thermal → 2D encoder       → ├→ Fusion → Multi-task head
RGB → 2D encoder           → ┘          ├→ Object detection
                                         ├→ Defect classification
                                         └→ Clearance regression

Autonomous Driving

Objects: Vehicles, pedestrians, cyclists, traffic signs, lane markings Key requirement: Temporal consistency (track-IDs across frames) Training data needs:

Long-tail scenarios (emergency vehicles, animals, debris)
Adverse weather (fog, rain, snow, night)
Edge cases (construction zones, accidents)

Architecture: CenterPoint, PointPillars, or Voxel-based detectors with BEV (Bird's Eye View) representation.

Agriculture/Wildfire

Objects: Crop rows, canopy height, fuel load, fire spread boundaries Sensor fusion: RGB + multispectral + LiDAR Training data needs:

Crop growth stages
Disease/pest visual signatures
Fuel load density from LiDAR CHM (Canopy Height Model)

Why not just VLM? VLMs can't:

Measure precise heights (LiDAR regression)
Classify at hyperspectral wavelengths
Maintain spatial precision for prescription maps

Common Anti-Patterns

Anti-Pattern: "Just Use SAM on Everything"

Novice thinking: "SAM segments anything, so I'll just run it on my LiDAR data"

Reality:

SAM 1/2 are 2D models—they don't understand 3D geometry
Point clouds need Point-SAM or SAM4D specifically
Raw application produces noisy masks without geometric priors

Correct approach: Use Point-SAM for native 3D, or project to 2D for SAM → lift back to 3D.

Anti-Pattern: Skipping Human Validation

Novice thinking: "AI pre-labels are 95% accurate, we can skip review"

Reality:

5% error on 100k objects = 5,000 wrong labels
Errors compound in edge cases (exactly where you need accuracy)
Model learns to reproduce annotation mistakes

Correct approach: Tier 1 (safety-critical) always human-validated. Use confidence thresholds for Tier 2/3.

Anti-Pattern: VLM for Real-Time Inference

Novice thinking: "GPT-4V can identify damage in my photos"

Reality:

500-2000ms latency per frame
Can't run on edge devices
Hallucination risk in safety-critical contexts

Correct approach: Use VLM for data generation/exploration, specialized model for deployment.

Anti-Pattern: Single-Modal Training

Novice thinking: "LiDAR is enough for 3D detection"

Reality:

LiDAR: Precise geometry, no color/texture
Camera: Rich semantics, no depth
Fusion outperforms single-modal by 5-15% mAP

Correct approach: Sensor fusion from day one. SAM4D shows fusion pseudo-labels > single-modal.

Decision Tree: Choosing Your Approach

                        Do you need real-time inference?
                              /                  \
                           YES                    NO
                            |                      |
                    Use specialized           Is this exploration?
                    detector (YOLO,              /        \
                    CenterPoint)               YES         NO
                            |                  |           |
                    Have labeled data?     Use VLM      Generate
                      /        \           for zero-    pseudo-labels
                   YES          NO         shot         with SAM4D
                    |            |
              Train model    Use SAM4D/
                             Point-SAM for
                             auto-labeling

Tool Selection Decision Matrix

Requirement	Recommended Tool
Autonomous driving at scale	Deepen AI or BasicAI
R&D/research flexibility	Supervisely or Segments.ai
Multi-modal (camera+LiDAR+radar)	Ango Hub or Dataloop
Self-hosted/open source	CVAT + 3D plugins or 3D BAT
Robotics perception	Segments.ai (2D+3D sync)
Budget-conscious	Label Studio + custom scripts

References

```
/references/sam4d-architecture.md
```
- Deep dive on SAM4D UMPE and data engine
```
/references/tool-comparison-matrix.md
```
- Detailed feature comparison of all tools

/references/hybrid-architecture-examples.md

- VOLTRON, DrivePI implementation patterns

```
/references/vertical-training-recipes.md
```
- Infrastructure, AV, agriculture specifics

Claude-skill-registry 3d-cv-labeling-2026

3D Computer Vision Labeling Expert (2026)

When to Use This Skill

2026 Tool Landscape Overview

Commercial Leaders

Open Source Options

SAM Evolution for 3D (2024-2026)

SAM4D (ICCV 2025) - Multi-Modal + Temporal

Point-SAM (ICLR 2025) - Native 3D Prompting

SAMNet++ (2025) - Hybrid Pipeline

Human-in-the-Loop Architecture

The Model-in-the-Loop Paradigm (2023-2026)

Efficiency Gains

Quality Assurance Strategies

Why Specialized Training > VLMs for 3D

The Core Trade-off

When to Use Each

The Hybrid Architecture (2025+ Best Practice)

Vertical-Specific Training Architecture

Infrastructure Inspection

Autonomous Driving

Agriculture/Wildfire

Common Anti-Patterns

Anti-Pattern: "Just Use SAM on Everything"

Anti-Pattern: Skipping Human Validation

Anti-Pattern: VLM for Real-Time Inference

Anti-Pattern: Single-Modal Training

Decision Tree: Choosing Your Approach

Tool Selection Decision Matrix

References

Sources