Vibeship-spawner-skills computer-vision-deep

id: computer-vision-deep

install

source · Clone the upstream repo

git clone https://github.com/vibeforge1111/vibeship-spawner-skills

manifest: ai/computer-vision-deep/skill.yaml

tags

#object-detection #segmentation #vision-models #yolo #sam #depth-estimation

source content

id: computer-vision-deep name: Computer Vision Deep category: ai description: Use when implementing object detection, semantic/instance segmentation, 3D vision, or video understanding - covers YOLO, SAM, depth estimation, and multi-modal vision

patterns: golden_rules: - rule: "YOLO for speed, SAM for accuracy" reason: "Different tools for different constraints" - rule: "YOLO + SAM hybrid is powerful" reason: "Detection boxes → SAM masks" - rule: "Anchor-free is the future" reason: "YOLO v8+ are anchor-free, simpler" - rule: "Resolution matters enormously" reason: "2x resolution ≈ 4x compute, better small objects" - rule: "Data augmentation is critical" reason: "Geometric + color augmentations improve robustness" - rule: "Pre-trained backbones always" reason: "ImageNet/CLIP pretrained >>> random init"

task_landscape: image_classification: description: "What is in the image?" output: "Single label or multi-label" models: ["ResNet", "ViT", "EfficientNet", "ConvNeXt"] object_detection: description: "Where are objects?" output: "Bounding boxes + classes" models: ["YOLO", "DETR", "Faster R-CNN"] semantic_segmentation: description: "Pixel-wise classification" output: "All cars = same class" models: ["DeepLab", "SegFormer", "UNet"] instance_segmentation: description: "Separate each object instance" output: "Car 1, Car 2, Car 3 distinct" models: ["Mask R-CNN", "YOLO-Seg", "SAM"] panoptic_segmentation: description: "Semantic + Instance unified" models: ["Panoptic FPN", "MaskFormer", "Mask2Former"]

yolo_models: yolov8n: params: "3.2M" map: "37.3" speed_ms: "1.2" use_case: "Edge, mobile" yolov8s: params: "11.2M" map: "44.9" speed_ms: "1.8" use_case: "Balanced" yolov8m: params: "25.9M" map: "50.2" speed_ms: "3.4" use_case: "Accuracy focus" yolov8l: params: "43.7M" map: "52.9" speed_ms: "5.0" use_case: "High accuracy" yolov8x: params: "68.2M" map: "53.9" speed_ms: "8.1" use_case: "Maximum accuracy"

foundation_models: sam: description: "Segment Anything Model" capabilities: - "Zero-shot segmentation" - "Point/box/text prompts" - "High-quality masks" clip: description: "Contrastive Language-Image Pre-training" capabilities: - "Zero-shot classification" - "Image-text matching" - "Feature extraction" dino: description: "Self-supervised vision transformer" capabilities: - "Self-supervised features" - "Part discovery" - "Correspondence"

anti_patterns:

pattern: "Training from scratch" problem: "Slow, poor results" solution: "Always use pretrained backbone"
pattern: "Low resolution for small objects" problem: "Missing detections" solution: "Increase input resolution"
pattern: "No augmentation" problem: "Overfitting" solution: "Strong augmentation pipeline"
pattern: "Wrong anchor sizes" problem: "Poor box regression" solution: "Anchor-free (YOLO v8+) or cluster anchors"
pattern: "Ignoring class imbalance" problem: "Biased predictions" solution: "Focal loss, oversampling"
pattern: "Not using SAM for annotation" problem: "Slow manual annotation" solution: "SAM-assisted labeling"

implementation_checklist: object_detection: - "Pretrained model selected (YOLO for speed, DETR for accuracy)" - "Input resolution appropriate for object sizes" - "Strong data augmentation" - "Class imbalance handled" - "NMS threshold tuned" segmentation: - "Task defined (semantic vs instance vs panoptic)" - "Model selected (SAM for zero-shot, SegFormer for custom)" - "Augmentations applied to both image and mask" - "Edge handling considered" video: - "Temporal consistency (tracking vs per-frame)" - "Tracking algorithm selected (ByteTrack, BoT-SORT)" - "Frame rate considered" - "GPU memory managed (batch processing)"

handoffs:

skill: model-optimization trigger: "deployment optimization for vision models"
skill: distributed-training trigger: "multi-GPU vision training"
skill: nlp-advanced trigger: "multi-modal vision-language tasks"

ecosystem: detection: - "Ultralytics YOLO" - "Detectron2" - "MMDetection" - "DETR" segmentation: - "SAM (Segment Anything)" - "SegFormer" - "DeepLab" - "MMSegmentation" depth: - "Depth Anything" - "MiDaS" - "ZoeDepth" - "DPT" video: - "VideoMAE" - "ByteTrack" - "BoT-SORT"

sources: documentation: - "Ultralytics YOLO Documentation" - "SAM Documentation" - "HuggingFace Vision Models" papers: - "Segment Anything (Kirillov et al.)" - "YOLO: Real-Time Object Detection" - "DETR: End-to-End Object Detection"