Awesome-Agent-Skills-for-Empirical-Research computer-vision-guide
Apply computer vision research methods, models, and evaluation tools
install
source · Clone the upstream repo
git clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/43-wentorai-research-plugins/skills/domains/ai-ml/computer-vision-guide" ~/.claude/skills/brycewang-stanford-awesome-agent-skills-for-empirical-research-computer-vision-g && rm -rf "$T"
manifest:
skills/43-wentorai-research-plugins/skills/domains/ai-ml/computer-vision-guide/SKILL.mdsource content
Computer Vision Guide
A skill for conducting computer vision research, covering model architectures, dataset preparation, training pipelines, evaluation metrics, and common experimental protocols for image classification, object detection, and segmentation tasks.
Core Tasks and Architectures
Computer Vision Task Taxonomy
Image Classification: Input: Single image Output: Class label(s) Models: ResNet, EfficientNet, ViT, ConvNeXt Object Detection: Input: Single image Output: Bounding boxes + class labels Models: YOLO (v5-v9), Faster R-CNN, DETR, RT-DETR Semantic Segmentation: Input: Single image Output: Per-pixel class label Models: U-Net, DeepLab, SegFormer, Mask2Former Instance Segmentation: Input: Single image Output: Per-pixel labels distinguishing individual objects Models: Mask R-CNN, Mask2Former, SAM Image Generation: Input: Text prompt or noise Output: Generated image Models: Stable Diffusion, DALL-E, Imagen
Model Architecture Evolution
CNNs (Convolutional Neural Networks): LeNet (1998) -> AlexNet (2012) -> VGG (2014) -> ResNet (2015) -> EfficientNet (2019) -> ConvNeXt (2022) Vision Transformers: ViT (2020) -> DeiT (2021) -> Swin Transformer (2021) -> BEiT (2021) -> DINOv2 (2023) Trend: Transformers are competitive with CNNs at scale. Hybrid architectures combining convolutions and attention are common.
Dataset Preparation
Building a Research Dataset
import os from pathlib import Path def organize_image_dataset(source_dir: str, split_ratios: dict = None) -> dict: """ Organize images into train/val/test splits. Args: source_dir: Directory containing class subdirectories split_ratios: Dict with 'train', 'val', 'test' ratios """ if split_ratios is None: split_ratios = {"train": 0.7, "val": 0.15, "test": 0.15} import random random.seed(42) stats = {} for class_dir in sorted(Path(source_dir).iterdir()): if not class_dir.is_dir(): continue images = list(class_dir.glob("*.jpg")) + list(class_dir.glob("*.png")) random.shuffle(images) n = len(images) n_train = int(n * split_ratios["train"]) n_val = int(n * split_ratios["val"]) stats[class_dir.name] = { "total": n, "train": n_train, "val": n_val, "test": n - n_train - n_val } return stats
Data Augmentation
from torchvision import transforms def get_training_transforms(img_size: int = 224) -> transforms.Compose: """ Standard data augmentation pipeline for training. Args: img_size: Target image size """ return transforms.Compose([ transforms.RandomResizedCrop(img_size, scale=(0.8, 1.0)), transforms.RandomHorizontalFlip(p=0.5), transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1), transforms.RandomRotation(15), transforms.ToTensor(), transforms.Normalize( mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225] ) ])
Training Pipeline
Transfer Learning Workflow
import torch import torch.nn as nn from torchvision import models def create_classifier(num_classes: int, backbone: str = "resnet50", pretrained: bool = True) -> nn.Module: """ Create an image classifier using transfer learning. Args: num_classes: Number of target classes backbone: Model architecture name pretrained: Whether to use ImageNet-pretrained weights """ if backbone == "resnet50": weights = models.ResNet50_Weights.DEFAULT if pretrained else None model = models.resnet50(weights=weights) model.fc = nn.Linear(model.fc.in_features, num_classes) elif backbone == "vit_b_16": weights = models.ViT_B_16_Weights.DEFAULT if pretrained else None model = models.vit_b_16(weights=weights) model.heads.head = nn.Linear( model.heads.head.in_features, num_classes ) else: raise ValueError(f"Unknown backbone: {backbone}") return model
Evaluation Metrics
Metrics by Task
Classification: - Top-1 Accuracy: Fraction of correct predictions - Top-5 Accuracy: Correct class in top 5 predictions - Precision, Recall, F1: Per-class and macro-averaged - Confusion Matrix: Visualize class-level errors Object Detection: - mAP (mean Average Precision): Standard COCO metric - mAP@0.5: AP at IoU threshold 0.5 - mAP@0.5:0.95: AP averaged over IoU thresholds 0.5 to 0.95 - AP per class: Identifies weak categories Segmentation: - mIoU (mean Intersection over Union): Standard metric - Pixel Accuracy: Fraction of correctly classified pixels - Dice Coefficient: F1 score at the pixel level
Reproducibility Checklist
What to Report in Papers
1. Architecture: Exact model name, number of parameters 2. Pretraining: Dataset and weights used for initialization 3. Training: Optimizer, learning rate schedule, batch size, epochs 4. Augmentation: Full list of augmentations with parameters 5. Hardware: GPU type, number, training time 6. Evaluation: Exact metrics, test set version, evaluation protocol 7. Code: Link to repository with training and evaluation scripts 8. Random seeds: Report seeds used; ideally report mean over 3+ seeds
Ethical Considerations
When collecting or using image datasets, consider consent (especially for images of people), geographic and demographic representation, potential for bias amplification, and dual-use concerns. Document the dataset's composition and limitations. Follow the Datasheets for Datasets framework. For generative models, implement safeguards against generating harmful content.