Babysitter nsight-profiler
Expert skill for NVIDIA Nsight Systems and Nsight Compute profiling tools. Configure profiling sessions, analyze kernel reports, interpret occupancy metrics, roofline model data, memory bandwidth bottlenecks, and warp execution efficiency.
git clone https://github.com/a5c-ai/babysitter
T=$(mktemp -d) && git clone --depth=1 https://github.com/a5c-ai/babysitter "$T" && mkdir -p ~/.claude/skills && cp -r "$T/library/specializations/gpu-programming/skills/nsight-profiler" ~/.claude/skills/a5c-ai-babysitter-nsight-profiler && rm -rf "$T"
library/specializations/gpu-programming/skills/nsight-profiler/SKILL.mdnsight-profiler
You are nsight-profiler - a specialized skill for NVIDIA Nsight Systems and Nsight Compute profiling tools. This skill provides expert capabilities for performance analysis and optimization of GPU applications.
Overview
This skill enables AI-powered GPU profiling operations including:
- Configure and execute Nsight Systems profiling sessions
- Analyze Nsight Compute kernel reports
- Interpret occupancy metrics and SM utilization
- Parse and visualize roofline model data
- Identify memory bandwidth bottlenecks
- Analyze warp execution efficiency
- Generate optimization recommendations from profiler data
- Compare kernel performance across different configurations
Prerequisites
- NVIDIA Nsight Systems 2023.1+
- NVIDIA Nsight Compute 2023.1+
- CUDA Toolkit 11.0+
- GPU with compute capability 7.0+ (for full profiling features)
Capabilities
1. Nsight Systems Profiling
System-wide performance analysis:
# Basic system profile nsys profile -o report ./cuda_program # Profile with CUDA API tracing nsys profile -t cuda,nvtx,osrt -o report ./cuda_program # Capture GPU metrics nsys profile --gpu-metrics-device=all -o report ./cuda_program # Profile specific duration nsys profile -d 10 -o report ./cuda_program # Export to multiple formats nsys export -t sqlite,json report.nsys-rep # Generate summary statistics nsys stats report.nsys-rep
2. Nsight Compute Profiling
Detailed kernel analysis:
# Profile all kernels ncu -o profile ./cuda_program # Profile specific kernel ncu --kernel-name myKernel -o profile ./cuda_program # Full metric collection ncu --set full -o profile ./cuda_program # Roofline analysis ncu --set roofline -o profile ./cuda_program # Memory analysis ncu --section MemoryWorkloadAnalysis -o profile ./cuda_program # Compare two runs ncu --import baseline.ncu-rep --diff ./cuda_program
3. Occupancy Analysis
Analyze and optimize occupancy:
# Collect occupancy metrics ncu --section Occupancy -o occupancy ./cuda_program # Key metrics to analyze: # - Achieved Occupancy # - Theoretical Occupancy # - Block Limit (registers, shared memory, warps) # - Occupancy Limiter
// Query occupancy in code int numBlocks; int blockSize = 256; cudaOccupancyMaxActiveBlocksPerMultiprocessor( &numBlocks, myKernel, blockSize, sharedMemSize); float occupancy = (numBlocks * blockSize) / (float)deviceProp.maxThreadsPerMultiProcessor; printf("Theoretical Occupancy: %.2f%%\n", occupancy * 100);
4. Roofline Model Analysis
Performance bound analysis:
# Generate roofline data ncu --set roofline -o roofline ./cuda_program # Key metrics: # - Achieved FLOP/s # - Achieved Memory Bandwidth # - Arithmetic Intensity (FLOP/byte) # - Ridge Point
Interpretation guide:
- Below memory roofline: Memory bound
- Below compute roofline: Compute bound
- At peak: Optimal utilization
5. Memory Bandwidth Analysis
Identify memory bottlenecks:
# Memory analysis sections ncu --section MemoryWorkloadAnalysis \ --section MemoryWorkloadAnalysis_Chart \ --section MemoryWorkloadAnalysis_Tables \ -o memory ./cuda_program
Key metrics:
- Global Load/Store Throughput
- L1/L2 Cache Hit Rate
- Shared Memory Bandwidth
- Memory Transactions per Request
6. Warp Execution Analysis
Analyze warp efficiency:
# Warp state analysis ncu --section WarpStateStatistics -o warp ./cuda_program # Scheduler statistics ncu --section SchedulerStatistics -o scheduler ./cuda_program
Key metrics:
- Warp Cycles Per Issued Instruction
- Eligible Warps Per Active Cycle
- Active Warps Per Scheduler
- Stall Reasons (memory, sync, execution)
7. Kernel Comparison
Compare kernel variants:
# Baseline capture ncu -o baseline ./program_v1 # Compare with new version ncu --import baseline.ncu-rep --diff ./program_v2 # Generate comparison report ncu --import baseline.ncu-rep \ --import optimized.ncu-rep \ --page diff --csv > comparison.csv
8. Performance Recommendations
Automated analysis:
# Get optimization recommendations ncu --section SpeedOfLight \ --section SpeedOfLight_RooflineChart \ -o speedoflight ./cuda_program # Export with recommendations ncu --import profile.ncu-rep --page details --csv > details.csv
Common Profiling Workflows
Workflow 1: Initial Performance Assessment
# Step 1: System overview nsys profile -t cuda -o system_overview ./program nsys stats system_overview.nsys-rep # Step 2: Identify hot kernels ncu --launch-skip 10 --launch-count 5 -o hot_kernels ./program # Step 3: Deep dive on bottleneck kernel ncu --kernel-name hotKernel --set full -o detailed ./program
Workflow 2: Memory Optimization
# Analyze memory access patterns ncu --section SourceCounters \ --section MemoryWorkloadAnalysis \ --kernel-name targetKernel \ -o memory_analysis ./program # Check for coalescing issues ncu --metrics l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum,\ l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum \ -o coalescing ./program
Workflow 3: Occupancy Optimization
# Profile with occupancy focus ncu --section Occupancy \ --section LaunchStatistics \ -o occupancy ./program # Test different block sizes for bs in 64 128 256 512 1024; do ncu --section Occupancy -o occ_$bs ./program --block-size $bs done
Process Integration
This skill integrates with the following processes:
- Performance analysis workflowperformance-profiling-analysis.js
- Occupancy optimizationoccupancy-optimization.js
- Warp efficiencywarp-efficiency-optimization.js
- Memory optimizationgpu-memory-optimization.js
Output Format
When executing operations, provide structured output:
{ "operation": "kernel-profile", "tool": "nsight-compute", "kernel": "matrixMultiply", "metrics": { "duration_us": 125.4, "achieved_occupancy": 0.78, "theoretical_occupancy": 1.0, "compute_throughput_pct": 65.2, "memory_throughput_pct": 89.3, "roofline": { "arithmetic_intensity": 12.5, "achieved_gflops": 4500, "peak_gflops": 8000, "bound": "compute" } }, "recommendations": [ "Increase block size to improve occupancy", "Consider loop unrolling to reduce instruction overhead" ], "artifacts": ["profile.ncu-rep", "summary.csv"] }
Dependencies
- Nsight Systems 2023.1+
- Nsight Compute 2023.1+
- CUDA Toolkit 11.0+
Constraints
- Full profiling requires root/admin privileges
- Some metrics only available on specific GPU architectures
- Profiling adds overhead; results may differ from production
- Nsight Compute profiles one kernel invocation at a time by default