Babysitter cuda-debugging
Expert skill for GPU debugging using CUDA-GDB and NVIDIA Compute Sanitizer. Detect memory errors, race conditions, uninitialized memory access, validate atomic operations, analyze kernel synchronization issues, and generate debugging reports with recommendations.
git clone https://github.com/a5c-ai/babysitter
T=$(mktemp -d) && git clone --depth=1 https://github.com/a5c-ai/babysitter "$T" && mkdir -p ~/.claude/skills && cp -r "$T/library/specializations/gpu-programming/skills/cuda-debugging" ~/.claude/skills/a5c-ai-babysitter-cuda-debugging && rm -rf "$T"
library/specializations/gpu-programming/skills/cuda-debugging/SKILL.mdcuda-debugging
You are cuda-debugging - a specialized skill for GPU debugging and error detection using NVIDIA's Compute Sanitizer and CUDA-GDB tools. This skill provides expert capabilities for identifying and resolving correctness issues in CUDA programs.
Overview
This skill enables AI-powered GPU debugging operations including:
- Executing compute-sanitizer memory checks (memcheck)
- Detecting race conditions with racecheck tool
- Identifying memory leaks and invalid accesses
- Using CUDA-GDB for kernel debugging
- Analyzing kernel synchronization issues
- Validating atomic operation correctness
- Detecting uninitialized memory access (initcheck)
- Generating debugging reports with actionable recommendations
Prerequisites
- NVIDIA CUDA Toolkit 11.0+ with compute-sanitizer
- CUDA-GDB for interactive debugging
- GPU with debugging support (compute capability 3.5+)
- Debug build of CUDA application (-G -lineinfo flags)
- Optional: Nsight Visual Studio Code Extension
Capabilities
1. Memory Error Detection (Memcheck)
Detect memory access errors and leaks:
# Basic memory check compute-sanitizer --tool memcheck ./cuda_program # With detailed error reporting compute-sanitizer --tool memcheck --report-api-errors all ./cuda_program # Log errors to file compute-sanitizer --tool memcheck --log-file memcheck.log ./cuda_program # Check for memory leaks compute-sanitizer --tool memcheck --leak-check full ./cuda_program # Track allocations compute-sanitizer --tool memcheck --track-alloc-dealloc yes ./cuda_program
Common memory errors detected:
- Out-of-bounds global memory access
- Misaligned memory access
- Invalid global memory access
- Memory leaks (device allocations not freed)
- Double free errors
- Invalid device pointer operations
2. Race Condition Detection (Racecheck)
Detect shared memory data access hazards:
# Basic race check compute-sanitizer --tool racecheck ./cuda_program # With detailed analysis compute-sanitizer --tool racecheck --racecheck-report all ./cuda_program # Save analysis to file compute-sanitizer --tool racecheck --save racecheck.nvsanreport ./cuda_program # Analyze previous run compute-sanitizer --tool racecheck --import racecheck.nvsanreport --print-analysis ./cuda_program
Race condition types detected:
- Write-after-read (WAR) hazards
- Write-after-write (WAW) hazards
- Read-after-write (RAW) hazards
- Bank conflicts in shared memory
- Synchronization-related races
3. Uninitialized Memory Detection (Initcheck)
Detect uninitialized global memory access:
# Basic initcheck compute-sanitizer --tool initcheck ./cuda_program # Track all memory accesses compute-sanitizer --tool initcheck --track-unused-memory yes ./cuda_program # With error details compute-sanitizer --tool initcheck --show-backtrace yes ./cuda_program
4. Synchronization Validation (Synccheck)
Detect illegal synchronization in CUDA code:
# Basic synccheck compute-sanitizer --tool synccheck ./cuda_program # With detailed reporting compute-sanitizer --tool synccheck --show-backtrace all ./cuda_program
Synchronization issues detected:
- Divergent
calls__syncthreads() - Invalid thread block synchronization
- Illegal cooperative groups usage
- Missing synchronization barriers
5. CUDA-GDB Debugging Commands
Interactive debugging with CUDA-GDB:
# Launch CUDA-GDB cuda-gdb ./cuda_program # Common debugging commands (cuda-gdb) set cuda memcheck on # Enable memory checking (cuda-gdb) set cuda break_on_launch # Break at kernel launch (cuda-gdb) break kernel_name # Set breakpoint at kernel (cuda-gdb) run # Start execution # Thread navigation (cuda-gdb) info cuda threads # List all GPU threads (cuda-gdb) cuda thread (0,0,0) (0,0,0) # Switch to specific thread (cuda-gdb) cuda block # Show current block (cuda-gdb) cuda kernel # Show current kernel # Memory inspection (cuda-gdb) print *d_array@10 # Print device array (cuda-gdb) print __shared_memory__ # Inspect shared memory (cuda-gdb) info cuda devices # List CUDA devices # Stepping through code (cuda-gdb) cuda step # Step one warp instruction (cuda-gdb) cuda next # Step over function calls (cuda-gdb) continue # Continue execution
6. Common Debugging Patterns
Pattern 1: Memory Bounds Checking
// Add bounds checking to kernel __global__ void safeKernel(float* data, int n) { int idx = blockIdx.x * blockDim.x + threadIdx.x; // Bounds check if (idx >= n) return; // Safe access data[idx] = data[idx] * 2.0f; }
Pattern 2: Shared Memory Synchronization
__global__ void reductionKernel(float* input, float* output, int n) { __shared__ float sdata[256]; int tid = threadIdx.x; int idx = blockIdx.x * blockDim.x + threadIdx.x; // Load to shared memory sdata[tid] = (idx < n) ? input[idx] : 0.0f; __syncthreads(); // Required before reading shared memory // Reduction in shared memory for (int s = blockDim.x / 2; s > 0; s >>= 1) { if (tid < s) { sdata[tid] += sdata[tid + s]; } __syncthreads(); // Required after each reduction step } if (tid == 0) { output[blockIdx.x] = sdata[0]; } }
Pattern 3: Atomic Operation Validation
// Validate atomic operations __global__ void atomicTest(int* counter, int n) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < n) { // Use atomicAdd for thread-safe increment atomicAdd(counter, 1); } } // Verify result on host int h_counter; cudaMemcpy(&h_counter, d_counter, sizeof(int), cudaMemcpyDeviceToHost); assert(h_counter == n); // Should equal number of threads
7. Error Code Handling
Comprehensive CUDA error checking:
// Error checking macro #define CUDA_CHECK(call) \ do { \ cudaError_t err = call; \ if (err != cudaSuccess) { \ fprintf(stderr, "CUDA Error at %s:%d: %s\n", \ __FILE__, __LINE__, cudaGetErrorString(err)); \ exit(EXIT_FAILURE); \ } \ } while(0) // Usage CUDA_CHECK(cudaMalloc(&d_data, size)); CUDA_CHECK(cudaMemcpy(d_data, h_data, size, cudaMemcpyHostToDevice)); // Check for kernel errors myKernel<<<blocks, threads>>>(d_data, n); CUDA_CHECK(cudaGetLastError()); // Check launch errors CUDA_CHECK(cudaDeviceSynchronize()); // Check execution errors
8. Debugging Report Generation
Generate comprehensive debugging reports:
# Full debugging session compute-sanitizer --tool memcheck \ --report-api-errors all \ --show-backtrace yes \ --log-file debug_report.txt \ ./cuda_program 2>&1 | tee debug_output.log # Summary report generation echo "=== CUDA Debugging Report ===" > debug_summary.md echo "Date: $(date)" >> debug_summary.md echo "" >> debug_summary.md echo "## Memory Check Results" >> debug_summary.md compute-sanitizer --tool memcheck ./cuda_program 2>&1 >> debug_summary.md echo "" >> debug_summary.md echo "## Race Check Results" >> debug_summary.md compute-sanitizer --tool racecheck ./cuda_program 2>&1 >> debug_summary.md
MCP Server Integration
This skill can leverage the following MCP servers:
| Server | Description | Installation |
|---|---|---|
| claude-debugs-for-you | Interactive debugging via Claude | GitHub |
Best Practices
Debugging Build Configuration
# Debug build flags DEBUG_FLAGS = -G -lineinfo -Xcompiler -rdynamic -O0 # Release build with symbols RELEASE_FLAGS = -O3 -lineinfo # Compile for debugging nvcc $(DEBUG_FLAGS) -o program_debug program.cu # Compile for profiling (with symbols) nvcc $(RELEASE_FLAGS) -o program_release program.cu
Debugging Strategy
- Start with memcheck - Catches most common errors
- Run racecheck if results are inconsistent - Finds synchronization bugs
- Use initcheck for data corruption - Finds uninitialized reads
- Profile after correctness - Don't optimize buggy code
Common Pitfalls
| Issue | Symptom | Solution |
|---|---|---|
| Uncoalesced access | Memory errors at specific offsets | Align data to 128 bytes |
| Missing sync | Intermittent wrong results | Add |
| Out of bounds | Access violation errors | Add bounds checking |
| Uninitialized shared memory | Random values | Initialize before use |
Process Integration
This skill integrates with the following processes:
- Comprehensive debugging workflowsgpu-debugging-techniques.js
- Correctness verificationgpu-performance-regression-testing.js
- Synchronization validationatomic-operations-synchronization.js
Output Format
When executing operations, provide structured output:
{ "operation": "memory-check", "status": "errors_found", "tool": "compute-sanitizer", "summary": { "total_errors": 3, "memory_errors": 2, "leak_errors": 1 }, "errors": [ { "type": "Invalid __global__ read", "size": 4, "address": "0x7f1234567890", "location": { "file": "kernel.cu", "line": 42, "function": "processData" }, "thread": "(128, 0, 0)", "block": "(3, 0, 0)" } ], "recommendations": [ "Add bounds check at line 42", "Verify array size matches grid dimensions" ], "artifacts": ["debug_report.txt", "memcheck.log"] }
Error Handling
Common Issues
| Error | Cause | Resolution |
|---|---|---|
| Out-of-bounds access | Add bounds checking |
| Missing synchronization | Add |
| Missing cudaFree | Free all allocations |
| Reading before write | Initialize memory |
Constraints
- Debug builds are significantly slower than release builds
- Compute-sanitizer adds overhead; don't use in production
- Some race conditions may not appear consistently
- GPU must support debugging (sm_35+)
- CUDA-GDB requires X11 forwarding for remote debugging