Claude-skill-registry gpu-parallel-scheduling

GPU-safe parallel processing patterns for KINTSUGI to prevent OOM crashes and ensure Jupyter-compatible progress output

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/gpu-parallel-scheduling" ~/.claude/skills/majiayu000-claude-skill-registry-gpu-parallel-scheduling && rm -rf "$T"

manifest: skills/data/gpu-parallel-scheduling/SKILL.md

GPU Parallel Scheduling - Research Notes

Experiment Overview

Item	Details
Date	2025-12-15
Goal	Fix kernel crashes during parallel GPU processing in Notebook 2
Environment	KINTSUGI pipeline, CuPy, multi-GPU HPC, Jupyter notebooks
Status	Success

Context

Notebook 2 (Cycle Processing) was crashing after Channel 1 completed when attempting to process remaining channels in parallel. The crash occurred immediately after the message "[PARALLEL] Processing channels [2, 3, 4] across GPUs..." with no further output.

Root Cause Analysis

Problem 1: Nested ThreadPoolExecutor Causing GPU Thread Explosion

The code structure was:

# OUTER: Parallel channel processing
with ThreadPoolExecutor(max_workers=len(GPU_DEVICE_IDS)) as executor:
    for ch in channels:
        # INNER: Each channel spawns parallel z-plane workers
        process_channel_zplanes_parallel(ch, ...)
            # Inside this function:
            with ThreadPoolExecutor(max_workers=ZPLANES_PER_GPU) as inner_executor:
                # Process z-planes in parallel

With 2 GPUs, 3 channels, and

ZPLANES_PER_GPU=4

Channel assignment: CH2→GPU0, CH3→GPU1, CH4→GPU0 (round-robin)
GPU0 gets CH2 + CH4 simultaneously
Each channel spawns 4 z-plane workers
GPU0 runs 2 × 4 = 8 concurrent BaSiC corrections → OOM CRASH

Problem 2: Jupyter Thread Output Suppression

Print statements from worker threads don't appear in Jupyter notebook output until the cell completes (or never). This made debugging difficult as processing appeared to stall.

Problem 3: Missing GPU Memory Cleanup Between Channels

GPU memory from Channel 1 wasn't being freed before Channel 2 started, causing cumulative memory pressure.

Verified Workflow

Solution: Queue-Based GPU Allocation

Use a GPU queue to ensure exactly 1 channel per GPU at any time:

import queue
from concurrent.futures import ThreadPoolExecutor, as_completed

# Create pool of available GPUs
gpu_queue = queue.Queue()
for dev_id in GPU_DEVICE_IDS:
    gpu_queue.put(dev_id)

def process_channel_with_gpu(ch):
    """Process channel, acquiring and releasing GPU from queue."""
    # ACQUIRE: Block until a GPU is available
    dev_id = gpu_queue.get()
    ch_start = time.time()

    try:
        process_channel_zplanes_parallel(
            ...,
            device_id=dev_id,
            zplanes_per_gpu=ZPLANES_PER_GPU,
            ...
        )
        elapsed = time.time() - ch_start

        # GPU cleanup before releasing
        try:
            import cupy as cp
            with cp.cuda.Device(dev_id):
                cp.get_default_memory_pool().free_all_blocks()
                cp.get_default_pinned_memory_pool().free_all_blocks()
        except Exception:
            pass
        gc.collect()

        return (ch, dev_id, elapsed, None)
    except Exception as e:
        return (ch, dev_id, time.time() - ch_start, str(e))
    finally:
        # RELEASE: Return GPU to pool for next channel
        gpu_queue.put(dev_id)

# Process with max_workers = number of GPUs
with ThreadPoolExecutor(max_workers=len(GPU_DEVICE_IDS)) as executor:
    futures = {executor.submit(process_channel_with_gpu, ch): ch
               for ch in remaining_channels}

    # Main thread prints progress (Jupyter-compatible)
    for future in as_completed(futures):
        ch, dev_id, elapsed, error = future.result()
        if error:
            log(f"  [GPU{dev_id}] Channel {ch} ERROR: {error}")
        else:
            log(f"  [GPU{dev_id}] Channel {ch} COMPLETE ({elapsed:.1f}s)")

Key Principles

max_workers = n_gpus: Never more concurrent channels than GPUs
Queue acquisition: Each channel blocks until it gets a GPU
Queue release in finally: GPU always returns to pool, even on error
GPU cleanup before release: Free memory before another channel uses the GPU
Main-thread progress: Use
```
as_completed()
```
to print from main thread

Failed Attempts (Critical)

Attempt	Why it Failed	Lesson Learned
Parallel channels with `iter_cycle(GPU_DEVICE_IDS)`	Pre-assigned GPUs didn't prevent multiple channels on same GPU	Need dynamic GPU allocation, not static assignment
`max_workers=len(GPU_DEVICE_IDS)` without queue	Channels could start on any GPU regardless of assignment	Queue ensures 1:1 GPU:channel mapping
Printing progress from worker threads	Output lost in Jupyter (thread stdout not captured)	Always print from main thread using `as_completed()`
GPU cleanup only at end of cell	Memory exhausted before cell completes	Cleanup after EACH channel, before releasing GPU
Nested ThreadPoolExecutor	Thread explosion: outer × inner workers	Inner parallelism is OK; outer must be limited to n_gpus
Adding status monitoring thread	Monitoring thread output also lost in Jupyter	Only main thread output is reliable in notebooks

Final Parameters

GPU Queue Pattern

# Initialize
gpu_queue = queue.Queue()
for dev_id in GPU_DEVICE_IDS:
    gpu_queue.put(dev_id)

# In worker
dev_id = gpu_queue.get()  # Blocks until available
try:
    # ... GPU work ...
finally:
    gpu_queue.put(dev_id)  # Always return

Progress Output Pattern (Jupyter-safe)

with ThreadPoolExecutor(max_workers=n_gpus) as executor:
    futures = {executor.submit(work_fn, item): item for item in items}

    for future in as_completed(futures):
        result = future.result()
        print(f"Completed: {result}")  # Main thread - visible in Jupyter

GPU Cleanup Pattern

try:
    import cupy as cp
    with cp.cuda.Device(device_id):
        cp.get_default_memory_pool().free_all_blocks()
        cp.get_default_pinned_memory_pool().free_all_blocks()
except Exception:
    pass
gc.collect()

Key Insights

CuPy is not thread-safe for concurrent operations on the same GPU
Jupyter suppresses thread output - always use main thread for progress
Queue > static assignment for dynamic GPU allocation
Cleanup before release prevents memory accumulation
Inner parallelism is safe (z-planes on single GPU) when outer is controlled
1 channel per GPU rule prevents all OOM issues from parallel processing

When to Apply This Pattern

Multi-GPU parallel processing in Jupyter notebooks
Any CuPy-based batch processing with parallelism
When kernel crashes after first item completes in parallel loop
When progress output disappears in parallel processing
When OOM occurs despite having enough total GPU memory

References

CuPy Memory Management: https://docs.cupy.dev/en/stable/user_guide/memory.html
Python concurrent.futures: https://docs.python.org/3/library/concurrent.futures.html
KINTSUGI Notebook 2: Cycle Processing
Related skill:
```
gpu-memory-cleanup
```
(per-channel cleanup pattern)