Skillsbench python-parallelization
Transform sequential Python code into parallel/concurrent implementations. Use when asked to parallelize Python code, improve code performance through concurrency, convert loops to parallel execution, or identify parallelization opportunities. Handles CPU-bound (multiprocessing), I/O-bound (asyncio, threading), and data-parallel (vectorization) scenarios.
install
source · Clone the upstream repo
git clone https://github.com/benchflow-ai/skillsbench
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/benchflow-ai/skillsbench "$T" && mkdir -p ~/.claude/skills && cp -r "$T/tasks/parallel-tfidf-search/environment/skills/python-parallelization" ~/.claude/skills/benchflow-ai-skillsbench-python-parallelization && rm -rf "$T"
manifest:
tasks/parallel-tfidf-search/environment/skills/python-parallelization/SKILL.mdsource content
Python Parallelization Skill
Transform sequential Python code to leverage parallel and concurrent execution patterns.
Workflow
- Analyze the code to identify parallelization candidates
- Classify the workload type (CPU-bound, I/O-bound, or data-parallel)
- Select the appropriate parallelization strategy
- Transform the code with proper synchronization and error handling
- Verify correctness and measure expected speedup
Parallelization Decision Tree
Is the bottleneck CPU-bound or I/O-bound? CPU-bound (computation-heavy): ├── Independent iterations? → multiprocessing.Pool / ProcessPoolExecutor ├── Shared state needed? → multiprocessing with Manager or shared memory ├── NumPy/Pandas operations? → Vectorization first, then consider numba/dask └── Large data chunks? → chunked processing with Pool.map I/O-bound (network, disk, database): ├── Many independent requests? → asyncio with aiohttp/aiofiles ├── Legacy sync code? → ThreadPoolExecutor ├── Mixed sync/async? → asyncio.to_thread() └── Database queries? → Connection pooling + async drivers Data-parallel (array/matrix ops): ├── NumPy arrays? → Vectorize, avoid Python loops ├── Pandas DataFrames? → Use built-in vectorized methods ├── Large datasets? → Dask for out-of-core parallelism └── GPU available? → Consider CuPy or JAX
Transformation Patterns
Pattern 1: Loop to ProcessPoolExecutor (CPU-bound)
Before:
results = [] for item in items: results.append(expensive_computation(item))
After:
from concurrent.futures import ProcessPoolExecutor with ProcessPoolExecutor() as executor: results = list(executor.map(expensive_computation, items))
Pattern 2: Sequential I/O to Async (I/O-bound)
Before:
import requests def fetch_all(urls): return [requests.get(url).json() for url in urls]
After:
import asyncio import aiohttp async def fetch_all(urls): async with aiohttp.ClientSession() as session: tasks = [fetch_one(session, url) for url in urls] return await asyncio.gather(*tasks) async def fetch_one(session, url): async with session.get(url) as response: return await response.json()
Pattern 3: Nested Loops to Vectorization
Before:
result = [] for i in range(len(a)): row = [] for j in range(len(b)): row.append(a[i] * b[j]) result.append(row)
After:
import numpy as np result = np.outer(a, b)
Pattern 4: Mixed CPU/IO with asyncio
import asyncio from concurrent.futures import ProcessPoolExecutor async def hybrid_pipeline(data, urls): loop = asyncio.get_event_loop() # CPU-bound in process pool with ProcessPoolExecutor() as pool: processed = await loop.run_in_executor(pool, cpu_heavy_fn, data) # I/O-bound with async results = await asyncio.gather(*[fetch(url) for url in urls]) return processed, results
Parallelization Candidates
Look for these patterns in code:
| Pattern | Indicator | Strategy |
|---|---|---|
with independent iterations | No shared mutation | / |
Multiple or file reads | Sequential I/O | |
| Nested loops over arrays | Numerical computation | NumPy vectorization |
or blocking waits | Waiting on external | Threading or async |
| Large list comprehensions | Independent transforms | with chunking |
Safety Requirements
Always preserve correctness when parallelizing:
- Identify shared state - variables modified across iterations break parallelism
- Check dependencies - iteration N depending on N-1 requires sequential execution
- Handle exceptions - wrap parallel code in try/except, use
for granular error handlingexecutor.submit() - Manage resources - use context managers, limit worker count to avoid exhaustion
- Preserve ordering - use
overmap()
when order matterssubmit()
Common Pitfalls
- GIL trap: Threading doesn't help CPU-bound Python code—use multiprocessing
- Pickle failures: Lambda functions and nested classes can't be pickled for multiprocessing
- Memory explosion: ProcessPoolExecutor copies data to each process—use shared memory for large data
- Async in sync: Can't just add
to existing code—requires restructuring call chainasync - Over-parallelization: Parallel overhead exceeds gains for small workloads (<1000 items typically)
Verification Checklist
Before finalizing transformed code:
- Output matches sequential version for test inputs
- No race conditions (shared mutable state properly synchronized)
- Exceptions are caught and handled appropriately
- Resources are properly cleaned up (pools closed, connections released)
- Worker count is bounded (default or explicit limit)
- Added appropriate imports