Claude-skill-registry analyze-simd-usage
Analyze SIMD usage opportunities in Mojo code. Use to find performance optimization opportunities.
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/analyze-simd-usage" ~/.claude/skills/majiayu000-claude-skill-registry-analyze-simd-usage && rm -rf "$T"
manifest:
skills/data/analyze-simd-usage/SKILL.mdsource content
Analyze SIMD Usage Opportunities
Identify where SIMD (Single Instruction Multiple Data) can improve performance.
When to Use
- Performance-critical tensor operations
- Element-wise operations on large arrays
- Vectorization of loops processing multiple elements
- Optimizing matrix/vector operations
- Finding performance bottlenecks in ML code
Quick Reference
# Find loops processing arrays/tensors grep -n "for.*in.*range\|@unroll\|@vectorize" *.mojo # Find element-wise operations grep -n "\.load\|\.store\|\.broadcast" *.mojo # Check for SIMD parameters grep -n "simd_width\|nelems\|\[.*:\]" *.mojo # Identify candidates grep -n "for i in range.*:" -A 10 *.mojo | grep -E "array\[i\]|tensor\[i\]"
SIMD Optimization Opportunities
Vectorizable Patterns:
- ✅ Element-wise addition:
for all ia[i] + b[i] - ✅ Scalar multiplication:
for all ia[i] * scalar - ✅ Unary operations:
,sin(a[i])
for all iexp(a[i]) - ✅ Reduction operations: sum, max, min over array
- ❌ Dependent iterations:
(sequential)a[i] = a[i-1] + value - ❌ Conditional branches:
(hard to vectorize)if a[i] > threshold: - ❌ Function calls: unpredictable latency (avoid in tight loops)
SIMD Width Selection:
- Generic SIMD width@parameter fn[simd_width: Int]
- Typically good for float32simd_width=4
- Optimal for many operationssimd_width=8
- For int32 or specialized opssimd_width=16+- Match hardware capabilities (AVX2=4-8, AVX512=8-16)
Vectorization Patterns:
- ✅
decorator for simple loops@vectorize - ✅
for small loops (2-4 iterations)@unroll - ✅ Manual SIMD with
and.load[].store[] - ✅ Tensor operations with SIMD dimensions
Analysis Workflow
- Profile code: Identify bottlenecks using time/memory metrics
- Find loops: Locate loops processing large amounts of data
- Check vectorizability: Verify no loop-carried dependencies
- Estimate speedup: SIMD could provide 4-16x improvement
- Implement SIMD: Use @vectorize, @unroll, or manual SIMD
- Measure performance: Verify improvement with benchmarks
- Document changes: Note what was optimized and why
Output Format
Report SIMD analysis with:
- Hotspots - Functions/loops using most CPU time
- Vectorization Potential - Operations that could use SIMD
- Estimated Speedup - Expected performance improvement
- Implementation Priority - High/medium/low impact
- Technical Approach - How to implement SIMD
- Risks - Potential issues with vectorization
- Recommendations - Which optimizations to pursue first
Optimization Examples
Example 1: Element-wise Addition
# Before: scalar loop fn add_scalar(a: Tensor, b: Tensor) -> Tensor: var result = Tensor(a.shape) for i in range(a.num_elements()): result._data[i] = a._data[i] + b._data[i] return result # After: vectorized @vectorize fn add_simd[simd_width: Int](i: Int): result._data.store[simd_width](i, a._data.load[simd_width](i) + b._data.load[simd_width](i)) def add_vectorized(a: Tensor, b: Tensor) -> Tensor: var result = Tensor(a.shape) # 4x-8x speedup typical return result
Example 2: Reduction (Sum)
# Before: scalar loop fn sum_scalar(tensor: Tensor) -> Float32: var total: Float32 = 0 for i in range(tensor.num_elements()): total += tensor._data[i] return total # After: SIMD reduction fn sum_simd[simd_width: Int](tensor: Tensor) -> Float32: # Process simd_width elements at a time # Then reduce results - can be much faster return total
Error Handling
| Problem | Solution |
|---|---|
| Vectorization causes wrong results | Check for loop-carried dependencies |
| Segment fault with SIMD | Verify alignment and bounds |
| Minimal speedup | May not be vectorizable, profile to confirm |
| Complex logic | Break into simpler vectorizable operations |
| Type mismatches | Ensure SIMD width compatible with element type |
SIMD Decision Tree
- Does loop process large arrays? → YES → Check vectorizability
- Loop-carried dependencies? → YES → Can't vectorize, optimize differently
- Simple operations on many elements? → YES → Use @vectorize or @unroll
- Critical path (hot loop)? → YES → Worth optimizing
- Implement → Measure → Iterate
References
- See mojo-simd-optimize for implementation guidance
- See CLAUDE.md for SIMD code patterns
- See performance section in module documentation