Claude-skill-registry analyze-simd-usage

Analyze SIMD usage opportunities in Mojo code. Use to find performance optimization opportunities.

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/analyze-simd-usage" ~/.claude/skills/majiayu000-claude-skill-registry-analyze-simd-usage && rm -rf "$T"

manifest: skills/data/analyze-simd-usage/SKILL.md

Analyze SIMD Usage Opportunities

Identify where SIMD (Single Instruction Multiple Data) can improve performance.

When to Use

Performance-critical tensor operations
Element-wise operations on large arrays
Vectorization of loops processing multiple elements
Optimizing matrix/vector operations
Finding performance bottlenecks in ML code

Quick Reference

# Find loops processing arrays/tensors
grep -n "for.*in.*range\|@unroll\|@vectorize" *.mojo

# Find element-wise operations
grep -n "\.load\|\.store\|\.broadcast" *.mojo

# Check for SIMD parameters
grep -n "simd_width\|nelems\|\[.*:\]" *.mojo

# Identify candidates
grep -n "for i in range.*:" -A 10 *.mojo | grep -E "array\[i\]|tensor\[i\]"

SIMD Optimization Opportunities

Vectorizable Patterns:

✅ Element-wise addition:
```
a[i] + b[i]
```
for all i
✅ Scalar multiplication:
```
a[i] * scalar
```
for all i
✅ Unary operations:
```
sin(a[i])
```
,
```
exp(a[i])
```
for all i
✅ Reduction operations: sum, max, min over array
❌ Dependent iterations:
```
a[i] = a[i-1] + value
```
(sequential)
❌ Conditional branches:
```
if a[i] > threshold:
```
(hard to vectorize)
❌ Function calls: unpredictable latency (avoid in tight loops)

SIMD Width Selection:

```
@parameter fn[simd_width: Int]
```
- Generic SIMD width
```
simd_width=4
```
- Typically good for float32
```
simd_width=8
```
- Optimal for many operations
```
simd_width=16+
```
- For int32 or specialized ops
Match hardware capabilities (AVX2=4-8, AVX512=8-16)

Vectorization Patterns:

✅
```
@vectorize
```
decorator for simple loops
✅
```
@unroll
```
for small loops (2-4 iterations)
✅ Manual SIMD with
```
.load[]
```
and
```
.store[]
```
✅ Tensor operations with SIMD dimensions

Analysis Workflow

Profile code: Identify bottlenecks using time/memory metrics
Find loops: Locate loops processing large amounts of data
Check vectorizability: Verify no loop-carried dependencies
Estimate speedup: SIMD could provide 4-16x improvement
Implement SIMD: Use @vectorize, @unroll, or manual SIMD
Measure performance: Verify improvement with benchmarks
Document changes: Note what was optimized and why

Output Format

Report SIMD analysis with:

Hotspots - Functions/loops using most CPU time
Vectorization Potential - Operations that could use SIMD
Estimated Speedup - Expected performance improvement
Implementation Priority - High/medium/low impact
Technical Approach - How to implement SIMD
Risks - Potential issues with vectorization
Recommendations - Which optimizations to pursue first

Optimization Examples

Example 1: Element-wise Addition

# Before: scalar loop
fn add_scalar(a: Tensor, b: Tensor) -> Tensor:
    var result = Tensor(a.shape)
    for i in range(a.num_elements()):
        result._data[i] = a._data[i] + b._data[i]
    return result

# After: vectorized
@vectorize
fn add_simd[simd_width: Int](i: Int):
    result._data.store[simd_width](i,
        a._data.load[simd_width](i) + b._data.load[simd_width](i))

def add_vectorized(a: Tensor, b: Tensor) -> Tensor:
    var result = Tensor(a.shape)
    # 4x-8x speedup typical
    return result

Example 2: Reduction (Sum)

# Before: scalar loop
fn sum_scalar(tensor: Tensor) -> Float32:
    var total: Float32 = 0
    for i in range(tensor.num_elements()):
        total += tensor._data[i]
    return total

# After: SIMD reduction
fn sum_simd[simd_width: Int](tensor: Tensor) -> Float32:
    # Process simd_width elements at a time
    # Then reduce results - can be much faster
    return total

Error Handling

Problem	Solution
Vectorization causes wrong results	Check for loop-carried dependencies
Segment fault with SIMD	Verify alignment and bounds
Minimal speedup	May not be vectorizable, profile to confirm
Complex logic	Break into simpler vectorizable operations
Type mismatches	Ensure SIMD width compatible with element type

SIMD Decision Tree

Does loop process large arrays? → YES → Check vectorizability
Loop-carried dependencies? → YES → Can't vectorize, optimize differently
Simple operations on many elements? → YES → Use @vectorize or @unroll
Critical path (hot loop)? → YES → Worth optimizing
Implement → Measure → Iterate

References

See mojo-simd-optimize for implementation guidance
See CLAUDE.md for SIMD code patterns
See performance section in module documentation