Claude-skill-registry analyze-simd-usage

Analyze SIMD usage opportunities in Mojo code. Use to find performance optimization opportunities.

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/analyze-simd-usage" ~/.claude/skills/majiayu000-claude-skill-registry-analyze-simd-usage && rm -rf "$T"
manifest: skills/data/analyze-simd-usage/SKILL.md
source content

Analyze SIMD Usage Opportunities

Identify where SIMD (Single Instruction Multiple Data) can improve performance.

When to Use

  • Performance-critical tensor operations
  • Element-wise operations on large arrays
  • Vectorization of loops processing multiple elements
  • Optimizing matrix/vector operations
  • Finding performance bottlenecks in ML code

Quick Reference

# Find loops processing arrays/tensors
grep -n "for.*in.*range\|@unroll\|@vectorize" *.mojo

# Find element-wise operations
grep -n "\.load\|\.store\|\.broadcast" *.mojo

# Check for SIMD parameters
grep -n "simd_width\|nelems\|\[.*:\]" *.mojo

# Identify candidates
grep -n "for i in range.*:" -A 10 *.mojo | grep -E "array\[i\]|tensor\[i\]"

SIMD Optimization Opportunities

Vectorizable Patterns:

  • ✅ Element-wise addition:
    a[i] + b[i]
    for all i
  • ✅ Scalar multiplication:
    a[i] * scalar
    for all i
  • ✅ Unary operations:
    sin(a[i])
    ,
    exp(a[i])
    for all i
  • ✅ Reduction operations: sum, max, min over array
  • ❌ Dependent iterations:
    a[i] = a[i-1] + value
    (sequential)
  • ❌ Conditional branches:
    if a[i] > threshold:
    (hard to vectorize)
  • ❌ Function calls: unpredictable latency (avoid in tight loops)

SIMD Width Selection:

  • @parameter fn[simd_width: Int]
    - Generic SIMD width
  • simd_width=4
    - Typically good for float32
  • simd_width=8
    - Optimal for many operations
  • simd_width=16+
    - For int32 or specialized ops
  • Match hardware capabilities (AVX2=4-8, AVX512=8-16)

Vectorization Patterns:

  • @vectorize
    decorator for simple loops
  • @unroll
    for small loops (2-4 iterations)
  • ✅ Manual SIMD with
    .load[]
    and
    .store[]
  • ✅ Tensor operations with SIMD dimensions

Analysis Workflow

  1. Profile code: Identify bottlenecks using time/memory metrics
  2. Find loops: Locate loops processing large amounts of data
  3. Check vectorizability: Verify no loop-carried dependencies
  4. Estimate speedup: SIMD could provide 4-16x improvement
  5. Implement SIMD: Use @vectorize, @unroll, or manual SIMD
  6. Measure performance: Verify improvement with benchmarks
  7. Document changes: Note what was optimized and why

Output Format

Report SIMD analysis with:

  1. Hotspots - Functions/loops using most CPU time
  2. Vectorization Potential - Operations that could use SIMD
  3. Estimated Speedup - Expected performance improvement
  4. Implementation Priority - High/medium/low impact
  5. Technical Approach - How to implement SIMD
  6. Risks - Potential issues with vectorization
  7. Recommendations - Which optimizations to pursue first

Optimization Examples

Example 1: Element-wise Addition

# Before: scalar loop
fn add_scalar(a: Tensor, b: Tensor) -> Tensor:
    var result = Tensor(a.shape)
    for i in range(a.num_elements()):
        result._data[i] = a._data[i] + b._data[i]
    return result

# After: vectorized
@vectorize
fn add_simd[simd_width: Int](i: Int):
    result._data.store[simd_width](i,
        a._data.load[simd_width](i) + b._data.load[simd_width](i))

def add_vectorized(a: Tensor, b: Tensor) -> Tensor:
    var result = Tensor(a.shape)
    # 4x-8x speedup typical
    return result

Example 2: Reduction (Sum)

# Before: scalar loop
fn sum_scalar(tensor: Tensor) -> Float32:
    var total: Float32 = 0
    for i in range(tensor.num_elements()):
        total += tensor._data[i]
    return total

# After: SIMD reduction
fn sum_simd[simd_width: Int](tensor: Tensor) -> Float32:
    # Process simd_width elements at a time
    # Then reduce results - can be much faster
    return total

Error Handling

ProblemSolution
Vectorization causes wrong resultsCheck for loop-carried dependencies
Segment fault with SIMDVerify alignment and bounds
Minimal speedupMay not be vectorizable, profile to confirm
Complex logicBreak into simpler vectorizable operations
Type mismatchesEnsure SIMD width compatible with element type

SIMD Decision Tree

  • Does loop process large arrays? → YES → Check vectorizability
  • Loop-carried dependencies? → YES → Can't vectorize, optimize differently
  • Simple operations on many elements? → YES → Use @vectorize or @unroll
  • Critical path (hot loop)? → YES → Worth optimizing
  • Implement → Measure → Iterate

References

  • See mojo-simd-optimize for implementation guidance
  • See CLAUDE.md for SIMD code patterns
  • See performance section in module documentation