Skillshub coreml-diag
CoreML diagnostics - model load failures, slow inference, memory issues, compression accuracy loss, compute unit problems, conversion errors.
git clone https://github.com/ComeOnOliver/skillshub
T=$(mktemp -d) && git clone --depth=1 https://github.com/ComeOnOliver/skillshub "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/CharlesWiltgen/Axiom/coreml-diag" ~/.claude/skills/comeonoliver-skillshub-coreml-diag && rm -rf "$T"
skills/CharlesWiltgen/Axiom/coreml-diag/SKILL.mdCoreML Diagnostics
Quick Reference
| Symptom | First Check | Pattern |
|---|---|---|
| Model won't load | Deployment target | 1a-1c |
| Slow first load | Cache miss | 2a |
| Slow inference | Compute units | 2b-2c |
| High memory | Concurrent predictions | 3a-3b |
| Bad accuracy after compression | Granularity | 4a-4c |
| Conversion fails | Operation support | 5a-5b |
Decision Tree
CoreML issue ├─ Load failure? │ ├─ "Unsupported model version" → 1a │ ├─ "Failed to create compute plan" → 1b │ └─ Other load error → 1c ├─ Performance issue? │ ├─ First load slow, subsequent fast? → 2a │ ├─ All predictions slow? → 2b │ └─ Slow only on specific device? → 2c ├─ Memory issue? │ ├─ Memory grows during predictions? → 3a │ └─ Out of memory on load? → 3b ├─ Accuracy degraded? │ ├─ After palettization? → 4a │ ├─ After quantization? → 4b │ └─ After pruning? → 4c └─ Conversion issue? ├─ Operation not supported? → 5a └─ Wrong output? → 5b
Pattern 1a - "Unsupported model version"
Symptom: Model fails to load with version error.
Cause: Model compiled for newer OS than device supports.
Diagnosis:
# Check model's minimum deployment target import coremltools as ct model = ct.models.MLModel("Model.mlpackage") print(model.get_spec().specificationVersion)
| Spec Version | Minimum iOS |
|---|---|
| 4 | iOS 13 |
| 5 | iOS 14 |
| 6 | iOS 15 |
| 7 | iOS 16 |
| 8 | iOS 17 |
| 9 | iOS 18 |
Fix: Re-convert with lower deployment target:
mlmodel = ct.convert( traced, minimum_deployment_target=ct.target.iOS16 # Lower target )
Tradeoff: Loses newer optimizations (SDPA fusion, per-block quantization, MLTensor).
Pattern 1b - "Failed to create compute plan"
Symptom: Model loads on some devices but not others.
Cause: Unsupported operations for target compute unit.
Diagnosis:
- Open model in Xcode
- Create Performance Report
- Check "Unsupported" operations
- Hover for hints
Fix:
// Force CPU-only to bypass unsupported GPU/NE operations let config = MLModelConfiguration() config.computeUnits = .cpuOnly let model = try MLModel(contentsOf: url, configuration: config)
Better fix: Update model precision or operations during conversion:
# Float16 often better supported mlmodel = ct.convert(traced, compute_precision=ct.precision.FLOAT16)
Pattern 1c - General Load Failures
Symptom: Model fails to load with unclear error.
Checklist:
- Check file exists and is readable
- Check compiled vs source model (runtime needs
).mlmodelc - Check available disk space (cache needs room)
- Check model isn't corrupted (re-convert)
// Debug logging let config = MLModelConfiguration() config.parameters = [.reporter: { print($0) }] // iOS 17+
Pattern 2a - Slow First Load (Cache Miss)
Symptom: First prediction after install/update is slow, subsequent are fast.
Cause: Device specialization not cached.
Diagnosis:
- Profile with Core ML Instrument
- Look at Load event label:
- "prepare and cache" = cache miss (slow)
- "cached" = cache hit (fast)
Why cache misses:
- First launch after install
- System update invalidated cache
- Low disk space cleared cache
- Model file was modified
Mitigation:
// Warm cache in background at app launch Task.detached(priority: .background) { _ = try? await MLModel.load(contentsOf: modelURL) }
Note: Cache is tied to (model path + configuration + device). Different configs = different cache entries.
Pattern 2b - All Predictions Slow
Symptom: Predictions consistently slow, not just first one.
Diagnosis:
- Create Xcode Performance Report
- Check compute unit distribution
- Look for high-cost operations
Common causes:
| Cause | Fix |
|---|---|
| Running on CPU when GPU/NE available | Check config |
| Model too large for Neural Engine | Compress model |
| Frequent CPU↔GPU↔NE transfers | Adjust segmentation |
| Dynamic shapes recompiling | Use fixed/enumerated shapes |
Profile compute unit usage:
let plan = try await MLComputePlan.load(contentsOf: modelURL) for op in plan.modelStructure.operations { let info = plan.computeDeviceInfo(for: op) print("\(op.name): \(info.preferredDevice)") }
Pattern 2c - Slow on Specific Device
Symptom: Fast on Mac, slow on iPhone (or vice versa).
Cause: Different hardware characteristics.
Diagnosis:
// Check available compute let devices = MLModel.availableComputeDevices print(devices) // Different per device
Common issues:
| Scenario | Cause | Fix |
|---|---|---|
| Fast on M-series Mac, slow on iPhone | Model optimized for GPU | Use palettization (Neural Engine) |
| Fast on iPhone, slow on Intel Mac | No Neural Engine | Use quantization (GPU) |
| Slow on older devices | Less compute power | Use more aggressive compression |
Recommendation: Profile on target devices, not just development Mac.
Pattern 3a - Memory Grows During Predictions
Symptom: Memory increases with each prediction, doesn't release.
Cause: Input/output buffers accumulating from concurrent predictions.
Diagnosis:
Instruments → Allocations + Core ML template Look for: Many concurrent prediction intervals Check: MLMultiArray allocations growing
Fix: Limit concurrent predictions:
actor PredictionLimiter { private let maxConcurrent = 2 private var inFlight = 0 func predict(_ model: MLModel, input: MLFeatureProvider) async throws -> MLFeatureProvider { while inFlight >= maxConcurrent { await Task.yield() } inFlight += 1 defer { inFlight -= 1 } return try await model.prediction(from: input) } }
Pattern 3b - Out of Memory on Load
Symptom: App crashes or model fails to load on memory-constrained devices.
Cause: Model too large for device memory.
Diagnosis:
# Check model size ls -lh Model.mlpackage/Data/com.apple.CoreML/weights/
Fix options:
| Approach | Compression | Memory Impact |
|---|---|---|
| 8-bit palettization | 2x smaller | 2x less memory |
| 4-bit palettization | 4x smaller | 4x less memory |
| Pruning (50%) | ~2x smaller | ~2x less memory |
Note: Compressed weights are decompressed just-in-time (iOS 17+), so smaller on-disk = smaller in memory.
Pattern 4a - Bad Accuracy After Palettization
Symptom: Model output degraded after palettization.
Diagnosis:
- What bit depth? (2-bit most likely to fail)
- What granularity? (per-tensor loses more than per-grouped-channel)
Fix progression:
# Step 1: Try grouped channels (iOS 18+) config = OpPalettizerConfig( nbits=4, granularity="per_grouped_channel", group_size=16 ) # Step 2: If still bad, try more bits config = OpPalettizerConfig(nbits=6, ...) # Step 3: If still need 4-bit, use calibration from coremltools.optimize.torch.palettization import DKMPalettizer # ... training-time compression
Key insight: 4-bit per-tensor has only 16 clusters for entire weight matrix. Grouped channels = 16 clusters per 16 channels = much better granularity.
Pattern 4b - Bad Accuracy After Quantization
Symptom: Model output degraded after INT8/INT4 quantization.
Diagnosis:
- What bit depth?
- What granularity?
Fix progression:
# Step 1: Use per-block (iOS 18+) config = OpLinearQuantizerConfig( dtype="int4", granularity="per_block", block_size=32 ) # Step 2: Use calibration data from coremltools.optimize.torch.quantization import LayerwiseCompressor compressor = LayerwiseCompressor(model, config) quantized = compressor.compress(calibration_loader)
Note: INT4 quantization works best on Mac GPU. For Neural Engine, prefer palettization.
Pattern 4c - Bad Accuracy After Pruning
Symptom: Model output degraded after weight pruning.
Diagnosis:
- What sparsity level?
- Post-training or training-time?
Thresholds (model-dependent):
- 0-30% sparsity: Usually safe
- 30-50% sparsity: May need calibration
- 50%+ sparsity: Usually needs training-time
Fix:
# Use calibration-based pruning from coremltools.optimize.torch.pruning import LayerwiseCompressor config = MagnitudePrunerConfig( target_sparsity=0.4, n_samples=128 ) compressor = LayerwiseCompressor(model, config) sparse = compressor.compress(calibration_loader)
Pattern 5a - Operation Not Supported
Symptom: Conversion fails with unsupported operation error.
Diagnosis:
Error: "Op 'custom_op' is not supported for conversion"
Options:
- Check if op is in coremltools: May need newer version
pip install --upgrade coremltools
- Use composite ops: Split into supported primitives
# Instead of custom_op(x) # Use: supported_op1(supported_op2(x))
- Register custom op: Advanced, requires MIL programming
from coremltools.converters.mil import Builder as mb @mb.register_torch_op def custom_op(context, node): # Map to MIL operations ...
Pattern 5b - Conversion Succeeds but Wrong Output
Symptom: Model converts but predictions differ from PyTorch.
Diagnosis checklist:
- Input normalization: Ensure preprocessing matches
# PyTorch often uses ImageNet normalization # CoreML may need explicit preprocessing
- Shape ordering: PyTorch (NCHW) vs CoreML (NHWC for some ops)
# Check shapes in conversion ct.convert(..., inputs=[ct.ImageType(shape=(1, 3, 224, 224))])
- Precision differences: Float16 may differ from Float32
# Force Float32 to match PyTorch ct.convert(..., compute_precision=ct.precision.FLOAT32)
- Random ops: Dropout, random initialization differ
# Ensure eval mode model.eval()
Debug:
# Compare outputs layer by layer import numpy as np torch_output = model(input).detach().numpy() coreml_output = mlmodel.predict({"input": input.numpy()})["output"] print(f"Max diff: {np.max(np.abs(torch_output - coreml_output))}")
Pressure Scenario - "Model works on simulator but not device"
Wrong approach: Assume simulator bug, ignore.
Right approach:
- Check model spec version vs device iOS version (Pattern 1a)
- Check compute unit availability (Pattern 2c)
- Profile on actual device, not simulator
- Simulator uses host Mac's GPU/CPU, not device Neural Engine
Pressure Scenario - "Ship now, optimize later"
Wrong approach: Compress to smallest possible size without testing.
Right approach:
- Ship Float16 baseline first
- Profile on target devices
- Apply compression incrementally with accuracy testing
- Document compression settings for future optimization
Diagnostic Checklist
When CoreML isn't working:
- Check deployment target matches device iOS
- Check model file is compiled (.mlmodelc)
- Profile load: cached vs uncached
- Profile prediction: which compute units
- Check memory: concurrent predictions limited
- For compression issues: try higher granularity
- For conversion issues: check op support, precision
Resources
WWDC: 2023-10047, 2023-10049, 2024-10159, 2024-10161
Docs: /coreml, /coreml/mlmodel
Skills: coreml, coreml-ref