Skillshub coreml-diag

CoreML diagnostics - model load failures, slow inference, memory issues, compression accuracy loss, compute unit problems, conversion errors.

install

source · Clone the upstream repo

git clone https://github.com/ComeOnOliver/skillshub

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/ComeOnOliver/skillshub "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/CharlesWiltgen/Axiom/coreml-diag" ~/.claude/skills/comeonoliver-skillshub-coreml-diag && rm -rf "$T"

manifest: skills/CharlesWiltgen/Axiom/coreml-diag/SKILL.md

source content

CoreML Diagnostics

Quick Reference

Symptom	First Check	Pattern
Model won't load	Deployment target	1a-1c
Slow first load	Cache miss	2a
Slow inference	Compute units	2b-2c
High memory	Concurrent predictions	3a-3b
Bad accuracy after compression	Granularity	4a-4c
Conversion fails	Operation support	5a-5b

Decision Tree

CoreML issue
├─ Load failure?
│   ├─ "Unsupported model version" → 1a
│   ├─ "Failed to create compute plan" → 1b
│   └─ Other load error → 1c
├─ Performance issue?
│   ├─ First load slow, subsequent fast? → 2a
│   ├─ All predictions slow? → 2b
│   └─ Slow only on specific device? → 2c
├─ Memory issue?
│   ├─ Memory grows during predictions? → 3a
│   └─ Out of memory on load? → 3b
├─ Accuracy degraded?
│   ├─ After palettization? → 4a
│   ├─ After quantization? → 4b
│   └─ After pruning? → 4c
└─ Conversion issue?
    ├─ Operation not supported? → 5a
    └─ Wrong output? → 5b

Pattern 1a - "Unsupported model version"

Symptom: Model fails to load with version error.

Cause: Model compiled for newer OS than device supports.

Diagnosis:

# Check model's minimum deployment target
import coremltools as ct
model = ct.models.MLModel("Model.mlpackage")
print(model.get_spec().specificationVersion)

Spec Version	Minimum iOS
4	iOS 13
5	iOS 14
6	iOS 15
7	iOS 16
8	iOS 17
9	iOS 18

Fix: Re-convert with lower deployment target:

mlmodel = ct.convert(
    traced,
    minimum_deployment_target=ct.target.iOS16  # Lower target
)

Tradeoff: Loses newer optimizations (SDPA fusion, per-block quantization, MLTensor).

Pattern 1b - "Failed to create compute plan"

Symptom: Model loads on some devices but not others.

Cause: Unsupported operations for target compute unit.

Diagnosis:

Open model in Xcode
Create Performance Report
Check "Unsupported" operations
Hover for hints

Fix:

// Force CPU-only to bypass unsupported GPU/NE operations
let config = MLModelConfiguration()
config.computeUnits = .cpuOnly
let model = try MLModel(contentsOf: url, configuration: config)

Better fix: Update model precision or operations during conversion:

# Float16 often better supported
mlmodel = ct.convert(traced, compute_precision=ct.precision.FLOAT16)

Pattern 1c - General Load Failures

Symptom: Model fails to load with unclear error.

Checklist:

Check file exists and is readable
Check compiled vs source model (runtime needs
```
.mlmodelc
```
)
Check available disk space (cache needs room)
Check model isn't corrupted (re-convert)

// Debug logging
let config = MLModelConfiguration()
config.parameters = [.reporter: { print($0) }]  // iOS 17+

Pattern 2a - Slow First Load (Cache Miss)

Symptom: First prediction after install/update is slow, subsequent are fast.

Cause: Device specialization not cached.

Diagnosis:

Profile with Core ML Instrument
Look at Load event label:
- "prepare and cache" = cache miss (slow)
- "cached" = cache hit (fast)

Why cache misses:

First launch after install
System update invalidated cache
Low disk space cleared cache
Model file was modified

Mitigation:

// Warm cache in background at app launch
Task.detached(priority: .background) {
    _ = try? await MLModel.load(contentsOf: modelURL)
}

Note: Cache is tied to (model path + configuration + device). Different configs = different cache entries.

Pattern 2b - All Predictions Slow

Symptom: Predictions consistently slow, not just first one.

Diagnosis:

Create Xcode Performance Report
Check compute unit distribution
Look for high-cost operations

Common causes:

Cause	Fix
Running on CPU when GPU/NE available	Check `computeUnits` config
Model too large for Neural Engine	Compress model
Frequent CPU↔GPU↔NE transfers	Adjust segmentation
Dynamic shapes recompiling	Use fixed/enumerated shapes

Profile compute unit usage:

let plan = try await MLComputePlan.load(contentsOf: modelURL)
for op in plan.modelStructure.operations {
    let info = plan.computeDeviceInfo(for: op)
    print("\(op.name): \(info.preferredDevice)")
}

Pattern 2c - Slow on Specific Device

Symptom: Fast on Mac, slow on iPhone (or vice versa).

Cause: Different hardware characteristics.

Diagnosis:

// Check available compute
let devices = MLModel.availableComputeDevices
print(devices)  // Different per device

Common issues:

Scenario	Cause	Fix
Fast on M-series Mac, slow on iPhone	Model optimized for GPU	Use palettization (Neural Engine)
Fast on iPhone, slow on Intel Mac	No Neural Engine	Use quantization (GPU)
Slow on older devices	Less compute power	Use more aggressive compression

Recommendation: Profile on target devices, not just development Mac.

Pattern 3a - Memory Grows During Predictions

Symptom: Memory increases with each prediction, doesn't release.

Cause: Input/output buffers accumulating from concurrent predictions.

Diagnosis:

Instruments → Allocations + Core ML template
Look for: Many concurrent prediction intervals
Check: MLMultiArray allocations growing

Fix: Limit concurrent predictions:

actor PredictionLimiter {
    private let maxConcurrent = 2
    private var inFlight = 0

    func predict(_ model: MLModel, input: MLFeatureProvider) async throws -> MLFeatureProvider {
        while inFlight >= maxConcurrent {
            await Task.yield()
        }
        inFlight += 1
        defer { inFlight -= 1 }
        return try await model.prediction(from: input)
    }
}

Pattern 3b - Out of Memory on Load

Symptom: App crashes or model fails to load on memory-constrained devices.

Cause: Model too large for device memory.

Diagnosis:

# Check model size
ls -lh Model.mlpackage/Data/com.apple.CoreML/weights/

Fix options:

Approach	Compression	Memory Impact
8-bit palettization	2x smaller	2x less memory
4-bit palettization	4x smaller	4x less memory
Pruning (50%)	~2x smaller	~2x less memory

Note: Compressed weights are decompressed just-in-time (iOS 17+), so smaller on-disk = smaller in memory.

Pattern 4a - Bad Accuracy After Palettization

Symptom: Model output degraded after palettization.

Diagnosis:

What bit depth? (2-bit most likely to fail)
What granularity? (per-tensor loses more than per-grouped-channel)

Fix progression:

# Step 1: Try grouped channels (iOS 18+)
config = OpPalettizerConfig(
    nbits=4,
    granularity="per_grouped_channel",
    group_size=16
)

# Step 2: If still bad, try more bits
config = OpPalettizerConfig(nbits=6, ...)

# Step 3: If still need 4-bit, use calibration
from coremltools.optimize.torch.palettization import DKMPalettizer
# ... training-time compression

Key insight: 4-bit per-tensor has only 16 clusters for entire weight matrix. Grouped channels = 16 clusters per 16 channels = much better granularity.

Pattern 4b - Bad Accuracy After Quantization

Symptom: Model output degraded after INT8/INT4 quantization.

Diagnosis:

What bit depth?
What granularity?

Fix progression:

# Step 1: Use per-block (iOS 18+)
config = OpLinearQuantizerConfig(
    dtype="int4",
    granularity="per_block",
    block_size=32
)

# Step 2: Use calibration data
from coremltools.optimize.torch.quantization import LayerwiseCompressor
compressor = LayerwiseCompressor(model, config)
quantized = compressor.compress(calibration_loader)

Note: INT4 quantization works best on Mac GPU. For Neural Engine, prefer palettization.

Pattern 4c - Bad Accuracy After Pruning

Symptom: Model output degraded after weight pruning.

Diagnosis:

What sparsity level?
Post-training or training-time?

Thresholds (model-dependent):

0-30% sparsity: Usually safe
30-50% sparsity: May need calibration
50%+ sparsity: Usually needs training-time

Fix:

# Use calibration-based pruning
from coremltools.optimize.torch.pruning import LayerwiseCompressor

config = MagnitudePrunerConfig(
    target_sparsity=0.4,
    n_samples=128
)
compressor = LayerwiseCompressor(model, config)
sparse = compressor.compress(calibration_loader)

Pattern 5a - Operation Not Supported

Symptom: Conversion fails with unsupported operation error.

Diagnosis:

Error: "Op 'custom_op' is not supported for conversion"

Options:

Check if op is in coremltools: May need newer version

pip install --upgrade coremltools

Use composite ops: Split into supported primitives

# Instead of custom_op(x)
# Use: supported_op1(supported_op2(x))

Register custom op: Advanced, requires MIL programming

from coremltools.converters.mil import Builder as mb

@mb.register_torch_op
def custom_op(context, node):
    # Map to MIL operations
    ...

Pattern 5b - Conversion Succeeds but Wrong Output

Symptom: Model converts but predictions differ from PyTorch.

Diagnosis checklist:

Input normalization: Ensure preprocessing matches

# PyTorch often uses ImageNet normalization
# CoreML may need explicit preprocessing

Shape ordering: PyTorch (NCHW) vs CoreML (NHWC for some ops)

# Check shapes in conversion
ct.convert(..., inputs=[ct.ImageType(shape=(1, 3, 224, 224))])

Precision differences: Float16 may differ from Float32

# Force Float32 to match PyTorch
ct.convert(..., compute_precision=ct.precision.FLOAT32)

Random ops: Dropout, random initialization differ

# Ensure eval mode
model.eval()

Debug:

# Compare outputs layer by layer
import numpy as np

torch_output = model(input).detach().numpy()
coreml_output = mlmodel.predict({"input": input.numpy()})["output"]

print(f"Max diff: {np.max(np.abs(torch_output - coreml_output))}")

Pressure Scenario - "Model works on simulator but not device"

Wrong approach: Assume simulator bug, ignore.

Right approach:

Check model spec version vs device iOS version (Pattern 1a)
Check compute unit availability (Pattern 2c)
Profile on actual device, not simulator
Simulator uses host Mac's GPU/CPU, not device Neural Engine

Pressure Scenario - "Ship now, optimize later"

Wrong approach: Compress to smallest possible size without testing.

Right approach:

Ship Float16 baseline first
Profile on target devices
Apply compression incrementally with accuracy testing
Document compression settings for future optimization

Diagnostic Checklist

When CoreML isn't working:

Check deployment target matches device iOS
Check model file is compiled (.mlmodelc)
Profile load: cached vs uncached
Profile prediction: which compute units
Check memory: concurrent predictions limited
For compression issues: try higher granularity
For conversion issues: check op support, precision

Resources

WWDC: 2023-10047, 2023-10049, 2024-10159, 2024-10161

Docs: /coreml, /coreml/mlmodel

Skills: coreml, coreml-ref