Skillshub coreml
Use when deploying custom ML models on-device, converting PyTorch models, compressing models, implementing LLM inference, or optimizing CoreML performance. Covers model conversion, compression, stateful models, KV-cache, multi-function models, MLTensor.
git clone https://github.com/ComeOnOliver/skillshub
T=$(mktemp -d) && git clone --depth=1 https://github.com/ComeOnOliver/skillshub "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/CharlesWiltgen/Axiom/coreml" ~/.claude/skills/comeonoliver-skillshub-coreml && rm -rf "$T"
skills/CharlesWiltgen/Axiom/coreml/SKILL.mdCoreML On-Device Machine Learning
Overview
CoreML enables on-device machine learning inference across all Apple platforms. It abstracts hardware details while leveraging Apple Silicon's CPU, GPU, and Neural Engine for high-performance, private, and efficient execution.
Key principle: Start with the simplest approach, then optimize based on profiling. Don't over-engineer compression or caching until you have real performance data.
Decision Tree - CoreML vs Foundation Models
Need on-device ML? ├─ Text generation (LLM)? │ ├─ Simple prompts, structured output? → Foundation Models (ios-ai skill) │ └─ Custom model, fine-tuned, specific architecture? → CoreML ├─ Custom trained model? │ └─ Yes → CoreML ├─ Image/audio/sensor processing? │ └─ Yes → CoreML └─ Apple's built-in intelligence? └─ Yes → Foundation Models (ios-ai skill)
Red Flags
Use this skill when you see:
- "Convert PyTorch model to CoreML"
- "Model too large for device"
- "Slow inference performance"
- "LLM on-device"
- "KV-cache" or "stateful model"
- "Model compression" or "quantization"
- MLModel, MLTensor, or coremltools in context
Pattern 1 - Basic Model Conversion
The standard PyTorch → CoreML workflow.
import coremltools as ct import torch # Trace the model model.eval() traced_model = torch.jit.trace(model, example_input) # Convert to CoreML mlmodel = ct.convert( traced_model, inputs=[ct.TensorType(shape=example_input.shape)], minimum_deployment_target=ct.target.iOS18 ) # Save mlmodel.save("MyModel.mlpackage")
Critical: Always set
minimum_deployment_target to enable latest optimizations.
Pattern 2 - Model Compression (Post-Training)
Three techniques, each with different tradeoffs:
Palettization (Best for Neural Engine)
Clusters weights into lookup tables. Use per-grouped-channel for better accuracy.
from coremltools.optimize.coreml import ( OpPalettizerConfig, OptimizationConfig, palettize_weights ) # 4-bit with grouped channels (iOS 18+) op_config = OpPalettizerConfig( mode="kmeans", nbits=4, granularity="per_grouped_channel", group_size=16 ) config = OptimizationConfig(global_config=op_config) compressed_model = palettize_weights(model, config)
| Bits | Compression | Accuracy Impact |
|---|---|---|
| 8-bit | 2x | Minimal |
| 6-bit | 2.7x | Low |
| 4-bit | 4x | Moderate (use grouped channels) |
| 2-bit | 8x | High (requires training-time) |
Quantization (Best for GPU on Mac)
Linear mapping to INT8/INT4. Use per-block for better accuracy.
from coremltools.optimize.coreml import ( OpLinearQuantizerConfig, OptimizationConfig, linear_quantize_weights ) # INT4 per-block quantization (iOS 18+) op_config = OpLinearQuantizerConfig( mode="linear", dtype="int4", granularity="per_block", block_size=32 ) config = OptimizationConfig(global_config=op_config) compressed_model = linear_quantize_weights(model, config)
Pruning (Combine with other techniques)
Sets weights to zero for sparse representation. Can combine with palettization.
from coremltools.optimize.coreml import ( OpMagnitudePrunerConfig, OptimizationConfig, prune_weights ) op_config = OpMagnitudePrunerConfig( target_sparsity=0.4 # 40% zeros ) config = OptimizationConfig(global_config=op_config) sparse_model = prune_weights(model, config)
Pattern 3 - Training-Time Compression
When post-training compression loses too much accuracy, fine-tune with compression.
from coremltools.optimize.torch.palettization import ( DKMPalettizerConfig, DKMPalettizer ) # Configure 4-bit palettization config = DKMPalettizerConfig(global_config={"n_bits": 4}) # Prepare model palettizer = DKMPalettizer(model, config) prepared_model = palettizer.prepare() # Fine-tune (your training loop) for epoch in range(num_epochs): train_epoch(prepared_model, data_loader) palettizer.step() # Finalize final_model = palettizer.finalize()
Tradeoff: Better accuracy than post-training, but requires training data and time.
Pattern 4 - Calibration-Based Compression (iOS 18+)
Middle ground: uses calibration data without full training.
from coremltools.optimize.torch.pruning import ( MagnitudePrunerConfig, LayerwiseCompressor ) # Configure config = MagnitudePrunerConfig( target_sparsity=0.4, n_samples=128 # Calibration samples ) # Create pruner pruner = LayerwiseCompressor(model, config) # Calibrate sparse_model = pruner.compress(calibration_data_loader)
Pattern 5 - Stateful Models (KV-Cache for LLMs)
For transformer models, use state to avoid recomputing key/value vectors.
PyTorch Model with State
class StatefulLLM(nn.Module): def __init__(self): super().__init__() # Register state buffers self.register_buffer("keyCache", torch.zeros(batch, heads, seq_len, dim)) self.register_buffer("valueCache", torch.zeros(batch, heads, seq_len, dim)) def forward(self, input_ids, causal_mask): # Update caches in-place during forward # ... attention with KV-cache ... return logits
Conversion with State
import coremltools as ct mlmodel = ct.convert( traced_model, inputs=[ ct.TensorType(name="input_ids", shape=(1, ct.RangeDim(1, 2048))), ct.TensorType(name="causal_mask", shape=(1, 1, ct.RangeDim(1, 2048), ct.RangeDim(1, 2048))) ], states=[ ct.StateType(name="keyCache", ...), ct.StateType(name="valueCache", ...) ], minimum_deployment_target=ct.target.iOS18 )
Using State at Runtime
// Create state from model let state = model.makeState() // Run prediction with state (updated in-place) let output = try model.prediction(from: input, using: state)
Performance: 1.6x speedup on Mistral-7B (M3 Max) compared to manual KV-cache I/O.
Pattern 6 - Multi-Function Models (Adapters/LoRA)
Deploy multiple adapters in a single model, sharing base weights.
from coremltools.models import MultiFunctionDescriptor from coremltools.models.utils import save_multifunction # Convert individual models sticker_model = ct.convert(sticker_adapter_model, ...) storybook_model = ct.convert(storybook_adapter_model, ...) # Save individually sticker_model.save("sticker.mlpackage") storybook_model.save("storybook.mlpackage") # Merge with shared weights desc = MultiFunctionDescriptor() desc.add_function("sticker", "sticker.mlpackage") desc.add_function("storybook", "storybook.mlpackage") save_multifunction(desc, "MultiAdapter.mlpackage")
Loading Specific Function
let config = MLModelConfiguration() config.functionName = "sticker" // or "storybook" let model = try MLModel(contentsOf: modelURL, configuration: config)
Pattern 7 - MLTensor for Pipeline Stitching (iOS 18+)
Simplifies computation between models (decoding, post-processing).
import CoreML // Create tensors let scores = MLTensor(shape: [1, vocab_size], scalars: logits) // Operations (executed asynchronously on Apple Silicon) let topK = scores.topK(k: 10) let probs = (topK.values / temperature).softmax() // Sample from distribution let sampled = probs.multinomial(numSamples: 1) // Materialize to access data (blocks until complete) let shapedArray = await sampled.shapedArray(of: Int32.self)
Key insight: MLTensor operations are async. Call
shapedArray() to materialize results.
Pattern 8 - Async Prediction for Concurrency
Thread-safe concurrent predictions for throughput.
class ImageProcessor { let model: MLModel func processImages(_ images: [CGImage]) async throws -> [Output] { try await withThrowingTaskGroup(of: Output.self) { group in for image in images { group.addTask { // Check cancellation before expensive work try Task.checkCancellation() let input = try self.prepareInput(image) // Async prediction - thread safe! return try await self.model.prediction(from: input) } } return try await group.reduce(into: []) { $0.append($1) } } } }
Warning: Limit concurrent predictions to avoid memory pressure from multiple input/output buffers.
// Limit concurrency let semaphore = AsyncSemaphore(value: 2) for image in images { group.addTask { await semaphore.wait() defer { semaphore.signal() } return try await process(image) } }
Anti-Patterns
Don't - Load models on main thread at launch
// BAD - blocks UI class AppDelegate { let model = try! MLModel(contentsOf: url) // Blocks! } // GOOD - lazy async loading class ModelManager { private var model: MLModel? func getModel() async throws -> MLModel { if let model { return model } model = try await Task.detached { try MLModel(contentsOf: url) }.value return model! } }
Don't - Reload model for each prediction
// BAD - reloads every time func predict(_ input: Input) throws -> Output { let model = try MLModel(contentsOf: url) // Expensive! return try model.prediction(from: input) } // GOOD - keep model loaded class Predictor { private let model: MLModel func predict(_ input: Input) throws -> Output { try model.prediction(from: input) } }
Don't - Compress without profiling first
// BAD - blind compression let compressed = palettize_weights(model, 2bit_config) // May break accuracy! // GOOD - profile, then compress iteratively // 1. Profile Float16 baseline // 2. Try 8-bit → check accuracy // 3. Try 6-bit → check accuracy // 4. Try 4-bit with grouped channels → check accuracy // 5. Only use 2-bit with training-time compression
Don't - Ignore deployment target
# BAD - misses optimizations mlmodel = ct.convert(traced_model, inputs=[...]) # GOOD - enables SDPA fusion, per-block quantization, etc. mlmodel = ct.convert( traced_model, inputs=[...], minimum_deployment_target=ct.target.iOS18 )
Pressure Scenarios
Scenario 1 - "Model is 5GB, need it under 2GB for iPhone"
Wrong approach: Jump straight to 2-bit palettization.
Right approach:
- Start with 8-bit palettization → check accuracy
- Try 6-bit → check accuracy
- Try 4-bit with
→ check accuracyper_grouped_channel - If still too large, use calibration-based compression
- If still losing accuracy, use training-time compression
Scenario 2 - "LLM inference is too slow"
Wrong approach: Try different compute units randomly.
Right approach:
- Profile with Core ML Instrument
- Check if load is cached (look for "cached" vs "prepare and cache")
- Enable stateful KV-cache
- Check SDPA optimization is enabled (iOS 18+ deployment target)
- Consider INT4 quantization for GPU on Mac
Scenario 3 - "Need multiple LoRA adapters in one app"
Wrong approach: Ship separate models for each adapter.
Right approach:
- Convert each adapter model separately
- Use
to merge with shared baseMultiFunctionDescriptor - Load specific function via
config.functionName - Weights are deduplicated automatically
Checklist
Before deploying a CoreML model:
- Set
to latest supported iOSminimum_deployment_target - Profile baseline Float16 performance
- Check if model load is cached
- Consider compression only if size/performance requires it
- Test accuracy after each compression step
- Use async prediction for concurrent workloads
- Limit concurrent predictions to manage memory
- Use state for transformer KV-cache
- Use multi-function for adapter variants
Resources
WWDC: 2023-10047, 2023-10049, 2024-10159, 2024-10161
Docs: /coreml, /coreml/mlmodel, /coreml/mltensor
Skills: coreml-ref, coreml-diag, axiom-ios-ai (Foundation Models)