Skillshub speech

Use when implementing speech-to-text, live transcription, or audio transcription. Covers SpeechAnalyzer (iOS 26+), SpeechTranscriber, volatile/finalized results, AssetInventory model management, audio format handling.

install

source · Clone the upstream repo

git clone https://github.com/ComeOnOliver/skillshub

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/ComeOnOliver/skillshub "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/CharlesWiltgen/Axiom/speech" ~/.claude/skills/comeonoliver-skillshub-speech && rm -rf "$T"

manifest: skills/CharlesWiltgen/Axiom/speech/SKILL.md

source content

Speech-to-Text with SpeechAnalyzer

Overview

SpeechAnalyzer is Apple's new speech-to-text API introduced in iOS 26. It powers Notes, Voice Memos, Journal, and Call Summarization. The on-device model is faster, more accurate, and better for long-form/distant audio than SFSpeechRecognizer.

Key principle: SpeechAnalyzer is modular—add transcription modules to an analysis session. Results stream asynchronously using Swift's AsyncSequence.

Decision Tree - SpeechAnalyzer vs SFSpeechRecognizer

Need speech-to-text?
  ├─ iOS 26+ only?
  │   └─ Yes → SpeechAnalyzer (preferred)
  ├─ Need iOS 10-25 support?
  │   └─ Yes → SFSpeechRecognizer (or DictationTranscriber)
  ├─ Long-form audio (meetings, lectures)?
  │   └─ Yes → SpeechAnalyzer
  ├─ Distant audio (across room)?
  │   └─ Yes → SpeechAnalyzer
  └─ Short dictation commands?
      └─ Either works

SpeechAnalyzer advantages:

Better for long-form and conversational audio
Works well with distant speakers (meetings)
On-device, private
Model managed by system (no app size increase)
Powers Notes, Voice Memos, Journal

DictationTranscriber (iOS 26+): Same languages as SFSpeechRecognizer, but doesn't require user to enable Siri/dictation in Settings.

Red Flags

Use this skill when you see:

"Live transcription"
"Transcribe audio"
"Speech-to-text"
"SpeechAnalyzer" or "SpeechTranscriber"
"Volatile results"
Building Notes-like or Voice Memos-like features

Pattern 1 - File Transcription (Simplest)

Transcribe an audio file to text in one function.

import Speech

func transcribe(file: URL, locale: Locale) async throws -> AttributedString {
    // Set up transcriber
    let transcriber = SpeechTranscriber(
        locale: locale,
        preset: .offlineTranscription
    )

    // Collect results asynchronously
    async let transcriptionFuture = try transcriber.results
        .reduce(AttributedString()) { str, result in
            str + result.text
        }

    // Set up analyzer with transcriber module
    let analyzer = SpeechAnalyzer(modules: [transcriber])

    // Analyze the file
    if let lastSample = try await analyzer.analyzeSequence(from: file) {
        try await analyzer.finalizeAndFinish(through: lastSample)
    } else {
        await analyzer.cancelAndFinishNow()
    }

    return try await transcriptionFuture
}

Key points:

```
analyzeSequence(from:)
```
reads file and feeds audio to analyzer
```
finalizeAndFinish(through:)
```
ensures all results are finalized
Results are
```
AttributedString
```
with timing metadata

Pattern 2 - Live Transcription Setup

For real-time transcription from microphone.

Step 1 - Configure SpeechTranscriber

import Speech

class TranscriptionManager: ObservableObject {
    private var transcriber: SpeechTranscriber?
    private var analyzer: SpeechAnalyzer?
    private var analyzerFormat: AudioFormatDescription?
    private var inputBuilder: AsyncStream<AnalyzerInput>.Continuation?

    @Published var finalizedTranscript = AttributedString()
    @Published var volatileTranscript = AttributedString()

    func setUp() async throws {
        // Create transcriber with options
        transcriber = SpeechTranscriber(
            locale: Locale.current,
            transcriptionOptions: [],
            reportingOptions: [.volatileResults],  // Enable real-time updates
            attributeOptions: [.audioTimeRange]     // Include timing
        )

        guard let transcriber else { throw TranscriptionError.setupFailed }

        // Create analyzer with transcriber module
        analyzer = SpeechAnalyzer(modules: [transcriber])

        // Get required audio format
        analyzerFormat = await SpeechAnalyzer.bestAvailableAudioFormat(
            compatibleWith: [transcriber]
        )

        // Ensure model is available
        try await ensureModel(for: transcriber)

        // Create input stream
        let (stream, builder) = AsyncStream<AnalyzerInput>.makeStream()
        inputBuilder = builder

        // Start analyzer
        try await analyzer?.start(inputSequence: stream)
    }
}

Step 2 - Ensure Model Availability

func ensureModel(for transcriber: SpeechTranscriber) async throws {
    let locale = Locale.current

    // Check if language is supported
    let supported = await SpeechTranscriber.supportedLocales
    guard supported.contains(where: {
        $0.identifier(.bcp47) == locale.identifier(.bcp47)
    }) else {
        throw TranscriptionError.localeNotSupported
    }

    // Check if model is installed
    let installed = await SpeechTranscriber.installedLocales
    if installed.contains(where: {
        $0.identifier(.bcp47) == locale.identifier(.bcp47)
    }) {
        return  // Already installed
    }

    // Download model
    if let downloader = try await AssetInventory.assetInstallationRequest(
        supporting: [transcriber]
    ) {
        // Track progress if needed
        let progress = downloader.progress
        try await downloader.downloadAndInstall()
    }
}

Note: Models are stored in system storage, not app storage. Limited number of languages can be allocated at once.

Step 3 - Handle Results

func startResultHandling() {
    Task {
        guard let transcriber else { return }

        do {
            for try await result in transcriber.results {
                let text = result.text

                if result.isFinal {
                    // Finalized result - won't change
                    finalizedTranscript += text
                    volatileTranscript = AttributedString()

                    // Access timing info
                    for run in text.runs {
                        if let timeRange = run.audioTimeRange {
                            print("Time: \(timeRange)")
                        }
                    }
                } else {
                    // Volatile result - will be replaced
                    volatileTranscript = text
                }
            }
        } catch {
            print("Transcription failed: \(error)")
        }
    }
}

Pattern 3 - Audio Recording and Streaming

Connect AVAudioEngine to SpeechAnalyzer.

import AVFoundation

class AudioRecorder {
    private let audioEngine = AVAudioEngine()
    private var outputContinuation: AsyncStream<AVAudioPCMBuffer>.Continuation?
    private let transcriptionManager: TranscriptionManager

    func startRecording() async throws {
        // Request permission
        guard await AVAudioApplication.requestRecordPermission() else {
            throw RecordingError.permissionDenied
        }

        // Configure audio session (iOS)
        #if os(iOS)
        let session = AVAudioSession.sharedInstance()
        try session.setCategory(.playAndRecord, mode: .spokenAudio)
        try session.setActive(true, options: .notifyOthersOnDeactivation)
        #endif

        // Set up transcriber
        try await transcriptionManager.setUp()
        transcriptionManager.startResultHandling()

        // Stream audio to transcriber
        for await buffer in try audioStream() {
            try await transcriptionManager.streamAudio(buffer)
        }
    }

    private func audioStream() throws -> AsyncStream<AVAudioPCMBuffer> {
        let inputNode = audioEngine.inputNode
        let format = inputNode.outputFormat(forBus: 0)

        inputNode.installTap(
            onBus: 0,
            bufferSize: 4096,
            format: format
        ) { [weak self] buffer, time in
            self?.outputContinuation?.yield(buffer)
        }

        audioEngine.prepare()
        try audioEngine.start()

        return AsyncStream { continuation in
            outputContinuation = continuation
        }
    }
}

Stream Audio with Format Conversion

extension TranscriptionManager {
    private var converter: AVAudioConverter?

    func streamAudio(_ buffer: AVAudioPCMBuffer) async throws {
        guard let inputBuilder, let analyzerFormat else {
            throw TranscriptionError.notSetUp
        }

        // Convert to analyzer's required format
        let converted = try convertBuffer(buffer, to: analyzerFormat)

        // Send to analyzer
        let input = AnalyzerInput(buffer: converted)
        inputBuilder.yield(input)
    }

    private func convertBuffer(
        _ buffer: AVAudioPCMBuffer,
        to format: AudioFormatDescription
    ) throws -> AVAudioPCMBuffer {
        // Lazy initialize converter
        if converter == nil {
            let sourceFormat = buffer.format
            let destFormat = AVAudioFormat(cmAudioFormatDescription: format)!
            converter = AVAudioConverter(from: sourceFormat, to: destFormat)
        }

        guard let converter else {
            throw TranscriptionError.conversionFailed
        }

        let outputBuffer = AVAudioPCMBuffer(
            pcmFormat: converter.outputFormat,
            frameCapacity: buffer.frameLength
        )!

        try converter.convert(to: outputBuffer, from: buffer)
        return outputBuffer
    }
}

Pattern 4 - Stopping Transcription

Properly finalize to get remaining volatile results as finalized.

func stopRecording() async {
    // Stop audio
    audioEngine.stop()
    audioEngine.inputNode.removeTap(onBus: 0)
    outputContinuation?.finish()

    // Finalize transcription (converts remaining volatile to final)
    try? await analyzer?.finalizeAndFinishThroughEndOfInput()

    // Cancel any pending tasks
    recognizerTask?.cancel()
}

Critical: Always call

finalizeAndFinishThroughEndOfInput()

to ensure volatile results are finalized.

Pattern 5 - Model Asset Management

Check Supported Languages

// Languages the API supports
let supported = await SpeechTranscriber.supportedLocales

// Languages currently installed on device
let installed = await SpeechTranscriber.installedLocales

Deallocate Languages

Limited number of languages can be allocated. Deallocate unused ones.

func deallocateLanguages() async {
    let allocated = await AssetInventory.allocatedLocales
    for locale in allocated {
        await AssetInventory.deallocate(locale: locale)
    }
}

Pattern 6 - Displaying Results with Timing

Highlight text during audio playback using timing metadata.

struct TranscriptView: View {
    let transcript: AttributedString
    @Binding var playbackTime: CMTime

    var body: some View {
        Text(highlightedTranscript)
    }

    var highlightedTranscript: AttributedString {
        var result = transcript

        for (range, run) in transcript.runs {
            guard let timeRange = run.audioTimeRange else { continue }

            let isActive = timeRange.containsTime(playbackTime)
            if isActive {
                result[range].backgroundColor = .yellow
            }
        }

        return result
    }
}

Anti-Patterns

Don't - Forget to finalize

// BAD - volatile results lost
func stopRecording() {
    audioEngine.stop()
    // Missing finalize!
}

// GOOD - volatile results become finalized
func stopRecording() async {
    audioEngine.stop()
    try? await analyzer?.finalizeAndFinishThroughEndOfInput()
}

Don't - Ignore format conversion

// BAD - format mismatch may fail silently
inputBuilder.yield(AnalyzerInput(buffer: rawBuffer))

// GOOD - convert to analyzer's format
let format = await SpeechAnalyzer.bestAvailableAudioFormat(compatibleWith: [transcriber])
let converted = try convertBuffer(rawBuffer, to: format)
inputBuilder.yield(AnalyzerInput(buffer: converted))

Don't - Skip model availability check

// BAD - may crash if model not installed
let transcriber = SpeechTranscriber(locale: locale, ...)
// Start using immediately

// GOOD - ensure model is ready
let transcriber = SpeechTranscriber(locale: locale, ...)
try await ensureModel(for: transcriber)
// Now safe to use

Presets Reference

Preset	Use Case
`.offlineTranscription`	File transcription, no real-time feedback needed
`.progressiveLiveTranscription`	Live transcription with volatile updates

Options Reference

TranscriptionOptions

Default: None (standard transcription)

ReportingOptions

```
.volatileResults
```
: Enable real-time approximate results

AttributeOptions

```
.audioTimeRange
```
: Include CMTimeRange for each text segment

Platform Availability

Platform	SpeechTranscriber	DictationTranscriber
iOS 26+	Yes	Yes
macOS Tahoe+	Yes	Yes
watchOS 26+	No	Yes
tvOS 26+	TBD	TBD

Hardware requirements: Varies by device. Use

supportedLocales

to check.

Integration with Apple Intelligence

Combine with Foundation Models for summarization:

import FoundationModels

func generateTitle(for transcript: String) async throws -> String {
    let session = LanguageModelSession()
    let prompt = "Generate a short, clever title for this story: \(transcript)"
    let response = try await session.respond(to: prompt)
    return response.content
}

See

axiom-ios-ai

skill for Foundation Models details.

Checklist

Before shipping speech-to-text:

Check locale support with
```
supportedLocales
```
Ensure model with
```
AssetInventory.assetInstallationRequest
```
Handle download progress for user feedback
Convert audio to
```
bestAvailableAudioFormat
```
Enable
```
.volatileResults
```
for live transcription
Call
```
finalizeAndFinishThroughEndOfInput()
```
on stop
Handle timing with
```
.audioTimeRange
```
if needed
Clear volatile results when finalized result arrives
Request microphone permission before recording

Resources

WWDC: 2025-277

Docs: /speech, /speech/speechanalyzer, /speech/speechtranscriber

Skills: coreml (on-device ML), axiom-ios-ai (Foundation Models)