Autosearch video-to-text-local

Transcribe video/audio URL or local file to text + SRT using yt-dlp + local mlx-whisper (Apple Silicon). Free, offline, fastest on M-series Macs. Opt-in advanced path for users with Apple Silicon + mlx-whisper installed. Returns raw text and segments; summary is caller's responsibility.

install

source · Clone the upstream repo

git clone https://github.com/0xmariowu/Autosearch

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/0xmariowu/Autosearch "$T" && mkdir -p ~/.claude/skills && cp -r "$T/autosearch/skills/tools/video-to-text-local" ~/.claude/skills/0xmariowu-autosearch-video-to-text-local && rm -rf "$T"

manifest: autosearch/skills/tools/video-to-text-local/SKILL.md

source content

Transcribe video or audio entirely on-device using

yt-dlp

audio extraction plus local

mlx-whisper

inference. Works offline, zero API cost, fastest on Apple Silicon M-series with MLX Metal acceleration.

Input Fit

YouTube URLs.
Bilibili URLs.
Douyin URLs.
Xiaoyuzhou podcast URLs.
Local audio files such as MP3, M4A, WAV, AAC, FLAC, OGG, and OPUS.
Local video files such as MP4, MOV, MKV, AVI, WEBM, FLV, and M4V.

Invocation

Call

transcribe.py

's sync

transcribe(url_or_path: str)

function with the original URL or local file path:

result = transcribe("https://www.youtube.com/watch?v=example")

Successful calls return:

{
    "ok": True,
    "raw_txt": "...",
    "subtitle_srt": "1\n00:00:00,000 --> 00:00:03,100\n...\n",
    "meta": {
        "language": "en",
        "duration_sec": 123.4,
        "model": "mlx-community/whisper-large-v3-turbo",
        "backend": "mlx-whisper",
    },
    "audio_path": "/tmp/autosearch-video-to-text-local-.../audio.mp3",
    "source": "https://www.youtube.com/watch?v=example",
}

Failure Modes

Missing

mlx-whisper

package returns

reason: mlx_whisper_unavailable

(install with

pip install mlx-whisper

, Apple Silicon only).

Unsupported URLs, network failures, and non-zero
```
yt-dlp
```
exits return
```
reason: yt_dlp_failed
```
.
Missing local
```
ffmpeg
```
for video conversion returns
```
reason: ffmpeg_missing
```
.
Model download failure (no network on first run) returns
```
reason: model_download_failed
```
.
Other MLX / transcription runtime errors return
```
reason: mlx_whisper_runtime_error
```
.

Downgrade chain: use

video-to-text-groq

as the free cloud fallback (requires

GROQ_API_KEY

), then

video-to-text-openai

as the paid cloud fallback.

Limits

Requires Apple Silicon (M1 / M2 / M3 / M4). Not compatible with Intel Macs, Linux, or Windows.
Requires
```
mlx-whisper
```
package installed separately:
```
pip install mlx-whisper
```
(not a default autosearch dependency because it's macOS/ARM-only).
First run downloads the Whisper model to
```
~/.cache/huggingface/hub/
```
(~3 GB for large-v3-turbo).
Default model:
```
mlx-community/whisper-large-v3-turbo
```
. Override via
```
AUTOSEARCH_MLX_WHISPER_MODEL
```
env var to any HuggingFace repo or local model path.

This tool does not produce a summary. It returns

raw_txt

and subtitles so the runtime AI can decide how to process, summarize, quote, or cite the transcript.

Quality Bar

Evidence items have non-empty title and url.
No crash on empty or malformed API response.
Source channel field matches the channel name.