Claude-skill-registry audio-transcriber
Extracts audio from dashcam MP4 files and produces GPU-accelerated timestamped transcripts with optional speaker diarization. This skill should be used when users request audio transcription from video files, mention dashcam audio/transcribe MP4/extract speech, want to analyze conversations from video footage, need timestamped transcripts with speaker identification, or ask to process video folders with audio extraction.
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/audio-transcriber" ~/.claude/skills/majiayu000-claude-skill-registry-audio-transcriber && rm -rf "$T"
skills/data/audio-transcriber/SKILL.mdAudio Transcriber
Skill Type: Media Processing & Analysis Domain: Audio Transcription, Speech Recognition, GPU Acceleration Version: 2.0 Last Updated: 2025-10-26
Description
Extracts audio from dashcam MP4 files and produces GPU-accelerated timestamped transcripts with optional speaker diarization. Uses faster-whisper with CUDA for efficient processing, organizing outputs by date with comprehensive metadata and quality metrics.
When to Use This Skill:
- User requests audio transcription from video files
- User mentions "dashcam audio", "transcribe MP4", or "extract speech"
- User wants to analyze conversations from video footage
- User needs timestamped transcripts with speaker identification
- User asks to process video folders with audio extraction
Quick Start
User Trigger Phrases
- "Transcribe audio from my dashcam videos"
- "Extract and transcribe speech from [folder/date]"
- "Generate transcripts for [MP4 files/date range]"
- "Process dashcam audio with speaker identification"
- "Create subtitles from video files"
Expected Inputs
- Video Folder Path (required) - Path to MP4 files or date-organized folders
- Date Range (optional) - Single day, range, or "all available"
- Output Directory (optional) - Default: parallel to input with
suffix_transcripts - Processing Options (optional) - Model size, formats, diarization, GPU settings
Expected Outputs
- Audio extracts (WAV files) organized by date
- Transcripts in multiple formats (TXT, JSON, SRT, VTT)
- Global INDEX.csv with searchable segment metadata
- Results JSON with GPU metrics and processing statistics
- Quality reports with confidence scores and coverage
Core Capabilities
1. User Input Acquisition (Section 0 Protocol)
CRITICAL: Always follow the Section 0 protocol before processing.
Step 1: Parse User Request
User: "Transcribe my dashcam videos from September 3rd" AI Detection: ✅ Date detected: 2025-09-03 ❌ Folder path not specified → Try auto-discovery
Step 2: Auto-Discovery
Search these locations in order:
C:\Users\[user]\Desktop\CARDV\Movie_F\YYYYMMDD\*.MP4G:\My Drive\PROJECTS\INVESTIGATION\DASHCAM\Movie_F\YYYYMMDD\*.MP4- Current working directory +
\*.mp4 - User-provided explicit path
Step 3: Validate Files
def validate_video_folder(folder_path): # Check folder exists # Count MP4 files # Estimate total audio duration # Check disk space for outputs # Verify FFmpeg installation
Step 4: Prompt for Missing Inputs
If files not found:
🎥 I need video files to transcribe. I couldn't find MP4s in the expected location. Please provide ONE of the following: 1. **Direct folder path**: e.g., `C:\Users\yousu\Desktop\CARDV\Movie_F\20250903` 2. **Parent directory**: I'll search for date subfolders 3. **Single video file**: Process just one MP4 💡 Tip: Dashcam folders are usually organized by date (YYYYMMDD)
Step 5: Configuration Summary & Confirmation
📋 **Transcription Configuration Summary** **Input:** 📁 Folder: C:\Users\yousu\Desktop\CARDV\Movie_F\20250903 📹 Videos found: 47 MP4 files ⏱️ Estimated audio: ~7.8 hours 💾 Estimated output size: ~450 MB (transcripts only) **Processing:** 🖥️ GPU: NVIDIA GeForce RTX 4080 (detected) 🧠 Model: faster-whisper base (FP16, CUDA) 🎯 Segmentation: fixed 30s chunks 🗣️ Diarization: disabled (opt-in) 📝 Formats: txt, json, srt **Output:** 💾 Audio extracts: C:\Users\yousu\Desktop\CARDV\Movie_F\20250903\audio\ 📄 Transcripts: C:\Users\yousu\Desktop\CARDV\Movie_F\20250903\transcripts\ 📊 INDEX.csv: C:\Users\yousu\Desktop\CARDV\Movie_F\20250903\transcripts\INDEX.csv Ready to proceed? (Yes/No)
NEVER begin processing without user confirmation.
2. Audio Processing Pipeline
A. Audio Extraction (FFmpeg with Retry Matrix)
# Primary extraction command ffmpeg -i video.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 audio.wav # Retry sequence on failure: # 1. Codec fallback: pcm_s16le → flac # 2. Add demuxer args: -fflags +genpts -rw_timeout 30000000 # 3. Extended probe: -analyzeduration 100M -probesize 100M
Quality Checks:
- Verify audio stream exists (ffprobe preflight)
- Check duration matches video duration
- Detect silent/corrupted audio
- Log extraction errors to
_FAILED.json
B. Segmentation (Two Modes)
Fixed Mode (Default):
- Split audio into 30-second chunks
- Predictable processing time
- No external VAD required
- Best for continuous speech
VAD Mode (Advanced):
- Use Silero VAD to detect speech regions
- Variable-length segments (2-60s)
- Skip long silences
- Best for sparse audio (parking mode)
Mutual Exclusion: Only one mode active at a time.
C. GPU Transcription (faster-whisper)
# Load model with GPU optimization model = WhisperModel( "base", device="cuda", compute_type="float16" ) # Transcribe with word-level timestamps segments, info = model.transcribe( audio_path, beam_size=5, word_timestamps=True, vad_filter=True )
GPU Metrics Captured:
- Device name, VRAM, utilization
- CUDA version, driver version
- Average GPU % during run (sampled at 1-2 Hz)
- Memory usage peaks
D. Speaker Diarization (Optional)
Backends:
- pyannote: State-of-the-art (requires HF token + VRAM)
- speechbrain: Good performance (no auth required)
Label Normalization:
- Different backends → unified
,spkA
, etc.spkB - Consistent across INDEX.csv and JSON outputs
Fallback Behavior:
- If HF token missing → skip diarization, log warning
- If OOM error → disable diarization, continue transcription
3. Output Generation
A. File Organization (Per-Day Structure)
C:\Users\yousu\Desktop\CARDV\Movie_F\ └── 20250903\ ├── audio\ │ ├── 20250903133516_059495B.wav │ ├── 20250903134120_059496B.wav │ └── ... (47 files) ├── transcripts\ │ ├── 20250903133516_059495B.txt │ ├── 20250903133516_059495B.json │ ├── 20250903133516_059495B.srt │ └── ... (47 × 3 = 141 files) └── INDEX.csv
B. Format Details
TXT (Plain Text):
[00:00:15] Speaker A: Hey, where are we going? [00:00:18] Speaker B: Just heading to the mall. [00:00:22] Speaker A: Okay, sounds good.
JSON (Complete Metadata):
{ "video_file": "20250903133516_059495B.MP4", "audio_duration_sec": 60, "language": "en", "language_confidence": 0.95, "segments": [ { "start": 15.2, "end": 17.8, "text": "Hey, where are we going?", "confidence": 0.89, "speaker": "spkA", "words": [ {"word": "Hey", "start": 15.2, "end": 15.4, "confidence": 0.92}, {"word": "where", "start": 15.5, "end": 15.8, "confidence": 0.88} ] } ] }
SRT (SubRip Subtitles):
1 00:00:15,200 --> 00:00:17,800 [spkA] Hey, where are we going? 2 00:00:18,000 --> 00:00:20,500 [spkB] Just heading to the mall.
VTT (WebVTT):
WEBVTT 00:00:15.200 --> 00:00:17.800 <v spkA>Hey, where are we going? 00:00:18.000 --> 00:00:20.500 <v spkB>Just heading to the mall.
C. INDEX.csv (Global Search Index)
Composite key:
(video_rel, seg_idx)
| Column | Description |
|---|---|
| Movie_F / Movie_R / Park_F / Park_R |
| YYYYMMDD |
| Relative path from root |
| Filename without extension |
| 0-based segment index |
| Segment start milliseconds |
| Segment end milliseconds |
| Transcript text (truncated to 512 chars) |
| Full text length |
| ISO language code |
| Language detection confidence |
| Average token confidence |
| Normalized speaker label |
| Files generated (txt/json/srt/vtt) |
| Basename |
| Basename |
| e.g., |
| CUDA version |
| Driver version |
| ISO 8601 timestamp |
D. Results JSON (Single Source of Truth)
{ "status": "ok", "summary": { "videos_processed": 47, "segments": 1847, "hours_audio": 7.8, "gpu_detected": true, "device_count": 1, "devices": [ { "index": 0, "name": "NVIDIA GeForce RTX 4080", "total_mem_mb": 16384, "free_mem_mb": 14200 } ], "utilization": { "gpu_pct": 35, "mem_pct": 42, "sampling_hz": 2 }, "cuda_version": "12.1", "driver_version": "546.01", "torch_version": "2.2.0+cu121", "errors": 0, "failed_files": [] }, "artifacts": { "index_csv": "C:\\Users\\yousu\\Desktop\\CARDV\\Movie_F\\20250903\\INDEX.csv", "output_dir": "C:\\Users\\yousu\\Desktop\\CARDV\\Movie_F\\20250903\\transcripts" } }
4. Quality & Error Handling
A. Resume Safety
- Skip existing transcripts unless
flag--force - Idempotent: re-running is safe
- Checkpoint support for long runs
B. Error Types & Recovery
Per-Video Failures (
{video_stem}_FAILED.json):
{ "video_path": "C:\\...\\video.mp4", "error_type": "ffmpeg_err", "error_message": "Failed to decode audio stream", "ffprobe_metadata": {"duration": null, "codec": "h264"}, "timestamp": "2025-09-03T14:30:00Z" }
Error types:
: Audio extraction failedffmpeg_err
: Whisper decode faileddecode_err
: Out of GPU memoryOOM
: Container/stream corruptedcorrupted
: No audio stream detectedno_audio
C. SRT/VTT Validation
- Strictly monotonic timestamps
- No overlapping segments
- Clamp gaps <50ms
- Proper timecode formatting (comma vs period)
Implementation Guide
Phase 1: Input Acquisition
# 1. Parse user request inputs = parse_user_request(user_message) # 2. Auto-discover video files if not inputs['video_folder']: inputs['video_folder'] = auto_discover_videos() # 3. Validate inputs validate_video_folder(inputs['video_folder']) check_ffmpeg_available() check_gpu_available() # 4. Estimate resource requirements estimate_processing_time(inputs) estimate_disk_space(inputs) # 5. Present configuration summary show_configuration_summary(inputs) # 6. Wait for confirmation if not user_confirms(): return # Do not proceed
Phase 2: Audio Extraction
for video_file in video_files: # FFprobe preflight check metadata = ffprobe(video_file) if not has_audio_stream(metadata): log_failed(video_file, "no_audio") continue # Extract audio with retry try: audio_path = extract_audio_ffmpeg( video_file, output_dir=audio_output_dir, sample_rate=16000, channels=1 ) except FFmpegError as e: # Retry with fallback codec audio_path = extract_audio_ffmpeg_retry(video_file)
Phase 3: Transcription
# Load model once (reuse for all files) model = load_whisper_model( model_size="base", device="cuda", compute_type="float16" ) for audio_file in audio_files: # Segment audio if segmentation_mode == "fixed": chunks = segment_fixed(audio_file, chunk_size=30) else: chunks = segment_vad(audio_file, vad_model) # Transcribe each chunk all_segments = [] for chunk in chunks: segments = model.transcribe(chunk) all_segments.extend(segments) # Optional: Diarization if diarization_enabled: all_segments = apply_diarization(audio_file, all_segments)
Phase 4: Output Generation
# Generate all formats for video_file, segments in results.items(): # TXT write_txt(segments, output_dir) # JSON write_json(segments, metadata, output_dir) # SRT srt_content = generate_srt(segments) validate_srt_monotonic(srt_content) write_srt(srt_content, output_dir) # VTT (optional) write_vtt(segments, output_dir) # Update INDEX.csv append_to_index(segments, index_csv_path)
Phase 5: Completion Report
# Generate results JSON results_json = { "status": "ok", "summary": collect_statistics(), "artifacts": list_output_files(), "gpu_metrics": get_gpu_metrics() } # Save to file save_results_json(results_json, output_dir) # Report to user print(f"✅ Complete! Processed {video_count} videos") print(f" Transcripts: {output_dir}") print(f" INDEX: {index_csv_path}") print(f" GPU Util: {avg_gpu_pct}%")
Reference Materials
In This Skill
- SKILL_MANIFEST.md - Complete technical specification (v2.0)
- references/TECHNICAL_SPECIFICATION.md - Detailed implementation rules
- scripts/batch_transcriber.py - Main batch processing script
- scripts/audio_extractor.py - FFmpeg wrapper with retry logic
- scripts/transcriber.py - Whisper transcription engine
- scripts/diarizer.py - Speaker diarization integration
- scripts/format_writers.py - TXT/JSON/SRT/VTT generators
- scripts/gpu_monitor.py - GPU metrics collection
- scripts/validation.py - Input validation and checks
- assets/config_template.json - Default configuration
- assets/params.json - Tunable parameters
External Documentation
- faster-whisper documentation
- FFmpeg audio processing guide
- pyannote.audio diarization guide
- SRT/VTT subtitle format specifications
Tunable Parameters
{ "whisper": { "model_size": "base", "device": "cuda", "compute_type": "float16", "batch_size": 8, "beam_size": 5, "language": "en", "detect_language": false }, "audio": { "sample_rate": 16000, "channels": 1, "format": "wav", "keep_intermediate": false }, "segmentation": { "mode": "fixed", "chunk_length_sec": 30, "vad_min_len_sec": 2, "vad_max_len_sec": 60 }, "diarization": { "enabled": false, "backend": "pyannote", "min_speakers": 1, "max_speakers": 10 }, "output": { "formats": ["txt", "json", "srt"], "text_truncate_csv": 512 }, "parallel": { "max_workers": 3 } }
Common Issues & Solutions
Issue 1: "GPU not detected"
Cause: CUDA not installed or incompatible driver Solution:
- Check:
python -c "import torch; print(torch.cuda.is_available())" - Install/update CUDA toolkit
- Fallback to CPU:
--device cpu
Issue 2: "FFmpeg command failed"
Cause: FFmpeg not in PATH or unsupported codec Solution:
- Verify:
ffmpeg -version - Install from ffmpeg.org
- Use retry matrix with codec fallback
Issue 3: "Out of memory (OOM)"
Cause: GPU VRAM insufficient for model + batch size Solution:
- Use smaller model:
ortinysmall - Reduce batch size:
--batch 4 - Process fewer files in parallel
Issue 4: "Diarization failed"
Cause: HF token missing or network error Solution:
- Set token:
export HF_TOKEN=hf_... - Accept pyannote license on HuggingFace
- Disable diarization:
--no-diarize
Issue 5: "SRT validation errors"
Cause: Overlapping timestamps or malformed timecodes Solution:
- Enable timestamp clamping:
--clamp-gaps - Check for negative durations
- Validate with subtitle validator tool
Security & Privacy Notes
Data Sensitivity
Dashcam audio may contain:
- Personal conversations
- Addresses and locations
- Phone numbers and names
- Private information
Processing Guidelines
- Local Processing Only - Never upload audio to external services
- Secure Storage - Encrypt transcripts if sharing devices
- Redaction - Use
flag for PII patterns (phone, email)--redact - Retention - Delete audio extracts after transcription if not needed
Investigation Use
- Designed for legitimate personal data analysis
- NOT an anti-forensics tool
- All conclusions require independent corroboration
Skill Invocation
This skill is invoked when the model detects:
- User mentions "transcribe audio", "dashcam transcription", or "extract speech"
- User requests processing of video/MP4 files for audio content
- User provides paths to video folders
- User asks for subtitles or timestamped transcripts
Success Criteria
A successful audio transcription must:
✅ Obtain all required inputs from user (video folder, output preferences) ✅ Validate all inputs before processing (files exist, FFmpeg available, GPU detected) ✅ Present configuration summary and get confirmation ✅ Extract audio successfully (or log failures) ✅ Transcribe with GPU acceleration (or CPU fallback) ✅ Generate all requested formats (TXT, JSON, SRT, VTT) ✅ Create INDEX.csv with searchable metadata ✅ Include GPU metrics in results JSON ✅ Report output locations to user
Key Principle: Never guess critical inputs. Always validate, confirm, and provide clear feedback.
Version History
- v2.0 (2025-10-26) - Production-ready skill with GPU-first architecture
- v1.5 (2025-10-25) - Added diarization support and retry matrix
- v1.0 (2025-10-20) - Initial release with basic transcription
Last Updated: 2025-10-26 Status: Production Ready Maintained By: Audio Transcription Pipeline Project