Skillsbench Multimodal Fusion for Speaker Diarization
Combine visual features (face detection, lip movement analysis) with audio features to improve speaker diarization accuracy in video files. Use OpenCV for face detection and lip movement tracking, then fuse visual cues with audio-based speaker embeddings. Essential when processing video files with multiple visible speakers or when audio-only diarization needs visual validation.
git clone https://github.com/benchflow-ai/skillsbench
T=$(mktemp -d) && git clone --depth=1 https://github.com/benchflow-ai/skillsbench "$T" && mkdir -p ~/.claude/skills && cp -r "$T/tasks/speaker-diarization-subtitles/environment/skills/multimodal-fusion" ~/.claude/skills/benchflow-ai-skillsbench-multimodal-fusion-for-speaker-diarization && rm -rf "$T"
tasks/speaker-diarization-subtitles/environment/skills/multimodal-fusion/SKILL.mdMultimodal Fusion for Speaker Diarization
Overview
When working with video files, you can significantly improve speaker diarization by combining audio features with visual features like face detection and lip movement analysis.
When to Use
- Processing video files (not just audio)
- Multiple speakers visible on screen
- Need to disambiguate speakers with similar voices
- Improve accuracy by leveraging visual cues
Visual Feature Extraction
Face Detection
import cv2 import numpy as np # Initialize face detector face_cascade = cv2.CascadeClassifier( cv2.data.haarcascades + 'haarcascade_frontalface_default.xml' ) # Process video frames cap = cv2.VideoCapture(video_path) fps = cap.get(cv2.CAP_PROP_FPS) faces_by_time = {} frame_count = 0 frame_skip = max(1, int(fps / 2)) # Process every other frame while cap.isOpened(): ret, frame = cap.read() if not ret: break if frame_count % frame_skip == 0: timestamp = frame_count / fps gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY) faces = face_cascade.detectMultiScale(gray, 1.1, 4) faces_by_time[timestamp] = len(faces) frame_count += 1 cap.release()
Lip Movement Detection
lip_movement_by_time = {} prev_mouth_roi = None while cap.isOpened(): ret, frame = cap.read() if not ret: break if frame_count % frame_skip == 0: timestamp = frame_count / fps gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY) faces = face_cascade.detectMultiScale(gray, 1.1, 4) lip_moving = False for (x, y, w, h) in faces: # Extract mouth region (lower 40% of face) mouth_roi_y = y + int(h * 0.6) mouth_roi_h = int(h * 0.4) mouth_region = gray[mouth_roi_y:mouth_roi_y + mouth_roi_h, x:x + w] if mouth_region.size > 0: if prev_mouth_roi is not None and prev_mouth_roi.shape == mouth_region.shape: # Calculate movement score diff = cv2.absdiff(mouth_region, prev_mouth_roi) movement_score = np.mean(diff) if movement_score > 10: # Threshold for movement lip_moving = True prev_mouth_roi = mouth_region.copy() break lip_movement_by_time[timestamp] = lip_moving frame_count += 1
Temporal Alignment
Visual features need to be aligned with audio timestamps:
def get_faces_at_time(timestamp, tolerance=0.5): """Get number of faces at a given timestamp""" if not faces_by_time: return 0 closest = min(faces_by_time.keys(), key=lambda t: abs(t - timestamp), default=None) if closest and abs(closest - timestamp) < tolerance: return faces_by_time[closest] return 0 def get_lip_movement_at_time(timestamp, tolerance=0.5): """Check if lips are moving at a given timestamp""" if not lip_movement_by_time: return False closest = min(lip_movement_by_time.keys(), key=lambda t: abs(t - timestamp), default=None) if closest and abs(closest - timestamp) < tolerance: return lip_movement_by_time[closest] return False
Fusion Strategies
1. Visual-Aided Speaker Assignment
Use visual features to help assign speakers to audio segments:
# For each diarization turn for turn in diarization_turns: turn_center = (turn['start'] + turn['end']) / 2 faces_at_turn = get_faces_at_time(turn_center) lip_moving = get_lip_movement_at_time(turn_center) # Use visual cues to refine speaker assignment if lip_moving and faces_at_turn > 0: # High confidence: speaker is visible and speaking turn['confidence'] = 'high' elif faces_at_turn > 0: # Medium confidence: speaker visible but no clear lip movement turn['confidence'] = 'medium' else: # Low confidence: no visual confirmation turn['confidence'] = 'low'
2. Face Count Validation
Use face count to validate speaker count:
# Count unique faces over video duration unique_faces = set() for timestamp in faces_by_time.keys(): if faces_by_time[timestamp] > 0: # In a real implementation, you'd track individual faces unique_faces.add(timestamp) # Validate predicted speaker count if len(unique_faces) > 0: visual_speaker_count = max(faces_by_time.values()) if abs(visual_speaker_count - predicted_speaker_count) > 1: # Warning: mismatch between audio and visual speaker counts print(f"Warning: Audio predicts {predicted_speaker_count} speakers, " f"but video shows up to {visual_speaker_count} faces")
3. Lip Movement Filtering
Filter out segments where no one appears to be speaking:
# Filter diarization turns based on lip movement filtered_turns = [] for turn in diarization_turns: turn_start = turn['start'] turn_end = turn['end'] # Check if lips are moving during this turn has_lip_movement = any( get_lip_movement_at_time(t) for t in np.arange(turn_start, turn_end, 0.1) ) if has_lip_movement: filtered_turns.append(turn) else: # Low confidence: no visual confirmation of speech turn['confidence'] = 'low' filtered_turns.append(turn)
Best Practices
- Process frames efficiently: Don't process every frame; use frame_skip
- Handle missing visual data: Always have fallback to audio-only
- Temporal alignment: Ensure visual and audio timestamps are synchronized
- Confidence scoring: Use visual features to assign confidence scores
- Error handling: Video processing can fail; handle exceptions gracefully
Integration Example
# Complete pipeline def multimodal_diarization(video_path, audio_path): # 1. Extract visual features faces_by_time, lip_movement_by_time = extract_visual_features(video_path) # 2. Run audio-based diarization audio_turns = run_audio_diarization(audio_path) # 3. Fuse visual and audio features for turn in audio_turns: turn_center = (turn['start'] + turn['end']) / 2 turn['faces_detected'] = get_faces_at_time(turn_center) turn['lip_movement'] = get_lip_movement_at_time(turn_center) turn['on_screen'] = turn['faces_detected'] > 0 return audio_turns
Limitations
- Visual features require video files (not just audio)
- Face detection may fail in poor lighting or angles
- Lip movement detection is approximate
- Processing video is computationally expensive
When to Skip Visual Features
- Audio-only files
- Poor video quality
- No faces visible
- Processing time constraints