Skillsbench Multimodal Fusion for Speaker Diarization

Combine visual features (face detection, lip movement analysis) with audio features to improve speaker diarization accuracy in video files. Use OpenCV for face detection and lip movement tracking, then fuse visual cues with audio-based speaker embeddings. Essential when processing video files with multiple visible speakers or when audio-only diarization needs visual validation.

install

source · Clone the upstream repo

git clone https://github.com/benchflow-ai/skillsbench

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/benchflow-ai/skillsbench "$T" && mkdir -p ~/.claude/skills && cp -r "$T/tasks/speaker-diarization-subtitles/environment/skills/multimodal-fusion" ~/.claude/skills/benchflow-ai-skillsbench-multimodal-fusion-for-speaker-diarization && rm -rf "$T"

manifest: tasks/speaker-diarization-subtitles/environment/skills/multimodal-fusion/SKILL.md

source content

Multimodal Fusion for Speaker Diarization

Overview

When working with video files, you can significantly improve speaker diarization by combining audio features with visual features like face detection and lip movement analysis.

When to Use

Processing video files (not just audio)
Multiple speakers visible on screen
Need to disambiguate speakers with similar voices
Improve accuracy by leveraging visual cues

Visual Feature Extraction

Face Detection

import cv2
import numpy as np

# Initialize face detector
face_cascade = cv2.CascadeClassifier(
    cv2.data.haarcascades + 'haarcascade_frontalface_default.xml'
)

# Process video frames
cap = cv2.VideoCapture(video_path)
fps = cap.get(cv2.CAP_PROP_FPS)
faces_by_time = {}

frame_count = 0
frame_skip = max(1, int(fps / 2))  # Process every other frame

while cap.isOpened():
    ret, frame = cap.read()
    if not ret:
        break

    if frame_count % frame_skip == 0:
        timestamp = frame_count / fps
        gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
        faces = face_cascade.detectMultiScale(gray, 1.1, 4)
        faces_by_time[timestamp] = len(faces)

    frame_count += 1

cap.release()

Lip Movement Detection

lip_movement_by_time = {}
prev_mouth_roi = None

while cap.isOpened():
    ret, frame = cap.read()
    if not ret:
        break

    if frame_count % frame_skip == 0:
        timestamp = frame_count / fps
        gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
        faces = face_cascade.detectMultiScale(gray, 1.1, 4)

        lip_moving = False
        for (x, y, w, h) in faces:
            # Extract mouth region (lower 40% of face)
            mouth_roi_y = y + int(h * 0.6)
            mouth_roi_h = int(h * 0.4)
            mouth_region = gray[mouth_roi_y:mouth_roi_y + mouth_roi_h, x:x + w]

            if mouth_region.size > 0:
                if prev_mouth_roi is not None and prev_mouth_roi.shape == mouth_region.shape:
                    # Calculate movement score
                    diff = cv2.absdiff(mouth_region, prev_mouth_roi)
                    movement_score = np.mean(diff)
                    if movement_score > 10:  # Threshold for movement
                        lip_moving = True
                prev_mouth_roi = mouth_region.copy()
                break

        lip_movement_by_time[timestamp] = lip_moving

    frame_count += 1

Temporal Alignment

Visual features need to be aligned with audio timestamps:

def get_faces_at_time(timestamp, tolerance=0.5):
    """Get number of faces at a given timestamp"""
    if not faces_by_time:
        return 0
    closest = min(faces_by_time.keys(),
                  key=lambda t: abs(t - timestamp),
                  default=None)
    if closest and abs(closest - timestamp) < tolerance:
        return faces_by_time[closest]
    return 0

def get_lip_movement_at_time(timestamp, tolerance=0.5):
    """Check if lips are moving at a given timestamp"""
    if not lip_movement_by_time:
        return False
    closest = min(lip_movement_by_time.keys(),
                  key=lambda t: abs(t - timestamp),
                  default=None)
    if closest and abs(closest - timestamp) < tolerance:
        return lip_movement_by_time[closest]
    return False

Fusion Strategies

1. Visual-Aided Speaker Assignment

Use visual features to help assign speakers to audio segments:

# For each diarization turn
for turn in diarization_turns:
    turn_center = (turn['start'] + turn['end']) / 2
    faces_at_turn = get_faces_at_time(turn_center)
    lip_moving = get_lip_movement_at_time(turn_center)

    # Use visual cues to refine speaker assignment
    if lip_moving and faces_at_turn > 0:
        # High confidence: speaker is visible and speaking
        turn['confidence'] = 'high'
    elif faces_at_turn > 0:
        # Medium confidence: speaker visible but no clear lip movement
        turn['confidence'] = 'medium'
    else:
        # Low confidence: no visual confirmation
        turn['confidence'] = 'low'

2. Face Count Validation

Use face count to validate speaker count:

# Count unique faces over video duration
unique_faces = set()
for timestamp in faces_by_time.keys():
    if faces_by_time[timestamp] > 0:
        # In a real implementation, you'd track individual faces
        unique_faces.add(timestamp)

# Validate predicted speaker count
if len(unique_faces) > 0:
    visual_speaker_count = max(faces_by_time.values())
    if abs(visual_speaker_count - predicted_speaker_count) > 1:
        # Warning: mismatch between audio and visual speaker counts
        print(f"Warning: Audio predicts {predicted_speaker_count} speakers, "
              f"but video shows up to {visual_speaker_count} faces")

3. Lip Movement Filtering

Filter out segments where no one appears to be speaking:

# Filter diarization turns based on lip movement
filtered_turns = []
for turn in diarization_turns:
    turn_start = turn['start']
    turn_end = turn['end']

    # Check if lips are moving during this turn
    has_lip_movement = any(
        get_lip_movement_at_time(t)
        for t in np.arange(turn_start, turn_end, 0.1)
    )

    if has_lip_movement:
        filtered_turns.append(turn)
    else:
        # Low confidence: no visual confirmation of speech
        turn['confidence'] = 'low'
        filtered_turns.append(turn)

Best Practices

Process frames efficiently: Don't process every frame; use frame_skip
Handle missing visual data: Always have fallback to audio-only
Temporal alignment: Ensure visual and audio timestamps are synchronized
Confidence scoring: Use visual features to assign confidence scores
Error handling: Video processing can fail; handle exceptions gracefully

Integration Example

# Complete pipeline
def multimodal_diarization(video_path, audio_path):
    # 1. Extract visual features
    faces_by_time, lip_movement_by_time = extract_visual_features(video_path)

    # 2. Run audio-based diarization
    audio_turns = run_audio_diarization(audio_path)

    # 3. Fuse visual and audio features
    for turn in audio_turns:
        turn_center = (turn['start'] + turn['end']) / 2
        turn['faces_detected'] = get_faces_at_time(turn_center)
        turn['lip_movement'] = get_lip_movement_at_time(turn_center)
        turn['on_screen'] = turn['faces_detected'] > 0

    return audio_turns

Limitations

Visual features require video files (not just audio)
Face detection may fail in poor lighting or angles
Lip movement detection is approximate
Processing video is computationally expensive

When to Skip Visual Features

Audio-only files
Poor video quality
No faces visible
Processing time constraints