Skillsbench Speaker Clustering Methods
Choose and implement clustering algorithms for grouping speaker embeddings after VAD and embedding extraction. Compare Hierarchical clustering (auto-tunes speaker count), KMeans (fast, requires known count), and Agglomerative clustering (fixed clusters). Use Hierarchical clustering when speaker count is unknown, KMeans when count is known, and always normalize embeddings before clustering.
git clone https://github.com/benchflow-ai/skillsbench
T=$(mktemp -d) && git clone --depth=1 https://github.com/benchflow-ai/skillsbench "$T" && mkdir -p ~/.claude/skills && cp -r "$T/tasks/speaker-diarization-subtitles/environment/skills/speaker-clustering" ~/.claude/skills/benchflow-ai-skillsbench-speaker-clustering-methods && rm -rf "$T"
tasks/speaker-diarization-subtitles/environment/skills/speaker-clustering/SKILL.mdSpeaker Clustering Methods
Overview
After extracting speaker embeddings from audio segments, you need to cluster them to identify unique speakers. Different clustering methods have different strengths.
When to Use
- After extracting speaker embeddings from VAD segments
- Need to group similar speakers together
- Determining number of speakers automatically or manually
Available Clustering Methods
1. Hierarchical Clustering (Recommended for Auto-tuning)
Best for: Automatically determining number of speakers, flexible threshold tuning
from scipy.cluster.hierarchy import linkage, fcluster from scipy.spatial.distance import pdist import numpy as np # Prepare embeddings embeddings_array = np.array(embeddings_list) n_segments = len(embeddings_array) # Compute distance matrix distances = pdist(embeddings_array, metric='cosine') # Create linkage matrix linkage_matrix = linkage(distances, method='average') # Auto-tune threshold to get reasonable speaker count min_speakers = 2 max_speakers = max(2, min(10, n_segments // 2)) threshold = 0.7 labels = fcluster(linkage_matrix, t=threshold, criterion='distance') n_speakers = len(set(labels)) # Adjust threshold if needed if n_speakers > max_speakers: for t in [0.8, 0.9, 1.0, 1.1, 1.2]: labels = fcluster(linkage_matrix, t=t, criterion='distance') n_speakers = len(set(labels)) if n_speakers <= max_speakers: threshold = t break elif n_speakers < min_speakers: for t in [0.6, 0.5, 0.4]: labels = fcluster(linkage_matrix, t=t, criterion='distance') n_speakers = len(set(labels)) if n_speakers >= min_speakers: threshold = t break print(f"Selected: t={threshold}, {n_speakers} speakers")
Advantages:
- Automatically determines speaker count
- Flexible threshold tuning
- Good for unknown number of speakers
- Can visualize dendrogram
2. KMeans Clustering
Best for: Known number of speakers, fast clustering
from sklearn.cluster import KMeans from sklearn.metrics import silhouette_score import numpy as np # Normalize embeddings embeddings_array = np.array(embeddings_list) norms = np.linalg.norm(embeddings_array, axis=1, keepdims=True) embeddings_normalized = embeddings_array / np.clip(norms, 1e-9, None) # Try different k values and choose best best_k = 2 best_score = -1 best_labels = None for k in range(2, min(7, len(embeddings_normalized))): kmeans = KMeans(n_clusters=k, random_state=0, n_init=10) labels = kmeans.fit_predict(embeddings_normalized) if len(set(labels)) < 2: continue score = silhouette_score(embeddings_normalized, labels, metric='cosine') if score > best_score: best_score = score best_k = k best_labels = labels print(f"Best k={best_k}, silhouette score={best_score:.3f}")
Advantages:
- Fast and efficient
- Works well with known speaker count
- Simple to implement
Disadvantages:
- Requires specifying number of clusters
- May get stuck in local minima
3. Agglomerative Clustering
Best for: Similar to hierarchical but with fixed number of clusters
from sklearn.cluster import AgglomerativeClustering from sklearn.metrics import silhouette_score import numpy as np # Normalize embeddings embeddings_array = np.array(embeddings_list) norms = np.linalg.norm(embeddings_array, axis=1, keepdims=True) embeddings_normalized = embeddings_array / np.clip(norms, 1e-9, None) # Try different numbers of clusters best_n = 2 best_score = -1 best_labels = None for n_clusters in range(2, min(6, len(embeddings_normalized))): clustering = AgglomerativeClustering(n_clusters=n_clusters) labels = clustering.fit_predict(embeddings_normalized) if len(set(labels)) < 2: continue score = silhouette_score(embeddings_normalized, labels, metric='cosine') if score > best_score: best_score = score best_n = n_clusters best_labels = labels print(f"Best n_clusters={best_n}, silhouette score={best_score:.3f}")
Advantages:
- Deterministic results
- Good for fixed number of clusters
- Can use different linkage methods
Comparison Table
| Method | Auto Speaker Count | Speed | Best For |
|---|---|---|---|
| Hierarchical | ✅ Yes | Medium | Unknown speaker count |
| KMeans | ❌ No | Fast | Known speaker count |
| Agglomerative | ❌ No | Medium | Fixed clusters needed |
Embedding Normalization
Always normalize embeddings before clustering:
# L2 normalization embeddings_normalized = embeddings_array / np.clip( np.linalg.norm(embeddings_array, axis=1, keepdims=True), 1e-9, None )
Distance Metrics
- Cosine: Best for speaker embeddings (default)
- Euclidean: Can work but less ideal for normalized embeddings
Choosing Number of Speakers
Method 1: Silhouette Score (for KMeans/Agglomerative)
from sklearn.metrics import silhouette_score best_k = 2 best_score = -1 for k in range(2, min(7, len(embeddings))): labels = clusterer.fit_predict(embeddings) score = silhouette_score(embeddings, labels, metric='cosine') if score > best_score: best_score = score best_k = k
Method 2: Threshold Tuning (for Hierarchical)
# Start with reasonable threshold threshold = 0.7 labels = fcluster(linkage_matrix, t=threshold, criterion='distance') n_speakers = len(set(labels)) # Adjust based on constraints if n_speakers > max_speakers: # Increase threshold to merge more threshold = 0.9 elif n_speakers < min_speakers: # Decrease threshold to split more threshold = 0.5
Post-Clustering: Merging Segments
After clustering, merge adjacent segments with same speaker:
def merge_speaker_segments(labeled_segments, gap_threshold=0.15): """ labeled_segments: list of (start, end, speaker_label) gap_threshold: merge if gap <= this (seconds) """ labeled_segments.sort(key=lambda x: (x[0], x[1])) merged = [] cur_s, cur_e, cur_spk = labeled_segments[0] for s, e, spk in labeled_segments[1:]: if spk == cur_spk and s <= cur_e + gap_threshold: cur_e = max(cur_e, e) else: merged.append((cur_s, cur_e, cur_spk)) cur_s, cur_e, cur_spk = s, e, spk merged.append((cur_s, cur_e, cur_spk)) return merged
Common Issues
- Too many speakers: Increase threshold (hierarchical) or decrease k (KMeans)
- Too few speakers: Decrease threshold (hierarchical) or increase k (KMeans)
- Poor clustering: Check embedding quality, try different normalization
- Over-segmentation: Increase gap_threshold when merging segments
Best Practices
- Normalize embeddings before clustering
- Use cosine distance for speaker embeddings
- Try multiple methods and compare results
- Validate speaker count with visual features if available
- Merge adjacent segments after clustering
- After diarization, use high-quality ASR: Use Whisper
orsmall
model for transcription (see automatic-speech-recognition skill)large-v3