Hacktricks-skills unsupervised-learning-security
Apply unsupervised machine learning algorithms to security data for anomaly detection, clustering, and dimensionality reduction. Use this skill whenever the user needs to analyze unlabeled security data, detect unknown threats, cluster network events, reduce feature dimensions, or identify outliers in logs, traffic, or behavioral data. Trigger for tasks involving K-Means, DBSCAN, HDBSCAN, Isolation Forest, GMM, PCA, t-SNE, or any unsupervised pattern discovery in cybersecurity contexts.
git clone https://github.com/abelrguezr/hacktricks-skills
skills/AI/AI-Unsupervised-Learning-Algorithms/SKILL.MDUnsupervised Learning for Security Analysis
This skill helps you apply unsupervised machine learning algorithms to security data. Unlike supervised learning, these methods work with unlabeled data to discover hidden patterns, detect anomalies, and cluster similar events.
When to Use This Skill
Use this skill when you need to:
- Detect unknown threats or anomalies in network traffic, logs, or user behavior
- Cluster security events to find groups of similar activity
- Reduce high-dimensional security data for visualization or analysis
- Identify outliers that might indicate attacks or misconfigurations
- Explore unlabeled datasets to understand their structure
- Build threat-hunting pipelines without labeled attack data
Algorithm Selection Guide
Choose the right algorithm based on your use case:
| Use Case | Recommended Algorithm | Why |
|---|---|---|
| Anomaly detection | Isolation Forest, DBSCAN, HDBSCAN | Designed to flag outliers without labels |
| Clustering with known K | K-Means, GMM | Fast, works well when you know cluster count |
| Clustering with unknown K | DBSCAN, HDBSCAN, Hierarchical | Automatically determines cluster count |
| Varying density clusters | HDBSCAN, DBSCAN | Handles clusters of different densities |
| Soft clustering | GMM | Provides probability of cluster membership |
| Dimensionality reduction | PCA (linear), t-SNE (nonlinear) | Compress features for visualization |
| Large datasets | Isolation Forest, K-Means | Efficient O(n log n) or O(n) complexity |
| Visual exploration | t-SNE, PCA | Create 2D/3D plots for human analysis |
Core Algorithms
K-Means Clustering
Best for: Quick clustering when you know the number of groups, spherical clusters of similar size.
How it works:
- Initialize K centroids (randomly or via k-means++)
- Assign each point to nearest centroid
- Recalculate centroids as mean of assigned points
- Repeat until convergence
Security use cases:
- Group network traffic into normal vs. attack patterns
- Cluster log entries by similarity
- Classify malware families by behavior profiles
Key parameters:
: Number of clusters (use Elbow Method or Silhouette Score to determine)n_clusters
: For reproducibilityrandom_state
: Number of initializations (higher = more stable)n_init
Limitations: Assumes spherical, equally-sized clusters. Sensitive to initialization. Requires normalization.
DBSCAN (Density-Based Clustering)
Best for: Finding arbitrarily shaped clusters, detecting outliers as noise, unknown cluster count.
How it works:
- Groups points within
distance that have at leasteps
neighborsmin_samples - Points not in any cluster are labeled as noise (-1)
- No need to specify number of clusters
Security use cases:
- Detect port scans or DoS traffic as sparse regions
- Flag zero-day malware that doesn't fit known families
- Identify anomalous user behavior patterns
Key parameters:
: Maximum distance between points in same clustereps
: Minimum points to form a dense regionmin_samples
Limitations: Struggles with varying densities. Single
eps may not work for all clusters. Can be slow on large datasets.
HDBSCAN (Hierarchical DBSCAN)
Best for: Clusters of varying density, modern threat-hunting pipelines, outlier scoring.
How it works:
- Builds hierarchy of density-connected components
- Condenses hierarchy to extract clusters
- Provides outlier scores for each point
Security use cases:
- Cluster HTTP beaconing traffic (legitimate vs. C2)
- Threat-hunting playbooks in XDR suites
- Detect low-density attack patterns hidden in noise
Key parameters:
: Minimum points per cluster (sensible defaults available)min_cluster_size
: Enable outlier scoringprediction_data=True
Advantages over DBSCAN: Handles varying densities, only one main hyperparameter, provides outlier probabilities.
Isolation Forest
Best for: Fast anomaly detection on large, high-dimensional datasets.
How it works:
- Builds random binary trees that partition data
- Anomalies are isolated faster (shorter path length)
- Anomaly score based on average path length across trees
Security use cases:
- Intrusion detection on network traffic logs
- Fraud detection in financial transactions
- Account takeover detection from login patterns
- System metric monitoring for unusual behavior
Key parameters:
: Number of trees (100-200 typical)n_estimators
: Expected fraction of anomalies (e.g., 0.05 for 5%)contamination
Advantages: O(n log n) complexity, works on high-dimensional data, no distribution assumptions.
Gaussian Mixture Models (GMM)
Best for: Soft clustering, probabilistic anomaly detection, ellipsoidal clusters.
How it works:
- Assumes data comes from mixture of K Gaussian distributions
- Uses Expectation-Maximization (EM) to fit parameters
- Each point gets probability of belonging to each cluster
Security use cases:
- Model normal traffic distribution, flag low-likelihood points
- Cluster user behavior with uncertainty quantification
- Phishing detection with probabilistic scoring
Key parameters:
: Number of Gaussian componentsn_components
: 'full', 'tied', 'diag', 'spherical'covariance_type
Advantages: Soft assignments, handles different cluster shapes, provides likelihood scores.
PCA (Principal Component Analysis)
Best for: Linear dimensionality reduction, noise reduction, feature decorrelation.
How it works:
- Finds orthogonal axes of maximum variance
- Projects data onto top K principal components
- Reduces dimensions while preserving structure
Security use cases:
- Reduce 40+ network metrics to handful of components
- Visualize traffic in 2D/3D for pattern spotting
- Preprocess data for other algorithms
- Remove redundant correlated features
Key parameters:
: Number of components to keepn_components
: Scale components to unit variancewhiten=True
Limitations: Linear only, components may be hard to interpret, prioritizes variance not "interestingness".
t-SNE (t-Distributed Stochastic Neighbor Embedding)
Best for: Nonlinear visualization of high-dimensional data.
How it works:
- Converts similarities to probability distributions
- Preserves local neighborhood structure in 2D/3D
- Uses gradient descent to minimize KL divergence
Security use cases:
- Visualize malware family groupings
- Plot network intrusion data to spot attack clusters
- Explore complex security datasets for patterns
Key parameters:
: Effective number of neighbors (5-50 typical)perplexity
: Step size for optimizationlearning_rate
: Number of iterations (1000+ for convergence)n_iter
Limitations: Computationally heavy O(n²), distances not globally meaningful, can't project new points easily.
Security Considerations
Adversarial Vulnerabilities
Unsupervised learners are not immune to active attackers:
- Data poisoning: Adding ~3% crafted traffic can shift anomaly detector boundaries (Chen et al., IEEE S&P 2024)
- Backdooring: Trigger patterns can force clustering into "benign" groups (BadCME, BlackHat EU 2023)
- Evasion: Attackers can craft patterns that fall into density gaps, hiding as noise (KU Leuven, 2025)
Mitigations
- Model sanitization (TRIM): Discard 1-2% highest-loss points before retraining
- Consensus ensembling: Combine multiple detectors (Isolation Forest + GMM + ECOD), alert if any flags
- Distance-based defense: Re-cluster with different random seeds, ignore points that hop clusters
Modern Tooling
- PyOD 2.x: 30+ anomaly detectors, one-line benchmarking
- Anomalib v1.5: Vision-focused, includes PatchCore for screenshot analysis
- scikit-learn 1.5: HDBSCAN wrapper with
supportscore_samples
Implementation Patterns
Pattern 1: Anomaly Detection Pipeline
from sklearn.ensemble import IsolationForest from sklearn.preprocessing import StandardScaler # 1. Prepare features X = df[['duration', 'bytes', 'packets', 'errors']].values X_scaled = StandardScaler().fit_transform(X) # 2. Train detector iso = IsolationForest(contamination=0.05, random_state=42) iso.fit(X_scaled) # 3. Score and flag scores = iso.decision_function(X_scaled) labels = iso.predict(X_scaled) # -1 = anomaly, 1 = normal # 4. Extract anomalies anomalies = df[labels == -1].sort_values('score', ascending=False)
Pattern 2: Clustering with Multiple Algorithms
from sklearn.cluster import KMeans, DBSCAN from hdbscan import HDBSCAN # Try multiple algorithms kmeans = KMeans(n_clusters=4, random_state=42).fit(X_scaled) dbscan = DBSCAN(eps=0.5, min_samples=5).fit(X_scaled) hdb = HDBSCAN(min_cluster_size=15, prediction_data=True).fit(X_scaled) # Compare results print(f"K-Means: {len(set(kmeans.labels_))} clusters") print(f"DBSCAN: {len(set(dbscan.labels_) - {-1})} clusters, {sum(dbscan.labels_ == -1)} noise") print(f"HDBSCAN: {len(set(hdb.labels_) - {-1})} clusters, {sum(hdb.labels_ == -1)} noise")
Pattern 3: Dimensionality Reduction for Visualization
from sklearn.decomposition import PCA from sklearn.manifold import TSNE # PCA for quick linear reduction pca = PCA(n_components=2) X_pca = pca.fit_transform(X_scaled) print(f"PCA explained variance: {pca.explained_variance_ratio_}") # t-SNE for nonlinear visualization tsne = TSNE(n_components=2, perplexity=30, random_state=42) X_tsne = tsne.fit_transform(X_scaled)
Pattern 4: Ensemble Anomaly Detection
from pyod.models import ECOD, IForest # Train multiple detectors models = [ECOD(), IForest(contamination=0.05)] # Ensemble: flag if any model detects anomaly scores = sum(m.fit(X_train).decision_function(X_test) for m in models) / len(models) anomalies = X_test[scores < threshold]
Best Practices
- Always normalize features before distance-based algorithms (K-Means, DBSCAN, etc.)
- Start with Isolation Forest for anomaly detection - fast, robust, few parameters
- Use HDBSCAN over DBSCAN when possible - handles varying densities better
- Try multiple algorithms and compare results - different algorithms find different patterns
- Visualize with t-SNE to understand cluster structure before deploying
- Set contamination conservatively (1-5%) and tune based on false positive rate
- Use ensemble methods to reduce adversarial vulnerability
- Monitor for concept drift - retrain periodically as attack patterns evolve
Common Pitfalls
- Not normalizing: Distance-based algorithms fail with unscaled features
- Wrong K for K-Means: Use Elbow Method or Silhouette Score to determine
- t-SNE overinterpretation: Distances aren't globally meaningful, focus on local structure
- PCA on nonlinear data: Use t-SNE or autoencoders for nonlinear relationships
- Ignoring noise: DBSCAN/HDBSCAN noise points are often the most interesting (potential attacks)
- Single algorithm reliance: Always try multiple approaches and compare
Next Steps
After running an algorithm:
- Inspect the flagged anomalies or clusters
- Validate against known ground truth if available
- Tune parameters based on precision/recall tradeoff
- Consider ensemble methods for production
- Set up monitoring for model drift and retraining triggers