Asi implementing-alert-fatigue-reduction
install
source · Clone the upstream repo
git clone https://github.com/plurigrid/asi
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/plurigrid/asi "$T" && mkdir -p ~/.claude/skills && cp -r "$T/plugins/asi/skills/implementing-alert-fatigue-reduction" ~/.claude/skills/plurigrid-asi-implementing-alert-fatigue-reduction && rm -rf "$T"
manifest:
plugins/asi/skills/implementing-alert-fatigue-reduction/SKILL.mdsource content
Implementing Alert Fatigue Reduction
When to Use
Use this skill when:
- SOC analysts face more alerts than they can reasonably investigate (>100 alerts/analyst/shift)
- False positive rates exceed 70% on key detection rules
- True positives are being missed or dismissed due to alert volume
- Management reports declining analyst morale or increasing turnover related to workload
Do not use to justify disabling detection rules without analysis — reducing alerts must not create detection blind spots.
Prerequisites
- SIEM with 90+ days of alert disposition data (true positive, false positive, benign)
- Alert metrics: volume, disposition rate, MTTD, MTTR per rule
- Detection engineering resources for rule tuning and testing
- Splunk ES with risk-based alerting (RBA) capability or equivalent
- Baseline analyst capacity metrics (alerts per analyst per shift)
Workflow
Step 1: Measure Current Alert Quality
Quantify the problem before making changes:
--- Alert volume and disposition analysis (last 90 days) index=notable earliest=-90d | stats count AS total_alerts, sum(eval(if(status_label="Resolved - True Positive", 1, 0))) AS true_positives, sum(eval(if(status_label="Resolved - False Positive", 1, 0))) AS false_positives, sum(eval(if(status_label="Resolved - Benign", 1, 0))) AS benign, sum(eval(if(status_label="New" OR status_label="In Progress", 1, 0))) AS unresolved by rule_name | eval fp_rate = round(false_positives / total_alerts * 100, 1) | eval tp_rate = round(true_positives / total_alerts * 100, 1) | eval signal_to_noise = round(true_positives / (false_positives + 0.01), 2) | sort - total_alerts | table rule_name, total_alerts, true_positives, false_positives, benign, fp_rate, tp_rate, signal_to_noise --- Top 10 noisiest rules (candidates for tuning) | search fp_rate > 70 OR total_alerts > 1000 | sort - false_positives | head 10
Daily alert volume per analyst:
index=notable earliest=-30d | bin _time span=1d | stats count AS daily_alerts by _time | stats avg(daily_alerts) AS avg_daily, max(daily_alerts) AS peak_daily, stdev(daily_alerts) AS stdev_daily | eval alerts_per_analyst = round(avg_daily / 6, 0) --- 6 analysts per shift | eval capacity_status = case( alerts_per_analyst > 100, "CRITICAL — Exceeds analyst capacity", alerts_per_analyst > 50, "WARNING — Approaching capacity limits", 1=1, "HEALTHY — Within manageable range" )
Step 2: Implement Risk-Based Alerting (RBA)
Convert threshold-based alerts to risk scoring in Splunk ES:
--- Instead of generating an alert for every failed login, contribute risk --- Risk Rule: Failed Authentication (contributes to risk score, no alert) index=wineventlog EventCode=4625 | stats count by src_ip, TargetUserName, ComputerName | where count > 5 | eval risk_score = case( count > 50, 40, count > 20, 25, count > 10, 15, count > 5, 5 ) | eval risk_object = src_ip | eval risk_object_type = "system" | eval risk_message = count." failed logins from ".src_ip." targeting ".TargetUserName | collect index=risk
--- Risk Rule: Successful Login After Failures (additive risk) index=wineventlog EventCode=4624 Logon_Type=3 | lookup risk_scores src_ip AS src_ip OUTPUT total_risk | where total_risk > 0 | eval risk_score = 30 | eval risk_message = "Successful login after ".total_risk." risk points from ".src_ip | collect index=risk
--- Risk Threshold Alert: Only alert when cumulative risk exceeds threshold index=risk earliest=-24h | stats sum(risk_score) AS total_risk, values(risk_message) AS risk_events, dc(source) AS contributing_rules by risk_object | where total_risk >= 75 | eval urgency = case( total_risk >= 150, "critical", total_risk >= 100, "high", total_risk >= 75, "medium" ) --- This single alert replaces 10+ individual threshold alerts
Before RBA vs After RBA comparison:
BEFORE RBA: Rule: "Failed Login > 5" → 847 alerts/day (FP rate: 92%) Rule: "Suspicious Process" → 234 alerts/day (FP rate: 78%) Rule: "Network Anomaly" → 156 alerts/day (FP rate: 85%) Total: 1,237 alerts/day AFTER RBA: Risk aggregation alerts → 23 alerts/day (FP rate: 18%) Each alert contains full context from multiple risk contributions Reduction: 98% fewer alerts with HIGHER true positive rate
Step 3: Tune High-Volume False Positive Rules
Systematically tune the noisiest rules:
--- Identify common false positive patterns index=notable rule_name="Suspicious PowerShell Execution" status_label="Resolved - False Positive" earliest=-90d | stats count by src, dest, user, CommandLine | sort - count | head 20 --- Reveals: SCCM client generating 80% of false positives
Apply tuning:
--- Original rule (generating false positives) index=sysmon EventCode=1 Image="*\\powershell.exe" (CommandLine="*-enc*" OR CommandLine="*-encodedcommand*" OR CommandLine="*invoke-expression*") | where count > 0 --- Tuned rule (excluding known legitimate sources) index=sysmon EventCode=1 Image="*\\powershell.exe" (CommandLine="*-enc*" OR CommandLine="*-encodedcommand*" OR CommandLine="*invoke-expression*") NOT [| inputlookup powershell_whitelist.csv | fields CommandLine_pattern] NOT (ParentImage="*\\ccmexec.exe" OR ParentImage="*\\sccm*") NOT (User="SYSTEM" AND ParentImage="*\\services.exe" AND CommandLine="*Microsoft\\ConfigMgr*") | where count > 0
Document tuning decisions:
rule_name: Suspicious PowerShell Execution tuning_date: 2024-03-15 original_fp_rate: 78% tuned_fp_rate: 22% exclusions_added: - ParentImage containing ccmexec.exe (SCCM client) - User=SYSTEM with ConfigMgr in CommandLine - Scheduled task: Windows Update PowerShell module alerts_reduced: ~180/day eliminated detection_impact: None — exclusions verified against ATT&CK test cases approved_by: detection_engineering_lead
Step 4: Implement Alert Consolidation
Group related alerts into single incidents:
--- Consolidate alerts by source IP within time window index=notable earliest=-1h | sort _time | dedup src, rule_name span=300 | stats count AS alert_count, values(rule_name) AS related_rules, earliest(_time) AS first_alert, latest(_time) AS last_alert by src | where alert_count > 3 | eval consolidated_alert = src." triggered ".alert_count." related alerts: ".mvjoin(related_rules, ", ")
Splunk ES Notable Event Suppression:
--- Suppress duplicate alerts for the same source/dest pair within 1 hour | notable | dedup src, dest, rule_name span=3600
Step 5: Implement Tiered Alert Routing
Route alerts based on confidence and severity:
ALERT ROUTING STRATEGY ━━━━━━━━━━━━━━━━━━━━━ Tier 1 (Automated): - Risk score < 30: Auto-close with enrichment data logged - Known false positive patterns: Auto-suppress (reviewed quarterly) - Informational alerts: Route to dashboard only (no queue) Tier 2 (Analyst Review): - Risk score 30-75: Standard triage queue - Medium confidence alerts: Analyst decision required - Enriched with automated context (VT, AbuseIPDB, asset info) Tier 3 (Priority Investigation): - Risk score > 75: Immediate investigation - Deception alerts: Auto-escalate (zero false positive) - Known malware detection: Auto-contain + analyst review
Implement in Splunk:
index=notable | eval routing = case( urgency="critical" OR source="deception", "TIER3_IMMEDIATE", urgency="high" AND risk_score > 75, "TIER3_IMMEDIATE", urgency="high" OR urgency="medium", "TIER2_STANDARD", urgency="low" AND fp_rate > 80, "TIER1_AUTO_CLOSE", 1=1, "TIER2_STANDARD" ) | where routing != "TIER1_AUTO_CLOSE" --- Auto-closed alerts removed from queue
Step 6: Measure Improvement and Maintain
Track alert fatigue metrics over time:
--- Weekly alert quality trend index=notable earliest=-90d | bin _time span=1w | stats count AS total, sum(eval(if(status_label="Resolved - True Positive", 1, 0))) AS tp, sum(eval(if(status_label="Resolved - False Positive", 1, 0))) AS fp by _time | eval tp_rate = round(tp / total * 100, 1) | eval fp_rate = round(fp / total * 100, 1) | eval alerts_per_analyst = round(total / 42, 0) --- 6 analysts * 7 days | table _time, total, tp, fp, tp_rate, fp_rate, alerts_per_analyst
Key Concepts
| Term | Definition |
|---|---|
| Alert Fatigue | Cognitive overload from excessive alert volumes leading analysts to dismiss or ignore valid alerts |
| Risk-Based Alerting (RBA) | Detection approach aggregating risk contributions from multiple events before generating a single high-context alert |
| Signal-to-Noise Ratio | Ratio of true positive alerts to false positives — higher ratio indicates better alert quality |
| False Positive Rate | Percentage of alerts classified as benign after investigation — target <30% for production rules |
| Alert Consolidation | Grouping related alerts from the same source/campaign into a single investigation unit |
| Detection Tuning | Process of refining rule logic to exclude known benign patterns while maintaining true positive detection |
Tools & Systems
- Splunk ES Risk-Based Alerting: Framework converting individual detections into cumulative risk scores per entity
- Splunk ES Adaptive Response: Actions that can auto-close, suppress, or route alerts based on enrichment results
- Elastic Detection Rules: Built-in severity and risk score assignment with exception lists for tuning
- Chronicle SOAR: Google's SOAR platform with automated alert deduplication and grouping capabilities
- Tines: No-code SOAR platform enabling custom alert routing and automated enrichment workflows
Common Scenarios
- Post-RBA Implementation: Convert 15 threshold alerts into risk contributions, reducing daily volume by 85%
- Quarterly Tuning Cycle: Review top 20 noisiest rules, apply exclusions, measure FP rate improvement
- New Tool Deployment: After deploying new EDR, tune initial detection rules to baseline the environment
- Analyst Capacity Planning: Calculate optimal alert-to-analyst ratio (target 40-60 alerts/analyst/shift)
- Compliance Balance: Maintain detection coverage for compliance while reducing operational alert volume
Output Format
ALERT FATIGUE REDUCTION REPORT — Q1 2024 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Before (January 2024): Daily Alert Volume: 1,847 Alerts/Analyst/Shift: 154 False Positive Rate: 82% True Positive Rate: 8% Signal-to-Noise: 0.10 Analyst Morale: Low (2 resignations in Q4) After (March 2024): Daily Alert Volume: 287 (-84%) Alerts/Analyst/Shift: 24 False Positive Rate: 23% (-72% improvement) True Positive Rate: 41% (+413% improvement) Signal-to-Noise: 1.78 Changes Implemented: [1] Risk-Based Alerting deployed (15 rules converted) -1,200 alerts/day [2] Top 10 noisy rules tuned with exclusion lists -280 alerts/day [3] Alert consolidation (5-min dedup window) -80 alerts/day [4] Tier 1 auto-close for low-confidence alerts -N/A (removed from queue) Detection Coverage Impact: NONE — ATT&CK coverage maintained at 67% True Positive Detection Rate: IMPROVED — 12 additional true positives caught per week