Claude-skill-registry detecting-anomalies
Detect anomalies in metrics and time-series data using OPAL statistical methods. Use when you need to identify unusual patterns, spikes, drops, or outliers in observability data. Covers statistical outlier detection (Z-score, IQR), threshold-based alerts, rate-of-change detection with window functions, and moving average baselines. Choose pattern based on data distribution and anomaly type.
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/detecting-anomalies" ~/.claude/skills/majiayu000-claude-skill-registry-detecting-anomalies && rm -rf "$T"
skills/data/detecting-anomalies/SKILL.mdDetecting Anomalies
Detect anomalies in metrics and time-series data using OPAL statistical methods. This skill covers multiple detection patterns for different types of anomalies: statistical outliers, sudden spikes/drops, threshold violations, and deviations from moving baselines.
Use when you need to:
- Identify unusual spikes or drops in request volume, errors, latency
- Detect values exceeding normal statistical bounds
- Alert on sudden percentage changes (traffic doubling, sudden drops)
- Compare current values to moving averages
- Find outliers in skewed distributions
Key Concepts
Anomaly Detection Approaches
Statistical Methods (good for gradual changes):
- Z-Score (standard deviation) - Assumes normal distribution
- IQR (Interquartile Range) - Robust to skewed data
- Percentile thresholds - Compare to historical baseline
Temporal Methods (good for sudden changes):
- Rate of change - Detect sudden spikes/drops
- Moving average deviation - Compare to recent baseline
Threshold Methods (simple and interpretable):
- Static thresholds - Known limits (CPU > 90%)
- Dynamic thresholds - Calculated from baseline (current > avg * 1.5)
When to Use Each Pattern
What type of anomaly? ├─ Known threshold (e.g., "CPU > 90%") │ └─> Threshold-Based Detection (Pattern 3) │ ├─ Statistical outliers (unusual values) │ ├─ Normal distribution? │ │ └─> Z-Score Method (Pattern 1) │ │ │ └─ Skewed distribution? │ └─> IQR Method (Pattern 2) │ ├─ Sudden spikes/drops │ └─> Rate of Change (Pattern 4) │ └─ Deviation from recent baseline └─> Moving Average (Pattern 5)
Pattern 1: Statistical Outlier Detection (Z-Score)
Concept: Detect values beyond N standard deviations from the mean
When to use:
- Metrics with relatively stable baseline
- Data roughly follows normal distribution
- Need statistically grounded detection
Query:
align 5m, metric_value:sum(m("span_call_count_5m")) | aggregate avg_val:avg(metric_value), stddev_val:stddev(metric_value), current_val:sum(metric_value), group_by(service_name) | make_col z_score:(current_val - avg_val) / stddev_val | make_col upper_bound:avg_val + (2 * stddev_val) | make_col lower_bound:avg_val - (2 * stddev_val) | make_col is_anomaly:if(z_score > 2 or z_score < -2, true, false) | filter is_anomaly = true | sort desc(z_score) | limit 20
Threshold tuning:
orz > 2
: ~95% confidence (moderate sensitivity)z < -2
orz > 3
: ~99.7% confidence (low false positives)z < -3
orz > 1.5
: ~87% confidence (high sensitivity)z < -1.5
Example result:
service_name: featureflagservice avg_val: 11.5 stddev_val: 13.9 current_val: 46 z_score: 2.48 is_anomaly: true
Pros:
- Statistically grounded
- Well-understood confidence intervals
- Good for normally distributed data
Cons:
- Assumes normal distribution
- Sensitive to extreme outliers in baseline
- Requires sufficient historical data
Pattern 2: IQR (Interquartile Range) Method
Concept: Detect values beyond the interquartile range using Tukey's fences
When to use:
- Skewed distributions (latency, error counts)
- Presence of natural outliers in baseline
- More robust alternative to Z-score
Query:
align 5m, metric_value:sum(m("span_call_count_5m")) | aggregate p25:percentile(metric_value, 0.25), p75:percentile(metric_value, 0.75), current_val:sum(metric_value), group_by(service_name) | make_col iqr:p75 - p25 | make_col upper_fence:p75 + (1.5 * iqr) | make_col lower_fence:p25 - (1.5 * iqr) | make_col is_outlier:if(current_val > upper_fence or current_val < lower_fence, true, false) | filter is_outlier = true | sort desc(current_val) | limit 20
Threshold tuning:
: Standard outliers (moderate sensitivity)1.5 * IQR
: Extreme outliers (low false positives)3 * IQR
: More sensitive detection1 * IQR
Example result:
service_name: featureflagservice p25: 1.75 p75: 16.75 iqr: 15 upper_fence: 39.25 current_val: 46 is_outlier: true
Pros:
- Robust to skewed distributions
- Not affected by extreme values
- Based on quartiles (median-based)
Cons:
- Less interpretable than Z-score
- May miss anomalies in heavy-tailed distributions
- Requires sufficient data for percentile calculation
Pattern 3: Threshold-Based Detection
Concept: Simple comparison against fixed or dynamic thresholds
When to use:
- Known capacity limits (CPU > 90%, memory > 80%)
- SLO violations (error rate > 1%, latency > 500ms)
- Business rules (orders < 100 per hour)
Static Threshold:
align options(bins: 1), total_calls:sum(m("span_call_count_5m")) aggregate current_rate:sum(total_calls), group_by(service_name) make_col threshold:100000 | make_col is_high:if(current_rate > threshold, true, false) | filter is_high = true | sort desc(current_rate)
Dynamic Threshold (baseline comparison):
align options(bins: 1), metric_value:sum(m("span_call_count_5m")) aggregate baseline:avg(metric_value), current:sum(metric_value), group_by(service_name) make_col threshold:baseline * 1.5 | make_col is_anomaly:if(current > threshold, true, false) | filter is_anomaly = true
Threshold multiplier guidance:
: High sensitivity (more alerts)1.5x
: Moderate sensitivity (balanced)2x
: Low sensitivity (only major spikes)3x
Pros:
- Simple and interpretable
- No assumptions about distribution
- Clear business meaning
Cons:
- Requires domain knowledge to set thresholds
- Static thresholds may not adapt to changing baselines
- May miss subtle anomalies
Pattern 4: Rate of Change Detection
Concept: Detect sudden spikes or drops by comparing to previous time period
When to use:
- Detect sudden traffic spikes or drops
- Identify rapid changes in behavior
- Alert on percentage change thresholds
Query:
align 5m, metric_value:sum(m("span_call_count_5m")) | make_col previous_value:window(lag(metric_value, 1), group_by(service_name)) | make_col value_change:metric_value - previous_value | make_col pct_change:if(previous_value > 0, (value_change / previous_value) * 100, 0) | make_col is_spike:if(pct_change > 100 or pct_change < -50, true, false) | filter is_spike = true | sort desc(pct_change) | limit 20
Critical syntax: Use
window(lag(...), group_by(...)) NOT lag(...) over (partition by...)
Threshold examples:
: 2x increase (doubling)pct_change > 100
: 3x increasepct_change > 200
: 50% droppct_change < -50
: 75% droppct_change < -75
Example result:
service_name: frontend metric_value: 50 previous_value: 2 value_change: 48 pct_change: 2400 is_spike: true
Pros:
- Detects sudden changes regardless of absolute value
- Adapts to current baseline automatically
- Effective for early spike detection
Cons:
- Sensitive to very low baseline values (small numbers can cause large percentage changes)
- May produce false positives during normal ramp-up/down
- Requires at least 2 time periods of data
Best practices:
- Add minimum value filter to avoid division by small numbers
- Use different thresholds for increases vs decreases
- Consider absolute change threshold in addition to percentage
Pattern 5: Moving Average Baseline
Concept: Compare current value to recent moving average using sliding window
When to use:
- Smooth noisy metrics for baseline
- Detect deviations from recent average
- Adaptive baseline that follows trends
Query:
align 5m, metric_value:sum(m("span_call_count_5m")) | make_col moving_avg:window(avg(metric_value), group_by(service_name), frame(back:30m)) | make_col deviation:metric_value - moving_avg | make_col pct_deviation:if(moving_avg > 0, (deviation / moving_avg) * 100, 0) | make_col is_anomaly:if(pct_deviation > 50 or pct_deviation < -50, true, false) | filter is_anomaly = true | sort desc(pct_deviation) | limit 20
Frame options:
: Short-term baseline (10-minute average)frame(back:10m)
: Medium-term baseline (30-minute average)frame(back:30m)
: Longer-term baseline (1-hour average)frame(back:1h)
Deviation thresholds:
: Moderate deviation from recent average> 50%
: Doubling compared to recent average> 100%
: More sensitive detection> 25%
Pros:
- Adapts to changing baselines and trends
- Smooths noisy data
- Good for metrics with daily/hourly patterns
Cons:
- Slower to detect anomalies (due to averaging)
- May miss anomalies during rapid baseline shifts
- Requires sufficient lookback data
Best practices:
- Choose frame duration based on metric volatility
- Shorter frames for fast-changing metrics
- Longer frames for more stable baselines
Pattern 6: Percentile-Based Threshold
Concept: Compare current value to historical percentile (p95, p99)
When to use:
- SLO violations (latency > p95)
- Detect values above "normal high"
- Comparing current to historical baseline
Query:
align 5m, metric_value:sum(m("span_call_count_5m")) | aggregate p95:percentile(metric_value, 0.95), p99:percentile(metric_value, 0.99), current:sum(metric_value), group_by(service_name) | make_col is_anomaly:if(current > p95, true, false) | filter is_anomaly = true | sort desc(current)
Percentile choices:
: Detect top 5% unusual values (moderate sensitivity)p95
: Detect top 1% extreme values (low false positives)p99
: Detect top 10% (high sensitivity)p90
Pros:
- Percentile-based SLOs are industry standard
- Automatically adapts to data distribution
- Clear meaning (top X% of values)
Cons:
- Unidirectional (only detects high values, not drops)
- Requires sufficient historical data
- May not detect subtle shifts in distribution
Common Patterns
Pattern: Combine Multiple Detection Methods
Increase confidence by requiring multiple methods to agree:
align 5m, metric_value:sum(m("span_call_count_5m")) | aggregate avg_val:avg(metric_value), stddev_val:stddev(metric_value), p95:percentile(metric_value, 0.95), current:sum(metric_value), group_by(service_name) | make_col z_score:(current - avg_val) / stddev_val | make_col is_zscore_anomaly:if(z_score > 2 or z_score < -2, true, false) | make_col is_percentile_anomaly:if(current > p95, true, false) | make_col is_anomaly:if(is_zscore_anomaly = true and is_percentile_anomaly = true, true, false) | filter is_anomaly = true
Use case: Reduce false positives by requiring consensus
Pattern: Multi-Metric Correlation
Detect anomalies across correlated metrics:
align options(bins: 1), requests:sum(m("span_call_count_5m")), errors:sum(m("span_error_count_5m")) aggregate total_requests:sum(requests), total_errors:sum(errors), group_by(service_name) make_col error_rate:if(total_requests > 0, (float64(total_errors) / float64(total_requests)) * 100, 0) | make_col threshold:1.0 | make_col is_high_error:if(error_rate > threshold and total_requests > 100, true, false) | filter is_high_error = true
Use case: Alert when error rate AND request volume both indicate issues
Pattern: Time-Series Trending
Track anomalies over time using timechart:
align 5m, metric_value:sum(m("span_call_count_5m")) | aggregate avg_val:avg(metric_value), stddev_val:stddev(metric_value), current:sum(metric_value), group_by(service_name) | make_col z_score:(current - avg_val) / stddev_val | make_col is_anomaly:if(z_score > 2 or z_score < -2, true, false) | filter is_anomaly = true
Result: Multiple rows per service showing anomalies across time buckets
Use case: Visualize when and how often anomalies occur
OPAL Syntax Key Points
Window Functions (LAG/LEAD)
CRITICAL: OPAL uses
window() function, NOT SQL OVER clause!
✅ CORRECT Syntax:
make_col prev:window(lag(column, 1), group_by(dimension)) make_col next:window(lead(column, 1), group_by(dimension)) make_col moving_avg:window(avg(column), group_by(dimension), frame(back:30m))
❌ WRONG Syntax (SQL-style):
lag(column, 1) over (partition by dimension order by time)
Window function components:
: Access previous row valuelag(column, offset)
: Access next row valuelead(column, offset)
: Partition by dimensiongroup_by(dimension)
: Sliding window lookback periodframe(back:duration)
Derived Columns Must Use Separate make_col
❌ WRONG - Cannot reference derived column in same make_col:
make_col upper_bound:avg + (2 * stddev), is_anomaly:if(value > upper_bound, true, false)
✅ CORRECT - Use separate make_col statements:
make_col upper_bound:avg + (2 * stddev) | make_col is_anomaly:if(value > upper_bound, true, false)
Metrics Query Patterns
Summary (one row per group):
align options(bins: 1), metric:sum(m("metric_name")) aggregate result:sum(metric), group_by(dimension)
Note: No pipe
| between align options(bins: 1) and aggregate!
Time-series (multiple rows per group):
align 5m, metric:sum(m("metric_name")) | aggregate result:sum(metric), group_by(dimension)
Note: Pipe
| required between align 5m and aggregate!
Period-Over-Period Comparison with Timeshift + Union
For comparing entire periods (e.g., "this hour" vs "exactly 1 hour ago"), use the
timeshift + union pattern with subquery definitions.
Key Difference:
: Compares adjacent buckets (5-min to 5-min, approximate)window(lag())
: Compares entire periods (exact time offset: 1h, 1d, 1w)timeshift + union
Working Example (✅ works in all query contexts):
@current <- @ { align rate:sum(m("span_call_count_5m")) aggregate current_sum:sum(rate), group_by(service_name) } @previous <- @ { timeshift 1h # Shift BEFORE align! align rate:sum(m("span_call_count_5m")) aggregate prev_sum:sum(rate), group_by(service_name) } @combined <- @current { union @previous aggregate current:any_not_null(current_sum), previous:any_not_null(prev_sum), group_by(service_name) make_col change:current - previous make_col pct_change:if(previous > 0, (change / previous) * 100, 0) make_col abs_pct_change:if(pct_change < 0, -pct_change, pct_change) } <- @combined { filter abs_pct_change > 50 sort desc(abs_pct_change) limit 10 }
Critical Points:
: Use@subquery <- @
alone to reference the primary input dataset@
BEFOREtimeshift
: Operates on raw data, shifts timestamps before aggregationalign- Separate aggregation: Both series must be aggregated independently
collapses union: Combines current/previous into single row per dimensionany_not_null()- Works everywhere: MCP queries, worksheets, and monitors all support this syntax
Use Cases:
- Day-over-day comparison: "Today vs yesterday" (use
)timeshift 1d - Week-over-week trending: "This week vs last week" (use
)timeshift 7d - Hour-over-hour spikes: "This hour vs 1 hour ago" (use
)timeshift 1h - SLA violations: "Current vs same period last month" (use
)timeshift 30d
Tested Results:
- ✅ Detected 200% increase in service request rate (18 vs 6)
- ✅ Detected 92% drop in request volume (1 vs 13)
- ✅ Works with any timeshift duration (1h, 6h, 1d, 7d, etc.)
Comparison with window(lag()):
| Feature | window(lag(rate, N)) | timeshift + union |
|---|---|---|
| Time precision | Approximate (N buckets back) | Exact (fixed time offset) |
| Example | ≈ 1 hour (if buckets are 5min) | = exactly 60 minutes |
| Complexity | Simple, one query | More complex, subqueries + union |
| Use case | Real-time spike detection | Period-over-period reporting |
| Best for | "Current vs previous bucket" | "Current vs same time yesterday" |
When to use each:
- Use
for: Real-time alerts, simple spike detection, fast querieswindow(lag()) - Use
for: Exact period comparison, day-over-day reports, SLA trackingtimeshift + union
Troubleshooting
Issue: "Unknown function 'over()'"
Cause: Using SQL window function syntax instead of OPAL syntax
Solution: Use
window(lag(...), group_by(...)) instead of lag(...) over (...)
Example:
# WRONG make_col prev:lag(value, 1) over (partition by service order by time) # CORRECT make_col prev:window(lag(value, 1), group_by(service))
Issue: High false positive rate
Cause: Threshold too sensitive or baseline includes anomalies
Solutions:
- Increase threshold: Use 3-sigma instead of 2-sigma for Z-score
- Combine methods: Require multiple detection methods to agree
- Filter baseline: Exclude known anomaly periods from baseline calculation
- Add minimum value filter: Avoid alerting on very low absolute values
Example with minimum value filter:
| make_col is_spike:if(pct_change > 100 and metric_value > 10, true, false)
Issue: Missing anomalies (false negatives)
Cause: Threshold too strict or wrong detection method for data type
Solutions:
- Decrease threshold: Use 1.5-sigma or lower percentile (p90 instead of p95)
- Try different method: IQR if data is skewed, rate-of-change for sudden spikes
- Check data distribution: Visualize baseline to understand normal range
- Use multiple methods: Catch different types of anomalies
Issue: Division by zero or very small numbers
Cause: Calculating percentage change when previous value is zero or very small
Solution: Add conditional check for minimum denominator:
make_col pct_change:if(previous_value > 5, (value_change / previous_value) * 100, 0)
Issue: Window function returns null values
Cause: First row in group has no previous value for
lag()
Solution: This is expected behavior - first row will have
null for lag(). Filter nulls or provide default:
make_col previous_value:window(lag(metric_value, 1), group_by(service_name)) | filter not is_null(previous_value)
Or use default value (though not directly supported in current lag syntax):
make_col pct_change:if(is_null(previous_value), 0, (value_change / previous_value) * 100)
Key Takeaways
-
Choose detection method based on anomaly type and data distribution
- Z-Score for normal distributions
- IQR for skewed data
- Rate-of-change for sudden spikes
- Moving average for trend deviations
-
OPAL window functions use different syntax from SQL
- Use
NOTwindow(lag(...), group_by(...))lag(...) over (...) - Works with both metrics (align) and raw datasets
- Use
-
Combine multiple methods to reduce false positives
- Require Z-score AND percentile agreement
- Add minimum value filters for rate-of-change
- Correlate multiple metrics (requests + errors)
-
Tune thresholds based on metric characteristics
- Volatile metrics: Higher thresholds (3-sigma, 100% change)
- Stable metrics: Lower thresholds (2-sigma, 50% change)
- Test and iterate based on false positive rate
-
Derived columns require separate make_col statements
- Cannot reference newly created column in same make_col
- Use pipeline of make_col statements for sequential calculations
-
Frame specification enables sliding window calculations
for 30-minute moving averageframe(back:30m)- Shorter frames for fast-changing metrics
- Longer frames for stable baselines
-
Metrics queries have two distinct patterns
for summary (no pipe before aggregate)options(bins: 1)
for time-series (pipe required before aggregate)align 5m
-
Statistical methods work best with sufficient historical data
- Need enough data points for meaningful stddev/percentiles
- Consider minimum sample size (e.g., 24 hours of 5m buckets = 288 samples)
-
Rate-of-change detection is powerful but requires careful tuning
- Very effective for early spike detection
- Prone to false positives with low baseline values
- Add minimum value and absolute change filters
-
Test detection patterns against historical data
- Validate false positive rate on known-good periods
- Verify detection on known anomaly events
- Adjust thresholds based on operational feedback
When to Use This Skill
Use detecting-anomalies skill when:
- User asks to check for anomalies
- Creating alert rules for unusual behavior
- Investigating performance degradation or incidents
- Identifying outliers in service metrics
- Detecting sudden traffic spikes or drops
- Comparing current values to historical baselines
- Setting up SLO violation alerts
- Analyzing metrics for unusual patterns
Cross-references:
- aggregating-gauge-metrics (for metric query patterns)
- analyzing-tdigest-metrics (for percentile-based detection)
- time-series-analysis (for temporal trending)
- working-with-intervals (for span-based anomaly detection)
- window-functions-deep-dive (to better understand window functions)