Skills error-monitoring

install
source · Clone the upstream repo
git clone https://github.com/TerminalSkills/skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/TerminalSkills/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/error-monitoring" ~/.claude/skills/terminalskills-skills-error-monitoring && rm -rf "$T"
manifest: skills/error-monitoring/SKILL.md
source content

Error Monitoring

Overview

This skill helps you analyze application errors from monitoring platforms. It processes error exports, groups events by root cause, identifies duplicates, and classifies errors by user impact — turning a chaotic error stream into an actionable triage report.

Instructions

When the user provides error data (JSON export, CSV, or pasted logs), follow this process:

1. Parse and Normalize

  • Accept Sentry JSON exports, Datadog event logs, generic JSON arrays, or CSV files
  • Extract key fields: error message, stack trace (top frame), timestamp, occurrence count, status, affected users
  • Normalize error messages by stripping variable parts (IDs, timestamps, file paths) to find true duplicates

2. Group by Root Cause

  • Cluster errors that share the same stack trace origin (file + line + function)
  • Merge errors with identical normalized messages even if stack traces differ slightly
  • Flag groups that are likely the same root cause but reported under different names

3. Assess Impact

For each error group, determine:

  • Frequency: events per day (average and trend — increasing, stable, decreasing)
  • User reach: unique users affected (if available)
  • Severity signals: look for keywords indicating data loss, payment failure, auth issues, or security concerns
  • Self-healing: does the error auto-resolve? (check if occurrences are followed by long gaps)

4. Produce the Report

Output a structured report with:

  • Total events analyzed, time range, unique error groups
  • Top error groups ranked by volume, with: message, source location, event count, percentage of total, first/last seen, impact assessment
  • Duplicate groups (errors that should be merged)
  • Resolved-but-recurring groups (marked resolved but still firing)
  • Recommended priority: P1 (immediate), P2 (soon), P3 (backlog)

Examples

Example 1: Sentry JSON Export Analysis

Input: User provides

errors-export.json
with 3,200 error events from the last 14 days.

Output:

Error Analysis — 14-Day Window
Total Events: 3,200 | Unique Groups: 47 | Time Range: Jan 3 – Jan 17

Top 5 Error Groups:

1. [P3] TypeError: Cannot read property 'name' of null
   Source: src/components/UserCard.tsx:89
   Events: 812 (25.4%) | Users: 34 | Trend: Stable
   Note: Occurs on profile preview for deleted accounts. No data loss.

2. [P1] DatabaseError: deadlock detected
   Source: src/services/order.ts:201
   Events: 156 (4.9%) | Users: 89 | Trend: Increasing (+40% week-over-week)
   Note: Causes failed order submissions. Revenue impact confirmed.

3. [P2] FetchError: network timeout at /api/inventory
   Source: src/lib/api-client.ts:45
   Events: 340 (10.6%) | Users: 201 | Trend: Decreasing
   Note: Correlates with deployment windows. Likely cold-start related.

Duplicates Found:
- Groups #12 and #31 share identical root cause (different error wrappers)
- Groups #8, #19, #22 all stem from expired JWT tokens

Resolved But Recurring:
- Group #5 (CORS error) was resolved on Jan 10 but has 45 new events since

Example 2: CSV Log File

Input: User provides

app-errors.csv
with columns: timestamp, level, message, stack_trace, user_id.

Output: Same structured report format, noting that CSV data lacks occurrence counts so each row is treated as one event.

Guidelines

  • Always show percentages alongside raw counts — "812 events" is less useful than "812 events (25.4%)"
  • When stack traces are missing, fall back to error message similarity for grouping
  • Flag any error containing "payment", "auth", "password", "delete", or "drop" as potentially high-severity regardless of volume
  • If the export is very large (>10,000 events), summarize the top 20 groups and offer to drill into specific ones
  • Never assume an error is low-priority just because it's old — check if it's trending upward
  • Recommend merging duplicate groups to reduce alert noise before any rule changes