Claude-skill-registry aggregating-gauge-metrics
Aggregate pre-computed metrics (gauge, counter, delta types) using OPAL. Use when analyzing request counts, error rates, resource utilization, or any numeric metrics over time. Covers align + m() + aggregate pattern, summary vs time-series output, and common aggregation functions. For percentile metrics (tdigest), see analyzing-tdigest-metrics skill.
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/aggregating-gauge-metrics" ~/.claude/skills/majiayu000-claude-skill-registry-aggregating-gauge-metrics && rm -rf "$T"
skills/data/aggregating-gauge-metrics/SKILL.mdAggregating Gauge Metrics
Pre-computed metrics in Observe store aggregated measurements at regular intervals (typically every 5 minutes). This skill teaches how to query gauge, counter, and delta metric types using OPAL.
When to Use This Skill
- Analyzing request counts, error rates, or throughput metrics
- Tracking resource utilization (CPU, memory, network)
- Computing totals, averages, or rates across time periods
- Creating dashboards with time-series charts
- Working with any gauge, counter, or delta metric type
- When you need summary statistics or trends over time
Prerequisites
- Access to Observe tenant via MCP
- Understanding that metrics are pre-aggregated (not raw events)
- Metric dataset with type: gauge, counter, or delta
- Use
to find and inspect metricsdiscover_context()
Key Concepts
What Are Gauge Metrics?
Gauge metrics are pre-aggregated numeric measurements collected at regular intervals:
Pre-aggregated: Already summarized at collection time (typically 5-minute intervals)
- More efficient than querying raw data
- Faster query performance
- Lower storage costs
Common Metric Types:
- Gauge: Point-in-time value (CPU utilization, memory usage, queue depth)
- Counter: Monotonically increasing value (total requests, bytes sent)
- Delta: Change between intervals (requests per interval, errors per interval)
Examples:
- Number of requests per 5-minute intervalspan_call_count_5m
- Number of errors per 5-minute intervalspan_error_count_5m
- CPU utilization percentagesystem_cpu_utilization_ratio
- Available memory in bytesk8s_pod_memory_available_bytes
CRITICAL: The align Verb is REQUIRED
Unlike datasets (Events/Intervals), metrics MUST use the
align verb:
# WRONG - Will not work ❌ m("span_call_count_5m") | statsby total:sum(metric) # CORRECT - Must use align ✅ align options(bins: 1), rate:sum(m("span_call_count_5m")) aggregate total_requests:sum(rate)
Why align is required: Metrics are stored as time-series data that must be aligned to a time grid before aggregation.
Summary vs Time-Series Output
OPAL metrics queries can produce two different output types:
| Output Type | Pattern | Result | Use Case |
|---|---|---|---|
| Summary | | One row per group | Totals, overall statistics |
| Time-Series | , , or default | Many rows per group | Trending, dashboards, charts |
Summary pattern - Single statistics across entire time range:
align options(bins: 1), rate:sum(m("metric")) aggregate total:sum(rate), group_by(service_name)
Output: One row per service
Time-series pattern - Values over time buckets:
align 5m, rate:sum(m("metric")) | aggregate total:sum(rate), group_by(service_name)
Output: Multiple rows per service (one per 5-minute bucket)
CRITICAL Syntax Difference:
- Summary (
): NO pipebins: 1
between align and aggregate| - Time-series (
): YES pipe5m
between align and aggregate|
Discovery Workflow
Step 1: Search for metrics
discover_context("request count", result_type="metric") discover_context("error", result_type="metric") discover_context("cpu memory", result_type="metric")
Step 2: Get detailed metric schema
discover_context(metric_name="span_call_count_5m")
Step 3: Verify metric type Look for:
Type: gauge (or counter, delta)
Step 4: Note available dimensions These are used for
group_by():
,service_nameservice_namespace
,environmentspan_name
,k8s_namespace_namek8s_pod_name- etc. (shown in discovery output)
Step 5: Write query Use
align + m() + aggregate pattern with correct dimensions
Basic Patterns
Pattern 1: Total Count Across Time Range
Get overall totals (summary output):
align options(bins: 1), rate:sum(m("span_call_count_5m")) aggregate total_requests:sum(rate)
Output: Single row with total count across entire time range.
No group_by: Aggregates everything together.
Pattern 2: Totals Per Group
Get totals broken down by dimension:
align options(bins: 1), rate:sum(m("span_call_count_5m")) aggregate total_requests:sum(rate), group_by(service_name)
Output: One row per service with total requests.
group_by: Use any dimension from metric schema.
Pattern 3: Average Values Per Group
Calculate averages across time range:
align options(bins: 1), cpu:avg(m("system_cpu_utilization_ratio")) aggregate avg_cpu:avg(cpu), group_by(service_name)
Output: Average CPU utilization per service.
avg() function: Used twice - once in align, once in aggregate.
Pattern 4: Multiple Aggregations
Compute several statistics together:
align options(bins: 1), rate:sum(m("span_call_count_5m")) aggregate total:sum(rate), average:avg(rate), maximum:max(rate), group_by(service_name)
Output: Multiple columns per service (total, average, maximum).
Pattern 5: Time-Series for Trending
Track values over time buckets:
align 5m, rate:sum(m("span_call_count_5m")) | aggregate requests_per_5min:sum(rate), group_by(service_name)
Output: Multiple rows per service (one per 5-minute interval).
Note: Pipe
| required after align for time-series pattern.
Output columns:
- Time bucket identifier_c_bucket
,valid_from
- Bucket boundariesvalid_to- Metric values
Common Use Cases
Counting Total Requests by Service
align options(bins: 1), rate:sum(m("span_call_count_5m")) aggregate total_requests:sum(rate), group_by(service_name) | sort desc(total_requests) | limit 10
Use case: Identify top services by request volume.
Counting Errors with Fill for Zero Values
align options(bins: 1), errors:sum(m("span_error_count_5m")) aggregate total_errors:sum(errors), group_by(service_name) fill total_errors:0
Use case: Show all services, even those with zero errors.
fill verb: Replaces null values with 0.
Tracking Request Rate Over Time
align 1h, rate:sum(m("span_call_count_5m")) | aggregate requests_per_hour:sum(rate), group_by(service_name)
Use case: Hourly request trends for dashboards.
Output: Time-series data for charting.
Multiple Metrics in One Query
align options(bins: 1), requests:sum(m("span_call_count_5m")), errors:sum(m("span_error_count_5m")) aggregate total_requests:sum(requests), total_errors:sum(errors), group_by(service_name) | make_col error_rate:float64(total_errors) / float64(total_requests)
Use case: Calculate error rate from two metrics.
make_col: Add derived column after aggregation.
Resource Utilization Averages
align options(bins: 1), cpu:avg(m("system_cpu_utilization_ratio")) aggregate avg_cpu:avg(cpu), max_cpu:max(cpu), group_by(k8s_pod_name) | sort desc(avg_cpu) | limit 20
Use case: Find pods with highest CPU usage.
Complete Example
Scenario: You want to analyze request and error rates for your microservices over the last 24 hours.
Step 1: Discover available metrics
discover_context("request error", result_type="metric")
Found metrics:
(type: gauge)span_call_count_5m
(type: gauge)span_error_count_5m
Step 2: Get metric details
discover_context(metric_name="span_call_count_5m")
Available dimensions:
service_name, service_namespace, environment, span_name
Step 3: Query for summary statistics
align options(bins: 1), requests:sum(m("span_call_count_5m")), errors:sum(m("span_error_count_5m")) aggregate total_requests:sum(requests), total_errors:sum(errors), group_by(service_name) fill total_errors:0 | make_col error_rate:float64(total_errors) / float64(total_requests) * 100.0 | sort desc(total_requests)
Step 4: Interpret results
| service_name | total_requests | total_errors | error_rate |
|---|---|---|---|
| frontend-proxy | 15660 | 0 | 0.0 |
| frontend | 15263 | 35 | 0.23 |
| featureflagservice | 11693 | 0 | 0.0 |
| productcatalogservice | 8813 | 0 | 0.0 |
Insight: Frontend has a 0.23% error rate - investigate errors.
Step 5: Get hourly trends
align 1h, requests:sum(m("span_call_count_5m")), errors:sum(m("span_error_count_5m")) | aggregate requests_per_hour:sum(requests), errors_per_hour:sum(errors), group_by(service_name) | filter service_name = "frontend"
Output: Time-series showing frontend requests and errors per hour.
Common Pitfalls
Pitfall 1: Forgetting align Verb
❌ Wrong:
m("span_call_count_5m") | statsby total:sum(metric)
✅ Correct:
align options(bins: 1), rate:sum(m("span_call_count_5m")) aggregate total:sum(rate)
Why: Metrics MUST use
align verb - it's required, not optional.
Pitfall 2: Wrong Pipe Usage
❌ Wrong (pipe with bins:1):
align options(bins: 1), rate:sum(m("metric")) | aggregate total:sum(rate)
❌ Wrong (no pipe with time duration):
align 5m, rate:sum(m("metric")) aggregate total:sum(rate)
✅ Correct:
# Summary - NO pipe align options(bins: 1), rate:sum(m("metric")) aggregate total:sum(rate) # Time-series - YES pipe align 5m, rate:sum(m("metric")) | aggregate total:sum(rate)
Why: Syntax differs between summary and time-series patterns.
Pitfall 3: Grouping by Non-Existent Dimension
❌ Wrong:
align options(bins: 1), rate:sum(m("metric")) aggregate total:sum(rate), group_by(service_name)
Error: "field 'service_name' does not exist"
✅ Correct:
# First: discover_context(metric_name="metric") to see available dimensions # Then: use only dimensions that exist align options(bins: 1), rate:sum(m("metric")) aggregate total:sum(rate), group_by(correct_dimension_name)
Why: Not all metrics have the same dimensions - always check first.
Pitfall 4: Using statsby Instead of aggregate
❌ Wrong:
align options(bins: 1), rate:sum(m("metric")) statsby total:sum(rate)
✅ Correct:
align options(bins: 1), rate:sum(m("metric")) aggregate total:sum(rate)
Why: After
align, use aggregate (not statsby which is for datasets).
Aggregation Functions Reference
Common functions used with gauge metrics:
# Summing values align options(bins: 1), metric:sum(m("metric_name")) aggregate total:sum(metric) # Averaging values align options(bins: 1), metric:avg(m("metric_name")) aggregate average:avg(metric) # Maximum value align options(bins: 1), metric:max(m("metric_name")) aggregate maximum:max(metric) # Minimum value align options(bins: 1), metric:min(m("metric_name")) aggregate minimum:min(metric) # Count of samples align options(bins: 1), metric:count(m("metric_name")) aggregate sample_count:count(metric)
Pattern: Function used in both
align and aggregate.
Time Bucket Options
Common time durations for time-series queries:
align 1m, ... # 1-minute buckets align 5m, ... # 5-minute buckets (common) align 15m, ... # 15-minute buckets align 1h, ... # 1-hour buckets align 1d, ... # 1-day buckets
Default:
align without duration uses automatic binning (300 bins).
Best Practices
- Always use discover_context() first to find metrics and check dimensions
- Verify metric type - this skill is for gauge/counter/delta (NOT tdigest)
- Use summary pattern (
) for single statistics, reports, totalsbins: 1 - Use time-series pattern (
,5m
) for dashboards, trending, charts1h - Remember pipe rule: bins:1 = no pipe, time duration = yes pipe
- Use fill to replace nulls with zeros for complete results
- Add sort + limit for top-N queries to avoid overwhelming output
- Check available dimensions before using group_by
Related Skills
- analyzing-tdigest-metrics - For percentile metrics (latency, duration p95/p99)
- time-series-analysis - For event/interval trending with timechart (different from metrics)
- aggregating-event-datasets - For aggregating raw events with statsby (different from metrics)
- working-with-intervals - For calculating durations from raw interval data
Summary
Gauge metrics are pre-aggregated measurements that require the
align verb:
- Core pattern:
+align
+m()aggregate - Metric types: gauge, counter, delta (NOT tdigest)
- Two output modes:
- Summary:
→ one row per group, NO pipeoptions(bins: 1) - Time-series:
,5m
→ many rows per group, YES pipe1h
- Summary:
- Common functions: sum, avg, max, min, count
- Discovery: Use
to find metrics and dimensionsdiscover_context()
Key distinction: Metrics are pre-aggregated (use
align), while Events/Intervals are raw data (use statsby/timechart).
Last Updated: November 14, 2025 Version: 1.0 Tested With: Observe OPAL (ServiceExplorer/Service Metrics)