Claude-skill-registry analyzing-tdigest-metrics
Analyze percentile metrics (tdigest type) using OPAL for latency analysis and SLO tracking. Use when calculating p50, p95, p99 from pre-aggregated duration or latency metrics. Covers the critical double-combine pattern with align + m_tdigest() + tdigest_combine + aggregate. For simple metrics (counts, averages), see aggregating-gauge-metrics skill.
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/analyzing-tdigest-metrics" ~/.claude/skills/majiayu000-claude-skill-registry-analyzing-tdigest-metrics && rm -rf "$T"
skills/data/analyzing-tdigest-metrics/SKILL.mdAnalyzing TDigest Metrics
TDigest metrics in Observe store pre-aggregated percentile data for efficient latency and duration analysis. This skill teaches the specialized pattern for querying tdigest metrics using OPAL.
When to Use This Skill
- Calculating latency percentiles (p50, p95, p99) for services or endpoints
- Analyzing request duration distributions
- Setting or tracking SLOs (Service Level Objectives) based on percentiles
- Understanding performance characteristics beyond simple averages
- Working with any metric of type
tdigest - When you need accurate percentile calculations from pre-aggregated data
Prerequisites
- Access to Observe tenant via MCP
- Understanding that tdigest metrics are pre-aggregated percentile structures
- Metric dataset with type:
tdigest - Familiarity with percentiles (p50 = median, p95 = 95th percentile, etc.)
- Use
to find and inspect tdigest metricsdiscover_context()
Key Concepts
What Are TDigest Metrics?
TDigest (t-digest) is a probabilistic data structure for estimating percentiles efficiently:
Pre-aggregated percentile data: Not raw values, but compressed statistical summaries
- Stores distribution information in compact form
- Enables accurate percentile calculations
- Much more efficient than storing all raw values
Why percentiles matter:
- Averages hide outliers: A service with avg 100ms might have p99 at 10 seconds
- SLOs use percentiles: "p95 latency < 500ms" is a common SLO target
- User experience: p95/p99 show what real users experience, not just average case
Common Examples:
- Service-to-service latency percentilesspan_sn_service_node_duration_tdigest_5m
- Edge latency percentilesspan_sn_service_edge_duration_tdigest_5m
- Request duration percentilesrequest_duration_tdigest_5m
- Database query latency percentilesdatabase_query_duration_tdigest_5m
CRITICAL: The Double-Combine Pattern
TDigest metrics require a special pattern that's different from gauge metrics:
# WRONG - Missing second combine ❌ align options(bins: 1), combined:tdigest_combine(m_tdigest("metric")) aggregate p95:tdigest_quantile(combined, 0.95) # CORRECT - Double-combine pattern ✅ align options(bins: 1), combined:tdigest_combine(m_tdigest("metric")) aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95)
Why the double combine?
- First
(intdigest_combine
): Combines tdigest data points within time bucketsalign - Second
(intdigest_combine
): Re-combines tdigests across groups/dimensionsaggregate - Then
: Calculates the actual percentile valuetdigest_quantile
Pattern breakdown:
align options(bins: 1), combined:tdigest_combine(m_tdigest("metric_name")) ← First combine aggregate p95:tdigest_quantile( tdigest_combine(combined), ← Second combine (NESTED!) 0.95), ← Quantile value (0.0-1.0) group_by(service_name)
Percentile Values
Percentiles are specified as decimal values from 0.0 to 1.0:
| Percentile | Value | Meaning |
|---|---|---|
| p50 (median) | 0.50 | 50% of values are below this |
| p75 | 0.75 | 75% of values are below this |
| p90 | 0.90 | 90% of values are below this |
| p95 | 0.95 | 95% of values are below this |
| p99 | 0.99 | 99% of values are below this |
| p99.9 | 0.999 | 99.9% of values are below this |
Common SLO targets: p95 < 500ms, p99 < 1000ms
Summary vs Time-Series (Same as Gauge Metrics)
| Output Type | Pattern | Result | Pipe? |
|---|---|---|---|
| Summary | | One row per group | NO |
| Time-Series | , | Many rows per group | YES |
Discovery Workflow
Step 1: Search for tdigest metrics
discover_context("duration tdigest", result_type="metric") discover_context("latency percentile", result_type="metric")
Step 2: Get detailed metric schema
discover_context(metric_name="span_sn_service_node_duration_tdigest_5m")
Step 3: Verify metric type Look for:
Type: tdigest (critical!)
Step 4: Note available dimensions Used for
group_by():
,service_namefor_service_name
,environmentfor_environment- etc. (shown in discovery output)
Step 5: Write query Use double-combine pattern with correct dimensions
Basic Patterns
Pattern 1: Overall Percentiles (No Grouping)
Calculate percentiles across all data:
align options(bins: 1), combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m")) aggregate p50:tdigest_quantile(tdigest_combine(combined), 0.50), p95:tdigest_quantile(tdigest_combine(combined), 0.95), p99:tdigest_quantile(tdigest_combine(combined), 0.99)
Output: Single row with overall p50, p95, p99 across entire time range.
Note: Both combines present, no
group_by.
Pattern 2: Percentiles Per Service
Calculate percentiles broken down by dimension:
align options(bins: 1), combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m")) aggregate p50:tdigest_quantile(tdigest_combine(combined), 0.50), p95:tdigest_quantile(tdigest_combine(combined), 0.95), p99:tdigest_quantile(tdigest_combine(combined), 0.99), group_by(service_name)
Output: One row per service with percentiles.
Pattern 3: Single Percentile (Common for SLOs)
Get just p95 for SLO tracking:
align options(bins: 1), combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m")) aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95), group_by(service_name) | sort desc(p95) | limit 10
Output: Top 10 services by p95 latency.
Use case: Identify slowest services for optimization.
Pattern 4: Converting Units
TDigest values are often in nanoseconds - convert for readability:
align options(bins: 1), combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m")) aggregate p50_ns:tdigest_quantile(tdigest_combine(combined), 0.50), p95_ns:tdigest_quantile(tdigest_combine(combined), 0.95), p99_ns:tdigest_quantile(tdigest_combine(combined), 0.99), group_by(service_name) | make_col p50_ms:p50_ns / 1000000, p95_ms:p95_ns / 1000000, p99_ms:p99_ns / 1000000
Output: Percentiles in both nanoseconds and milliseconds.
Note: Check sample values in
discover_context() to identify units.
Pattern 5: Time-Series Percentiles
Track percentiles over time buckets:
align 5m, combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m")) | aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95), group_by(service_name)
Output: Multiple rows per service (one per 5-minute interval).
Note: Pipe
| required for time-series pattern.
Use case: Dashboard charts showing latency trends over time.
Common Use Cases
SLO Tracking: p95 Latency Under Threshold
align options(bins: 1), combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m")) aggregate p95_ns:tdigest_quantile(tdigest_combine(combined), 0.95), group_by(service_name) | make_col p95_ms:p95_ns / 1000000 | make_col slo_target:500, meets_slo:if(p95_ms < 500, "yes", "no") | sort desc(p95_ms)
Use case: Check which services meet p95 < 500ms SLO target.
Output: Services with SLO compliance status.
Latency Distribution Analysis
align options(bins: 1), combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m")) aggregate p50:tdigest_quantile(tdigest_combine(combined), 0.50), p75:tdigest_quantile(tdigest_combine(combined), 0.75), p90:tdigest_quantile(tdigest_combine(combined), 0.90), p95:tdigest_quantile(tdigest_combine(combined), 0.95), p99:tdigest_quantile(tdigest_combine(combined), 0.99), group_by(service_name) | make_col p50_ms:p50 / 1000000, p95_ms:p95 / 1000000, p99_ms:p99 / 1000000
Use case: Understand full latency distribution to identify outliers.
Insight: Large gap between p95 and p99 indicates inconsistent performance.
Comparing Services by Latency
align options(bins: 1), combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m")) aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95), group_by(service_name) | make_col p95_ms:p95 / 1000000 | sort desc(p95_ms) | limit 10
Use case: Find slowest services to prioritize optimization efforts.
Time-Series for Incident Investigation
align 5m, combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m")) | aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95), group_by(service_name) | filter service_name = "frontend" | make_col p95_ms:p95 / 1000000
Use case: See when latency spiked during an incident.
Output: Timeline of p95 latency for specific service.
Multi-Dimension Grouping
align options(bins: 1), combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m")) aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95), group_by(service_name, environment) | make_col p95_ms:p95 / 1000000 | sort desc(p95_ms)
Use case: Compare latency across services AND environments.
Complete Example
Scenario: You're tracking SLOs for your microservices. The target is p95 latency < 500ms and p99 latency < 1000ms for all production services.
Step 1: Discover tdigest metrics
discover_context("duration tdigest", result_type="metric")
Found:
span_sn_service_node_duration_tdigest_5m (type: tdigest)
Step 2: Get metric details
discover_context(metric_name="span_sn_service_node_duration_tdigest_5m")
Available dimensions:
service_name, environment, for_service_name
Step 3: Query for SLO compliance
align options(bins: 1), combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m")) aggregate p95_ns:tdigest_quantile(tdigest_combine(combined), 0.95), p99_ns:tdigest_quantile(tdigest_combine(combined), 0.99), group_by(service_name, environment) | make_col p95_ms:p95_ns / 1000000, p99_ms:p99_ns / 1000000 | make_col p95_slo:if(p95_ms < 500, "✓", "✗"), p99_slo:if(p99_ms < 1000, "✓", "✗") | filter environment = "production" | sort desc(p95_ms)
Step 4: Interpret results
| service_name | environment | p95_ms | p99_ms | p95_slo | p99_slo |
|---|---|---|---|---|---|
| frontend | production | 19373.5 | 5641328.2 | ✗ | ✗ |
| featureflagservice | production | 5838.8 | 7473.9 | ✗ | ✗ |
| cartservice | production | 4136.6 | 5898.3 | ✗ | ✗ |
| productcatalogservice | production | 257.0 | 313.1 | ✓ | ✓ |
| currencyservice | production | 54.1 | 125.1 | ✓ | ✓ |
Insight: Frontend, featureflagservice, and cartservice are violating SLOs - need optimization.
Step 5: Investigate frontend latency over time
align 1h, combined:tdigest_combine(m_tdigest("span_sn_service_node_duration_tdigest_5m")) | aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95), p99:tdigest_quantile(tdigest_combine(combined), 0.99), group_by(service_name) | filter service_name = "frontend" | make_col p95_ms:p95 / 1000000, p99_ms:p99 / 1000000
Output: Hourly p95/p99 trends to identify when latency degraded.
Common Pitfalls
Pitfall 1: Forgetting Second Combine
❌ Wrong (most common mistake):
align options(bins: 1), combined:tdigest_combine(m_tdigest("metric")) aggregate p95:tdigest_quantile(combined, 0.95)
✅ Correct:
align options(bins: 1), combined:tdigest_combine(m_tdigest("metric")) aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95)
Why: TDigest requires combining twice - once in align, once in aggregate.
Error message: "the field has to be aggregated or grouped"
Pitfall 2: Using m() Instead of m_tdigest()
❌ Wrong:
align options(bins: 1), combined:tdigest_combine(m("duration_tdigest_5m"))
✅ Correct:
align options(bins: 1), combined:tdigest_combine(m_tdigest("duration_tdigest_5m"))
Why: Tdigest metrics require
m_tdigest() function, not m().
Check: Look for
Type: tdigest in discover_context() output.
Pitfall 3: Wrong Pipe Usage (Same as Gauge)
❌ Wrong (pipe with bins:1):
align options(bins: 1), combined:tdigest_combine(m_tdigest("metric")) | aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95)
✅ Correct:
# Summary - NO pipe align options(bins: 1), combined:tdigest_combine(m_tdigest("metric")) aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95) # Time-series - YES pipe align 5m, combined:tdigest_combine(m_tdigest("metric")) | aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95)
Pitfall 4: Percentile Value Out of Range
❌ Wrong:
aggregate p95:tdigest_quantile(tdigest_combine(combined), 95)
✅ Correct:
aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95)
Why: Quantile values must be 0.0 to 1.0 (not 1 to 100).
Pitfall 5: Not Converting Units
❌ Wrong (values in nanoseconds, hard to read):
aggregate p95:tdigest_quantile(tdigest_combine(combined), 0.95)
Result:
p95 = 14675991.25 (what unit is this?)
✅ Correct (convert to milliseconds):
aggregate p95_ns:tdigest_quantile(tdigest_combine(combined), 0.95) | make_col p95_ms:p95_ns / 1000000
Result:
p95_ms = 14.68 (clearly milliseconds)
Tip: Check sample values in discovery to identify units (19-digit numbers = nanoseconds).
Percentile Reference
Common percentiles and their meanings:
| Percentile | Decimal | Meaning | Common Use |
|---|---|---|---|
| p50 | 0.50 | Median (middle value) | Typical user experience |
| p75 | 0.75 | 75th percentile | Better than average case |
| p90 | 0.90 | 90th percentile | Catching most outliers |
| p95 | 0.95 | 95th percentile | Standard SLO target |
| p99 | 0.99 | 99th percentile | Tail latency / worst 1% |
| p99.9 | 0.999 | 99.9th percentile | Extreme outliers |
SLO best practice: Track p95 and p99, not just averages.
Unit Conversion Reference
Common time unit conversions (assuming nanoseconds):
# Nanoseconds to milliseconds (most common) make_col value_ms:value_ns / 1000000 # Nanoseconds to seconds make_col value_sec:value_ns / 1000000000 # Nanoseconds to microseconds make_col value_us:value_ns / 1000
How to identify units: Check sample values in
discover_context():
- 19 digits (1760201545280843522) = nanoseconds
- 13 digits (1758543367916) = milliseconds
- 10 digits (1758543367) = seconds
Best Practices
- Always use double-combine pattern - most critical rule for tdigest
- Verify metric type - must be
(nottdigest
)gauge - Check units - convert nanoseconds to milliseconds for readability
- Use multiple percentiles - p50, p95, p99 show full distribution
- Calculate SLO compliance - add derived columns comparing to targets
- Sort and limit - focus on worst offenders with
sort desc() | limit 10 - Use time-series for investigation - see when latency changed
- Group by relevant dimensions - service, environment, endpoint, etc.
Related Skills
- aggregating-gauge-metrics - For count/sum/avg metrics (NOT percentiles)
- working-with-intervals - For calculating percentiles from raw interval data (slower)
- time-series-analysis - For event/interval trending with timechart
Summary
TDigest metrics enable efficient percentile calculations:
- Core pattern:
+align
+ doublem_tdigest()
+tdigest_combinetdigest_quantile - Critical rule: Use
TWICE (in align AND in aggregate)tdigest_combine() - Metric function:
(NOTm_tdigest()
)m() - Percentile values: 0.0 to 1.0 (0.95 = p95)
- Common percentiles: p50 (median), p95 (SLO), p99 (tail latency)
- Units: Often nanoseconds - convert to milliseconds for readability
Key distinction: TDigest metrics use special double-combine pattern, while gauge metrics use simple
m() + aggregate.
Last Updated: November 14, 2025 Version: 1.0 Tested With: Observe OPAL (ServiceExplorer/Service Inspector Metrics)