Agent-skills observability-service-health

install
source · Clone the upstream repo
git clone https://github.com/elastic/agent-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/elastic/agent-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/observability/service-health" ~/.claude/skills/elastic-agent-skills-observability-service-health && rm -rf "$T"
manifest: skills/observability/service-health/SKILL.md
source content

APM Service Health

Assess APM service health using Observability APIs, ES|QL against APM indices, Elasticsearch APIs, and (for correlation and APM-specific logic) the Kibana repo. Use SLOs, firing alerts, ML anomalies, throughput, latency (avg/p95/p99), error rate, and dependency health.

Where to look

  • Observability APIs (Observability APIs): Use the SLOs API (Stack | Serverless) to get SLO definitions, status, burn rate, and error budget. Use the Alerting API (Stack | Serverless) to list and manage alerting rules and their alerts for the service. Use APM annotations API to create or search annotations when needed.
  • ES|QL and Elasticsearch: Query
    traces*apm*,traces*otel*
    and
    metrics*apm*,metrics*otel*
    with ES|QL (see Using ES|QL for APM metrics) for throughput, latency, error rate, and dependency-style aggregations. Use Elasticsearch APIs (e.g.
    POST _query
    for ES|QL, or Query DSL) as documented in the Elasticsearch repo for indices and search.
  • APM Correlations: Run the apm-correlations script to get attributes that correlate with high-latency or failed transactions for a given service. It tries the Kibana internal APM correlations API first, then falls back to Elasticsearch significant_terms on
    traces*apm*,traces*otel*
    . See APM Correlations script.
  • Infrastructure: Correlate via resource attributes (e.g.
    k8s.pod.name
    ,
    container.id
    ,
    host.name
    ) in traces; query infrastructure or metrics indices with ES|QL/Elasticsearch for CPU and memory. OOM and CPU throttling directly impact APM health.
  • Logs: Use ES|QL or Elasticsearch search on log indices filtered by
    service.name
    or
    trace.id
    to explain behavior and root cause.
  • Observability Labs: Observability Labs and APM tag for patterns and troubleshooting.

Health criteria

Synthesize health from all of the following when available:

SignalWhat to check
SLOsBurn rate, status (healthy/degrading/violated), error budget.
Firing alertsOpen or recently fired alerts for the service or dependencies.
ML anomaliesAnomaly jobs; score and severity for latency, throughput, or error rate.
ThroughputRequest rate; compare to baseline or previous period.
LatencyAvg, p95, p99; compare to SLO targets or history.
Error rateFailed/total requests; spikes or sustained elevation.
Dependency healthDownstream latency, error rate, availability (ES|QL, APIs, Kibana repo).
InfrastructureCPU usage, memory; OOM and CPU throttling on pods/containers/hosts.
LogsApp logs filtered by service or trace ID for context and root cause.

Treat a service as unhealthy if SLOs are violated, critical alerts are firing, or ML anomalies indicate severe degradation. Correlate with infrastructure (OOM, CPU throttling), dependencies, and logs (service/trace context) to explain why and suggest next steps.

Using ES|QL for APM metrics

When querying APM data from Elasticsearch (

traces*apm*,traces*otel*
,
metrics*apm*,metrics*otel*
), use ES|QL by default where available.

  • Availability: ES|QL is available in Elasticsearch 8.11+ (technical preview; GA in 8.14). It is always available in Elastic Observability Serverless Complete tier.
  • Scoping to a service: Always filter by
    service.name
    (and
    service.environment
    when relevant). Combine with a time range on
    @timestamp
    :
WHERE service.name == "my-service-name" AND service.environment == "production"
  AND @timestamp >= "2025-03-01T00:00:00Z" AND @timestamp <= "2025-03-01T23:59:59Z"
  • Example patterns: Throughput, latency, and error rate over time: see Kibana
    trace_charts_definition.ts
    (
    getThroughputChart
    ,
    getLatencyChart
    ,
    getErrorRateChart
    ). Use
    from(index)
    where(...)
    stats(...)
    /
    evaluate(...)
    with
    BUCKET(@timestamp, ...)
    and
    WHERE service.name == "<service_name>"
    .
  • Performance: Add
    LIMIT n
    to cap rows and token usage. Prefer coarser
    BUCKET(@timestamp, ...)
    (e.g. 1 hour) when only trends are needed; finer buckets increase work and result size.

APM Correlations script

When only a subpopulation of transactions has high latency or failures, run the apm-correlations script to list attributes that correlate with those transactions (e.g. host, service version, pod, region). The script tries the Kibana internal APM correlations API first; if unavailable (e.g. 404), it falls back to Elasticsearch significant_terms on

traces*apm*,traces*otel*
.

# Latency correlations (attributes over-represented in slow transactions)
node skills/observability/service-health/scripts/apm-correlations.js latency-correlations --service-name <name> [--start <iso>] [--end <iso>] [--last-minutes 60] [--transaction-type <t>] [--transaction-name <n>] [--space <id>] [--json]

# Failed transaction correlations
node skills/observability/service-health/scripts/apm-correlations.js failed-correlations --service-name <name> [--start <iso>] [--end <iso>] [--last-minutes 60] [--transaction-type <t>] [--transaction-name <n>] [--space <id>] [--json]

# Test Kibana connection
node skills/observability/service-health/scripts/apm-correlations.js test [--space <id>]

Environment:

KIBANA_URL
and
KIBANA_API_KEY
(or
KIBANA_USERNAME
/
KIBANA_PASSWORD
) for Kibana; for fallback,
ELASTICSEARCH_URL
and
ELASTICSEARCH_API_KEY
. Use the same time range as the investigation.

Workflow

Service health progress:
- [ ] Step 1: Identify the service (and time range)
- [ ] Step 2: Check SLOs and firing alerts
- [ ] Step 3: Check ML anomalies (if configured)
- [ ] Step 4: Review throughput, latency (avg/p95/p99), error rate
- [ ] Step 5: Assess dependency health (ES|QL/APIs / Kibana repo)
- [ ] Step 6: Correlate with infrastructure and logs
- [ ] Step 7: Summarize health and recommend actions

Step 1: Identify the service

Confirm service name and time range. Resolve the service from the request; if multiple are in scope, target the most relevant. Use ES|QL on

traces*apm*,traces*otel*
or
metrics*apm*,metrics*otel*
(e.g.
WHERE service.name == "<name>"
) or Kibana repo APM routes to obtain service-level data. If the user has not provided the time range, assume last hour.

Step 2: Check SLOs and firing alerts

SLOs: Call the SLOs API to get SLO definitions and status for the service (latency, availability), healthy/degrading/violated, burn rate, error budget. Alerts: For active APM alerts, call

/api/alerting/rules/_find?search=apm&search_fields=tags&per_page=100&filter=alert.attributes.executionStatus.status:active
. When checking one service, include both rules where
params.serviceName
matches the service and rules where
params.serviceName
is absent (all-services rules). Do not query
.alerts*
indices for active-state checks. Correlate with SLO violations or metric changes.

Step 3: Check ML anomalies

If ML anomaly detection is used, query ML job results or anomaly records (via Elasticsearch ML APIs or indices) for the service and time range. Note high-severity anomalies (latency, throughput, error rate); use anomaly time windows to narrow Steps 4–5.

Step 4: Review throughput, latency, and error rate

Use ES|QL against

traces*apm*,traces*otel*
or
metrics*apm*,metrics*otel*
for the service and time range to get throughput (e.g. req/min), latency (avg, p95, p99), error rate (failed/total or 5xx/total). Example:
FROM traces*apm*,traces*otel* | WHERE service.name == "<service_name>" AND @timestamp >= ... AND @timestamp <= ... | STATS ...
. Compare to prior period or SLO targets. See Using ES|QL for APM metrics.

Step 5: Assess dependency health

Obtain dependency and service-map data via ES|QL on

traces*apm*,traces*otel*
/
metrics*apm*,metrics*otel*
(e.g. downstream service/span aggregations) or via APM route handlers in the Kibana repo that expose dependency/service-map data. For the service and time range, note downstream latency and error rate; flag slow or failing dependencies as likely causes.

Step 6: Correlate with infrastructure and logs

  • APM Correlations (when only a subpopulation is affected): Run
    node skills/observability/service-health/scripts/apm-correlations.js latency-correlations|failed-correlations --service-name <name> [--start ...] [--end ...]
    to get correlated attributes. Filter by those attributes and fetch trace samples or errors to confirm root cause. See APM Correlations script.
  • Infrastructure: Use resource attributes from traces (e.g.
    k8s.pod.name
    ,
    container.id
    ,
    host.name
    ) and query infrastructure/metrics indices with ES|QL or Elasticsearch for CPU and memory. OOM and CPU throttling directly impact APM health; correlate their time windows with APM degradation.
  • Logs: Use ES|QL or Elasticsearch on log indices with
    service.name == "<service_name>"
    or
    trace.id == "<trace_id>"
    to explain behavior and root cause (exceptions, timeouts, restarts).

Step 7: Summarize and recommend

State health (healthy / degraded / unhealthy) with reasons; list concrete next steps.

Examples

Example: ES|QL for a specific service

Scope with

WHERE service.name == "<service_name>"
and time range. Throughput and error rate (1-hour buckets;
LIMIT
caps rows and tokens):

FROM traces*apm*,traces*otel*
| WHERE service.name == "api-gateway"
  AND @timestamp >= "2025-03-01T00:00:00Z" AND @timestamp <= "2025-03-01T23:59:59Z"
| STATS request_count = COUNT(*), failures = COUNT(*) WHERE event.outcome == "failure" BY BUCKET(@timestamp, 1 hour)
| EVAL error_rate = failures / request_count
| SORT @timestamp
| LIMIT 500

Latency percentiles and exact field names: see Kibana

trace_charts_definition.ts
.

Example: "Is service X healthy?"

  1. Resolve service X and time range. Call SLOs API and Alerting API; run ES|QL on
    traces*apm*,traces*otel*
    /
    metrics*apm*,metrics*otel*
    for throughput, latency, error rate; query dependency/service-map data (ES|QL or Kibana repo).
  2. Evaluate SLO status (violated/degrading?), firing rules, ML anomalies, and dependency health.
  3. Answer: Healthy / Degraded / Unhealthy with reasons and next steps (e.g. Observability Labs).

Example: "Why is service Y slow?"

  1. Service Y and slowness time range. Call SLOs API and Alerting API; run ES|QL for Y and dependencies; query ML anomaly results.
  2. Compare latency (avg/p95/p99) to prior period via ES|QL; from dependency data identify high-latency or failing deps.
  3. Summarize (e.g. p99 up; dependency Z elevated) and recommend (investigate Z; Observability Labs for latency).

Example: Correlate service to infrastructure (OpenTelemetry)

Use resource attributes on spans/traces to get the runtimes (pods, containers, hosts) for the service. Then check CPU and memory for those resources in the same time window as the APM issue:

  • From the service’s traces or metrics, read resource attributes such as
    k8s.pod.name
    ,
    k8s.namespace.name
    ,
    container.id
    , or
    host.name
    .
  • Run ES|QL or Elasticsearch search on infrastructure/metrics indices filtered by those resource values and the incident time range. Check CPU usage and memory consumption (e.g.
    system.cpu.total.norm.pct
    ); look for OOMKilled events, CPU throttling, or sustained high CPU/memory that align with APM latency or error spikes.

Example: Filter logs by service or trace ID

To understand behavior for a specific service or a single trace, filter logs accordingly:

  • By service: Run ES|QL or Elasticsearch search on log indices with
    service.name == "<service_name>"
    and time range to get application logs (errors, warnings, restarts) in the service context.
  • By trace ID: When investigating a specific request, take the
    trace.id
    from the APM trace and filter logs by
    trace.id == "<trace_id>"
    (or equivalent field in your log schema). Logs with that trace ID show the full request path and help explain failures or latency.

Guidelines

  • Use Observability APIs (SLOs API, Alerting API) and ES|QL on
    traces*apm*,traces*otel*
    /
    metrics*apm*,metrics*otel*
    (8.11+ or Serverless), filtering by
    service.name
    (and
    service.environment
    when relevant). For active APM alerts, call
    /api/alerting/rules/_find?search=apm&search_fields=tags&per_page=100&filter=alert.attributes.executionStatus.status:active
    . When checking one service, evaluate both rule types: rules where
    params.serviceName
    matches the target service, and rules where
    params.serviceName
    is absent (all-services rules). Treat either as applicable to the service before declaring health. Do not query
    .alerts*
    indices when determining currently active alerts; use the Alerting API response above as the source of truth. For APM correlations, run the apm-correlations script (see APM Correlations script); for dependency/service-map data, use ES|QL or Kibana repo route handlers. For Elasticsearch index and search behavior, see the Elasticsearch APIs in the Elasticsearch repo.
  • Always use the user's time range; avoid assuming "last 1 hour" if the issue is historical.
  • When SLOs exist, anchor the health summary to SLO status and burn rate; when they do not, rely on alerts, anomalies, throughput, latency, error rate, and dependencies.
  • When analyzing only application metrics ingested via OpenTelemetry, use the ES|QL TS (time series) command for efficient metrics queries. The TS command is available in Elasticsearch 9.3+ and is always available in Elastic Observability Serverless.
  • Summary: one short health verdict plus bullet points for evidence and next steps.