GB-Power-Market-JJ grafana-lens
Grafana tools for data visualization, monitoring, alerting, security, and SRE investigation. Use grafana_query, grafana_query_logs, grafana_query_traces, grafana_create_dashboard, grafana_update_dashboard, grafana_create_alert, grafana_share_dashboard, grafana_annotate, grafana_explore_datasources, grafana_list_metrics, grafana_search, grafana_get_dashboard, grafana_check_alerts, grafana_push_metrics, grafana_explain_metric, grafana_security_check, and grafana_investigate. Trigger when asked about metrics, dashboards, monitoring, alerts, costs, token usage, data visualization, PromQL, Prometheus, LogQL, Loki, log queries, error logs, log search, TraceQL, Tempo, traces, distributed tracing, span search, find slow traces, debug session traces, annotations, deployments, sharing charts, investigating alert notifications, pushing custom data (calendar, git, fitness, finance) to Grafana for visualization, pushing historical data, backfilling metrics, recording past data with timestamps, modifying dashboards, adding panels, removing panels, changing dashboard settings, updating dashboard time range, explain metric, metric trend, what is this metric, how has this changed, is this metric normal, why did my bill spike, cost visibility, security monitoring, security check, security audit, am I being attacked, is my agent compromised, suspicious activity, threat detection, prompt injection detection, set up security alerts, investigate, debug, triage, root cause, what's wrong, why is X broken, anomaly detection, RED method, USE method, alert fatigue, postmortem, incident summary.
git clone https://github.com/GeorgeDoors888/GB-Power-Market-JJ
T=$(mktemp -d) && git clone --depth=1 https://github.com/GeorgeDoors888/GB-Power-Market-JJ "$T" && mkdir -p ~/.claude/skills && cp -r "$T/openclaw-skills/skills/awsome-o/grafana-lens/skills" ~/.claude/skills/georgedoors888-gb-power-market-jj-grafana-lens-bc05ba && rm -rf "$T"
T=$(mktemp -d) && git clone --depth=1 https://github.com/GeorgeDoors888/GB-Power-Market-JJ "$T" && mkdir -p ~/.openclaw/skills && cp -r "$T/openclaw-skills/skills/awsome-o/grafana-lens/skills" ~/.openclaw/skills/georgedoors888-gb-power-market-jj-grafana-lens-bc05ba && rm -rf "$T"
openclaw-skills/skills/awsome-o/grafana-lens/skills/SKILL.mdGrafana Lens
You have full native Grafana access — query data, create dashboards, set alerts, receive alert notifications, annotate events, explore datasources, push custom data, and deliver visualizations inline. Works with ANY data in Grafana, not just agent metrics.
Musts
- Always call
first when you need a datasource UID — never guess UIDsgrafana_explore_datasources - Always call
before creating a dashboard — avoid duplicatesgrafana_search - Always call
beforegrafana_get_dashboard
— you need exact panel IDsgrafana_share_dashboard - Always call
beforegrafana_get_dashboard
— you need panel IDs and current structuregrafana_update_dashboard - Prefer
for direct answers over creating dashboards — "what's my cost?" needs a number, not a URLgrafana_query - Prefer
overgrafana_query
+grafana_create_dashboard
for simple data questions — a number is faster than a chartgrafana_share_dashboard - Use
for log searches — LogQL for logs, PromQL for metrics, TraceQL for traces. Never usegrafana_query_logs
for Loki datasourcesgrafana_query - Use
for trace searches — TraceQL for traces, PromQL for metrics, LogQL for logs. Never usegrafana_query_traces
orgrafana_query
for Tempo datasourcesgrafana_query_logs - All tools work with ANY Prometheus datasource — not just
metricsopenclaw_lens_* - When you see "GRAFANA ALERTS" in prompt context, investigate immediately with
— use thegrafana_check_alerts
field to go directly to querying (it provides the tool, query, and datasource)suggestedInvestigation - Run
with actiongrafana_check_alerts
once before alert notifications can reach the agent — this creates the webhook contact pointsetup - Push data before querying or dashboarding it — data is pushed via OTLP and available immediately
- Prefer
for "what is this metric?" questions over manualgrafana_explain_metric
— it returns current value, trend, stats, and metadata in one callgrafana_query - Use
from push response for PromQL queries — don't guess metric names (counters getqueryNames
suffix)_total - Use
prefix for custom metrics —openclaw_ext_
auto-prepends it if missinggrafana_push_metrics - Follow statistics-first discipline for log investigation — always run count/rate LogQL before reading individual entries. Use
with metric-over-logs queries (grafana_query_logs
,count_over_time
,rate
) before switching to raw log entriestopk - Silence alerts during investigation — use
with actiongrafana_check_alerts
to prevent repeat notifications while investigatingsilence - Use
for complete alert health —list_rules
with actiongrafana_check_alerts
returns all rules with live eval state (normal/firing/pending/nodata/error), health, and lastEvaluation — no need to cross-reference withlist_rules
actionlist - Use
+dashboardUid
to re-run panel queries — don't manually extract PromQL/LogQL frompanelId
output. Bothget_dashboard
andgrafana_query
accept these params to auto-resolve the panel's query expression and datasource. The tool handles template variable replacement and datasource routing automaticallygrafana_query_logs - Confirm with user before deleting dashboards or alert rules —
with operationgrafana_update_dashboard
anddelete
with actiongrafana_check_alerts
are permanent and cannot be undonedelete_rule
Quick Decision Tree
- "What is [metric]?" / "Why did it spike?" →
grafana_explain_metric - "What's the current value of X?" / complex PromQL →
grafana_query - "Find error logs" / "Search logs for..." →
grafana_query_logs - "Find slow traces" / "Show trace for session X" / "Debug distributed spans" →
grafana_query_traces - "Debug this session" / "Why did it fail?" / "What went wrong?" →
(search error/slow) →grafana_query_traces
(get → followgrafana_query_traces
) →correlationHint
→grafana_query_logs
→grafana_querygrafana_annotate - "Show me a chart" / "Visualize..." →
→grafana_search
→grafana_get_dashboardgrafana_share_dashboard - "Create a dashboard for..." →
(check duplicates) →grafana_searchgrafana_create_dashboard - "Add a panel to my dashboard" →
→grafana_get_dashboardgrafana_update_dashboard - "Delete this dashboard" →
with operationgrafana_update_dashboard
(confirm with user first)delete - "Alert me when..." →
(setup) →grafana_check_alertsgrafana_create_alert - "List my alert rules" / "What alerts do I have?" →
with actiongrafana_check_alertslist_rules - "Delete alert rule X" →
with actiongrafana_check_alerts
→list_rules
withdelete_ruleruleUid - "Track my [custom data]" / "Record my [past data]" →
(with optionalgrafana_push_metrics
for historical data, auto-registers, returnstimestamp
) →queryNames
withgrafana_queryqueryNames - "What data sources do I have?" →
grafana_explore_datasources - "What metrics are available?" →
grafana_list_metrics - "Set up monitoring" / "Monitor my agent" / "What dashboards should I have?" →
(check existing) →grafana_search
withgrafana_create_dashboard
→ followllm-command-center
chain through remaining templatessuggestedNext - "GenAI observability" / "OTel gen_ai metrics" / "Standard AI monitoring" →
withgrafana_create_dashboard
templategenai-observability - "What happened in session X?" / "Debug this session" →
withgrafana_create_dashboard
template → paste session IDsession-explorer - "Show me LLM traces" / "Show agent logs" →
withgrafana_create_dashboard
template (Loki + Tempo)llm-command-center - "How much am I spending?" / "Cost analysis" →
withgrafana_create_dashboard
templatecost-intelligence - "Which tools are slow?" / "Tool errors" →
withgrafana_create_dashboard
templatetool-performance - "Queue health" / "Webhook issues" / "Stuck sessions" →
withgrafana_create_dashboard
templatesre-operations - "System health check" / "Status report" / "Review all dashboards" →
→grafana_explore_datasources
(list + list_rules) →grafana_check_alerts
→grafana_search
(audit=true for each) → summarizegrafana_get_dashboard - "Audit my dashboard" / "Which panels are broken?" →
(audit=true) → reviewgrafana_get_dashboard
+ per-panelauditSummaryhealth - "Am I being attacked?" / "Security check" / "Security status" →
grafana_security_check - "Set up security monitoring" →
(setup) →grafana_check_alerts
(grafana_create_dashboard
) →security-overview
(webhook error burst, cost spike, tool loops, injection signals)grafana_create_alert - "Investigate security alert" →
→grafana_security_check
(correlate) →grafana_query_logs
(mark investigation) →grafana_annotate
(silence)grafana_check_alerts - "Investigate this alert" / "Why is X broken?" / "Debug this issue" / "Triage" / "Root cause" →
(multi-signal triage) → followgrafana_investigate
for deep-divessuggestedHypotheses.testWith - "Is this metric normal?" / "Is there an anomaly?" →
(returnsgrafana_explain_metric
z-score +anomaly
vs 1d/7d ago for 24h period)seasonality - "RED analysis" / "What's the error rate?" / "Service health" → RED Method queries (see sre-investigation.md §2)
- "Alert fatigue" / "Which alerts are noisy?" / "Alert health" →
with actiongrafana_check_alerts
— fatigue reportanalyze - "Postmortem" / "Incident summary" / "What happened?" →
→ 5-Phase methodology → postmortem template (see sre-investigation.md §9)grafana_investigate - "Compare before/after deployment" →
(list, tags: ["deploy"]) →grafana_annotate
(compareWith: "previous")grafana_explain_metric
Working with Multiple Grafana Instances
When several Grafana environments are configured (dev, staging, prod), every tool accepts an optional
instance parameter. grafana_explore_datasources returns availableInstances — use the name values from that list.
Why this matters: Users often need to query production metrics, create dashboards in dev, or compare environments side by side. Each tool call targets one instance.
Smart defaults: Omitting
instance always targets the configured default — safe and invisible for single-environment setups. Only specify instance when the user explicitly names a non-default environment.
Cross-environment workflows: Each call is independent. Query prod, create dashboard in dev — just set
instance differently on each call. No context switching needed.
Tool Inventory
| Tool | What It Does |
|---|---|
| Discover configured datasources (UIDs, types, query routing) — tells you which tool + query language to use for each datasource |
| Discover available metrics or label values from a datasource. Use with for minimal fields in multi-tool chains |
| Run PromQL instant/range queries — get numbers directly |
| Run LogQL queries against Loki — search and filter logs |
| Run TraceQL queries against Tempo — search traces or get full trace by ID |
| Create dashboards from templates or custom JSON |
| Add/remove/update panels, change dashboard metadata, or delete dashboard |
| Get dashboard summary (panels, queries). Use for overview scans, to health-check all panels in one call |
| Search existing dashboards by title, tags, or starred status |
| Render panel as image and deliver inline via messaging |
| Create Grafana-native alert rules on any metric |
| Create or list annotations (events) on dashboards |
| Check, acknowledge, list/delete rules, silence/unsilence, or set up Grafana alert webhook notifications. Use with for minimal fields |
| Push custom data (calendar, git, fitness, finance) via OTLP |
| Get metric context: current value, trend, stats, metadata, drill-down queries — agent interprets |
| Run 6 parallel security checks and return threat-level assessment (green/yellow/red) — "Am I being attacked?" |
| Multi-signal investigation triage — gathers metrics, logs, traces, and context in parallel, generates hypotheses with specific tool+params for follow-up |
Tool Details
grafana_explore_datasources
grafana_explore_datasourcesWhen: First step when user mentions data, metrics, or monitoring. Gets datasource UIDs needed by
grafana_query, grafana_query_logs, grafana_query_traces, grafana_list_metrics, grafana_create_alert, and grafana_explain_metric.
Params: instance (optional — target Grafana instance, omit for default).
Example: {}
Example (multi-instance): { "instance": "prod" }
Returns: List of datasources with uid, name, type, isDefault, plus routing hints: queryTool (which agent tool to use, e.g. "grafana_query", "grafana_query_logs", or "grafana_query_traces"), queryLanguage (e.g. "PromQL", "LogQL", "TraceQL"), and supported (boolean — whether an agent tool can query this datasource). Use queryTool to pick the right tool for each datasource. When multiple Grafana instances are configured, also returns instance (which instance was queried) and availableInstances (list of { name, url, isDefault } for all configured instances).
grafana_list_metrics
grafana_list_metricsWhen: User asks "what metrics are available?" or you need to discover metrics before querying or composing dashboards. Also when grouping metrics by function — metadata mode adds
category to each openclaw_* metric. Use purpose when user asks about a specific concern (e.g., "performance metrics", "cost metrics").
Params: datasourceUid (required), prefix (filter by prefix), search (targeted discovery — server-side regex, only matching metrics returned), purpose ("performance" | "cost" | "reliability" | "capacity" — pre-filter by intent, composable with prefix and search), label (list label values instead), metadata (boolean — enriched results with type/help/category), compact (boolean — with metadata, returns only name/type/category, ~60% smaller).
Example names: { "datasourceUid": "prom1", "prefix": "openclaw_lens_" }
Example search: { "datasourceUid": "prom1", "search": "steps" }
Example purpose: { "datasourceUid": "prom1", "purpose": "performance", "metadata": true }
Example combined: { "datasourceUid": "prom1", "prefix": "openclaw_ext_", "search": "fitness" }
Example metadata: { "datasourceUid": "prom1", "metadata": true, "prefix": "openclaw_" }
Example compact: { "datasourceUid": "prom1", "metadata": true, "compact": true }
Returns names: { metrics: ["metric1", "metric2", ...] }. Truncated at 200.
Returns metadata: { metadataSource, categorySummary: { cost: 3, usage: 4, session: 5, ... }, metrics: [{ name, type, help, category?, source? }, ...] }. Use this before composing custom dashboards — type tells you counter vs gauge vs histogram, category groups openclaw_* metrics by function. Search also matches help text. Categories: cost, usage, session, queue, messaging, webhook, tools, agent, custom. categorySummary gives counts per category for quick overview (omitted when no openclaw_* metrics). Purpose maps: performance → session + tools, cost → cost + usage, reliability → webhook + messaging + agent, capacity → queue + session. metadataSource: "prometheus" when Prometheus metadata endpoint has data, "synthetic" when OTLP-only (metadata synthesized from known metric registry — histogram sub-metrics deduplicated, type/help from Grafana Lens definitions). On OTLP stacks, includes hint explaining why metadata is synthetic. source: "synthetic" on individual entries from the registry; source: "custom" on entries from the custom metrics store.
Returns compact: { metadataSource, categorySummary: {...}, metrics: [{ name, type, category? }, ...] }. Same as metadata but drops help, source, labelNames — use in multi-tool chains where you need metric names and types but not full descriptions.
Example label: { "datasourceUid": "prom1", "label": "job" }
Returns label: { label, count, totalCount, values: ["value1", "value2", ...] }. Truncated at 200.
grafana_query
grafana_queryWhen: User asks a data question that needs a direct answer, not a dashboard. Also for re-running an existing dashboard panel's query with different time ranges. Params:
datasourceUid, expr (PromQL), queryType (instant/range), start (range only, required), end (range only, default "now"), step (range only, optional — auto-calculated from time range if omitted, targeting ~300 datapoints), dashboardUid (optional — resolve query from panel), panelId (optional — use with dashboardUid).
Example instant: { "datasourceUid": "prom1", "expr": "sum(increase(openclaw_lens_cost_by_model_total[1d])) or vector(0)" }
Example range (auto-step): { "datasourceUid": "prom1", "expr": "rate(openclaw_tokens_total[5m])", "queryType": "range", "start": "now-30d" }
Example range (explicit step): { "datasourceUid": "prom1", "expr": "rate(openclaw_tokens_total[5m])", "queryType": "range", "start": "now-1h", "end": "now", "step": "60" }
Example panel re-run: { "dashboardUid": "openclaw-command-center", "panelId": 10, "queryType": "range", "start": "now-7d" }
Tip: start/end accept Unix seconds or relative expressions like "now-1h", "now-7d". For range queries, just set start — end defaults to "now" and step is auto-calculated. Override step only when you need specific resolution.
Tip (panel re-run): Set dashboardUid + panelId to re-run a panel's query without manually extracting PromQL. The tool auto-resolves expr and datasourceUid from the panel definition. Template variables are replaced with wildcards. You can still override expr or datasourceUid explicitly if needed. Get panel IDs from grafana_get_dashboard.
Returns instant: { metrics: [{ metric: {...}, value: "1.23", timestamp: "...", healthContext?: { status, thresholds, description, direction } }], datasourceUid, resultCount, warnings?, hint? } — healthContext is included for well-known openclaw_lens_* gauge metrics, providing SRE-grade health assessment: status ("healthy"/"warning"/"critical"), thresholds (warning/critical values), description (what the metric means), direction ("higher_is_worse"/"lower_is_worse"). Omitted for unknown metrics. Capped at 50 results; when exceeded includes truncated: true, totalResults, and truncationHint advising to narrow the query.
Returns range: { series: [{ metric: {...}, values: [{ time, value }...] }], datasourceUid, resultCount, warnings?, hint? } — truncated to 20 points per series and 50 series max. When series are truncated includes truncated: true, totalSeries, and truncationHint. When step is auto-calculated, includes step: { value: "288s", display: "5m", auto: true }.
Returns (panel re-run): Includes resolvedFrom: "panel", panelTitle, panelType, templateVarsReplaced alongside normal query results. If the panel uses a Loki datasource, returns an error directing you to use grafana_query_logs instead.
Returns (warnings): When Prometheus flags a non-fatal issue (e.g., rate() on a gauge), warnings: [{ cause, suggestion, example? }] is included. Example: rate() on a gauge → cause says "rate() applied to 'metric' which appears to be a gauge", suggestion says "use delta() or deriv() instead", example shows the corrected query.
Returns (hint): When the query returns zero results, hint: { cause, suggestion } explains why (metric may not exist, label filters may not match) and suggests using grafana_list_metrics to verify.
Returns (error with guidance): On query failure, includes guidance: { cause, suggestion, example? } alongside the raw error. Pattern-matched for common PromQL mistakes: unclosed parenthesis, missing range selector, timeout, auth failure, rate on gauge, etc. Omitted when the error is unrecognized.
Tip (chaining): Both instant and range responses include datasourceUid — pass it directly to grafana_create_alert or other tools without re-calling grafana_explore_datasources. This enables zero-friction query→alert chains.
grafana_query_logs
grafana_query_logsWhen: User asks about logs, errors, or needs to investigate issues by searching log data. Also for session debugging, OTel log investigation, and re-running existing log panel queries. Params:
datasourceUid, expr (LogQL), queryType (instant/range, default range), start/end (default now-1h/now), step (metric queries only), limit (default 100), direction (backward/forward), lineLimit (max chars per log line, default 500, max 2000), extractFields (boolean, default false — extract structured OTel attributes into a clean fields object), dashboardUid (optional — resolve query from panel), panelId (optional — use with dashboardUid).
Example log search: { "datasourceUid": "loki1", "expr": "{job=\"api\"} |= \"error\"" }
Example with filters: { "datasourceUid": "loki1", "expr": "{job=\"api\"} |~ \"timeout|refused\"", "limit": 50, "direction": "forward" }
Example full stack traces: { "datasourceUid": "loki1", "expr": "{job=\"api\"} |= \"Exception\"", "lineLimit": 2000 }
Example session debugging: { "datasourceUid": "loki1", "expr": "{service_name=\"openclaw\"} | json | component=\"lifecycle\"", "extractFields": true }
Example metric query: { "datasourceUid": "loki1", "expr": "rate({job=\"api\"}[5m])", "queryType": "range", "start": "now-6h", "end": "now", "step": "60" }
Example panel re-run: { "dashboardUid": "openclaw-command-center", "panelId": 18, "start": "now-24h", "extractFields": true }
Returns streams: { entries: [{ labels: {...}, timestamp: "...", line: "..." }], datasourceUid, totalEntries, truncated } — capped at 100 entries, lines at 500 chars (set lineLimit: 2000 for full stack traces).
Returns streams (extractFields): { entries: [{ labels: {...cleaned...}, timestamp: "...", line: "...", fields: { component, event_name, session_id, trace_id, model, duration_s, ... } }], datasourceUid } — infrastructure noise labels removed, openclaw_ prefix stripped from field keys, numeric values auto-converted. Also parses JSON log bodies if present.
Returns streams (traceCorrelation): When extractFields: true and entries contain trace_id, includes traceCorrelation: { traceIds: [...], tool: "grafana_query_traces", tip } — up to 5 unique trace IDs ready for grafana_query_traces with queryType: "get".
Returns metric: Same shape as grafana_query range/instant results (matrix capped at 50 series, vector capped at 50 results — includes datasourceUid, truncated, totalSeries/totalResults, and truncationHint when exceeded).
Returns (panel re-run): Includes resolvedFrom: "panel", panelTitle, panelType, templateVarsReplaced alongside normal results. If the panel uses a Prometheus datasource, returns an error directing you to use grafana_query instead.
Returns (error with guidance): On query failure, includes guidance: { cause, suggestion, example? } alongside the raw error. Pattern-matched for common LogQL mistakes: bare text without stream selector, empty {}, unclosed braces, missing label matchers, auth failure, timeout. Omitted when the error is unrecognized.
Tip: LogQL: {label="value"} selects streams, |= substring filter, |~ regex, != exclude. Metric wrappers: rate(), count_over_time(), bytes_rate(). Use extractFields: true when investigating OTel/lifecycle logs — it surfaces trace_id, session_id, event_name, model, and other attributes as first-class fields instead of buried in raw labels.
Tip (panel re-run): Same as grafana_query — set dashboardUid + panelId to auto-resolve LogQL and datasource. The tool routes Prometheus panels to grafana_query with a helpful error.
grafana_query_traces
grafana_query_tracesWhen: User asks about traces, distributed tracing, slow spans, session trace hierarchies, or needs to debug request flows across services. Params:
datasourceUid, query (TraceQL expression or trace ID), queryType (search/get, default search), start/end (default now-1h/now), limit (default 20, max 50), minDuration/maxDuration (e.g., "1s", "10s"), dashboardUid (optional — resolve query from panel), panelId (optional — use with dashboardUid).
Example search: { "datasourceUid": "tempo1", "query": "{ resource.service.name = \"openclaw\" }" }
Example search slow: { "datasourceUid": "tempo1", "query": "{ resource.service.name = \"openclaw\" }", "minDuration": "5s" }
Example search with time: { "datasourceUid": "tempo1", "query": "{ span.gen_ai.system = \"anthropic\" }", "start": "now-24h", "limit": 50 }
Example get: { "datasourceUid": "tempo1", "query": "abc123def456789...", "queryType": "get" }
Example panel re-run: { "dashboardUid": "openclaw-session-explorer", "panelId": 12, "start": "now-24h" }
Returns search: { traces: [{ traceId, rootServiceName, rootTraceName, startTime, durationMs, spanCount? }], datasourceUid, totalTraces, truncated?, correlationHint? } — capped at 50 traces. When exceeded includes truncated: true and truncationHint. When traces are found, includes correlationHint: { logQuery, tool, tip } with a ready-to-use LogQL expression for grafana_query_logs.
Returns get: { traceId, spans: [{ traceId, spanId, parentSpanId?, operationName, serviceName, startTime, durationMs, status, kind?, attributes: {...} }], datasourceUid, totalSpans, truncated? } — flattened OTLP spans with resolved attributes (string/number/boolean). Capped at 200 spans. Sorted by start time (earliest first).
Returns (panel re-run): Includes resolvedFrom: "panel", panelTitle, panelType, templateVarsReplaced alongside normal results. If the panel uses a Prometheus or Loki datasource, returns an error directing you to use the correct tool.
Returns (error with guidance): On query failure, includes guidance: { cause, suggestion, example? } alongside the raw error. Pattern-matched for common TraceQL mistakes: syntax errors, invalid attributes, auth failure, timeout, not-found, invalid trace ID. Omitted when the error is unrecognized.
Returns (no results): When search returns zero traces, includes hint: { cause, suggestion } suggesting to broaden the query or check the datasource.
Tip: TraceQL: { } matches all traces, resource.service.name for service filter, span.http.status_code for HTTP spans, name for operation name, duration for span duration, status for error/ok filtering. Use minDuration/maxDuration to find performance outliers. Trace-to-Log: search and get results include correlationHint.logQuery — pass it directly to grafana_query_logs to find correlated logs. Log-to-Trace: grafana_query_logs results (with extractFields: true) include traceCorrelation.traceIds — pass any ID to grafana_query_traces with queryType: "get".
Tip (panel re-run): Same as grafana_query — set dashboardUid + panelId to auto-resolve TraceQL and datasource. The tool routes Prometheus/Loki panels to the correct tool with a helpful error.
grafana_create_dashboard
grafana_create_dashboardWhen: User wants a persistent dashboard for ongoing monitoring. Params:
template or dashboard (custom JSON) — one required. Optional: title (overrides template default), folderUid (target folder), overwrite (default true).
Returns: { uid, url, status, message, suggestedNext?: [{ template, reason }], validation?: DashboardValidation }. For template-based dashboards, suggestedNext lists complementary templates to deploy next. For custom JSON dashboards, validation dry-runs each panel's PromQL and reports per-panel health — check validation.panelsError for broken queries.
Choose the right template (3-tier SRE drill-down hierarchy):
Tier 1 → System: Start here for overall health. Tier 2 → Session: Click a session from Tier 1 to investigate. Tier 3 → Deep Dive: Cost, tool, or SRE details.
| Template | Tier | Domain | Variables | Use When |
|---|---|---|---|---|
| Tier 1 | System overview | , , , , , | Golden signals, session table with click-to-drill-down, cost, cache, live feeds |
| Tier 2 | Session debug | , , , (textbox) | Per-session trace hierarchy, LLM calls, tool calls, conversation flow |
| Tier 3a | Cost analysis | , , , | Spending trends, model attribution, cache savings, per-session cost table |
| Tier 3b | Tool analytics | , , , | Tool leaderboard, latency ranking, error rates, tool traces |
| Tier 3c | SRE operations | , | Queue health, webhooks, stuck sessions, tool loops |
| — | OTel gen_ai standard | , , , , | Industry-standard AI monitoring: token analytics, LLM performance, traces, logs, cache efficiency. Works with any gen_ai data. |
| — | System/DevOps | , | Server CPU, memory, disk, network |
| — | Web/DevOps | , | HTTP request rate, errors, latency (RED signals) |
| — | Any domain | , | Deep-dive into any single metric from a dropdown |
| — | Any domain | , .. | 4-metric KPI overview (business, fitness, finance, IoT) |
| — | Any domain | , , | Weekly overview of 2 external metrics with trends + all openclaw_ext_* table |
All AI templates have Loki log-to-trace correlation via Tempo + stable UIDs for cross-dashboard navigation.
Example AI health:
{ "template": "llm-command-center", "title": "My AI Dashboard" }
Example session debug: { "template": "session-explorer", "title": "Session Debug" }
Example cost analysis: { "template": "cost-intelligence", "title": "My AI Costs" }
Example tool analytics: { "template": "tool-performance", "title": "Tool Health" }
Example SRE ops: { "template": "sre-operations", "title": "SRE Health" }
Example GenAI observability: { "template": "genai-observability", "title": "GenAI Observability" }
Example system: { "template": "node-exporter", "title": "Server Health" }
Example generic: { "template": "metric-explorer", "title": "Explore My Data" }
Example multi-KPI: { "template": "multi-kpi", "title": "Business KPIs" }
Example weekly review: { "template": "weekly-review", "title": "My Weekly Review" }
Example custom with validation: { "dashboard": { "title": "Model Comparison", "panels": [{ "id": 1, "title": "Cost by Model", "type": "timeseries", "targets": [{ "refId": "A", "expr": "sum by (model) (rate(openclaw_lens_cost_by_token_type[1h]))", "datasource": { "uid": "prometheus" } }] }] } }
Custom dashboard validation (returned only for
dashboard param, not templates):
validation: { panelsTotal: 3, panelsValid: 1, panelsNoData: 1, panelsError: 1, panelsSkipped: 0, details: [{ panelId: 1, title: "Cost by Model", status: "ok", queries: [{ refId: "A", expr: "...", valid: true, sampleValue: 0.42 }] }, { panelId: 2, title: "Latency", status: "nodata" }, { panelId: 3, title: "Bad Query", status: "error", error: "parse error at char 5" }] }
Panel statuses: ok (query returned data), nodata (valid query, no results — metric may not exist yet), error (PromQL syntax error or datasource issue), skipped (no datasource UID found). Dashboard is always created regardless — validation is informational.
grafana_update_dashboard
grafana_update_dashboardWhen: User wants to add a panel, remove a panel, change a query, update dashboard settings, or delete a dashboard. Params:
uid (required), operation (required: add_panel, remove_panel, update_panel, update_metadata, delete).
add_panel params: panel (object with title, type, targets). Auto-layouts below existing panels.
remove_panel / update_panel params: panelId (preferred) or panelTitle (case-insensitive substring fallback). updates (object) for update_panel.
update_metadata params: title, description, tags, time (e.g., { "from": "now-7d", "to": "now" }), refresh (e.g., "1m").
delete params: None besides uid — permanently removes the dashboard. Always confirm with user first.
Example add: { "uid": "abc123", "operation": "add_panel", "panel": { "title": "Error Rate", "type": "timeseries", "targets": [{ "refId": "A", "expr": "rate(errors_total[5m])", "datasource": { "uid": "prom1" } }] } }
Example add (no datasource): { "uid": "abc123", "operation": "add_panel", "panel": { "title": "Latency", "type": "timeseries", "targets": [{ "refId": "A", "expr": "histogram_quantile(0.99, rate(http_duration_bucket[5m]))" }] } } — validation skipped if no datasource UID found, panel still saved.
Example remove: { "uid": "abc123", "operation": "remove_panel", "panelId": 3 }
Example update panel: { "uid": "abc123", "operation": "update_panel", "panelId": 1, "updates": { "title": "New Title", "targets": [{ "refId": "A", "expr": "new_query" }] } }
Example update metadata: { "uid": "abc123", "operation": "update_metadata", "title": "My Dashboard v2", "time": { "from": "now-7d", "to": "now" }, "refresh": "5m" }
Example delete: { "uid": "abc123", "operation": "delete" }
Returns update: { status: "updated", uid, url, version, operation, panelCount, affectedPanel?: { id, title }, changedFields?: [...], queryValidation?: { validated, results, datasourceUid?, skippedReason? } }.
Returns queryValidation: For add_panel and update_panel (when targets change), PromQL queries are dry-run against Grafana. Each result: { refId, expr, valid: boolean, error?: string, sampleValue?: number }. Panel is always saved — validation is informational. If valid: false, check the error field for PromQL syntax issues. If skippedReason is set, no datasource UID was found — include datasource: { uid: "..." } on targets to enable validation.
Returns delete: { status: "deleted", uid, title, message }.
Tip: targets in update_panel replaces entirely — include all targets, not just changed ones. Include datasource.uid on targets for query validation feedback.
grafana_get_dashboard
grafana_get_dashboardWhen: Need to inspect a dashboard's panels — find panel IDs for sharing, verify structure, scan multiple dashboards for an overview, or audit which panels are returning data. Params:
uid (required). Optional: compact (boolean, default false) — return panel titles and types only, no queries or metadata (~70% smaller). audit (boolean, default false) — dry-run each panel's query and add health status.
Example (full): { "uid": "abc123" }
Example (compact overview): { "uid": "abc123", "compact": true }
Example (audit): { "uid": "abc123", "audit": true }
Returns (full): { uid, title, description?, url, tags, time?, refresh?, panelCount, panels: [{ id, title, type, queries: [{ refId, expr }] }], folderUid, created?, updated? }.
Returns (compact): { uid, title, url, tags, panelCount, panels: [{ id, title, type }] }.
Returns (audit): Same as full, plus each panel gets health: { status: "ok"|"nodata"|"error"|"skipped", error?, sampleValue? } and the response includes auditSummary: { ok, nodata, error, skipped }. Resolves template variable datasources ($prometheus, $loki) and replaces expression template vars with wildcards.
Tip: Use audit: true when the user asks "which panels are broken?" or "audit my dashboard" — it replaces N separate grafana_query calls with one tool call. Use compact: true for lightweight overview scans. Omit both when you need query details (before update or share).
grafana_search
grafana_searchWhen: User mentions a dashboard by name, before creating one (check duplicates), or for reporting/audit workflows. Params:
query (required). Optional: tags (array — filter by tags), starred (boolean — only starred), sort ("alpha-asc"/"alpha-desc"), limit (number, default 100), enrich (boolean — add updatedAt + panelCount per result, default false).
Example: { "query": "cost" }
Example with tags: { "query": "", "tags": ["production"] }
Example starred: { "query": "", "starred": true, "limit": 10 }
Example enriched: { "query": "", "enrich": true }
Returns: { count, enriched, dashboards: [{ uid, title, url, tags, folderTitle?, folderUid?, updatedAt?, panelCount? }] }. folderTitle/folderUid always included when dashboard is in a folder. updatedAt (ISO 8601) and panelCount only present when enrich: true — enables staleness detection and reporting without per-dashboard get_dashboard calls.
Tip: Use enrich: true for reporting workflows ("which dashboards are stale?", "give me a summary of all dashboards"). Skip enrichment for simple lookups. After finding a dashboard, use grafana_get_dashboard to inspect panels, grafana_share_dashboard to render a chart, or grafana_update_dashboard to modify it.
grafana_share_dashboard
grafana_share_dashboardWhen: User says "show me" or "send me" a chart/dashboard. Params:
dashboardUid, panelId (required). Optional: from (default "now-6h"), to (default "now"), width (default 1000), height (default 500), theme ("light"/"dark", default "dark").
Example: { "dashboardUid": "abc123", "panelId": 2, "from": "now-6h", "to": "now" }
Returns: Image rendered inline (tier 1), or snapshot URL (tier 2), or deep link (tier 3). Always delivers something. Includes deliveryTier ("image" | "snapshot" | "link"), rendererAvailable (boolean — false when Image Renderer plugin is missing), renderFailureReason (why image rendering failed), and remediation (how to fix it). Tier 3 also includes snapshotFailureReason.
Tip: Use grafana_get_dashboard first to find panel IDs. If rendererAvailable is false, tell the user to install the grafana-image-renderer plugin.
grafana_create_alert
grafana_create_alertWhen: User wants notifications when a metric crosses a threshold. Params:
title, datasourceUid, expr (PromQL), threshold (all required). Optional: evaluation ("instant"/"rate"/"increase", default "instant"), evaluationWindow (default "5m", used with rate/increase), condition (gt/lt/gte/lte, default gt), for (duration, default 5m), folderUid, labels (e.g., { "severity": "warning" }), annotations (e.g., { "summary": "Cost too high" }), noDataState (NoData/Alerting/OK, default NoData).
IMPORTANT: For counter metrics (*_total), always use evaluation: "rate" (per-second rate) or evaluation: "increase" (total change over window). Raw counter values always increase and will immediately breach any threshold. Use "instant" (default) only for gauges.
Example gauge alert: { "title": "High Cost Alert", "datasourceUid": "prom1", "expr": "openclaw_lens_daily_cost_usd", "threshold": 5, "condition": "gt" }
Example rate alert: { "title": "High Error Rate", "datasourceUid": "prom1", "expr": "openclaw_lens_webhook_error_total", "threshold": 0.1, "evaluation": "rate" }
Example increase alert: { "title": "Token Burst", "datasourceUid": "prom1", "expr": "openclaw_lens_tokens_total", "threshold": 10000, "evaluation": "increase", "evaluationWindow": "1h" }
Returns: { uid, title, status: "created", datasourceUid, url, evaluation?: { mode, window, evaluatedExpr }, metricValidation: { valid, error?, sampleValue? }, message }. The datasourceUid echoes back which datasource the rule targets (verify correctness). metricValidation dry-runs the expression before creation — valid: true + sampleValue confirms data exists; valid: false + error warns of typos/missing metrics. Alert is always created regardless (metric may not have data yet). When evaluation is "rate" or "increase", validation runs the wrapped expression.
Note: Auto-creates a "Grafana Lens Alerts" folder if no folderUid is specified.
grafana_annotate
grafana_annotateWhen: User deploys, changes config, or wants to mark an event for correlation. Params:
action ("create" default, or "list").
Create params: text (required), tags, dashboardUid, panelId, time (epoch ms or relative like "now-2h", default now), timeEnd (epoch ms or relative).
List params: from, to (epoch ms or relative like "now-7d", "now-24h", "now"), tags, limit (default 20).
Time formats: All time params accept epoch ms (e.g., 1700000000000) OR Grafana-style relative strings ("now", "now-1h", "now-7d", "now-30m"). Prefer relative strings — they're simpler and avoid arithmetic errors.
Example create: { "text": "Deployed v2.1.0", "tags": ["deploy", "production"] }
Example create past: { "text": "Incident started", "time": "now-2h", "timeEnd": "now-30m", "tags": ["incident"] }
Example list recent: { "action": "list", "from": "now-7d", "to": "now", "tags": ["deploy"] }
Example list: { "action": "list", "tags": ["deploy"], "limit": 10 }
Returns create: { status: "created", id, message, time, comparisonHint: { beforeWindow: { from, to }, afterWindow: { from, to }, suggestion } }. The comparisonHint provides ready-to-use ISO 8601 time ranges (30-min windows) for before/after comparison via grafana_query — no manual time math needed. For region annotations (with timeEnd), afterWindow starts at timeEnd.
Returns list: { annotations: [{ id, text, tags, time, timeEnd?, dashboardUID?, panelId? }] }.
grafana_check_alerts
grafana_check_alertsWhen: Prompt context shows "GRAFANA ALERTS", need to manage alert rules (list/delete), set up the alert webhook, silence alerts during investigation, or acknowledge an investigated alert. Params:
action ("list" default, "acknowledge", "list_rules", "delete_rule", "silence", "unsilence", "setup").
List params: None — returns all pending (unacknowledged) alerts. Instances capped at 5 per alert.
Acknowledge params: alertId (required) — marks an alert as investigated.
List rules params: compact (boolean, default false — returns only uid/title/state/condition). Full mode returns all configured alert rules from Grafana with UID, title, condition (PromQL), folder, labels, annotations, AND live evaluation state (normal/firing/pending/nodata/error), health, and lastEvaluation. One call gives the complete alert health picture.
Delete rule params: ruleUid (required) — permanently deletes an alert rule. Get UIDs from list_rules.
Silence params: matchers (required — array of { name, value, isRegex? } from alert's commonLabels), duration (default "2h"), comment (optional).
Unsilence params: silenceId (required) — removes a silence so alerts resume notifying.
Setup params: webhookUrl (optional, auto-detected) — creates webhook contact point and notification policy route in Grafana.
Example list: {}
Example acknowledge: { "action": "acknowledge", "alertId": "alert-1" }
Example list rules: { "action": "list_rules" }
Example list rules compact: { "action": "list_rules", "compact": true }
Example delete rule: { "action": "delete_rule", "ruleUid": "abc123-def456" }
Example silence: { "action": "silence", "matchers": [{ "name": "alertname", "value": "HighCost" }], "duration": "2h", "comment": "Investigating cost spike" }
Example unsilence: { "action": "unsilence", "silenceId": "silence-uuid-123" }
Example setup: { "action": "setup" }
Returns list: { status: "success", alertCount, alerts: [{ id, status, title, message, receivedAt, commonLabels, totalInstances, truncated?, suggestedInvestigation?: { datasourceUid, condition, tool, queryLanguage, hint }, instances: [{ status, labels, annotations, startsAt, values }] }] }. suggestedInvestigation is auto-enriched by matching the alert to its rule — provides the PromQL/LogQL expression, datasource, and tool to use for immediate investigation (eliminates the need for separate list_rules + explore_datasources calls).
Returns acknowledge: { status: "acknowledged", alertId }.
Returns list_rules: { status: "success", ruleCount, rules: [{ uid, title, folder, ruleGroup, state, health, lastEvaluation, for, labels, annotations, condition, updated }] }. state is the live evaluation state: "normal" (not firing), "firing", "pending" (within for duration), "nodata", or "error". Falls back to "unknown" if the eval state API is unavailable. health is "ok", "nodata", "error", or "unknown". condition is the extracted PromQL expression from the rule's data queries.
Returns list_rules (compact): { status: "success", ruleCount, rules: [{ uid, title, state, condition }] }. Minimal fields for multi-tool chains — use when you need a quick overview of all rules without details.
Returns delete_rule: { status: "deleted", ruleUid, message }.
Returns silence: { status: "silenced", silenceId, duration, matchers, message }.
Returns unsilence: { status: "unsilenced", silenceId, message }.
Returns setup: { status: "created", contactPointUid, webhookUrl } or { status: "already_exists", contactPointUid }.
Note: Setup is idempotent — safe to call multiple times. Only alerts with managed_by=openclaw label route to the webhook (auto-added by grafana_create_alert). Use list_rules → delete_rule for full alert lifecycle management (create via grafana_create_alert, list/delete via grafana_check_alerts).
grafana_push_metrics
grafana_push_metricsWhen: User wants to track custom data (calendar events, git commits, fitness stats, financial data) in Grafana. Params:
action ("push" default, "register", "list", "delete").
Push params: metrics (required array) — each: { name, value, labels?, type?, help?, timestamp? }. Names auto-get openclaw_ext_ prefix. timestamp is optional ISO 8601 for historical data (gauge only).
Register params: name (required), type ("gauge"/"counter", default "gauge"), help, labelNames (array), ttlDays.
List params: None — returns all custom metric definitions.
Delete params: name (required) — removes a custom metric.
Example push: { "metrics": [{ "name": "steps_today", "value": 8000 }, { "name": "meetings", "value": 3, "labels": { "type": "standup" } }] }
Example backfill: { "metrics": [{ "name": "steps", "value": 8000, "timestamp": "2025-01-15" }, { "name": "steps", "value": 10500, "timestamp": "2025-01-16" }] }
Example mixed: { "metrics": [{ "name": "steps", "value": 9000, "timestamp": "2025-01-17" }, { "name": "heart_rate", "value": 72 }] }
Example register: { "action": "register", "name": "weight_kg", "type": "gauge", "help": "Body weight", "labelNames": ["person"], "ttlDays": 90 }
Example list: { "action": "list" }
Example delete: { "action": "delete", "name": "old_metric" }
Returns push: { status: "ok", accepted: 2, queryNames: { "openclaw_ext_steps": "openclaw_ext_steps", "openclaw_ext_events": "openclaw_ext_events_total" }, suggestedWorkflow: [{ tool, action, example }], message: "..." }. suggestedWorkflow contains concrete next-step examples using the actual pushed metric names — verify (grafana_query), visualize (grafana_create_dashboard with metric-explorer template), and alert (grafana_create_alert, single-metric only). Partial success supported. Timestamped and real-time points in the same batch are both accepted.
Returns register: { status: "registered", metric: { name, type, help, labelNames, ttlMs }, queryName: "openclaw_ext_events_total", suggestedWorkflow: [{ tool, action, example }] }. suggestedWorkflow shows how to push data and query the registered metric (with rate() wrapping for counters).
Returns list: { count, metrics: [{ name, type, queryName, help, labelNames, createdAt, updatedAt }] }.
Returns delete: { status: "deleted", name }.
Note: Push auto-registers unknown metrics. Response includes queryNames with exact PromQL names and suggestedWorkflow with concrete next steps. Follow suggestedWorkflow to complete the push→visualize pipeline. Timestamped pushes are gauge-only — counters with timestamps are rejected. See external-data.md for naming conventions and backfill patterns.
grafana_explain_metric
grafana_explain_metricWhen: User asks "what does this metric mean?", "why did it spike?", "is this normal?", or "show me the trend". Params:
datasourceUid (required), expr (PromQL or plain metric name, required), period (24h/7d/30d, default 24h), compareWith ("previous" — compare current period with the same-length window immediately before it).
Example: { "datasourceUid": "prom1", "expr": "openclaw_lens_daily_cost_usd" }
Example counter: { "datasourceUid": "prom1", "expr": "openclaw_lens_tokens_total" }
Example 7d: { "datasourceUid": "prom1", "expr": "openclaw_lens_daily_cost_usd", "period": "7d" }
Example comparison: { "datasourceUid": "prom1", "expr": "openclaw_lens_daily_cost_usd", "period": "7d", "compareWith": "previous" }
Example PromQL: { "datasourceUid": "prom1", "expr": "rate(http_requests_total[5m])", "period": "24h" }
Returns: { metricType?, trendQuery?, current: { value, timestamp }, healthContext?: { status, thresholds, description, direction }, trend: { changePercent, direction, first, last }, stats: { min, max, avg, samples }, comparison?: { previousPeriod: { from, to, avg, min, max, samples }, change: { absolute, percentage, direction } }, metadata: { type, help, unit }, suggestedQueries?: [{ query, description }], suggestedBreakdowns?: string[] }. Sections omitted when data unavailable. changePercent is null when first value is zero. healthContext is included for well-known openclaw_lens_* gauge metrics — same as grafana_query.
Counter-aware: Auto-detects counter metrics (via metadata type or _total suffix) and wraps the trend query in rate(expr[5m]). The current value stays raw (cumulative total), but trend and stats show rate of change. metricType field tells you the detected type (counter/gauge/histogram). trendQuery shows the actual PromQL used for trend (only present when different from expr).
Drill-down: For multi-dimensional metrics (metrics with labels like model, token_type, provider), the response includes suggestedQueries — ready-to-use PromQL queries for grafana_query that break down the metric by each label. Counter metrics get rate() wrapping automatically. Use these to investigate cost attribution, identify top contributors, or decompose aggregates.
Breakdowns: suggestedBreakdowns provides label names for decomposition — always available for known OpenClaw metrics (cost, session, queue, webhook families) even when the metric has no data yet. For unknown metrics, falls back to labels discovered from the instant query. Use these labels with grafana_query to build sum by (label) (...) queries for root-cause analysis.
Period comparison: Use compareWith: "previous" for period-over-period analysis (e.g., this week vs. last week). Returns a comparison object with the previous period's stats and the change (absolute, percentage, direction). Works with counters too (compares rates). Eliminates the need for manual multi-query workflows.
Tip: For simple trend context, call with just period. For "did things improve?" questions, add compareWith: "previous". Metadata only available for plain metric names (not complex PromQL). No need to manually wrap counters in rate() — the tool does it automatically.
grafana_security_check
grafana_security_checkWhen: User asks "am I being attacked?", "security status", "security audit", "security check", or wants a comprehensive threat assessment. Params:
lookback (time window, default "1h". Use "24h" for daily review, "7d" for weekly).
Example: {}
Example weekly review: { "lookback": "7d" }
Returns: { overallThreatLevel: "green"|"yellow"|"red", summary, checks: [{ name, status, value, detail }], securityEventLogs, limitations: [...], suggestedActions: [...], dashboardTemplate: "security-overview" }.
Checks (6 parallel, via Promise.allSettled — partial failures return partial results):
— webhook error rate vs received rate. Warning 20%, critical 50%.webhook_error_ratio
— daily cost. Warning $10, critical $50.cost_anomaly
— active tool loops. Warning 1, critical 3.tool_loops
— prompt injection patterns detected. Warning 1, critical 5.injection_signals
— unique sessions in 1h. Warning 50, critical 200.session_enumeration
— stuck sessions. Warning 1, critical 3. Limitations: Auth failures are NOT observable — openclaw's auth middleware emits no diagnostic events. The tool monitors observable signals (webhook errors, cost spikes, prompt injection patterns, session anomalies) but cannot detect silent auth-layer attacks (bad tokens, brute-force, rate-limiter lockouts). Always includes limitations in response. Auto-discovery: No datasource UID needed — auto-discovers first Prometheus datasource. Loki is optional (skips log check if absent). Follow-up: Usestuck_sessions
to create a persistent security dashboard. UsedashboardTemplate
for investigation steps.suggestedActions
grafana_investigate
grafana_investigateWhen: First step for "investigate this", "what's wrong?", "triage", "root cause analysis", "debug this issue", or any multi-signal investigation. Params:
focus (required — alert UID, metric name, or symptom text), timeWindow (optional, default "1h", options: "1h", "6h", "24h"), service (optional, default "openclaw").
Example: { "focus": "high error rate" }
Example with alert: { "focus": "alert-abc", "timeWindow": "6h" }
Example with metric: { "focus": "openclaw_lens_daily_cost_usd", "timeWindow": "24h" }
Returns: { timeWindow, focus, metricSignals: { focus: { current, trend, anomalyScore, anomalySeverity }, red: { rate, errorRate, p95Latency } }, logSignals: { totalVolume, errorCount, bySeverity, topPatterns, sampleErrors }, traceSignals: { errorTraces, slowTraces }, contextSignals: { recentAnnotations, alertsActive }, suggestedHypotheses: [{ hypothesis, evidence, confidence, testWith: { tool, params } }], limitations }.
Auto-discovery: No datasource UID needed — auto-discovers Prometheus, Loki, and Tempo datasources. Loki and Tempo are optional (graceful degradation with limitations noted).
Follow-up: Use suggestedHypotheses[].testWith for deep-dives with specific tools. Use grafana_annotate to mark findings. Use grafana_check_alerts to acknowledge investigated alerts.
Security Monitoring
Honest limitations: OpenClaw's auth middleware does not emit diagnostic events for failed authentication attempts (bad tokens, brute-force, rate-limiter lockouts). Gateway-level auth failures are invisible to Grafana Lens. The security-check tool monitors observable signals (webhook errors, cost spikes, prompt injection patterns, session anomalies) but cannot detect silent auth-layer attacks.
Security PromQL queries:
- Webhook error spike:
rate(openclaw_lens_webhook_error_total[5m]) > 0.5 - Webhook error ratio:
rate(openclaw_lens_webhook_error_total[5m]) / (rate(openclaw_lens_webhook_received_total[5m]) + 0.001) > 0.3 - Active tool loops:
sum(openclaw_lens_tool_loops_active) > 0 - Prompt injection signals:
sum(increase(openclaw_lens_prompt_injection_signals_total[1h])) > 0 - Cost spike:
openclaw_lens_daily_cost_usd > 10 - Unique session anomaly:
openclaw_lens_unique_sessions_1h > 50 - Gateway restarts:
sum(increase(openclaw_lens_gateway_restarts_total[24h])) > 3 - Tool error classification:
sum by (error_class) (rate(openclaw_lens_tool_error_classes_total[5m]))
Security LogQL queries:
- Security events:
{service_name="openclaw"} | json | event_name=~"prompt_injection.detected|gateway.start|gateway.stop" - Tool loops:
{service_name="openclaw"} | json | component="diagnostic" | event_name="tool.loop" - Session resets:
{service_name="openclaw"} | json | event_name="session.reset"
Composed Workflow Examples
"How's my agent doing?"
withgrafana_query
andsum(increase(openclaw_lens_cost_by_model_total[1d])) or vector(0)
for quick numbersopenclaw_lens_tokens_total- Or
withgrafana_create_dashboard
template +llm-command-center
for a visualgrafana_share_dashboard
"Set up monitoring for my agent" / "What dashboards should I create?"
query: "LLM Command Center" — check for existing agent dashboardsgrafana_search
template: "llm-command-center" — single-pane health: cost, sessions, errors, latency, cache, live feedsgrafana_create_dashboard- Follow
from the response — it chains you through the remaining templates (session-explorer → cost-intelligence → tool-performance → sre-operations → genai-observability)suggestedNext - Share dashboard URLs with user
- Optional:
to show a key panel inlinegrafana_share_dashboard
The 6 AI observability templates provide comprehensive OpenClaw monitoring with Loki log-to-trace correlation:
- llm-command-center: Single-pane health view (daily cost, sessions, error rate, P95 latency, cache hit rate, token flow, live error/session feeds, per-model latency, message type breakdown)
- session-explorer: Per-session deep-dive (trace hierarchy, LLM calls, tool calls, conversation flow, cost/token breakdown, TraceQL session spans, latency stats). THE killer feature.
- cost-intelligence: Financial analysis (daily/weekly/monthly trends, model/provider/token attribution pie charts, cache savings, per-session cost table, projected monthly cost, token efficiency)
- tool-performance: Tool reliability (leaderboard bar gauges, p95 latency ranking, error rates, tool traces, tool calls by model, error traces, duration heatmap)
- sre-operations: Infrastructure health (queue depth, webhooks, stuck sessions, tool loops, gen_ai error rate, context window pressure)
- genai-observability: Industry-standard AI monitoring using OTel gen_ai semantic conventions — token analytics, LLM latency heatmap, trace explorer, log intelligence, cache efficiency. Works with any gen_ai data, not just OpenClaw.
"Find error logs in the last hour"
— find Loki datasource UIDgrafana_explore_datasources
withgrafana_query_logs{job="api"} |= "error"
"Why did my service crash?"
— find Loki UIDgrafana_explore_datasources
withgrafana_query_logs{job="myservice"} |~ "fatal|panic|crash"
— check related metrics around the same time windowgrafana_query
"Investigate alert: correlate metrics with logs"
— get alert details withgrafana_check_alerts
(includes ready-to-use query, datasource, and tool)suggestedInvestigation
/grafana_query
— usegrafana_query_logs
withsuggestedInvestigation.tool
andsuggestedInvestigation.conditionsuggestedInvestigation.datasourceUid
— search logs around alert time for root cause (if PromQL alert)grafana_query_logs
— mark findings on dashboardgrafana_annotate
withgrafana_check_alertsaction: "acknowledge"
"What data do I have in Grafana?"
— find Prometheus datasourcesgrafana_explore_datasources
withgrafana_list_metrics
anddatasourceUid
— list available metrics grouped by categorymetadata: true- Summarize what's available using
for quick overview (e.g., "5 cost metrics, 8 session metrics, 3 queue metrics")categorySummary
"My agent feels slow — what performance metrics can I look at?"
— get Prometheus UIDgrafana_explore_datasources
withgrafana_list_metrics
,purpose: "performance"
— returns only session + tool metrics (latency, duration, tool calls)metadata: true
on key metrics (e.g.,grafana_explain_metric
,openclaw_lens_session_latency_avg_ms
) — get current values and trendsopenclaw_lens_tool_duration_ms
"Alert me if costs exceed $10/day"
— get Prometheus UIDgrafana_explore_datasources
withgrafana_create_alert
,expr: "openclaw_lens_daily_cost_usd"
,threshold: 10condition: "gt"
"Query error rate and alert if it's above 1%" (query→alert chaining)
withgrafana_query
— response includesexpr: "rate(http_errors_total[5m])"datasourceUid
withgrafana_create_alert
from the query response, samedatasourceUid
,expr
,threshold: 0.01
— no extra datasource discovery neededcondition: "gt"
"Show me my cost trends"
withgrafana_search
— check for existing dashboardquery: "cost"- If not found:
withgrafana_create_dashboard
templatecost-intelligence
— find the right panel IDgrafana_get_dashboard
— deliver the chart inlinegrafana_share_dashboard
"Re-run the latency panel from my Command Center for the last 7 days"
withgrafana_search
— get the dashboard UIDquery: "Command Center"
withgrafana_get_dashboard
— find the latency panel IDuid
withgrafana_query
,dashboardUid
,panelId
,queryType: "range"
— auto-resolves PromQL and datasource from the panelstart: "now-7d"
"Mark that I deployed version 2"
withgrafana_annotate
,text: "Deployed v2"tags: ["deploy"]
"I deployed v2.3.0 — compare error rates before and after"
withgrafana_annotate
,text: "Deployed v2.3.0"
— response includestags: ["deploy"]
with before/after time windowscomparisonHint
withgrafana_query
time range — get pre-deploy error ratecomparisonHint.beforeWindow
withgrafana_query
time range — get post-deploy error ratecomparisonHint.afterWindow- Compare and report the difference to the user
"What happened around 3pm?"
withgrafana_annotate
,action: "list"
/from
relative times (e.g.,to
) or epoch ms, or specific"now-6h"tags
"Monitor my server's health"
— get Prometheus UIDgrafana_explore_datasources
withgrafana_search
— check for existingquery: "node"
withgrafana_create_dashboard
— user picks instance from dropdowntemplate: "node-exporter"
"Show me my e-commerce metrics"
— get Prometheus UIDgrafana_explore_datasources
withgrafana_list_metrics
— discover available metricsprefix: "orders_"
withgrafana_create_dashboard
— user picks 4 business metrics from dropdownstemplate: "multi-kpi"
"I want to explore my fitness data"
— get Prometheus UIDgrafana_explore_datasources
withgrafana_create_dashboard
— user browses alltemplate: "metric-explorer"
metrics from dropdownfitness_*
"Set up HTTP API monitoring with alerts"
— get Prometheus UIDgrafana_explore_datasources
withgrafana_create_dashboardtemplate: "http-service"
withgrafana_create_alert
,expr: "http_requests_total{code=~\"5..\"}"
,threshold: 0.1
— the tool wraps inevaluation: "rate"
automaticallyrate()
"Build a custom Redis dashboard"
— get Prometheus UIDgrafana_explore_datasources
withgrafana_list_metrics
,metadata: true
— learn metric typesprefix: "redis_"- Compose custom
JSON using panel snippets from the composition guidedashboard
with the custom JSON — checkgrafana_create_dashboard
for broken queries, fix any errors withvalidation.panelsErrorgrafana_update_dashboard
"Set up full alert monitoring loop"
— get Prometheus UIDgrafana_explore_datasources
withgrafana_check_alerts
— create webhook contact pointaction: "setup"
with PromQL condition — rule auto-routes to webhook viagrafana_create_alert
labelmanaged_by=openclaw- When alert fires, agent sees it in prompt context and can investigate
"List all my alert rules and delete one"
withgrafana_check_alerts
— get all configured rules with UIDs, titles, conditions, AND live eval state (normal/firing/nodata/error)action: "list_rules"- Identify the target rule by title, condition, or state
withgrafana_check_alerts
,action: "delete_rule"
— permanently remove the ruleruleUid
"Investigate a firing alert" (triggered by "GRAFANA ALERTS" in prompt context)
withgrafana_investigate
= alert UID or symptom,focus
— parallel multi-signal triage (metrics, logs, traces, context)timeWindow: "1h"
withgrafana_check_alerts
,action: "silence"
from alert'smatchers
— prevent repeat notificationscommonLabels- Follow
from investigate response — deep-dive with specific toolssuggestedHypotheses[].testWith
— search logs around alert time for root cause (statistics first:grafana_query_logs
, then samples)count_over_time
— inspect error/slow traces for span-level detailgrafana_query_traces
— mark findings on dashboardgrafana_annotate
withgrafana_check_alerts
— mark as investigatedaction: "acknowledge"
withgrafana_check_alerts
,action: "unsilence"
— restore notificationssilenceId- Report findings using Evidence Presentation Format (see sre-investigation.md §10)
"Is this metric anomalous?" / "Statistical assessment"
— get Prometheus UIDgrafana_explore_datasources
withgrafana_explain_metric
— response includesperiod: "24h"
(z-score, severity) andanomaly
(vs 1d/7d ago)seasonality- If
>= "significant":anomaly.severity
with z-score PromQL for time series viewgrafana_query
withgrafana_query
— where is it heading?predict_linear(METRIC[6h], 3600)- Interpret: report σ-level, compare with yesterday/last week, forecast trajectory
"Alert fatigue analysis — which alerts are noisy?"
withgrafana_check_alerts
— fatigue report with always-firing, flapping, and healthy classificationsaction: "analyze"- For always-firing rules: suggest raising thresholds or adding
durationfor: - For flapping rules: suggest adding hysteresis or widening threshold bands
withgrafana_check_alerts
for rules being tunedaction: "silence"
"Generate a postmortem for the last incident"
withgrafana_investigate
= incident symptom,focus
— gather all evidencetimeWindow: "6h"
(list) — find resolved alerts with timelinegrafana_check_alerts
(list, tags: ["investigation"]) — build event timelinegrafana_annotate
for key metrics — trend/stats for incident periodgrafana_explain_metric
— error patterns (statistics first)grafana_query_logs
— error traces for root cause evidencegrafana_query_traces- Synthesize into blameless postmortem (see sre-investigation.md §9)
"Add an error rate panel to my API dashboard"
withgrafana_search
— find the dashboardquery: "API"
— get current panels and structuregrafana_get_dashboard
withgrafana_update_dashboard
,operation: "add_panel"panel: { title: "Error Rate", type: "timeseries", targets: [...] }
"Remove the old latency panel and add a new one"
— find panel IDsgrafana_get_dashboard
withgrafana_update_dashboard
,operation: "remove_panel"panelId: <old_id>
withgrafana_update_dashboard
,operation: "add_panel"panel: { ... }
"Change my dashboard time range to 7 days"
withgrafana_update_dashboard
,operation: "update_metadata"time: { "from": "now-7d", "to": "now" }
"Send my team the weekly dashboard"
with the dashboard namegrafana_search
— get panel IDsgrafana_get_dashboard
for each key panel withgrafana_share_dashboardfrom: "now-7d"
"Show me my weekly work review"
— push work data (commits, hours, meetings)grafana_push_metrics
withgrafana_create_dashboard
— weekly trends + all custom metrics tabletemplate: "weekly-review"
— deliver key panels inlinegrafana_share_dashboard
"Track my daily fitness data"
— push fitness metrics (steps, weight, calories). Response hasgrafana_push_metrics
with exact PromQL names.queryNames
withgrafana_create_dashboard
— weekly trendstemplate: "weekly-review"
usinggrafana_create_alert
from step 1 for exact PromQL — e.g.,queryNames
,threshold: 5000condition: "lt"
"Backfill my step count for the past week"
— push 7 data points withgrafana_push_metrics
on each:timestamp{ "metrics": [{ "name": "steps", "value": 8000, "timestamp": "2025-01-13" }, { "name": "steps", "value": 10500, "timestamp": "2025-01-14" }, ...] }
withgrafana_create_dashboard
— visualize the trendtemplate: "metric-explorer"
withgrafana_share_dashboard
— deliver the chartfrom: "now-7d"
"What custom data have I pushed?"
withgrafana_push_metrics
— see all custom metric definitions withaction: "list"
per metricqueryName
usinggrafana_query
values — see current valuesqueryName- Or
withgrafana_list_metrics
— server-side discovery from Prometheussearch: "ext"
"System health check — give me a full status report"
— discover available datasourcesgrafana_explore_datasources
— check pending alerts +grafana_check_alerts
for configured rulesaction: "list_rules"
withgrafana_search
andquery: ""
— find all dashboards withenrich: true
andupdatedAt
for staleness triagepanelCount
withgrafana_get_dashboard
for stale or suspicious dashboards — check which panels have data vs brokenaudit: true- Summarize: datasources available, alerts firing, dashboard health (
), stale dashboards, broken panelsauditSummary
"Audit my dashboard — which panels are broken?"
withgrafana_get_dashboard
— dry-runs every panel's queryaudit: true- Review
for counts (auditSummary
,ok
,nodata
,error
)skipped - For
panels, explain the issue; forerror
panels, suggest fixes (missing metrics, wrong datasource)nodata
"What is this metric doing?"
— get Prometheus UIDgrafana_explore_datasources
with metric name — get current value, trend, stats, metadatagrafana_explain_metric- Interpret results for the user — the tool provides data, you provide narrative
"Compare costs this week vs. last week" / "Did the new model help?"
— get Prometheus UIDgrafana_explore_datasources
withgrafana_explain_metric
,period: "7d"
— one call gives this week's stats + last week's stats + percentage changecompareWith: "previous"- Interpret the
object — "costs are down 25% week-over-week, the model switch is working"comparison.change
"Compare metric over different time scales"
withgrafana_explain_metric
— recent trendperiod: "24h"
withgrafana_explain_metric
— weekly contextperiod: "7d"- Compare and explain — "up 15% today but down 5% over the week"
"Where are my costs going?" / "Why did my costs spike?"
— get Prometheus UIDgrafana_explore_datasources
withgrafana_explain_metric
— get trend +openclaw_lens_cost_by_token_type
(always:suggestedBreakdowns
) +["model", "token_type", "provider"]suggestedQueries- Use
to know which dimensions matter, or pick fromsuggestedBreakdowns
for ready-to-use PromQLsuggestedQueries
withgrafana_query
— get concrete breakdown numberssum by (model) (rate(openclaw_lens_cost_by_token_type[5m]))- Explain to user: "Opus accounts for 80% of costs, mainly from output tokens"
"Set up AI observability" / "Show me LLM traces and logs"
— verify Prometheus, Loki, and Tempo datasources are availablegrafana_explore_datasources
withgrafana_search
— check for existingquery: "LLM Command Center"
withgrafana_create_dashboard
— health overview with live error/session feedstemplate: "llm-command-center"
withgrafana_create_dashboard
— per-session trace drill-downtemplate: "session-explorer"
— deliver an LLM latency or token usage panel inlinegrafana_share_dashboard
"Why is my LLM slow?" (multi-signal investigation)
withgrafana_explain_metric
— check latency trendgen_ai_client_operation_duration_seconds
withgrafana_query_logs
,{service_name="openclaw"} | json | component="lifecycle" |= "llm.output"
— find slow LLM calls withextractFields: true
andfields.duration_s
surfaced directlyfields.model
withgrafana_query_traces
andqueryType: "get"
from step 2 — inspect spans; response includesfields.trace_id
for correlated logscorrelationHint.logQuery
with model-specific duration breakdown — identify which model is slowgrafana_query
"Debug a slow or failing session"
withgrafana_query_traces
or{ resource.service.name = "openclaw" && status = error }
— find problematic tracesminDuration: "10s"
withgrafana_query_traces
and the trace ID — inspect span hierarchy (response includesqueryType: "get"
)correlationHint
with thegrafana_query_logs
from step 2,correlationHint.logQuery
— correlated logsextractFields: true
with relevant PromQL — check metrics around the same time windowgrafana_query
withgrafana_annotate
,text: "Debug: [findings]"
— mark investigation on dashboardstags: ["debug"]
"Which dashboards are stale?" / "Give me a dashboard summary report"
withgrafana_search
andquery: ""
— getsenrich: true
,updatedAt
,panelCount
for every dashboardfolderTitle- Sort by
to identify stale dashboards (e.g., not updated in 30+ days)updatedAt - Summarize: total dashboards, by folder, stale vs active, panel counts
"Am I being attacked?" / "Security status" / "Security audit"
— runs 6 parallel security checks, returns threat level (green/yellow/red)grafana_security_check- If non-green: follow
from the responsesuggestedActions - If
is present:dashboardTemplate
withgrafana_create_dashboard
— persistent monitoringtemplate: "security-overview"
using suggested thresholds for ongoing monitoringgrafana_create_alert
"Set up security monitoring"
withgrafana_check_alerts
— create webhook contact point (if not already done)action: "setup"
withgrafana_create_dashboard
— 15-panel security dashboardtemplate: "security-overview"
withgrafana_create_alert
,expr: "rate(openclaw_lens_webhook_error_total[5m]) / (rate(openclaw_lens_webhook_received_total[5m]) + 0.001)"
,threshold: 0.3
— webhook error burstcondition: "gt"
withgrafana_create_alert
,expr: "openclaw_lens_daily_cost_usd"
,threshold: 10
— cost spikecondition: "gt"
withgrafana_create_alert
,expr: "sum(increase(openclaw_lens_prompt_injection_signals_total[1h]))"
,threshold: 3
— prompt injection signalscondition: "gt"
"Investigate security alert"
— get current threat assessmentgrafana_security_check
withgrafana_query_logs
— security event logs{service_name="openclaw"} | json | component="lifecycle" | event_name=~"prompt_injection.detected|gateway.start|gateway.stop"
with specific PromQL for the affected check (e.g.,grafana_query
)rate(openclaw_lens_tool_error_classes_total[5m])
withgrafana_annotate
,text: "Security investigation"
— mark investigation on dashboardstags: ["security"]
withgrafana_check_alerts
— mark alerts as investigatedaction: "acknowledge"
"Clean up old test dashboards"
withgrafana_search
or relevant tags andquery: "test"
— includesenrich: true
for staleness checkupdatedAt- Confirm with user which dashboards to delete
withgrafana_update_dashboard
for each confirmed dashboardoperation: "delete"
"Find my production dashboards"
withgrafana_search
ortags: ["production"]starred: true
"Why is the queue backing up?"
— get Prometheus UIDgrafana_explore_datasources
withgrafana_query
— current queue depthopenclaw_lens_queue_depth
withgrafana_query
— message inflow by sourcesum(rate(openclaw_message_queued_total[5m])) by (openclaw_source)
withgrafana_query
— p95 queue wait timehistogram_quantile(0.95, rate(openclaw_queue_wait_ms_milliseconds_bucket[5m]))
withgrafana_query
vssum(rate(openclaw_queue_lane_enqueue_total[5m]))
— enqueue vs dequeue ratesum(rate(openclaw_queue_lane_dequeue_total[5m]))
withgrafana_query
— check for stuck sessions blocking the queueopenclaw_lens_sessions_stuck- Diagnose: if inflow > drain → scale issue; if stuck sessions → investigate with
grafana_explain_metric
Dashboard Composition
For composing custom dashboards from discovered metrics — template selection, metric-to-panel mapping, JSON snippets, worked examples across domains — see references/dashboard-composition.md.
Agent Metrics
For metric names, types, labels, and common PromQL — covering both
openclaw_* (from diagnostics-otel) and openclaw_lens_* (from grafana-lens) — see references/agent-metrics.md.
External Data
For external data naming conventions and integration patterns, see references/external-data.md.
SRE Investigation Patterns
For 5-phase investigation methodology, anomaly detection PromQL, RED/USE method queries, LogQL investigation discipline, TraceQL debugging, SLI/SLO burn rates, cross-signal correlation, and composed investigation workflows — see references/sre-investigation.md.