Learn-skills.dev monitoring-operations
Use when setting up OCI metrics, alarms, or log collection, or troubleshooting missing data and silent alarms. Covers metric namespace naming, MQL dimension requirements, alarm missing-data handling, Service Connector IAM gaps, and Cloud Guard integration. KEYWORDS: monitoring, alarm, metric, MQL, namespace, log, Service Connector, Log Analytics, Cloud Guard, missing data, oci_computeagent.
git clone https://github.com/NeverSight/learn-skills.dev
T=$(mktemp -d) && git clone --depth=1 https://github.com/NeverSight/learn-skills.dev "$T" && mkdir -p ~/.claude/skills && cp -r "$T/data/skills-md/acedergren/agentic-tools/monitoring-operations" ~/.claude/skills/neversight-learn-skills-dev-monitoring-operations && rm -rf "$T"
data/skills-md/acedergren/agentic-tools/monitoring-operations/SKILL.mdOCI Monitoring and Observability - Expert Knowledge
NEVER Do This
NEVER debug "missing metrics" within the first 15 minutes
- Metrics are published every 1–5 minutes
- Processing delay adds another 5–10 minutes
- Total lag from event to visible metric: 10–15 minutes
- Premature debugging creates false investigations
NEVER use
for alarm thresholds with sparse metrics=
# WRONG - alarm never fires when metric has data gaps MetricName[1m].mean() = 0 # RIGHT - handle missing data explicitly MetricName[1m]{dataMissing=zero}.mean() > 0
NEVER omit the
dimension in metric queriesresourceId
# WRONG - returns no data (required dimension missing) CPUUtilization[1m].mean() # RIGHT - filter by instance OCID CPUUtilization[1m]{resourceId="<instance-ocid>"}.mean()
Querying without dimensions returns data for ALL resources — usually not what's intended, and rate-limited at 1000 req/min.
NEVER set alarm thresholds without a trigger delay
# BAD - fires on every transient CPU spike (alert fatigue) CPUUtilization[1m].mean() > 80 # BETTER - fires only on sustained breach CPUUtilization[5m].mean() > 80 # + set trigger delay: 5 minutes (5 consecutive breaches)
NEVER create alarms without notification destinations
# WRONG - alarm fires but nobody is notified oci monitoring alarm create ... --destinations '[]' # RIGHT - always link to a notification topic oci monitoring alarm create ... --destinations '["<notification-topic-ocid>"]'
Cost impact: undetected production outages = $5,000–50,000+/hour.
NEVER ignore Cloud Guard findings
- Cloud Guard detects misconfigurations before they become incidents
- Wire it: Cloud Guard → Notifications → email/Slack/PagerDuty
- Unresolved findings fail CIS/SOC2/HIPAA audits
Metric Namespace Reference
OCI uses service-specific namespaces — using the wrong namespace returns no data with no error.
| Service | Namespace | Key Metrics |
|---|---|---|
| Compute | | , |
| Autonomous DB | | , |
| Load Balancer | | , |
| Object Storage | | , |
Common mistake: using
oci_compute instead of oci_computeagent — the agent namespace requires the OCI Compute Agent to be running on the instance.
Alarm Missing Data Handling
| Setting | Behavior | Use When |
|---|---|---|
| Alarm fires if no data arrives | Critical services (silence = outage) |
| Alarm silent if no data | Optional or intermittent monitoring |
in MQL | Treats gaps as 0 value | Request counters, throughput metrics |
Log Collection Troubleshooting
Logs not appearing in Log Analytics? │ ├─ Is logging enabled on the resource? │ └─ Compute: is oci-compute-agent running? (systemctl status oracle-cloud-agent) │ └─ Functions: is logging enabled in function configuration? │ ├─ Is Service Connector configured and ACTIVE? │ └─ Source: Log Group → Target: Log Analytics │ └─ Check status: oci sch service-connector get --id <ocid> │ ├─ IAM policy for Service Connector? │ └─ "Allow any-user to use log-content in tenancy" │ └─ "Allow service loganalytics to READ logcontent in tenancy" │ └─ Missing EITHER policy causes silent failure │ └─ 10–15 minute ingestion lag? └─ Wait before concluding logs are missing
Metric Query Performance
Unfiltered queries scan ALL resources in compartment — slow and consumes rate limit budget.
# Expensive: scans all instances CPUUtilization[1m].mean() # Optimized: filter to specific instance CPUUtilization[1m]{resourceId='<instance-ocid>'}.mean()
Rate limit: 1000 metric queries/minute per tenancy. Dashboard with many unfiltered widgets can exhaust this.
Progressive Loading Reference
Load
when:references/oci-monitoring-reference.md
- Need the complete list of OCI service metric namespaces and metric names
- Writing complex MQL expressions (composites, functions, grouping)
- Implementing composite alarm conditions
- Setting up Log Analytics workspace, APM, or Service Connector Hub in detail
Do NOT load for alarm threshold patterns, namespace gotchas, or log troubleshooting — this file covers those.