Claude-skill-registry instrumentation-planning
Plan instrumentation strategy before implementation, covering what to instrument, naming conventions, cardinality management, and instrumentation budget
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/instrumentation-planning" ~/.claude/skills/majiayu000-claude-skill-registry-instrumentation-planning && rm -rf "$T"
manifest:
skills/data/instrumentation-planning/SKILL.mdsource content
Instrumentation Planning
Strategic planning for application instrumentation before implementation.
When to Use This Skill
- Planning instrumentation for new services
- Reviewing instrumentation strategy
- Establishing naming conventions
- Managing telemetry cardinality
- Setting instrumentation budgets
Instrumentation Strategy Framework
What to Instrument
Instrumentation Layers: ┌─────────────────────────────────────────────────────────────────┐ │ Layer 1: Automatic/Library Instrumentation │ │ - HTTP clients/servers (auto-captured) │ │ - Database clients (auto-captured) │ │ - Message queue clients (auto-captured) │ │ - Framework-provided metrics │ │ Effort: Low | Coverage: Broad | Customization: Limited │ ├─────────────────────────────────────────────────────────────────┤ │ Layer 2: Business Transaction Instrumentation │ │ - Key user journeys │ │ - Business operations (checkout, signup, etc.) │ │ - Revenue-generating flows │ │ - SLA-bound operations │ │ Effort: Medium | Coverage: Targeted | Value: High │ ├─────────────────────────────────────────────────────────────────┤ │ Layer 3: Debug/Diagnostic Instrumentation │ │ - Algorithmic hot paths │ │ - Cache behavior │ │ - Circuit breaker states │ │ - Retry/fallback paths │ │ Effort: Medium | Coverage: Deep | Use: Troubleshooting │ ├─────────────────────────────────────────────────────────────────┤ │ Layer 4: Business Metrics │ │ - Domain-specific counters │ │ - Conversion rates │ │ - Feature usage │ │ - Customer behavior │ │ Effort: High | Coverage: Custom | Value: Business Insights │ └─────────────────────────────────────────────────────────────────┘
Instrumentation Decision Matrix
instrumentation_decisions: always_instrument: - "Inbound HTTP/gRPC requests" - "Outbound HTTP/gRPC calls" - "Database queries" - "Message publish/consume" - "Authentication/authorization" - "External API calls" - "Cache operations" consider_instrumenting: - "Complex business logic" - "Feature flags evaluation" - "Background jobs" - "Scheduled tasks" - "File I/O operations" - "CPU-intensive operations" avoid_instrumenting: - "Every method call (too noisy)" - "Tight loops (performance impact)" - "Data transformation (low value)" - "Validation helpers" - "Utility functions" decision_criteria: business_value: weight: 0.3 question: "Does this help understand business outcomes?" debugging_value: weight: 0.25 question: "Does this help diagnose production issues?" slo_relevance: weight: 0.25 question: "Does this contribute to SLI measurement?" cost_impact: weight: 0.2 question: "Is the cardinality/volume acceptable?"
Naming Conventions
Metric Naming
metric_naming: format: "[namespace]_[subsystem]_[name]_[unit]" rules: case: "snake_case" unit_suffix: "Always include (_seconds, _bytes, _total)" base_units: "Use base units (seconds not milliseconds)" counter_suffix: "_total for counters" examples: good: - "http_server_requests_total" - "http_server_request_duration_seconds" - "http_server_response_size_bytes" - "db_connections_current" - "order_processing_duration_seconds" - "payment_transactions_total" bad: - "requests (no unit, no namespace)" - "HttpRequestDuration (wrong case)" - "order_latency_ms (use base units)" - "totalOrders (camelCase, no unit)" label_naming: case: "snake_case" avoid: - "Embedded values in name (path=/users)" - "High cardinality labels" good_labels: - "method, status_code, path" - "service, version, environment" bad_labels: - "user_id (high cardinality)" - "request_id (high cardinality)" - "timestamp (not a dimension)"
Span Naming
span_naming: format: "[operation] [resource]" rules: - "Use verb + noun pattern" - "Keep names low cardinality" - "Include operation type, not specific values" - "Be consistent across services" examples: http: pattern: "HTTP {METHOD} {route_template}" good: "HTTP GET /users/{id}" bad: "HTTP GET /users/12345" database: pattern: "{operation} {table}" good: "SELECT orders" bad: "SELECT * FROM orders WHERE id=123" messaging: pattern: "{operation} {queue/topic}" good: "PUBLISH order-events" bad: "publish message to order-events queue" rpc: pattern: "{service}/{method}" good: "OrderService/CreateOrder" bad: "grpc call to order service" attributes: required: - "service.name" - "service.version" - "deployment.environment" recommended: http: - "http.method" - "http.route" - "http.status_code" - "http.target" database: - "db.system" - "db.name" - "db.operation" - "db.statement (sanitized)" messaging: - "messaging.system" - "messaging.destination" - "messaging.operation"
Log Field Naming
log_naming: format: "snake_case for all fields" standard_fields: timestamp: "ISO 8601 format" level: "INFO, WARN, ERROR, etc." message: "Human-readable description" service: "Service name" trace_id: "Correlation ID" span_id: "Current span" domain_fields: pattern: "{domain}_{field}" examples: - "order_id" - "customer_id" - "payment_amount" - "product_sku" avoid: - "Nested objects (flatten for indexing)" - "Arrays of unknown length" - "Large text blobs" - "Sensitive data (PII, secrets)"
Cardinality Management
Understanding Cardinality
Cardinality = Number of unique time series Example: http_requests_total{method="GET", path="/api/users", status="200"} Cardinality = methods × paths × statuses = 5 × 100 × 10 = 5,000 time series With user_id (1M users): = 5 × 100 × 10 × 1,000,000 = 5,000,000,000 time series ← EXPLOSION!
Cardinality Budget
cardinality_budget: planning: total_budget: 100000 # Target max time series per service allocation: automatic_instrumentation: 30% # 30,000 business_transactions: 40% # 40,000 custom_metrics: 20% # 20,000 buffer: 10% # 10,000 per_metric_limits: low_cardinality: max_series: 100 example: "status codes, methods" medium_cardinality: max_series: 1000 example: "endpoints, operations" high_cardinality: max_series: 10000 example: "aggregated by hour" requires: "Justification and approval" monitoring: - "Alert on cardinality growth > 10% per day" - "Weekly cardinality reviews" - "Automatic label value limiting"
Cardinality Reduction Techniques
cardinality_reduction: bucketing: before: "path=/users/12345" after: "path=/users/{id}" technique: "Path template extraction" sampling: description: "Sample high-volume, low-value traces" strategies: head_sampling: "Decide at trace start" tail_sampling: "Decide after seeing full trace" adaptive: "Adjust rate based on volume" aggregation: description: "Pre-aggregate before export" example: "Count by status, not per request" value_limiting: description: "Cap unique values per label" example: "Max 100 unique paths, then 'other'" dropping: description: "Drop low-value dimensions" candidates: - "Instance ID (use service name)" - "Request ID (not for metrics)" - "Full URLs (use route templates)"
Instrumentation Budget
Performance Impact
performance_budget: cpu_overhead: target: "< 1% CPU increase" measurement: "Profile with/without instrumentation" memory_overhead: target: "< 50MB additional heap" components: - "Metric registries" - "Span buffers" - "Log buffers" latency_overhead: target: "< 1ms per request" hot_paths: "< 100μs" data_volume: metrics: target: "< 1GB/day per service" calculation: "series × scrape_interval × 8 bytes" traces: target: "< 10GB/day per service (with sampling)" sampling_rate: "1-10% for high-volume services" logs: target: "< 5GB/day per service" strategies: "Sampling, level gating"
Cost Planning
cost_planning: estimation_formula: metrics: monthly_cost: "time_series × $0.003 (typical cloud pricing)" example: "10,000 series × $0.003 = $30/month" traces: monthly_cost: "spans_per_month × $0.000005" example: "100M spans × $0.000005 = $500/month" logs: monthly_cost: "GB_per_month × $0.50" example: "500GB × $0.50 = $250/month" optimization_strategies: - "Increase scrape interval (15s → 60s)" - "Reduce trace sampling rate" - "Log level gating in production" - "Shorter retention for debug data" - "Downsample old metrics"
Instrumentation Plan Template
instrumentation_plan: service: "{Service Name}" version: "1.0" date: "{Date}" owner: "{Team}" objectives: - "Track SLIs for order processing" - "Enable distributed tracing for debugging" - "Monitor payment success rates" automatic_instrumentation: framework: "OpenTelemetry .NET" enabled: - "ASP.NET Core (HTTP server)" - "HttpClient (HTTP client)" - "Entity Framework Core (database)" - "Azure.Messaging.ServiceBus" configuration: sampling_rate: 0.1 # 10% of traces batch_export_interval: 5000 # ms custom_spans: - name: "ProcessOrder" purpose: "Track order processing duration" attributes: - "order.id" - "order.item_count" - "order.total_amount" events: - "inventory.reserved" - "payment.processed" - name: "ValidatePayment" purpose: "Track payment validation steps" attributes: - "payment.method" - "payment.provider" sensitive: false custom_metrics: counters: - name: "orders_total" labels: ["status", "payment_method"] purpose: "Count orders by outcome" cardinality_estimate: 20 - name: "payment_failures_total" labels: ["reason", "provider"] purpose: "Track payment failure reasons" cardinality_estimate: 50 histograms: - name: "order_processing_duration_seconds" labels: ["order_type"] purpose: "Track order processing latency" buckets: [0.1, 0.5, 1, 2, 5, 10] cardinality_estimate: 10 gauges: - name: "pending_orders_current" labels: [] purpose: "Current pending order count" cardinality_estimate: 1 cardinality_summary: estimated_total: 81 budget: 1000 status: "Within budget" log_strategy: production_level: "INFO" structured_fields: standard: - "trace_id" - "span_id" - "service" - "environment" domain: - "order_id" - "customer_id (hashed)" sampling: debug_logs: "1% in production" cost_estimate: monthly: metrics: "$30" traces: "$200" logs: "$150" total: "$380" review_schedule: frequency: "Quarterly" metrics_to_review: - "Cardinality growth" - "Data volume" - "Cost vs budget"
Related Skills
- Three pillars overviewobservability-patterns
- Trace implementation detailsdistributed-tracing
- What to measure for SLOsslo-sli-error-budget
Last Updated: 2025-12-26