NWave nw-production-readiness

Monitoring, observability, operational procedures, CI/CD lessons learned, and quality gate definitions. Load when assessing production readiness or validating operational excellence.

install
source · Clone the upstream repo
git clone https://github.com/nWave-ai/nWave
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/nWave-ai/nWave "$T" && mkdir -p ~/.claude/skills && cp -r "$T/nWave/skills/nw-production-readiness" ~/.claude/skills/nwave-ai-nwave-nw-production-readiness && rm -rf "$T"
manifest: nWave/skills/nw-production-readiness/SKILL.md
source content

Production Readiness

Monitoring and Observability

Application Monitoring

  • Performance: response time | throughput | latency percentiles (P50, P95, P99)
  • Resources: CPU | memory | database connections | cache hit rates
  • Errors: exception tracking | error rate trends | integration failure detection
  • Business: KPI tracking | conversion funnels | feature usage | revenue impact

Infrastructure Monitoring

Server/container health and resource utilization | Network performance and connectivity | Storage capacity and I/O performance | Security event detection.

Alerting Tiers

TierConditionResponse
PageService down, data loss risk, security breachImmediate response
UrgentError rate >2x baseline, latency SLA breachResponse within 15 min
WarningCapacity >80%, error rate trending upResponse within 1 hour
InfoDeployment complete, metric threshold crossedReview next business day

Operational Procedures

Incident Response

  1. Detect: automated alerting identifies issue
  2. Triage: classify severity, assign responder
  3. Communicate: notify stakeholders per severity level
  4. Resolve: apply fix or rollback
  5. Review: post-incident review within 48 hours
  6. Improve: update runbooks and monitoring based on findings

Maintenance Procedures

Regular update and patching schedule | Backup verification (test restores quarterly) | Security vulnerability scanning (automated, weekly) | Performance baseline recalibration (after major changes).

Knowledge Transfer

Operational runbooks for common procedures | Architecture documentation with system diagrams | Deployment procedures and configuration management | Troubleshooting guides for known failure modes.

Quality Gates for Production Readiness

Before declaring production-ready, all must pass:

  • All acceptance tests passing
  • Unit coverage meets project standard (default: >= 80%)
  • Integration tests validated
  • Performance validated under realistic load
  • Security scan completed (0 critical, 0 high)
  • Monitoring and alerting configured
  • Logging structured and searchable
  • Rollback procedure documented and tested
  • Runbook created for operational procedures
  • On-call team trained on new feature

For CI/CD architecture lessons and measurement coupling pitfalls, see

cicd-and-deployment
skill.