git clone https://github.com/diegosouzapw/awesome-omni-skill
T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/development/principal-engineer" ~/.claude/skills/diegosouzapw-awesome-omni-skill-principal-engineer && rm -rf "$T"
skills/development/principal-engineer/SKILL.mdThe Four Phases
You MUST complete each phase before proceeding to the next.
Phase 1: Constraints & Trade-offs (The "No")
BEFORE designing the system:
-
Identify Non-Functional Requirements (NFRs)
- Scalability: 1k users or 10M users?
- Latency: Real-time (Gaming) or Batch (Reporting)?
- Compliance: HIPAA? GDPR? PCI-DSS?
- Rule: These constraints dictate the tech stack, not your personal preference.
-
The "Buy vs. Build" Decision
- Should we build a Chat system or use Stream.io?
- Should we build Auth or use Auth0?
- Rule: Only build your "Core Differentiator." Buy everything else.
-
Define the Boundaries
- Where does the Mobile App end and the Backend begin?
- Monolith or Microservices? (Hint: Start with Monolith).
- Output: A high-level context diagram showing system interactions.
Phase 1.5: Modern Architecture Patterns (2026)
Emerging patterns that change the game:
-
Edge Compute Strategies
- Cloudflare Workers / Vercel Edge: Run code closest to users (sub-50ms latency)
- Use Cases: A/B testing, personalization, auth checks, rate limiting
- Trade-off: Limited runtime (no filesystem, max 50ms CPU time)
- When: Global apps needing <100ms response times
-
WebAssembly (WASM) Backends
- Compile Rust/Go to WASM: Run in serverless with near-native performance
- Benefits: 10x faster cold starts vs containers, tiny bundles
- Tools: wasmtime, Spin (Fermyon), WASI
- When: CPU-intensive serverless functions (image processing, crypto)
-
Local-First Architecture
- CRDTs: Conflict-free Replicated Data Types (eventual consistency)
- Offline-First: App works without network, syncs when available
- Tools: Replicache, Electric SQL, PowerSync
- When: Collaborative apps (Figma, Notion patterns)
-
AI Integration Points
- Where to embed LLMs: Content generation, search, recommendations
- Vector Databases: pgvector (Postgres), Pinecone, Weaviate
- Cost Management: Cache embeddings, batch requests, use smaller models
- Latency: Stream responses (SSE/WebSockets), don't block UI
Phase 2: Technical Consensus
Aligning the experts:
-
Write the RFC (Request for Comments)
- Don't dictate. Write a design doc proposing the solution.
- List "Alternatives Considered" and why they were rejected.
- Review: Get sign-off from the Lead Backend, Lead Mobile, and DevOps.
-
Data Flow & Consistency
- Source of Truth: Which database owns the "User Profile"?
- Consistency Model: Strong (Banking) or Eventual (Social Feed)?
- Data Gravity: Don't move massive data to the compute; move compute to the data.
-
Failure Mode Analysis
- "What happens if the Payment Gateway is down?"
- "What happens if the Redis Cache is flushed?"
- Design for resilience, not perfection.
Phase 2.5: Observability-Driven Development
Build systems you can actually debug:
-
Think in Traces, Not Logs
- OpenTelemetry Standard: Unified traces, metrics, logs
- Trace ID Propagation: Follow a request across 10 microservices
- Span Attributes: Enrich with context (user_id, feature_flag, A/B test)
- Tools: Jaeger, Tempo, Honeycomb, Datadog APM
-
Structured Events Over Text Logs
{ "trace_id": "abc123", "span_id": "xyz789", "service": "payment-api", "event": "charge_created", "user_id": 42, "amount_cents": 1999, "duration_ms": 234 }- Benefits: Query like a database, not grep
- Anti-Pattern:
(unparseable)"User 42 charged $19.99 in 234ms"
-
Define SLOs Before SLAs
- SLO: Internal target (99.9% = 43min downtime/month)
- SLA: External commitment to customers
- Error Budgets: If SLO is 99.9%, you have 0.1% to "spend" on deploys/experiments
- Rule: If you burn 50% of error budget, freeze risky changes
Phase 3: Governance & Standards
Setting the rules of the road:
-
Tech Radar
- Adopt: TypeScript, Terraform, Postgres. (Safe to use).
- Trial: Go, Rust. (Use with caution/permission).
- Hold: MongoDB, PHP. (Do not use for new services).
-
Standardize Interfaces
- "All APIs must return standard HTTP error codes."
- "All logs must be JSON."
- "All services must expose a /metrics endpoint."
- "MANDATORY: All code changes must undergo a deep-dive 'fine-toothed comb' review before git deployment."
-
Cost Governance
- Review the architecture for hidden costs (e.g., Cross-AZ data transfer).
- Set the budget ceiling per active user.
Phase 4: Evolution & Modernization
Managing the rot:
-
Tech Debt Management
- Treat debt like a financial loan. Pay interest (refactoring) regularly.
- Identify "Load Bearing" legacy code that needs replacement.
-
The Strangler Fig Pattern
- Don't rewrite the old system from scratch (The "Big Bang" fails).
- Build new features on the new stack and slowly route traffic away from the old one.
-
Mentorship
- Your job is to make Senior Engineers into Architects.
- Explain why you made a decision, don't just give orders.
Red Flags - STOP and Follow Process
If you catch yourself thinking:
- "Let's use Kubernetes because Google uses it." (You are not Google).
- "Microservices will solve our spaghetti code." (Now you have distributed spaghetti).
- "I don't need to write this down, it's in my head." (Bus factor risk).
- "We'll figure out how to migrate the data later." (Data gravity is real).
- "I'll let every team choose their own language." (Operational nightmare).
- Pushing to deployment without a 'fine-toothed comb' code review.
ALL of these mean: STOP. Return to Phase 1.
Quick Reference
| Phase | Key Activities | Success Criteria |
|---|---|---|
| 1. Constraints | Buy vs Build, NFRs | Tech stack aligned with Biz |
| 2. Consensus | RFCs, Design Review | Team buy-in, clear boundaries |
| 3. Governance | Tech Radar, Standards | Consistent ecosystem |
| 4. Evolution | Refactoring, Mentoring | System survives > 5 years |
📐 Architecture Decision Record (ADR) Template
Use this for every significant technical decision:
# ADR-###: [Title] **Status:** Proposed | Accepted | Deprecated | Superseded **Date:** 2026-01-29 **Deciders:** [Names] ## Context What forces are at play? Business constraints, technical debt, team skillsets. ## Decision We will [decision]. We chose [option A] over [option B]. ## Consequences ### Positive - What becomes easier? - What performance/cost benefits? ### Negative - What becomes harder? - What new risks? ### Neutral - What stays the same? ## Alternatives Considered | Option | Pros | Cons | Why Rejected | |--------|------|------|---------------| | Option A (chosen) | ... | ... | N/A | | Option B | ... | ... | ... | | Option C | ... | ... | ... | ## References - [Link to RFC] - [Link to prototype]
🛠️ Modern Architecture Stack (2026)
Compute Patterns
- Serverless: AWS Lambda, Cloud Run, Cloudflare Workers
- Containers: ECS/Fargate, Cloud Run, Fly.io
- Edge: Cloudflare Workers, Vercel Edge, Fastly Compute
- WASM: Spin (Fermyon), wasmCloud
Data Patterns
- OLTP: Postgres 17, Neon (serverless), Supabase
- OLAP: ClickHouse, BigQuery, DuckDB
- Caching: Redis 7, Valkey, Dragonfly
- Vector: pgvector, Pinecone, Weaviate
- Local-First: Replicache, ElectricSQL, PowerSync
Observability
- Standard: OpenTelemetry
- APM: Datadog, New Relic, Honeycomb
- Logs: Grafana Loki, CloudWatch, Mezmo
- Metrics: Prometheus, Grafana, Datadog
Cost Optimization
- FinOps: Spot instances, reserved capacity
- Auto-scaling: KEDA, AWS Auto Scaling
- Budget Alerts: CloudWatch billing, Infracost