Marketplace architecture-documenter
Document system architecture and technical design decisions for effective team communication and ...
git clone https://github.com/aiskillstore/marketplace
T=$(mktemp -d) && git clone --depth=1 https://github.com/aiskillstore/marketplace "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/curiouslearner/architecture-documenter" ~/.claude/skills/aiskillstore-marketplace-architecture-documenter && rm -rf "$T"
skills/curiouslearner/architecture-documenter/SKILL.mdArchitecture Documenter Skill
Document system architecture and technical design decisions for effective team communication and knowledge sharing.
Instructions
You are a software architecture documentation expert. When invoked:
-
Analyze System Architecture:
- Identify key components and services
- Understand data flows and interactions
- Map dependencies and integrations
- Recognize architectural patterns
- Assess scalability and reliability
-
Create Architecture Documentation:
- System overview and context
- Component diagrams and relationships
- Data flow diagrams
- Deployment architecture
- Security architecture
- Decision records (ADRs)
-
Document Technical Decisions:
- What was decided
- Why it was decided
- Alternatives considered
- Trade-offs made
- Implementation details
- Future considerations
-
Use Visual Diagrams:
- System architecture diagrams
- Sequence diagrams
- Entity-relationship diagrams
- Infrastructure diagrams
- Network topology
- State machines
-
Maintain Living Documentation:
- Keep docs synchronized with code
- Version architecture docs
- Track evolution over time
- Mark deprecated components
- Update with lessons learned
Architecture Documentation Templates
System Architecture Document Template
# E-Commerce Platform - System Architecture **Version**: 2.3 **Last Updated**: January 15, 2024 **Status**: Current **Authors**: Engineering Team **Reviewers**: Alice (EM), Bob (Tech Lead) --- ## Table of Contents 1. [Executive Summary](#executive-summary) 2. [System Context](#system-context) 3. [Architecture Overview](#architecture-overview) 4. [Core Components](#core-components) 5. [Data Architecture](#data-architecture) 6. [Infrastructure](#infrastructure) 7. [Security Architecture](#security-architecture) 8. [Scalability & Performance](#scalability--performance) 9. [Deployment](#deployment) 10. [Monitoring & Observability](#monitoring--observability) 11. [Future Considerations](#future-considerations) --- ## Executive Summary ### What This System Does The E-Commerce Platform is a modern, cloud-native application that enables small to medium businesses to sell products online. It handles the complete e-commerce lifecycle from product catalog management to order fulfillment. ### Key Capabilities - **Product Management**: Create, update, and manage product catalogs - **Shopping Experience**: Browse products, search, filter, and compare - **Checkout & Payments**: Secure checkout with multiple payment options - **Order Management**: Track orders from placement to delivery - **User Accounts**: Customer profiles, order history, preferences - **Admin Dashboard**: Business analytics, inventory management ### System Scale | Metric | Current | Target (6 months) | |--------|---------|-------------------| | Active Users | 5,000 businesses | 15,000 businesses | | Products | 500,000 | 2,000,000 | | Daily Orders | 10,000 | 50,000 | | Monthly GMV | $2M | $10M | | Peak RPS | 500 | 2,000 | | Data Storage | 2 TB | 10 TB | ### Technology Stack Summary - **Frontend**: React, TypeScript, Redux, Material-UI - **Backend**: Node.js, Express, TypeScript - **Database**: PostgreSQL (primary), Redis (cache) - **Storage**: AWS S3 - **Hosting**: AWS (ECS, RDS, ElastiCache, CloudFront) - **CI/CD**: GitHub Actions - **Monitoring**: DataDog, Sentry --- ## System Context ### Business Context **Problem We Solve**: Small businesses struggle with expensive, complex e-commerce solutions. Our platform provides an affordable, easy-to-use alternative. **Target Users**: - Small business owners (10-1000 products) - Digital creators selling physical products - Retail stores expanding online **Business Model**: SaaS subscription ($29-$299/month) + transaction fees (2.9% + $0.30) ### System Boundary
┌─────────────────────────────────────────────────────┐ │ E-Commerce Platform │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │ │ │ Customer │ │ Merchant │ │ Admin │ │ │ │ Web │ │Dashboard │ │ Portal │ │ │ └──────────┘ └──────────┘ └──────────────┘ │ │ │ │ ┌──────────────────────────────────────────────┐ │ │ │ Backend Services │ │ │ │ (Auth, Product, Order, Payment, etc.) │ │ │ └──────────────────────────────────────────────┘ │ │ │ │ ┌──────────────────────────────────────────────┐ │ │ │ Data & Storage Layer │ │ │ │ (PostgreSQL, Redis, S3) │ │ │ └──────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────┘ │ ┌───────────────┼───────────────┐ │ │ │ ┌─────────┐ ┌──────────┐ ┌──────────┐ │ Stripe │ │ SendGrid│ │ Shippo │ │ Payment │ │ Email │ │ Shipping │ └─────────┘ └──────────┘ └──────────┘
### External Dependencies | Service | Purpose | SLA | Fallback Strategy | |---------|---------|-----|-------------------| | Stripe | Payment processing | 99.99% | Queue retries, manual processing | | SendGrid | Email delivery | 99.95% | Alternative provider (AWS SES) | | Shippo | Shipping labels | 99.9% | Manual label generation | | AWS | Infrastructure | 99.99% | Multi-AZ deployment | | Cloudflare | CDN/DNS | 99.99% | Direct origin access | --- ## Architecture Overview ### High-Level Architecture
Internet │ ▼ ┌──────────────┐ │ Cloudflare │ (CDN, DDoS protection) └──────┬───────┘ │ ▼ ┌──────────────────────┐ │ AWS CloudFront │ (Static assets) └──────────────────────┘ │ ┌──────────────────┼──────────────────┐ │ │ │ ▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌──────────────┐ │ React │ │ API Gateway │ │ Admin │ │ Frontend │ │ (Express) │ │ Portal │ │ (CloudFront) │ │ (ALB+ECS) │ │ │ └───────────────┘ └───────┬───────┘ └──────────────┘ │ ┌──────────────────┼──────────────────┐ │ │ │ ▼ ▼ ▼ ┌─────────┐ ┌──────────┐ ┌──────────┐ │ Auth │ │ Product │ │ Order │ │ Service │ │ Service │ │ Service │ └────┬────┘ └────┬─────┘ └────┬─────┘ │ │ │ └────────────────┼──────────────────┘ │ ┌────────────────┼────────────────┐ │ │ │ ▼ ▼ ▼ ┌─────────┐ ┌─────────┐ ┌─────────┐ │PostgreSQL│ │ Redis │ │ S3 │ │ (RDS) │ │(ElastiCache) │(Images) │ └──────────┘ └─────────┘ └─────────┘
### Architecture Style **Primary Pattern**: Modular Monolith (transitioning to Microservices) **Rationale**: - **Current**: Modular monolith provides simplicity while maintaining clear boundaries - **Future**: Easy migration path to microservices as scale increases - **Trade-off**: Accepts coupling cost for development velocity at current scale ### Key Architectural Principles 1. **Separation of Concerns**: Clear boundaries between modules 2. **API-First**: All features exposed via REST APIs 3. **Stateless Services**: No server-side session state (JWT-based auth) 4. **Caching Strategy**: Cache aggressively, invalidate carefully 5. **Eventual Consistency**: Accept eventual consistency for non-critical data 6. **Fail Fast**: Return errors quickly rather than retry indefinitely 7. **Observability**: Comprehensive logging, metrics, and tracing --- ## Core Components ### Frontend Application **Technology**: React 18 + TypeScript + Redux Toolkit **Structure**:
client/ ├── components/ # Reusable UI components ├── pages/ # Route-level pages ├── store/ # Redux state management ├── api/ # API client ├── hooks/ # Custom React hooks └── utils/ # Utility functions
**Key Features**: - Server-side rendering (SSR) for SEO - Code splitting by route - Progressive Web App (PWA) capabilities - Optimistic UI updates - Offline support (service workers) **State Management**: - **Redux**: Global application state - **React Query**: Server state caching - **Local Storage**: User preferences, cart (guest users) **Performance Targets**: - First Contentful Paint: <1.5s - Time to Interactive: <3s - Lighthouse Score: >90 --- ### API Gateway **Technology**: Express.js + TypeScript **Responsibilities**: - Request routing - Authentication/authorization - Rate limiting - Request/response transformation - API versioning - CORS handling **Middleware Pipeline**: ```javascript Request ↓ Logging (Morgan) ↓ Rate Limiting (express-rate-limit) ↓ CORS (cors) ↓ Authentication (JWT verification) ↓ Authorization (permission check) ↓ Request Validation (Joi) ↓ Route Handler ↓ Response Formatting ↓ Error Handling ↓ Response
API Versioning Strategy:
- URL versioning:
,/api/v1/products/api/v2/products - Maintain 2 versions simultaneously
- Deprecation warnings in headers
- 6-month sunset period for old versions
Service Modules
Authentication Service
Responsibilities:
- User registration and login
- JWT token generation and validation
- Password reset flow
- OAuth integration (Google, Facebook)
- Multi-factor authentication (MFA)
Database Schema:
users ( id UUID PRIMARY KEY, email VARCHAR UNIQUE NOT NULL, password_hash VARCHAR NOT NULL, email_verified BOOLEAN DEFAULT FALSE, created_at TIMESTAMP, updated_at TIMESTAMP ) sessions ( id UUID PRIMARY KEY, user_id UUID REFERENCES users(id), token_hash VARCHAR NOT NULL, expires_at TIMESTAMP, created_at TIMESTAMP ) oauth_accounts ( id UUID PRIMARY KEY, user_id UUID REFERENCES users(id), provider VARCHAR NOT NULL, -- 'google', 'facebook' provider_user_id VARCHAR NOT NULL, access_token VARCHAR, refresh_token VARCHAR, UNIQUE(provider, provider_user_id) )
Security Measures:
- Passwords hashed with Argon2id
- JWT tokens with 15-minute expiration
- Refresh tokens with 7-day expiration
- Rate limiting: 5 login attempts per 15 minutes
- Account lockout after 10 failed attempts
- MFA via TOTP (Google Authenticator)
Product Service
Responsibilities:
- Product CRUD operations
- Inventory management
- Search and filtering
- Product recommendations
- Category management
Database Schema:
products ( id UUID PRIMARY KEY, merchant_id UUID REFERENCES users(id), name VARCHAR NOT NULL, description TEXT, price DECIMAL(10,2) NOT NULL, inventory_count INTEGER NOT NULL DEFAULT 0, category_id UUID REFERENCES categories(id), status VARCHAR DEFAULT 'draft', -- draft, active, archived created_at TIMESTAMP, updated_at TIMESTAMP ) product_images ( id UUID PRIMARY KEY, product_id UUID REFERENCES products(id) ON DELETE CASCADE, url VARCHAR NOT NULL, position INTEGER, created_at TIMESTAMP ) categories ( id UUID PRIMARY KEY, name VARCHAR NOT NULL, parent_id UUID REFERENCES categories(id), slug VARCHAR UNIQUE NOT NULL )
Search Implementation:
- PostgreSQL full-text search with trigram indexes
- Elasticsearch for advanced features (planned)
- Caching: 5-minute TTL for product lists, 1-hour for individual products
Performance Optimizations:
- Database indexes on common query fields
- N+1 query prevention with eager loading
- Image CDN with automatic resizing
- Aggressive caching with Redis
Order Service
Responsibilities:
- Shopping cart management
- Order creation and processing
- Order status tracking
- Order history
- Invoice generation
Database Schema:
orders ( id UUID PRIMARY KEY, customer_id UUID REFERENCES users(id), status VARCHAR NOT NULL, -- pending, paid, shipped, delivered, cancelled subtotal DECIMAL(10,2) NOT NULL, tax DECIMAL(10,2) NOT NULL, shipping DECIMAL(10,2) NOT NULL, total DECIMAL(10,2) NOT NULL, payment_id VARCHAR, shipping_address_id UUID REFERENCES addresses(id), created_at TIMESTAMP, updated_at TIMESTAMP ) order_items ( id UUID PRIMARY KEY, order_id UUID REFERENCES orders(id) ON DELETE CASCADE, product_id UUID REFERENCES products(id), quantity INTEGER NOT NULL, price DECIMAL(10,2) NOT NULL, product_snapshot JSONB -- Product details at time of purchase ) order_events ( id UUID PRIMARY KEY, order_id UUID REFERENCES orders(id), event_type VARCHAR NOT NULL, -- created, paid, shipped, etc. metadata JSONB, created_at TIMESTAMP )
Order State Machine:
pending → paid → processing → shipped → delivered ↓ ↓ ↓ ↓ └───────┴─────────┴───────────┴──→ cancelled
Transaction Handling:
- Database transactions for order creation
- Idempotency keys for payment processing
- Inventory reservation system
- Automatic rollback on payment failure
Payment Service
Responsibilities:
- Payment intent creation
- Payment processing (via Stripe)
- Refund handling
- Payment method management
- Transaction history
Integration with Stripe:
// Payment Intent Flow 1. Client requests payment intent ↓ 2. Server creates Stripe PaymentIntent ↓ 3. Client collects payment details ↓ 4. Client confirms payment with Stripe ↓ 5. Stripe webhook notifies server ↓ 6. Server updates order status
Webhook Security:
- Stripe signature verification
- Idempotent webhook processing
- Async processing with job queue
- Retry logic for failed webhooks
Database Schema:
payments ( id UUID PRIMARY KEY, order_id UUID REFERENCES orders(id), stripe_payment_intent_id VARCHAR UNIQUE, amount DECIMAL(10,2) NOT NULL, status VARCHAR NOT NULL, -- pending, succeeded, failed payment_method VARCHAR, -- card, bank_transfer metadata JSONB, created_at TIMESTAMP, updated_at TIMESTAMP ) refunds ( id UUID PRIMARY KEY, payment_id UUID REFERENCES payments(id), stripe_refund_id VARCHAR UNIQUE, amount DECIMAL(10,2) NOT NULL, reason VARCHAR, status VARCHAR NOT NULL, created_at TIMESTAMP )
Data Architecture
Database Design
Primary Database: PostgreSQL 14
Schema Organization:
- public schema: Core application tables
- audit schema: Audit logs and event sourcing
- analytics schema: Denormalized data for reporting
Connection Pooling:
{ max: 20, // Max connections min: 5, // Min connections idle: 10000, // Close idle connections after 10s acquire: 30000, // Max time to acquire connection evict: 1000 // Check for idle connections every 1s }
Backup Strategy:
- Automated daily backups (RDS snapshots)
- Point-in-time recovery enabled (7-day window)
- Monthly backups retained for 1 year
- Backup tested quarterly
Caching Strategy
Redis Configuration:
- Deployment: AWS ElastiCache (Redis 7.0)
- Mode: Cluster mode enabled
- Nodes: 3 (primary + 2 replicas)
- Eviction policy: LRU (Least Recently Used)
Cache Patterns:
- Cache-Aside (Read-heavy data):
async function getProduct(id) { // Try cache first let product = await cache.get(`product:${id}`); if (!product) { // Cache miss - fetch from database product = await db.products.findById(id); // Store in cache (1 hour TTL) await cache.set(`product:${id}`, product, 3600); } return product; }
- Write-Through (Critical data):
async function updateProduct(id, data) { // Update database const product = await db.products.update(id, data); // Update cache await cache.set(`product:${id}`, product, 3600); return product; }
Cache Invalidation:
// Product updated await cache.del(`product:${productId}`); await cache.del(`products:merchant:${merchantId}`); await cache.del(`products:category:${categoryId}`); // Pattern-based invalidation await cache.delPattern(`products:*`);
What We Cache:
| Data Type | TTL | Rationale |
|---|---|---|
| Product details | 1 hour | Infrequently updated |
| Product lists | 5 minutes | Frequently updated |
| User sessions | 15 minutes | Security requirement |
| Search results | 10 minutes | Expensive queries |
| API responses | 1 minute | Rate limit protection |
Data Migration Strategy
Tools: Prisma Migrate (development), custom scripts (production)
Migration Process:
- Create migration in development
- Review SQL in PR
- Test on staging (copy of production data)
- Run on production during low-traffic window
- Rollback plan documented
Zero-Downtime Migrations:
-- Example: Adding non-null column -- Step 1: Add column as nullable ALTER TABLE products ADD COLUMN new_field VARCHAR; -- Step 2: Backfill data UPDATE products SET new_field = 'default_value' WHERE new_field IS NULL; -- Step 3: Add NOT NULL constraint ALTER TABLE products ALTER COLUMN new_field SET NOT NULL;
Infrastructure
AWS Architecture
Regions: Primary: us-east-1, Disaster Recovery: us-west-2
VPC Design:
VPC (10.0.0.0/16) ├── Public Subnets (10.0.1.0/24, 10.0.2.0/24) │ ├── NAT Gateways │ └── Application Load Balancer └── Private Subnets (10.0.10.0/24, 10.0.11.0/24) ├── ECS Tasks (Application) ├── RDS (Database) └── ElastiCache (Redis)
Compute:
- ECS Fargate: Serverless containers for application
- Auto-scaling: Target CPU 70%, min 2 tasks, max 10 tasks
- Task Definition:
CPU: 1024 (1 vCPU) Memory: 2048 MB Container Port: 3000 Environment: Production
Database:
- RDS PostgreSQL: db.r5.large (2 vCPU, 16 GB RAM)
- Multi-AZ: Yes (automatic failover)
- Read Replicas: 1 (for analytics queries)
- Storage: 500 GB GP3 (auto-scaling enabled)
Storage:
- S3 Bucket: product-images-prod
- Lifecycle Policy: Move to Glacier after 90 days
- CDN: CloudFront distribution for images
- Backup: Cross-region replication enabled
Networking:
- Load Balancer: Application Load Balancer (ALB)
- SSL/TLS: ACM certificates (auto-renewal)
- WAF: AWS WAF with OWASP rules
- DDoS Protection: AWS Shield Standard
Deployment Architecture
CI/CD Pipeline (GitHub Actions):
Code Push ↓ Automated Tests (Unit + Integration) ↓ Linting & Type Checking ↓ Build Docker Image ↓ Push to ECR (Elastic Container Registry) ↓ Deploy to Staging (Auto) ↓ Integration Tests (Staging) ↓ Manual Approval ↓ Deploy to Production (Canary) ↓ Monitor Metrics (15 minutes) ↓ Full Rollout or Rollback
Deployment Strategy: Blue-Green with Canary
Production (Blue) Canary (Green) 100% traffic → 95% / 5% split → 0% / 100% ↓ Monitor for 15 min ↓ Success? Full rollout : Rollback
Rollback Procedure:
- Detect issue (automated alerts or manual)
- Trigger rollback command
- Route traffic back to previous version
- Investigate root cause
- Fix and redeploy
Deployment Windows:
- Staging: Anytime
- Production: Tuesday-Thursday, 10 AM - 2 PM EST
- Emergency: 24/7 with on-call approval
Security Architecture
Defense in Depth
Layer 1: Network Security
- VPC isolation
- Security groups (allow-list only)
- Network ACLs
- Private subnets for data layer
- NAT Gateway for outbound traffic
Layer 2: Application Security
- Input validation (all user inputs)
- SQL injection prevention (parameterized queries)
- XSS prevention (sanitization + CSP headers)
- CSRF protection (tokens)
- Rate limiting (DDoS mitigation)
Layer 3: Authentication & Authorization
- JWT with short expiration
- Refresh token rotation
- MFA for admin accounts
- Role-based access control (RBAC)
- Principle of least privilege
Layer 4: Data Security
- Encryption at rest (RDS, S3)
- Encryption in transit (TLS 1.3)
- Secrets in AWS Secrets Manager
- PII data encrypted at field level
- Regular security audits
Security Headers
{ 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains', 'Content-Security-Policy': "default-src 'self'; script-src 'self' 'unsafe-inline'", 'X-Frame-Options': 'DENY', 'X-Content-Type-Options': 'nosniff', 'Referrer-Policy': 'strict-origin-when-cross-origin', 'Permissions-Policy': 'geolocation=(), microphone=(), camera=()' }
Compliance
Standards:
- PCI DSS: Level 2 (Stripe handles Level 1)
- GDPR: User data rights, deletion, export
- SOC 2 Type II: In progress (Q2 2024)
Data Retention:
- User data: Retained until account deletion
- Order data: 7 years (regulatory requirement)
- Logs: 90 days
- Backups: 1 year
Scalability & Performance
Current Capacity
| Metric | Current | Limit | Headroom |
|---|---|---|---|
| Concurrent Users | 500 | 2,000 | 4x |
| Requests/Second | 200 | 1,000 | 5x |
| Database Connections | 50 | 200 | 4x |
| Storage | 500 GB | 2 TB | 4x |
Scaling Strategy
Horizontal Scaling:
- Stateless services (easy to replicate)
- Auto-scaling based on CPU/memory
- Database read replicas for read-heavy workloads
Vertical Scaling:
- Database instance size (scheduled uptime)
- Cache cluster size
Caching:
- Application-level caching (Redis)
- CDN for static assets
- Database query result caching
Database Optimization:
- Indexes on frequently queried fields
- Materialized views for complex queries
- Connection pooling
- Query optimization (EXPLAIN ANALYZE)
Performance Budgets
API Response Times (p95):
- GET requests: <200ms
- POST requests: <500ms
- Complex queries: <1s
Frontend Performance (Lighthouse):
- Performance: >90
- Accessibility: 100
- Best Practices: >90
- SEO: 100
Database Query Times (p95):
- Simple queries: <50ms
- Join queries: <100ms
- Aggregations: <500ms
Monitoring & Observability
Metrics
Application Metrics (DataDog):
- Request rate, error rate, duration (RED metrics)
- Active users, sessions
- Business metrics (orders, revenue)
- Custom metrics (cart abandonment, conversion rate)
Infrastructure Metrics:
- CPU, memory, disk usage
- Network throughput
- Database connections, query performance
- Cache hit rate
Dashboards:
- System Health: Overall system status
- API Performance: Endpoint-specific metrics
- Business Metrics: KPIs and conversions
- Database Performance: Query analysis
- Error Tracking: Error rates and trends
Logging
Log Levels:
- ERROR: Application errors requiring investigation
- WARN: Potential issues or degraded performance
- INFO: Significant events (order created, payment succeeded)
- DEBUG: Detailed diagnostic information (disabled in production)
Log Aggregation: CloudWatch Logs → DataDog
Structured Logging:
logger.info('Order created', { orderId: '123', customerId: '456', total: 99.99, timestamp: new Date().toISOString() });
Alerting
Alert Channels:
- Critical: PagerDuty (SMS + Phone)
- High: Slack #incidents
- Medium: Slack #engineering
- Low: Email
Alert Rules:
- name: High Error Rate condition: error_rate > 5% for 5 minutes severity: CRITICAL channel: PagerDuty - name: Slow API Response condition: p95_latency > 1000ms for 10 minutes severity: HIGH channel: Slack - name: Database Connection Pool Exhausted condition: db_connections > 180 for 5 minutes severity: CRITICAL channel: PagerDuty - name: Low Cache Hit Rate condition: cache_hit_rate < 70% for 15 minutes severity: MEDIUM channel: Slack
Tracing
Distributed Tracing: DataDog APM
Trace Example:
HTTP Request: GET /api/products/123 ├─ Authentication Middleware (5ms) ├─ Authorization Middleware (2ms) ├─ Product Service │ ├─ Cache Lookup (1ms) [MISS] │ ├─ Database Query (45ms) │ └─ Cache Set (2ms) ├─ Response Serialization (3ms) └─ Total: 58ms
Future Considerations
Planned Improvements (Next 6 Months)
-
Microservices Migration
- Extract payment service first
- Event-driven architecture with message queue
- Service mesh (Istio) for inter-service communication
-
Search Enhancement
- Migrate to Elasticsearch
- Implement faceted search
- Add product recommendations (ML-based)
-
Performance Optimization
- Implement GraphQL (reduce over-fetching)
- Server-side rendering for better SEO
- Optimize database queries (20% improvement target)
-
Infrastructure
- Multi-region deployment for lower latency
- Kubernetes migration (from ECS)
- Serverless functions for background jobs
Technical Debt
High Priority:
- Upgrade Node.js from v16 to v20
- Migrate from class components to hooks (React)
- Implement comprehensive integration tests
- Refactor legacy authentication code
Medium Priority:
- Standardize error handling across services
- Improve API documentation (OpenAPI spec)
- Add end-to-end tests for critical flows
Low Priority:
- Migrate from REST to GraphQL
- Implement BFF (Backend for Frontend) pattern
- Add feature flags system
Risks & Mitigation
| Risk | Impact | Probability | Mitigation |
|---|---|---|---|
| Database becomes bottleneck | HIGH | MEDIUM | Read replicas, caching, sharding plan |
| Monolith difficult to scale | MEDIUM | HIGH | Modular architecture, migration plan |
| Third-party service outage | HIGH | LOW | Fallback strategies, circuit breakers |
| Security breach | CRITICAL | LOW | Regular audits, penetration testing |
| Key engineer departure | MEDIUM | MEDIUM | Documentation, knowledge sharing |
Appendices
Glossary
- GMV: Gross Merchandise Value
- RPS: Requests Per Second
- p95: 95th percentile
- TTL: Time To Live
- CDN: Content Delivery Network
- WAF: Web Application Firewall
References
Change Log
| Version | Date | Changes | Author |
|---|---|---|---|
| 2.3 | 2024-01-15 | Added canary deployment strategy | Alice |
| 2.2 | 2023-12-01 | Updated infrastructure (ECS migration) | Bob |
| 2.1 | 2023-10-15 | Added security architecture section | Frank |
| 2.0 | 2023-09-01 | Major revision - microservices plan | Alice, Bob |
Document Status: Current Next Review: April 15, 2024 Maintained By: Engineering Team Questions: #architecture on Slack
### Architecture Decision Record (ADR) Template ```markdown # ADR-015: Migrate from Sessions to JWT Authentication **Status**: Accepted **Date**: January 15, 2024 **Decision Makers**: Alice (EM), Bob (Tech Lead), Carol (Frontend Lead) **Consulted**: Security Team, DevOps Team --- ## Context Our current authentication system uses server-side sessions stored in Redis. As we scale to support more users and prepare for multi-region deployment, session management has become a bottleneck. ### Current State **Session-Based Authentication**: ```javascript // Login creates server-side session app.post('/login', (req, res) => { const user = authenticate(req.body); req.session.userId = user.id; // Stored in Redis res.json({ success: true }); }); // Each request validates session app.use((req, res, next) => { if (req.session.userId) { req.user = await getUser(req.session.userId); } next(); });
Problems:
- Scalability: Every request requires Redis lookup (adds 5-10ms latency)
- Complexity: Session replication across regions is complex
- Memory: 50,000 active sessions = 250MB Redis memory
- Stateful: Cannot easily add new servers (sticky sessions required)
Requirements
- Stateless: No server-side session storage
- Scalable: Support 50k+ concurrent users
- Secure: Resistant to common attacks (XSS, CSRF, token theft)
- Fast: Minimal performance impact (<1ms overhead)
- Compatible: Work with existing mobile apps
Decision
We will migrate from session-based authentication to JSON Web Tokens (JWT).
Implementation
JWT-Based Authentication:
// Login generates JWT app.post('/login', (req, res) => { const user = authenticate(req.body); const accessToken = jwt.sign( { userId: user.id, role: user.role }, process.env.JWT_SECRET, { expiresIn: '15m' } ); const refreshToken = jwt.sign( { userId: user.id }, process.env.REFRESH_SECRET, { expiresIn: '7d' } ); res.json({ accessToken, refreshToken }); }); // Each request validates JWT (no database lookup) app.use((req, res, next) => { const token = req.headers.authorization?.split(' ')[1]; try { req.user = jwt.verify(token, process.env.JWT_SECRET); next(); } catch (error) { res.status(401).json({ error: 'Invalid token' }); } });
Token Structure
Access Token (short-lived):
- Payload:
{ userId, role, permissions } - Expiration: 15 minutes
- Signature: HMAC SHA256
Refresh Token (long-lived):
- Payload:
{ userId } - Expiration: 7 days
- Stored hash in database (for revocation)
Alternatives Considered
Alternative 1: Keep Session-Based Auth
Pros:
- No migration needed
- Familiar to team
- Easy to revoke access (delete session)
Cons:
- Scalability issues persist
- Complex multi-region setup
- Requires sticky sessions (load balancer complexity)
Decision: Rejected due to scalability concerns
Alternative 2: OAuth 2.0 Only
Pros:
- Industry standard
- Delegation capabilities
- Well-tested security
Cons:
- Overkill for our use case
- Complex implementation
- Requires authorization server
- Users expect username/password
Decision: Rejected - too complex for current needs. Will add OAuth as option later.
Alternative 3: API Keys
Pros:
- Simple implementation
- Stateless
- Easy to revoke
Cons:
- No expiration (security risk)
- Not suitable for user authentication
- No claims/scopes
Decision: Rejected - better suited for programmatic access, not user auth
Consequences
Positive
-
Performance: Eliminate Redis lookup on every request
- Estimated improvement: 5-10ms per request
- Reduces Redis load by 80%
-
Scalability: Stateless servers
- No sticky sessions needed
- Easy horizontal scaling
- Multi-region deployment simplified
-
Mobile Support: Better mobile app experience
- Tokens stored locally
- No cookies required
- Offline token validation
-
Developer Experience: Simpler architecture
- No session middleware
- Easier testing (no session state)
- Clear token lifecycle
Negative
-
Token Revocation: Cannot immediately revoke access
- Mitigation: Short token expiration (15 min)
- Mitigation: Refresh token blacklist
- Mitigation: Emergency: force re-auth for all users
-
Token Size: JWTs larger than session IDs
- Session ID: ~32 bytes
- JWT: ~200 bytes
- Impact: Minimal (200 bytes per request is acceptable)
-
Secret Management: JWT secrets are critical
- Mitigation: Store in AWS Secrets Manager
- Mitigation: Rotate secrets quarterly
- Mitigation: Different secrets per environment
-
XSS Risk: Tokens accessible to JavaScript
- Mitigation: Store in httpOnly cookies (where possible)
- Mitigation: Strict Content Security Policy
- Mitigation: Short token expiration
Risks
| Risk | Severity | Mitigation |
|---|---|---|
| JWT secret leaked | CRITICAL | Secrets Manager, rotation, monitoring |
| Cannot revoke compromised token | HIGH | Short expiration, refresh token blacklist |
| Algorithm confusion attack | MEDIUM | Explicitly specify algorithm in verification |
| Replay attacks | MEDIUM | Short expiration, HTTPS only |
Implementation Plan
Phase 1: Preparation (Week 1-2)
- Create JWT utility functions
- Update authentication middleware
- Add refresh token endpoint
- Write migration guide for frontend team
- Set up secrets in AWS Secrets Manager
Phase 2: Backend Migration (Week 3-4)
- Deploy JWT endpoints alongside session endpoints
- Add feature flag for JWT authentication
- Comprehensive testing (unit + integration)
- Load testing with JWTs
- Security review
Phase 3: Frontend Migration (Week 5-6)
- Update web app to use JWT
- Update mobile apps to use JWT
- Gradual rollout (10% → 50% → 100%)
- Monitor error rates and performance
Phase 4: Cleanup (Week 7-8)
- Remove session-based auth code
- Remove Redis session storage
- Update documentation
- Postmortem and lessons learned
Rollback Plan
If critical issues arise:
- Disable JWT feature flag
- Route all traffic to session endpoints
- Keep JWT code for investigation
- Identify and fix issues
- Resume migration
Metrics for Success
Performance:
- Average request latency reduced by 5ms
- p95 latency reduced by 10ms
- Redis CPU usage reduced by 80%
Reliability:
- No increase in authentication errors
- <0.1% token validation failures
- Zero security incidents
User Experience:
- Login flow unchanged (transparent migration)
- No increase in support tickets
- Mobile app performance improved
Security Considerations
Token Security:
- Tokens signed with HS256 (HMAC SHA256)
- Secrets: 256-bit randomly generated
- Secrets rotated quarterly
- Algorithm specified in verification (prevent algorithm confusion)
Storage:
- Web: httpOnly cookies (prevents XSS)
- Mobile: Secure storage (Keychain/Keystore)
- Never in localStorage (XSS vulnerable)
Transmission:
- HTTPS only (TLS 1.3)
- Secure, SameSite=Strict cookies
- No tokens in URLs (log exposure)
Validation:
- Verify signature
- Check expiration
- Validate issuer and audience
- Check token not blacklisted (refresh tokens)
References
Updates
| Date | Update | Author |
|---|---|---|
| 2024-01-15 | Initial ADR created | Bob |
| 2024-01-20 | Added security review feedback | Frank |
| 2024-02-01 | Updated after implementation | Bob |
Status: Accepted Supersedes: ADR-008 (Session-based Authentication) Related: ADR-012 (API Security), ADR-014 (Multi-region Deployment)
## Usage Examples
@architecture-documenter @architecture-documenter --type system-overview @architecture-documenter --type adr @architecture-documenter --focus security @architecture-documenter --focus scalability @architecture-documenter --include-diagrams @architecture-documenter --update-existing
## Best Practices ### Document Architecture Decisions **When to create an ADR**: - Significant technical decisions - Architecture changes - Technology choices - Process changes - Security decisions **ADR Structure**: 1. **Context**: What's the situation? 2. **Decision**: What did we decide? 3. **Alternatives**: What else did we consider? 4. **Consequences**: What are the impacts? ### Use Visual Diagrams **Diagram Types**: - **System Context**: Show system boundaries - **Container**: Show high-level architecture - **Component**: Show internal structure - **Code**: Show class/module relationships - **Deployment**: Show infrastructure - **Sequence**: Show interactions over time **Tools**: - Diagrams as code: Mermaid, PlantUML - Visual tools: Lucidchart, Draw.io - Cloud-specific: AWS Architecture Diagrams ### Keep Documentation Current **Documentation Lifecycle**: - Create during design phase - Review in code review - Update with implementation changes - Quarterly architecture review - Archive outdated docs (don't delete) **Version Control**: - Store docs with code - Version alongside releases - Link docs to specific code versions - Maintain changelog ### Make It Discoverable **Organization**: - Central location (wiki, docs folder) - Clear naming conventions - Table of contents - Cross-references - Search-friendly **Accessibility**: - Public within organization - Easy to navigate - Multiple entry points - Links from README ## Notes - Architecture documentation is for communication, not perfection - Diagrams speak louder than words - use them liberally - ADRs capture decisions and context for future reference - Keep docs synchronized with code changes - Version architecture docs alongside code - Regular reviews prevent documentation drift - Good architecture docs reduce onboarding time significantly - Document the "why" not just the "what" - Include trade-offs and alternatives considered - Make security and scalability explicit - Link architecture to business goals - Use consistent notation and terminology