Vibecosystem vp-engineering

VP Engineering perspective - org design (team topologies), process improvement, cross-team dependencies, engineering culture, OKRs, incident management maturity, platform strategy, DX optimization, release management at scale

install

source · Clone the upstream repo

git clone https://github.com/vibeeval/vibecosystem

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/vibeeval/vibecosystem "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/vp-engineering" ~/.claude/skills/vibeeval-vibecosystem-vp-engineering && rm -rf "$T"

manifest: skills/vp-engineering/SKILL.md

VP Engineering Perspective

Engineering Org Design (Team Topologies)

Team Types

Type	Purpose	Characteristics	Size
Stream-aligned	Deliver user/business value	Full-stack, autonomous, owns entire feature slice	5-8
Platform	Reduce cognitive load for stream teams	Internal products, self-service APIs/tools	3-6
Enabling	Help teams adopt new capabilities	Coaching, not doing; temporary engagement	2-3
Complicated subsystem	Deep specialist expertise	ML, payments, security, real-time systems	2-4

Interaction Modes

Mode	Description	When to Use
Collaboration	Teams work together closely	New capability discovery, high uncertainty
X-as-a-Service	One team provides, other consumes	Well-defined API/platform capability
Facilitating	One team coaches another	Skill transfer, technology adoption

Org Design Template

## Engineering Organization: [Company Name]

### Team Map

Stream-aligned Teams:
├── Team Alpha: [product area] (5 people)
│   Owns: [service/feature list]
│   Stack: [tech stack]
├── Team Beta: [product area] (6 people)
│   Owns: [service/feature list]
│   Stack: [tech stack]
└── Team Gamma: [product area] (5 people)
    Owns: [service/feature list]
    Stack: [tech stack]

Platform Team:
└── Team Platform (4 people)
    Provides: CI/CD, observability, developer portal
    Interaction: X-as-a-Service

Enabling Team:
└── Team Enable (2 people)
    Focus: [current initiative - e.g., Kubernetes migration]
    Interaction: Facilitating (rotates every quarter)

Complicated Subsystem:
└── Team ML (3 people)
    Owns: ML pipeline, model serving, feature store
    Interaction: Collaboration with stream teams

### Cognitive Load Assessment
| Team | Intrinsic (domain) | Extraneous (tools) | Total | Status |
|------|-------------------|--------------------|----|--------|
| Alpha | 6/10 | 3/10 | 9/10 | At capacity |
| Beta | 5/10 | 4/10 | 9/10 | At capacity |
| Platform | 7/10 | 2/10 | 9/10 | At capacity |

Org Design Anti-Patterns

Anti-Pattern	Symptom	Fix
Conway's Law violation	Architecture doesn't match team structure	Align teams to desired architecture
Shared services bottleneck	Every team waits for "core team"	Split into platform + self-service
Matrix management	Unclear ownership, split loyalty	Single reporting line per IC
Too many meetings	"Alignment" overhead > execution	Reduce interaction surface, use async
Hero culture	One person knows everything	Document, pair, rotate on-call

Process Improvement (Agile Maturity)

Agile Maturity Model

Level	Name	Characteristics
1	Initial	Ad-hoc, no process, firefighting
2	Managed	Basic scrum/kanban, inconsistent
3	Defined	Consistent process, metrics tracked
4	Measured	Data-driven decisions, predictable delivery
5	Optimizing	Continuous improvement, experiments

Process Improvement Framework

## Process Improvement Cycle

### 1. Observe (1 sprint)
- Shadow team ceremonies
- Measure cycle time, WIP, defects
- Interview team members (1-on-1)

### 2. Diagnose
| Problem | Root Cause | Impact |
|---------|-----------|--------|
| [symptom] | [why] | [what it costs] |

### 3. Hypothesize
"If we [change], then [expected outcome], measured by [metric]"

### 4. Experiment (2-3 sprints)
- Implement ONE change at a time
- Measure baseline vs. new
- Collect team feedback

### 5. Evaluate
- Did the metric improve?
- Did the team feel the improvement?
- Any unintended side effects?

### 6. Adopt or Revert
- Improvement verified: document and standardize
- No improvement: revert and try next hypothesis

Common Process Fixes

Problem	Fix	Metric
Missed deadlines	Smaller stories, better estimation	Story completion rate
Too much WIP	WIP limits (Kanban)	Cycle time
Unclear requirements	Refinement meetings, acceptance criteria	Defect rate
Deployment fear	Feature flags, canary deploys	Deploy frequency
Slow code reviews	SLA (24h max), small PRs	Review turnaround
Meeting overload	No-meeting days, async updates	Focus time %

Cross-Team Dependency Management

Dependency Mapping

## Cross-Team Dependencies: [Quarter]

### Dependency Matrix
| Providing Team | Consuming Team | Dependency | Type | Status | Risk |
|---------------|---------------|-----------|------|--------|------|
| Platform | Alpha | Auth service v2 | Blocking | In progress | Medium |
| Alpha | Beta | User API | Non-blocking | Available | Low |
| ML | Gamma | Rec engine | Blocking | Not started | High |

### Dependency Types
- **Blocking:** Must be completed before consumer can start
- **Non-blocking:** Can work in parallel with mocked interface
- **Soft:** Nice to have, workaround exists

### Visualization
Team Alpha ──blocks──→ Team Beta (User API)
Team Platform ──blocks──→ Team Alpha (Auth v2)
Team ML ──blocks──→ Team Gamma (Rec engine) ← HIGH RISK

Dependency Resolution Strategies

Strategy	When to Use
Contract-first	Define API contract, both teams implement independently
Embedded engineer	Loan an engineer from providing team
Shared interface	Agree on interface, mock until ready
Prioritize differently	Move blocking work to top of providing team's backlog
Decouple	Feature flags, adapter pattern, event-driven
Eliminate	Redesign to remove dependency entirely

Dependency Anti-Patterns

Anti-Pattern	Neden Yanlis	Dogru Yol
Hidden dependencies	Discovered too late	Map dependencies in planning
Dependency as excuse	"Blocked by Team X" for weeks	Escalate immediately, find alternatives
Hub team (everything flows through one)	Bottleneck	Distribute ownership, self-service
Cross-team code ownership	Slow PRs, merge conflicts	Clear ownership boundaries

Engineering Culture Building

Culture Pillars

## Engineering Culture: [Company Name]

### Our Values (with behaviors)

1. **Ownership**
   - Do: Take responsibility end-to-end (build, deploy, monitor)
   - Don't: "Not my code" / "That's ops problem"
   - Measure: On-call engagement, post-incident participation

2. **Craft**
   - Do: Write tests, review thoughtfully, refactor proactively
   - Don't: "Ship now, fix later" (unless P0)
   - Measure: Code review quality, tech debt ratio

3. **Transparency**
   - Do: Share context, document decisions, default to public channels
   - Don't: Hoarding information, private DMs for team decisions
   - Measure: Documentation coverage, team survey

4. **Learning**
   - Do: Blameless retros, share mistakes, invest in growth
   - Don't: Blame individuals, hide failures
   - Measure: Retro action items completed, conference talks

5. **Speed**
   - Do: Small PRs, feature flags, iterate quickly
   - Don't: Big bang releases, analysis paralysis
   - Measure: Lead time, deploy frequency

Culture Building Practices

Practice	Frequency	Owner	Goal
Blameless post-mortems	Per incident	Engineering managers	Learn from failures
Engineering all-hands	Monthly	VP Engineering	Alignment, wins, direction
Tech talks / brown bags	Biweekly	Rotating engineers	Knowledge sharing
Hack days / hackathon	Quarterly	Engineering leads	Innovation, morale
Architecture review	Biweekly	Architects	Consistency, quality
1-on-1s	Weekly	Managers	Growth, retention
Skip-level 1-on-1s	Monthly	VP/Director	Pulse check, escalation
Engineering blog	Monthly+	Rotating authors	Employer branding
Open source contributions	Continuous	Anyone	Community, recruitment

OKR Setting for Engineering

OKR Template

## Engineering OKRs: Q[X] [Year]

### Objective 1: Accelerate delivery velocity
| KR | Target | Current | Status |
|----|--------|---------|--------|
| KR1.1: Reduce lead time from code to production | < 4 hours | 2 days | [on/off track] |
| KR1.2: Increase deploy frequency | 5x/day | 2x/week | [on/off track] |
| KR1.3: Reduce change failure rate | < 5% | 12% | [on/off track] |

### Objective 2: Improve developer experience
| KR | Target | Current | Status |
|----|--------|---------|--------|
| KR2.1: Developer satisfaction score | > 4.2/5 | 3.6/5 | [on/off track] |
| KR2.2: Reduce CI build time | < 5 min | 12 min | [on/off track] |
| KR2.3: New hire productive in < 2 weeks | 90% | 60% | [on/off track] |

### Objective 3: Strengthen reliability
| KR | Target | Current | Status |
|----|--------|---------|--------|
| KR3.1: Achieve 99.95% uptime | 99.95% | 99.8% | [on/off track] |
| KR3.2: Reduce MTTR to < 30 min | 30 min | 2 hours | [on/off track] |
| KR3.3: Zero P0 incidents from known issues | 0 | 3/quarter | [on/off track] |

OKR Anti-Patterns

Anti-Pattern	Neden Yanlis	Dogru Yol
Feature-based OKRs	"Ship feature X" is a task, not an outcome	Focus on outcomes ("Reduce churn by 10%")
Too many OKRs	Diluted focus	3 objectives, 3-4 KRs each max
Binary KRs	No progress signal	Quantitative, measurable, with baseline
No alignment	Disconnected from company OKRs	Cascade from company → engineering → team
Set and forget	No mid-quarter check	Weekly tracking, monthly review

Incident Management Maturity

Maturity Levels

Level	Characteristics	Actions
1: Reactive	No process, ad-hoc response, hero-driven	Document basic runbooks, assign on-call
2: Organized	On-call rotation, basic alerting, Slack channel	Add severity classification, escalation paths
3: Systematic	Incident commander role, structured comms, SLOs	Add blameless post-mortems, action item tracking
4: Proactive	Error budgets, chaos engineering, SLO dashboards	Game days, automated remediation
5: Predictive	ML-based anomaly detection, self-healing	Continuous improvement, near-zero MTTR

Incident Management Framework

## Incident Response Structure

### Roles
| Role | Responsibility |
|------|---------------|
| Incident Commander (IC) | Coordinates response, makes decisions |
| Technical Lead | Diagnoses and fixes the issue |
| Communications Lead | Stakeholder updates, status page |
| Scribe | Documents timeline and actions |

### Severity Levels
| Level | Definition | Response Time | IC Required | Status Page | Exec Notify |
|-------|-----------|--------------|-------------|-------------|-------------|
| SEV-1 | Full outage | 5 min | Yes | Yes | Immediately |
| SEV-2 | Major degradation | 15 min | Yes | Yes | Within 1h |
| SEV-3 | Minor impact | 1 hour | No | Optional | No |
| SEV-4 | No user impact | Next business day | No | No | No |

### Communication Cadence
| SEV | Internal Update | External Update | Exec Update |
|-----|----------------|----------------|-------------|
| SEV-1 | Every 15 min | Every 30 min | Every 30 min |
| SEV-2 | Every 30 min | Every 1h | Every 2h |
| SEV-3 | Every 2h | If customer-facing | None |

Post-Incident Review Quality Checklist

Platform Team Strategy

Platform Team Charter

## Platform Team Charter

### Mission
Reduce cognitive load on stream-aligned teams by providing self-service
infrastructure, tooling, and abstractions.

### Principles
1. Treat internal teams as customers
2. Self-service > ticket-based requests
3. Paved roads, not mandates
4. Measure developer experience, not just uptime

### Product Areas
| Area | What We Provide | Maturity |
|------|----------------|----------|
| CI/CD | Build pipelines, deploy automation | Mature |
| Observability | Logging, metrics, tracing, dashboards | Growing |
| Developer portal | Service catalog, docs, templates | Early |
| Infrastructure | K8s, databases, caching, queues | Mature |
| Security | Secret management, vulnerability scanning | Growing |

### Success Metrics
| Metric | Target | Current |
|--------|--------|---------|
| Time to onboard new service | < 1 day | 1 week |
| Developer satisfaction (platform) | > 4.0/5 | 3.5/5 |
| Self-service adoption rate | > 80% | 50% |
| Support tickets per team per month | < 5 | 12 |

### Roadmap (Next 2 Quarters)
| Quarter | Initiative | Impact |
|---------|-----------|--------|
| Q1 | Internal developer portal | Reduce onboarding time 50% |
| Q1 | Standardized service template | Consistent microservices |
| Q2 | Golden path for new services | < 1 hour to first deploy |
| Q2 | Self-service database provisioning | Remove DBA bottleneck |

Platform Anti-Patterns

Anti-Pattern	Symptom	Fix
Building for no one	Platform features nobody asked for	Customer interviews, usage metrics
Mandatory adoption	Teams forced to use half-baked tools	Make it so good they want to use it
Ticket-based everything	Slow provisioning, frustrated teams	Self-service APIs and UIs
No documentation	Teams can't use platform without help	Treat docs as product
Ivory tower	Platform team disconnected from users	Embed with stream teams periodically

Developer Experience (DX) Optimization

DX Metrics

Metric	How to Measure	Target
Dev environment setup	Time from clone to running	< 15 min
CI build time	Pipeline duration (p50/p95)	< 5 min (p50)
Code review turnaround	PR open to first review	< 4 hours
Deploy to production	Merge to live	< 1 hour
Incident notification	Alert to human eyes	< 5 min
Documentation freshness	% docs updated in last 90 days	> 80%
On-call burden	Pages per week per person	< 2
Context switching	Interruptions per focus block	< 1

DX Improvement Roadmap

## DX Improvement Plan

### Quick Wins (< 1 week each)
- [ ] Pre-configured dev containers / devbox
- [ ] One-command project setup script
- [ ] PR template with checklist
- [ ] Slack bot for deploy status
- [ ] Auto-assign code reviewers

### Medium Term (1-4 weeks)
- [ ] Reduce CI build time by 50%
- [ ] Local development matches production (docker-compose)
- [ ] API documentation auto-generated from code
- [ ] Error messages link to runbooks
- [ ] Feature flag self-service UI

### Long Term (1-3 months)
- [ ] Internal developer portal (Backstage/custom)
- [ ] Self-service infrastructure provisioning
- [ ] Automated dependency updates (Renovate)
- [ ] Golden path templates for new services
- [ ] DX survey and tracking dashboard

DX Survey Template

## Developer Experience Survey (Quarterly)

Rate 1-5 (1 = terrible, 5 = excellent):

### Development
1. How easy is it to set up your local dev environment?
2. How reliable is your local dev environment?
3. How fast is your CI/CD pipeline?
4. How easy is it to find and understand documentation?

### Collaboration
5. How efficient is your code review process?
6. How well does cross-team collaboration work?
7. How effective are your team's meetings?

### Operations
8. How manageable is on-call?
9. How good are your monitoring and alerting tools?
10. How confident are you in deploying to production?

### Growth
11. How supported do you feel in your career growth?
12. How much time do you spend on meaningful work vs. toil?

### Open Ended
13. What is the biggest time-waster in your day?
14. If you could change one thing about engineering, what would it be?

Release Management at Scale

Release Strategy Options

Strategy	When to Use	Complexity
Continuous deployment	Mature CI/CD, high test confidence	Low (automated)
Release train	Multi-team, coordinated releases	Medium
Feature flags	Decouple deploy from release	Medium
Blue-green deploy	Zero-downtime requirement	Medium
Canary release	Gradual rollout, risk mitigation	High
Ring deployment	Internal -> beta -> GA	High

Release Process (Multi-Team)

## Release Checklist: v[X.Y.Z]

### Pre-Release (T-2 days)
- [ ] All feature branches merged to release branch
- [ ] Release branch passes all tests
- [ ] Cross-team integration tests passing
- [ ] Dependent services compatible (API contracts)
- [ ] Database migrations tested
- [ ] Feature flags configured for new features
- [ ] Rollback plan documented

### Release Day (T-0)
- [ ] Release branch deployed to staging
- [ ] QA sign-off on staging
- [ ] Monitoring dashboards reviewed (baseline)
- [ ] On-call team briefed
- [ ] Canary deployment initiated
- [ ] Canary metrics monitored (error rate, latency, business KPIs)
- [ ] Full rollout completed
- [ ] Post-deploy verification

### Post-Release (T+1)
- [ ] Metrics compared to baseline
- [ ] No regression in error rates or latency
- [ ] Customer support briefed on changes
- [ ] Release notes published
- [ ] Feature flags cleaned up (remove old)
- [ ] Retrospective scheduled (if issues occurred)

Release Metrics

Metric	What It Measures	Target
Release frequency	How often we ship	Weekly or more
Release lead time	Code complete to production	< 1 day
Release success rate	Releases without rollback	> 95%
Rollback rate	How often we revert	< 5%
Hotfix frequency	Emergency fixes needed	< 1/month
Feature flag cleanup	Stale flags removed	Within 30 days

Release Anti-Patterns

Anti-Pattern	Neden Yanlis	Dogru Yol
"Big bang" releases	High risk, hard to debug	Small, frequent releases
Release branch lives too long	Merge conflicts, integration hell	Short-lived, merge daily
Manual release process	Error-prone, slow	Fully automated pipeline
No rollback plan	Stuck with broken release	Always have rollback procedure
Feature flags never cleaned	Combinatorial explosion	Clean up within 30 days
Friday deployments	Nobody around for issues	Deploy Mon-Thu, observe Fri
No release notes	Users/support confused	Automated changelog generation