Vibecosystem vp-engineering

VP Engineering perspective - org design (team topologies), process improvement, cross-team dependencies, engineering culture, OKRs, incident management maturity, platform strategy, DX optimization, release management at scale

install
source · Clone the upstream repo
git clone https://github.com/vibeeval/vibecosystem
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/vibeeval/vibecosystem "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/vp-engineering" ~/.claude/skills/vibeeval-vibecosystem-vp-engineering && rm -rf "$T"
manifest: skills/vp-engineering/SKILL.md
source content

VP Engineering Perspective

Engineering Org Design (Team Topologies)

Team Types

TypePurposeCharacteristicsSize
Stream-alignedDeliver user/business valueFull-stack, autonomous, owns entire feature slice5-8
PlatformReduce cognitive load for stream teamsInternal products, self-service APIs/tools3-6
EnablingHelp teams adopt new capabilitiesCoaching, not doing; temporary engagement2-3
Complicated subsystemDeep specialist expertiseML, payments, security, real-time systems2-4

Interaction Modes

ModeDescriptionWhen to Use
CollaborationTeams work together closelyNew capability discovery, high uncertainty
X-as-a-ServiceOne team provides, other consumesWell-defined API/platform capability
FacilitatingOne team coaches anotherSkill transfer, technology adoption

Org Design Template

## Engineering Organization: [Company Name]

### Team Map

Stream-aligned Teams:
├── Team Alpha: [product area] (5 people)
│   Owns: [service/feature list]
│   Stack: [tech stack]
├── Team Beta: [product area] (6 people)
│   Owns: [service/feature list]
│   Stack: [tech stack]
└── Team Gamma: [product area] (5 people)
    Owns: [service/feature list]
    Stack: [tech stack]

Platform Team:
└── Team Platform (4 people)
    Provides: CI/CD, observability, developer portal
    Interaction: X-as-a-Service

Enabling Team:
└── Team Enable (2 people)
    Focus: [current initiative - e.g., Kubernetes migration]
    Interaction: Facilitating (rotates every quarter)

Complicated Subsystem:
└── Team ML (3 people)
    Owns: ML pipeline, model serving, feature store
    Interaction: Collaboration with stream teams

### Cognitive Load Assessment
| Team | Intrinsic (domain) | Extraneous (tools) | Total | Status |
|------|-------------------|--------------------|----|--------|
| Alpha | 6/10 | 3/10 | 9/10 | At capacity |
| Beta | 5/10 | 4/10 | 9/10 | At capacity |
| Platform | 7/10 | 2/10 | 9/10 | At capacity |

Org Design Anti-Patterns

Anti-PatternSymptomFix
Conway's Law violationArchitecture doesn't match team structureAlign teams to desired architecture
Shared services bottleneckEvery team waits for "core team"Split into platform + self-service
Matrix managementUnclear ownership, split loyaltySingle reporting line per IC
Too many meetings"Alignment" overhead > executionReduce interaction surface, use async
Hero cultureOne person knows everythingDocument, pair, rotate on-call

Process Improvement (Agile Maturity)

Agile Maturity Model

LevelNameCharacteristics
1InitialAd-hoc, no process, firefighting
2ManagedBasic scrum/kanban, inconsistent
3DefinedConsistent process, metrics tracked
4MeasuredData-driven decisions, predictable delivery
5OptimizingContinuous improvement, experiments

Process Improvement Framework

## Process Improvement Cycle

### 1. Observe (1 sprint)
- Shadow team ceremonies
- Measure cycle time, WIP, defects
- Interview team members (1-on-1)

### 2. Diagnose
| Problem | Root Cause | Impact |
|---------|-----------|--------|
| [symptom] | [why] | [what it costs] |

### 3. Hypothesize
"If we [change], then [expected outcome], measured by [metric]"

### 4. Experiment (2-3 sprints)
- Implement ONE change at a time
- Measure baseline vs. new
- Collect team feedback

### 5. Evaluate
- Did the metric improve?
- Did the team feel the improvement?
- Any unintended side effects?

### 6. Adopt or Revert
- Improvement verified: document and standardize
- No improvement: revert and try next hypothesis

Common Process Fixes

ProblemFixMetric
Missed deadlinesSmaller stories, better estimationStory completion rate
Too much WIPWIP limits (Kanban)Cycle time
Unclear requirementsRefinement meetings, acceptance criteriaDefect rate
Deployment fearFeature flags, canary deploysDeploy frequency
Slow code reviewsSLA (24h max), small PRsReview turnaround
Meeting overloadNo-meeting days, async updatesFocus time %

Cross-Team Dependency Management

Dependency Mapping

## Cross-Team Dependencies: [Quarter]

### Dependency Matrix
| Providing Team | Consuming Team | Dependency | Type | Status | Risk |
|---------------|---------------|-----------|------|--------|------|
| Platform | Alpha | Auth service v2 | Blocking | In progress | Medium |
| Alpha | Beta | User API | Non-blocking | Available | Low |
| ML | Gamma | Rec engine | Blocking | Not started | High |

### Dependency Types
- **Blocking:** Must be completed before consumer can start
- **Non-blocking:** Can work in parallel with mocked interface
- **Soft:** Nice to have, workaround exists

### Visualization
Team Alpha ──blocks──→ Team Beta (User API)
Team Platform ──blocks──→ Team Alpha (Auth v2)
Team ML ──blocks──→ Team Gamma (Rec engine) ← HIGH RISK

Dependency Resolution Strategies

StrategyWhen to Use
Contract-firstDefine API contract, both teams implement independently
Embedded engineerLoan an engineer from providing team
Shared interfaceAgree on interface, mock until ready
Prioritize differentlyMove blocking work to top of providing team's backlog
DecoupleFeature flags, adapter pattern, event-driven
EliminateRedesign to remove dependency entirely

Dependency Anti-Patterns

Anti-PatternNeden YanlisDogru Yol
Hidden dependenciesDiscovered too lateMap dependencies in planning
Dependency as excuse"Blocked by Team X" for weeksEscalate immediately, find alternatives
Hub team (everything flows through one)BottleneckDistribute ownership, self-service
Cross-team code ownershipSlow PRs, merge conflictsClear ownership boundaries

Engineering Culture Building

Culture Pillars

## Engineering Culture: [Company Name]

### Our Values (with behaviors)

1. **Ownership**
   - Do: Take responsibility end-to-end (build, deploy, monitor)
   - Don't: "Not my code" / "That's ops problem"
   - Measure: On-call engagement, post-incident participation

2. **Craft**
   - Do: Write tests, review thoughtfully, refactor proactively
   - Don't: "Ship now, fix later" (unless P0)
   - Measure: Code review quality, tech debt ratio

3. **Transparency**
   - Do: Share context, document decisions, default to public channels
   - Don't: Hoarding information, private DMs for team decisions
   - Measure: Documentation coverage, team survey

4. **Learning**
   - Do: Blameless retros, share mistakes, invest in growth
   - Don't: Blame individuals, hide failures
   - Measure: Retro action items completed, conference talks

5. **Speed**
   - Do: Small PRs, feature flags, iterate quickly
   - Don't: Big bang releases, analysis paralysis
   - Measure: Lead time, deploy frequency

Culture Building Practices

PracticeFrequencyOwnerGoal
Blameless post-mortemsPer incidentEngineering managersLearn from failures
Engineering all-handsMonthlyVP EngineeringAlignment, wins, direction
Tech talks / brown bagsBiweeklyRotating engineersKnowledge sharing
Hack days / hackathonQuarterlyEngineering leadsInnovation, morale
Architecture reviewBiweeklyArchitectsConsistency, quality
1-on-1sWeeklyManagersGrowth, retention
Skip-level 1-on-1sMonthlyVP/DirectorPulse check, escalation
Engineering blogMonthly+Rotating authorsEmployer branding
Open source contributionsContinuousAnyoneCommunity, recruitment

OKR Setting for Engineering

OKR Template

## Engineering OKRs: Q[X] [Year]

### Objective 1: Accelerate delivery velocity
| KR | Target | Current | Status |
|----|--------|---------|--------|
| KR1.1: Reduce lead time from code to production | < 4 hours | 2 days | [on/off track] |
| KR1.2: Increase deploy frequency | 5x/day | 2x/week | [on/off track] |
| KR1.3: Reduce change failure rate | < 5% | 12% | [on/off track] |

### Objective 2: Improve developer experience
| KR | Target | Current | Status |
|----|--------|---------|--------|
| KR2.1: Developer satisfaction score | > 4.2/5 | 3.6/5 | [on/off track] |
| KR2.2: Reduce CI build time | < 5 min | 12 min | [on/off track] |
| KR2.3: New hire productive in < 2 weeks | 90% | 60% | [on/off track] |

### Objective 3: Strengthen reliability
| KR | Target | Current | Status |
|----|--------|---------|--------|
| KR3.1: Achieve 99.95% uptime | 99.95% | 99.8% | [on/off track] |
| KR3.2: Reduce MTTR to < 30 min | 30 min | 2 hours | [on/off track] |
| KR3.3: Zero P0 incidents from known issues | 0 | 3/quarter | [on/off track] |

OKR Anti-Patterns

Anti-PatternNeden YanlisDogru Yol
Feature-based OKRs"Ship feature X" is a task, not an outcomeFocus on outcomes ("Reduce churn by 10%")
Too many OKRsDiluted focus3 objectives, 3-4 KRs each max
Binary KRsNo progress signalQuantitative, measurable, with baseline
No alignmentDisconnected from company OKRsCascade from company → engineering → team
Set and forgetNo mid-quarter checkWeekly tracking, monthly review

Incident Management Maturity

Maturity Levels

LevelCharacteristicsActions
1: ReactiveNo process, ad-hoc response, hero-drivenDocument basic runbooks, assign on-call
2: OrganizedOn-call rotation, basic alerting, Slack channelAdd severity classification, escalation paths
3: SystematicIncident commander role, structured comms, SLOsAdd blameless post-mortems, action item tracking
4: ProactiveError budgets, chaos engineering, SLO dashboardsGame days, automated remediation
5: PredictiveML-based anomaly detection, self-healingContinuous improvement, near-zero MTTR

Incident Management Framework

## Incident Response Structure

### Roles
| Role | Responsibility |
|------|---------------|
| Incident Commander (IC) | Coordinates response, makes decisions |
| Technical Lead | Diagnoses and fixes the issue |
| Communications Lead | Stakeholder updates, status page |
| Scribe | Documents timeline and actions |

### Severity Levels
| Level | Definition | Response Time | IC Required | Status Page | Exec Notify |
|-------|-----------|--------------|-------------|-------------|-------------|
| SEV-1 | Full outage | 5 min | Yes | Yes | Immediately |
| SEV-2 | Major degradation | 15 min | Yes | Yes | Within 1h |
| SEV-3 | Minor impact | 1 hour | No | Optional | No |
| SEV-4 | No user impact | Next business day | No | No | No |

### Communication Cadence
| SEV | Internal Update | External Update | Exec Update |
|-----|----------------|----------------|-------------|
| SEV-1 | Every 15 min | Every 30 min | Every 30 min |
| SEV-2 | Every 30 min | Every 1h | Every 2h |
| SEV-3 | Every 2h | If customer-facing | None |

Post-Incident Review Quality Checklist

  • Timeline is complete and accurate
  • Root cause (not symptoms) identified
  • Contributing factors documented
  • Action items are specific, assigned, and deadlined
  • "5 whys" or similar root cause analysis used
  • Systemic fixes preferred over individual fixes
  • No blame assigned to individuals
  • Detection improvement identified
  • Recovery improvement identified
  • Shared with broader engineering team

Platform Team Strategy

Platform Team Charter

## Platform Team Charter

### Mission
Reduce cognitive load on stream-aligned teams by providing self-service
infrastructure, tooling, and abstractions.

### Principles
1. Treat internal teams as customers
2. Self-service > ticket-based requests
3. Paved roads, not mandates
4. Measure developer experience, not just uptime

### Product Areas
| Area | What We Provide | Maturity |
|------|----------------|----------|
| CI/CD | Build pipelines, deploy automation | Mature |
| Observability | Logging, metrics, tracing, dashboards | Growing |
| Developer portal | Service catalog, docs, templates | Early |
| Infrastructure | K8s, databases, caching, queues | Mature |
| Security | Secret management, vulnerability scanning | Growing |

### Success Metrics
| Metric | Target | Current |
|--------|--------|---------|
| Time to onboard new service | < 1 day | 1 week |
| Developer satisfaction (platform) | > 4.0/5 | 3.5/5 |
| Self-service adoption rate | > 80% | 50% |
| Support tickets per team per month | < 5 | 12 |

### Roadmap (Next 2 Quarters)
| Quarter | Initiative | Impact |
|---------|-----------|--------|
| Q1 | Internal developer portal | Reduce onboarding time 50% |
| Q1 | Standardized service template | Consistent microservices |
| Q2 | Golden path for new services | < 1 hour to first deploy |
| Q2 | Self-service database provisioning | Remove DBA bottleneck |

Platform Anti-Patterns

Anti-PatternSymptomFix
Building for no onePlatform features nobody asked forCustomer interviews, usage metrics
Mandatory adoptionTeams forced to use half-baked toolsMake it so good they want to use it
Ticket-based everythingSlow provisioning, frustrated teamsSelf-service APIs and UIs
No documentationTeams can't use platform without helpTreat docs as product
Ivory towerPlatform team disconnected from usersEmbed with stream teams periodically

Developer Experience (DX) Optimization

DX Metrics

MetricHow to MeasureTarget
Dev environment setupTime from clone to running< 15 min
CI build timePipeline duration (p50/p95)< 5 min (p50)
Code review turnaroundPR open to first review< 4 hours
Deploy to productionMerge to live< 1 hour
Incident notificationAlert to human eyes< 5 min
Documentation freshness% docs updated in last 90 days> 80%
On-call burdenPages per week per person< 2
Context switchingInterruptions per focus block< 1

DX Improvement Roadmap

## DX Improvement Plan

### Quick Wins (< 1 week each)
- [ ] Pre-configured dev containers / devbox
- [ ] One-command project setup script
- [ ] PR template with checklist
- [ ] Slack bot for deploy status
- [ ] Auto-assign code reviewers

### Medium Term (1-4 weeks)
- [ ] Reduce CI build time by 50%
- [ ] Local development matches production (docker-compose)
- [ ] API documentation auto-generated from code
- [ ] Error messages link to runbooks
- [ ] Feature flag self-service UI

### Long Term (1-3 months)
- [ ] Internal developer portal (Backstage/custom)
- [ ] Self-service infrastructure provisioning
- [ ] Automated dependency updates (Renovate)
- [ ] Golden path templates for new services
- [ ] DX survey and tracking dashboard

DX Survey Template

## Developer Experience Survey (Quarterly)

Rate 1-5 (1 = terrible, 5 = excellent):

### Development
1. How easy is it to set up your local dev environment?
2. How reliable is your local dev environment?
3. How fast is your CI/CD pipeline?
4. How easy is it to find and understand documentation?

### Collaboration
5. How efficient is your code review process?
6. How well does cross-team collaboration work?
7. How effective are your team's meetings?

### Operations
8. How manageable is on-call?
9. How good are your monitoring and alerting tools?
10. How confident are you in deploying to production?

### Growth
11. How supported do you feel in your career growth?
12. How much time do you spend on meaningful work vs. toil?

### Open Ended
13. What is the biggest time-waster in your day?
14. If you could change one thing about engineering, what would it be?

Release Management at Scale

Release Strategy Options

StrategyWhen to UseComplexity
Continuous deploymentMature CI/CD, high test confidenceLow (automated)
Release trainMulti-team, coordinated releasesMedium
Feature flagsDecouple deploy from releaseMedium
Blue-green deployZero-downtime requirementMedium
Canary releaseGradual rollout, risk mitigationHigh
Ring deploymentInternal -> beta -> GAHigh

Release Process (Multi-Team)

## Release Checklist: v[X.Y.Z]

### Pre-Release (T-2 days)
- [ ] All feature branches merged to release branch
- [ ] Release branch passes all tests
- [ ] Cross-team integration tests passing
- [ ] Dependent services compatible (API contracts)
- [ ] Database migrations tested
- [ ] Feature flags configured for new features
- [ ] Rollback plan documented

### Release Day (T-0)
- [ ] Release branch deployed to staging
- [ ] QA sign-off on staging
- [ ] Monitoring dashboards reviewed (baseline)
- [ ] On-call team briefed
- [ ] Canary deployment initiated
- [ ] Canary metrics monitored (error rate, latency, business KPIs)
- [ ] Full rollout completed
- [ ] Post-deploy verification

### Post-Release (T+1)
- [ ] Metrics compared to baseline
- [ ] No regression in error rates or latency
- [ ] Customer support briefed on changes
- [ ] Release notes published
- [ ] Feature flags cleaned up (remove old)
- [ ] Retrospective scheduled (if issues occurred)

Release Metrics

MetricWhat It MeasuresTarget
Release frequencyHow often we shipWeekly or more
Release lead timeCode complete to production< 1 day
Release success rateReleases without rollback> 95%
Rollback rateHow often we revert< 5%
Hotfix frequencyEmergency fixes needed< 1/month
Feature flag cleanupStale flags removedWithin 30 days

Release Anti-Patterns

Anti-PatternNeden YanlisDogru Yol
"Big bang" releasesHigh risk, hard to debugSmall, frequent releases
Release branch lives too longMerge conflicts, integration hellShort-lived, merge daily
Manual release processError-prone, slowFully automated pipeline
No rollback planStuck with broken releaseAlways have rollback procedure
Feature flags never cleanedCombinatorial explosionClean up within 30 days
Friday deploymentsNobody around for issuesDeploy Mon-Thu, observe Fri
No release notesUsers/support confusedAutomated changelog generation