Harness-engineering harness-feature-flags

Harness Feature Flags

install

source · Clone the upstream repo

git clone https://github.com/Intense-Visions/harness-engineering

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/Intense-Visions/harness-engineering "$T" && mkdir -p ~/.claude/skills && cp -r "$T/agents/skills/claude-code/harness-feature-flags" ~/.claude/skills/intense-visions-harness-engineering-harness-feature-flags-a4fa94 && rm -rf "$T"

manifest: agents/skills/claude-code/harness-feature-flags/SKILL.md

source content

Harness Feature Flags

Flag lifecycle management, A/B testing infrastructure, and gradual rollout design. Ship features safely with controlled exposure and clean retirement.

When to Use

When designing feature flag architecture for a new project or feature
When auditing existing flags for staleness, test coverage, or hygiene
When planning gradual rollout or A/B testing strategies
NOT for deployment pipeline configuration (use harness-deployment)
NOT for environment variable management (use harness-secrets)
NOT for application performance testing under flag variants (use harness-perf)

Process

Phase 1: DETECT -- Discover Flag Definitions and Usage

Identify flag provider. Scan the project for feature flag infrastructure:
- Third-party SDKs: LaunchDarkly (
```
launchdarkly-node-server-sdk
```
  ), Unleash (
```
unleash-client
```
  ), Flagsmith (
```
flagsmith-nodejs
```
  ), Split (
```
@splitsoftware/splitio
```
  )
- Custom implementations: Files matching
```
*feature-flag*
```
  ,
```
*toggle*
```
  ,
```
*flags*
```
- Configuration-based:
```
flags.json
```
  ,
```
features.json
```
  ,
```
feature-flags.yaml
```
- Environment-based: Feature flags controlled via environment variables (
```
FEATURE_*
```
  ,
```
FF_*
```
  )
Catalog all flag definitions. For each flag found, record:
- Flag name/key
- Flag type (boolean, multivariate, percentage rollout)
- Default value (what happens when the flag service is unreachable)
- Where it is defined (provider dashboard, config file, code)
- Creation date or first appearance in git history
Map flag usage in code. For each flag, find all evaluation points:
- Source files that check the flag value
- Conditional branches gated by the flag
- Components that render differently based on flag state
- API endpoints with flag-dependent behavior
- Test files that exercise flag variants
Detect flag categories. Classify each flag:
- Release flag: Gates a new feature for gradual rollout (temporary)
- Experiment flag: Controls A/B test variants (temporary)
- Ops flag: Circuit breaker or kill switch (permanent)
- Permission flag: Controls access to premium features (permanent)

Present detection summary:

Feature Flag Detection:
  Provider: LaunchDarkly (SDK v7.x)
  Flags defined: 23
  Flags evaluated in code: 19 (4 defined but unused)
  Categories: 12 release, 5 experiment, 4 ops, 2 permission
  Stale candidates: 6 (release flags older than 90 days)

Phase 2: DESIGN -- Recommend Flag Architecture and Rollout Strategy

Evaluate flag naming convention. Check for consistency:
- Recommended format:
```
{team}.{feature}.{variant}
```
  (e.g.,
```
payments.stripe-v2.enabled
```
  )
- Flags should have descriptive names (not
```
flag_123
```
  or
```
test_flag
```
  )
- Naming should indicate category (prefix with
```
ops.
```
  for operational flags)
- Temporary flags should include a target removal date in metadata
Design flag evaluation architecture. Recommend patterns for:
- Server-side evaluation: Flags evaluated on the backend, consistent behavior
- Client-side evaluation: Flags evaluated in the browser/app, real-time updates
- Edge evaluation: Flags evaluated at CDN layer for zero-latency decisions
- Default values: Every flag must have a safe default (feature off) for provider outages
Design rollout strategy. For release flags:
- Percentage rollout: Start at 1%, monitor metrics, increase to 5%, 25%, 50%, 100%
- User segment targeting: Internal users first, then beta users, then general availability
- Geographic rollout: Single region first, expand after validation
- Time-based gates: Enable during low-traffic periods for initial validation
- Define success criteria for each rollout stage (error rate, latency, conversion)
Design A/B testing infrastructure. For experiment flags:
- Consistent user assignment (same user always sees same variant)
- Statistical significance calculation before declaring a winner
- Event tracking integration for measuring experiment outcomes
- Guardrail metrics that auto-disable experiments if health degrades
- Experiment duration limits (no indefinite experiments)
Design fallback behavior. Ensure resilience:
- Provider SDK timeout configuration (fail fast, not block)
- Local cache for flag values (stale-while-revalidate)
- Default values that are safe (feature disabled, not enabled)
- Circuit breaker on flag evaluation errors
- Monitoring on flag evaluation latency and error rate

Phase 3: VALIDATE -- Verify Flag Hygiene and Test Coverage

Check test coverage per flag. For each flag, verify:
- Tests exist for both the flag-on and flag-off code paths
- Integration tests cover the default value behavior (provider down scenario)
- No tests depend on a specific flag value being set in production
- Test utilities exist for overriding flag values in test context
Validate flag consistency. Check for common issues:
- Flags checked in code but not defined in the provider (will use defaults)
- Flags defined in the provider but never checked in code (dead flags)
- Same flag checked with different keys in different files (typos)
- Nested flag checks that create combinatorial complexity
Check for flag coupling. Identify problematic patterns:
- Flags that depend on other flags (if A and B, then C)
- Flags that must be enabled in a specific order
- Flags that share state or side effects
- Recommend decoupling or consolidating dependent flags
Validate rollout configuration. For active rollouts:
- Percentage values are within expected ranges
- Targeting rules are correctly configured
- Kill switch is functional (can disable the feature instantly)
- Rollback plan is documented

Generate validation report:

Flag Validation: [PASS/WARN/FAIL]

Test coverage: WARN (3 flags missing flag-off tests)
Consistency: PASS (all code references match definitions)
Coupling: WARN (2 flags have interdependencies)
Rollout config: PASS (active rollouts correctly configured)

Issues:
  1. payments.stripe-v2 -- no test for flag-off fallback path
  2. search.new-algorithm + search.reranking -- coupled flags
  3. checkout.express -- missing kill switch test

Phase 4: LIFECYCLE -- Audit Stale Flags and Plan Cleanup

Identify stale flags. Flag candidates for removal:
- Release flags that have been at 100% for more than 30 days
- Experiment flags past their declared end date
- Flags with no evaluation in the last 90 days (from provider analytics)
- Flags whose feature has been fully adopted (no rollback expected)
Generate cleanup plan. For each stale flag:
- List all code locations that reference the flag
- Identify which branch to keep (the flag-on path for graduated features)
- Estimate lines of code to remove (dead branch cleanup)
- Determine if any tests need updating after flag removal
- Assign a cleanup owner and deadline
Assess technical debt. Calculate flag burden:
- Total active flags per service (recommend under 20)
- Ratio of temporary to permanent flags
- Average flag age for temporary flags
- Code complexity added by flag branching (conditional paths)
- Estimate of test matrix growth due to flags
Recommend lifecycle policies. Design governance:
- Maximum lifetime for release flags (e.g., 90 days)
- Maximum lifetime for experiment flags (e.g., 30 days)
- Required metadata: owner, creation date, expiry date, cleanup ticket
- Automated alerting when flags exceed their lifetime
- Monthly flag review cadence with the team

Generate lifecycle report:

Flag Lifecycle Report:

Active flags: 23
Stale flags: 6 (candidates for removal)
Technical debt: ~340 lines of dead code across 6 stale flags

Cleanup plan:
  1. payments.old-checkout -- at 100% for 45 days, 4 files, ~80 LOC to remove
  2. search.v1-algorithm -- experiment ended 60 days ago, 2 files, ~40 LOC
  3. onboarding.legacy-flow -- at 100% for 90 days, 6 files, ~120 LOC
  4. notifications.email-v1 -- unused for 120 days, 3 files, ~50 LOC
  5. dashboard.old-charts -- at 100% for 35 days, 2 files, ~30 LOC
  6. auth.password-reset-v1 -- at 100% for 60 days, 1 file, ~20 LOC

Recommendations:
  - Adopt 90-day maximum lifetime policy for release flags
  - Add flag expiry dates to LaunchDarkly metadata
  - Schedule monthly flag cleanup sprint
  - Target: reduce active flag count from 23 to 17

Harness Integration

harness skill run harness-feature-flags
-- Primary invocation for flag analysis and lifecycle management.
harness validate
-- Run after flag cleanup to verify project health.
harness check-deps
-- Verify flag provider SDK dependencies are installed.
emit_interaction
-- Present lifecycle report and gather decisions on cleanup priorities.

Success Criteria

All feature flags in the project are discovered and cataloged
Flags are categorized by type (release, experiment, ops, permission)
Stale flags are identified with specific cleanup plans
Test coverage for both flag-on and flag-off paths is verified
Rollout configuration is validated for active flags
Lifecycle policies are recommended with enforcement mechanisms

Rationalizations to Reject

Rationalization	Why It Is Wrong
"This release flag has been at 100% for a while, but removing it is risky"	Release flags at 100% for more than 30 days are stale candidates. Every stale flag adds dead code branches and test matrix complexity.
"We only need to test the flag-on path since that is the path we ship"	No flags without test coverage for both paths. The flag-off path IS the fallback when the flag provider is unreachable.
"These two flags depend on each other, but they work fine together"	No coupled flag dependencies is a blocking finding. Flags that require other flags creates combinatorial complexity.
"Setting the flag default to true makes the rollout easier"	Every flag must default to safe (feature disabled). A default of true means a provider outage enables the feature for everyone.
"We do not need a naming convention -- our flag count is small"	Inconsistent naming becomes unmanageable as flag count grows. The skill flags inconsistency as a warning even at small scale.

Examples

Example: React SPA with LaunchDarkly

Phase 1: DETECT
  Provider: LaunchDarkly (React SDK v3.x)
  Flags defined: 15 (in LaunchDarkly dashboard)
  Flags in code: 13 (2 unused in dashboard)
  Categories: 8 release, 3 experiment, 2 ops

Phase 2: DESIGN
  Naming: WARN -- inconsistent (mix of camelCase and kebab-case)
  Recommended convention: team.feature.variant (kebab-case)
  Evaluation: Client-side (React SDK), 200ms init timeout
  Default values: WARN -- 2 flags default to true (should default to false)
  Rollout: checkout.express-pay at 25% (targeting premium users)

Phase 3: VALIDATE
  Test coverage: WARN -- 4 flags missing useFlags mock in component tests
  Consistency: PASS
  Coupling: PASS (no interdependent flags)
  Kill switch: PASS (all release flags can be instantly disabled)

Phase 4: LIFECYCLE
  Stale: 3 flags at 100% for 30+ days
  Cleanup estimate: 180 LOC across 8 files
  Recommendation: Remove search.v2-results (at 100% for 65 days, 3 components)
  Result: WARN -- 3 stale flags, 4 missing tests, 2 unsafe defaults

Example: Spring Boot API with Custom Flag System

Phase 1: DETECT
  Provider: Custom implementation (FeatureFlagService.java)
  Flag store: PostgreSQL table (feature_flags)
  Flags defined: 28
  Flags in code: 24 (4 in database but unreferenced)
  Categories: 15 release, 6 experiment, 5 ops, 2 permission

Phase 2: DESIGN
  Architecture: Server-side evaluation with 60s cache refresh
  Concern: No SDK resilience -- database outage disables all flags
  Recommend: Add in-memory cache with stale-while-revalidate pattern
  Recommend: Add health check endpoint for flag service status
  Rollout: Using percentage field in database -- no segment targeting
  Recommend: Add user segment support for targeted rollouts

Phase 3: VALIDATE
  Test coverage: WARN -- @FeatureFlag annotation used but no test helper
    to toggle flags in test context (tests rely on database state)
  Consistency: WARN -- 4 database entries have no code references
  Coupling: FAIL -- 3 flags form a dependency chain
    (billing.new-pricing requires billing.tax-calc-v2 requires billing.currency-v3)
  Recommend: Consolidate into single billing.pricing-v3 flag

Phase 4: LIFECYCLE
  Stale: 8 flags enabled for all users for 60+ days
  Technical debt: ~520 LOC of dead code
  Flag count: 28 (above recommended 20 per service)
  Recommendation: Immediate cleanup sprint targeting 8 stale flags
  Recommendation: Enforce 60-day lifetime policy going forward
  Result: FAIL -- flag coupling detected, high stale flag count

Gates

No flags without default values. Every flag evaluation must specify a default value for provider outage scenarios. Missing defaults are blocking findings.
No stale release flags beyond policy limit. Release flags that exceed their maximum lifetime (default 90 days at 100%) are blocking warnings. They must be cleaned up or have an explicit extension with documented justification.
No coupled flag dependencies. Flags that require other flags to be in a specific state create combinatorial complexity and are fragile. Flag coupling is a blocking finding that requires consolidation.
No flags without test coverage for both paths. Feature flags that lack tests for the off-path leave the fallback behavior unverified. Both paths must be tested.

Escalation

When flag count exceeds 30 per service: The flag system is becoming a maintenance burden. Recommend a flag audit sprint with a target of reducing to under 20. Prioritize removing the oldest release flags first.
When a flag provider outage affects production: If the application does not handle provider unavailability gracefully, this is an architectural issue. Recommend implementing local caching, safe defaults, and a circuit breaker pattern before adding more flags.
When experiment results are inconclusive: Do not extend the experiment indefinitely. Recommend increasing sample size (broader targeting), simplifying the variants, or declaring the experiment inconclusive and removing it.
When flag cleanup requires database migration: Removing flags that are stored in a database may require migration scripts. Coordinate cleanup with a release cycle and ensure rollback is possible if the migration fails.