Agent-almanac probe-feature-flag-state

install
source · Clone the upstream repo
git clone https://github.com/pjt222/agent-almanac
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/pjt222/agent-almanac "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/probe-feature-flag-state" ~/.claude/skills/pjt222-agent-almanac-probe-feature-flag-state-9149f7 && rm -rf "$T"
manifest: skills/probe-feature-flag-state/SKILL.md
source content

Probe Feature-Flag State

Determine whether a named feature flag in a shipped CLI binary is LIVE, DARK, INDETERMINATE, or UNKNOWN, using a four-pronged evidence protocol that pairs every state claim with a specific observation.

When to Use

  • A capability is rumored, documented, or inferred and you need to verify whether the gate actually fires for the running session.
  • You are auditing dark-launched features — code that ships in the bundle but is gated off — to plan integrations responsibly.
  • A prior probe's conclusions need refreshing against a new binary version (the flag may have flipped, been removed, or been merged into a conjunction).
  • You are following up Phase 1 (
    monitor-binary-version-baselines
    ) markers and need to classify each candidate flag's rollout state before moving to Phase 4 wire capture.
  • A user-visible behavior changed and you need to know whether a flag flip or a code change drove it.

Inputs

  • Required: the flag name as it appears in the binary (string-literal form).
  • Required: the CLI binary or bundle file you can read and invoke.
  • Required: an authenticated session against the harness's normal backend (your own account; never another user's).
  • Optional: the binary version identifier — strongly recommended so the evidence table is diff-able against future probes.
  • Optional: a list of suspected co-gates (other flag names that may participate in a conjunction with this one).
  • Optional: a prior probe artifact for the same flag at a different version, for delta analysis.

Procedure

Step 1: Confirm the Flag Name Is Present in the Binary (Prong A — Binary Strings)

Extract the candidate flag name from the bundle to confirm it actually exists as a string literal. Without this, all later prongs are probing thin air.

# Locate the bundle (common shapes: .js, .mjs, .bun, packaged binary)
BUNDLE=/path/to/cli/bundle.js
FLAG=acme_widget_v3   # synthetic placeholder — replace with the candidate

# Confirm the literal exists
grep -c "$FLAG" "$BUNDLE"

# Capture every line where it appears, with surrounding context for Step 2
grep -n -C 3 "$FLAG" "$BUNDLE" > /tmp/flag-context.txt
wc -l /tmp/flag-context.txt

Inspect

/tmp/flag-context.txt
and tag each occurrence as one of:

  • gate-call — appears as the first argument to a gate-shaped function (
    gate("$FLAG", default)
    ,
    isEnabled("$FLAG")
    ,
    flag("$FLAG", ...)
    ).
  • telemetry-call — appears as the first argument to an emit/log/track function.
  • env-var-check — appears in a
    process.env.X
    (or equivalent) lookup.
  • string-table — appears in a static map or registry whose role is unclear.

Expected: at least one occurrence of the flag string in the bundle, and each occurrence tagged with its call-site role.

On failure: if

grep -c
returns 0, the flag is not in this build. Either the input name is wrong (typo, wrong namespace) or the flag was removed in this version. Re-check Phase 1 marker output, then either correct the input or classify as
REMOVED
and stop.

Step 2: Disambiguate Gate from Event from Env Var

The same string can appear as a gate, a telemetry event name, an env var, or all three. The classification depends on call-site, not on the string. Mistaking a telemetry name for a gate produces nonsense reasoning ("this gate must be off") about something that was never a gate.

For each tagged occurrence from Step 1:

  • A gate-call occurrence makes this string eligible for LIVE / DARK / INDETERMINATE classification. Capture the default value passed to the gate (
    gate("$FLAG", false)
    defaults the flag to off;
    gate("$FLAG", true)
    defaults it to on). Record both the literal default and the gate function name.
  • A telemetry-call occurrence does not make the string a gate. It is a label fired when some other gate has already passed. If the only occurrences are telemetry-call, the string is event-only and final classification is
    UNKNOWN
    (name present but not a gate).
  • An env-var-check occurrence usually indicates a kill switch (default-on capability disabled by an env var) or an explicit opt-in (default-off capability enabled by an env var). Note the polarity —
    if (process.env.X) { return null; }
    is a kill switch;
    if (process.env.X) { enable(); }
    is an opt-in.
  • A string-table occurrence must be cross-referenced — look at how the table is consumed downstream.

Expected: for every occurrence, a definite call-site role and (for gate-calls) the recorded default value.

On failure: if a gate-call's surrounding context is too minified to read the default, expand the grep context (

-C 10
) and inspect the full callee. If the default still cannot be determined, record it as
default=?
and downgrade any LIVE/DARK conclusion to INDETERMINATE.

Step 3: Observe Live Invocation Behavior (Prong B — Runtime Probe)

Run the harness in an authenticated session you control and observe whether the gated capability surfaces. This is the single highest-signal prong: the bundle says what can happen, the runtime shows what does happen.

Pick a probe action that would reveal the gate-pass — typically the user-visible behavior the gate guards (a tool appearing in a tool list, a command flag becoming valid, a UI element rendering, an output field appearing in a response).

# Example shape — adapt to the harness
$CLI --list-capabilities | grep -i widget         # does the gated capability appear?
$CLI --help 2>&1 | grep -i "$FLAG"                # is a flag-related option exposed?
$CLI run-some-command --debug 2>&1 | tee probe-runtime.log

Record one of three outcomes:

  • gate-pass observed — the capability surfaced in the session. Classification candidate:
    LIVE
    .
  • gate-pass not observed — the capability did not surface. Classification candidate depends on the default from Step 2 (default-false →
    DARK
    ; default-true → re-check, this is suspicious).
  • gate-pass conditional on a specific input or context not reproducible here — record the condition; classification candidate:
    INDETERMINATE
    .

Expected: a recorded probe action, the observed outcome, and the candidate classification it points to.

On failure: if the probe action itself errors (auth failure, network unreachable, wrong subcommand), the runtime prong is unusable for this round. Fix the session or pick a different probe action; do not infer DARK from a runtime that never ran.

Step 4: Inspect On-Disk State (Prong C — Config, Cache, Session)

Many harnesses persist gate evaluations or override values to disk so they need not be re-fetched. Inspecting this state shows what the harness believed about the flag at last evaluation.

Common locations (adapt to the harness — these are shapes, not specific paths):

# User-level config
ls ~/.config/<harness>/ 2>/dev/null
ls ~/.<harness>/ 2>/dev/null

# Per-project state
ls .<harness>/ 2>/dev/null

# Cache directories
ls ~/.cache/<harness>/ 2>/dev/null

# Search any of these for the flag name
grep -r "$FLAG" ~/.config/<harness>/ ~/.cache/<harness>/ .<harness>/ 2>/dev/null

Record each hit's path, the value associated with the flag, and the file's last-modified time. A recently-modified cache entry overriding a binary default is the strongest possible evidence either way.

Expected: either a confirmed override value with timestamp, or a confirmed absence (no on-disk state mentions this flag).

On failure: if you find the flag mentioned but cannot tell whether the recorded value is a cached server response, a user override, or a stale value, flag the entry for Step 5 (platform cache) reconciliation rather than guessing.

Step 5: Inspect Platform Flag-Service Cache (Prong D)

If the harness uses an external feature-flag service (LaunchDarkly, Statsig, GrowthBook, vendor-internal, etc.), the locally-cached service response is the authoritative current rollout state. Inspect it where available.

# Look for service-shaped cache files
find ~/.cache ~/.config -name "*flag*" -o -name "*feature*" -o -name "*config*" 2>/dev/null | head

# If a cache file is present, parse it for the flag name
jq ".[] | select(.key == \"$FLAG\")" ~/.cache/<harness>/flags.json 2>/dev/null

Record the cached value, the cache timestamp, and (if present) the cache TTL. A platform cache that says

false
overrides a binary default of
true
; a platform cache that says
true
overrides a binary default of
false
.

Expected: either a definite cached value with timestamp, or confirmed absence of a flag-service cache for this harness.

On failure: if the harness has no flag-service or you cannot locate the cache, this prong contributes nothing — that is acceptable. Note "Prong D: not applicable" in the evidence table; do not guess.

Step 6: Handle Conjunction Gates

Some capabilities are guarded by multiple flags that must all be true:

gate("A") && gate("B") && gate("C")
. Any one being DARK is sufficient to make the capability DARK, but the per-flag classification still belongs to each flag individually.

# After finding the gate-call site for the primary flag in Step 2, scan the
# enclosing predicate for other gate(...) calls
grep -n -C 5 "$FLAG" "$BUNDLE" | grep -oE 'gate\("[^"]+"' | sort -u

For each co-gate string surfaced:

  • Repeat Steps 1–5 for that flag (treat each as its own probe).
  • Record the per-flag classification.
  • Compute the capability-level classification: LIVE iff all conjuncts are LIVE; DARK if any conjunct is DARK; INDETERMINATE if no conjunct is DARK and at least one is INDETERMINATE.

Expected: every conjunct identified and individually classified, plus a derived capability-level classification.

On failure: if the predicate is too minified to enumerate cleanly (call site is inlined or wrapped), record the conjunction as "≥1 additional gate, structure unreadable" and downgrade the capability-level classification to INDETERMINATE even if the primary flag looks LIVE.

Step 7: Check for Skill-Substitution

A flag may legitimately be DARK while the user-facing capability it would unlock is reachable through a different, fully-supported route — a different command, a user-invocable skill, an alternate API. The honest finding "flag DARK, capability LIVE via substitution" is common and important; missing it produces panicked dark-launch reports about capabilities users actually have.

For any candidate classification of DARK or INDETERMINATE, ask:

  • Is there a documented user-invokable command, slash command, or skill that delivers the same end-user outcome?
  • Is there an alternate API surface (different endpoint, different tool name) that returns equivalent data?
  • Does the harness publish a user-facing extension point (plugins, custom tools, hooks) that allows users to assemble the equivalent themselves?

If yes to any, append a

substitution:
note to the evidence row recording the alternate route and its observability (how a user reaches it, whether it is documented).

Expected: for every DARK / INDETERMINATE classification, an explicit substitution check — either the route, or the explicit note "no substitution route identified."

On failure: if you suspect a substitution exists but cannot confirm the route, mark "substitution suspected; not confirmed" rather than asserting either way.

Step 8: Assemble the Evidence Table and Final Classification

Combine the four prongs into a single table. Every state claim must be paired with the observation that supports it; re-running the probe at a new version produces a diff-able artifact.

FieldValue
Flag
acme_widget_v3
(synthetic placeholder)
Binary version
<version-id>
Probe date
YYYY-MM-DD
Prong A — stringspresent (3 occurrences: 1 gate-call default=
false
, 2 telemetry)
Prong B — runtimegate-pass not observed in capability list
Prong C — on-diskno override found in
~/.config/<harness>/
Prong D — platform cacheservice cache absent / not applicable
Conjunctionnone — single-gate predicate
Substitutionuser-invokable
widget
slash command delivers equivalent UX
Final stateDARK (capability LIVE via substitution)

Apply the classification rules:

  • LIVE — at least one prong observed gate-pass this session AND no prong contradicts.
  • DARK — flag string present, gate-call default is
    false
    , no prong observed gate-pass, no override flips it on.
  • INDETERMINATE — gate-pass is conditional on an input or context not reproducible in this probe, OR the gate's default could not be determined, OR a conjunct is INDETERMINATE.
  • UNKNOWN — string present but not used as a gate (telemetry-only, string-table-only, env-var-only label).

Save the table as a probe artifact (e.g.,

probes/<flag>-<version>.md
) so future probes diff against it.

Expected: a complete evidence table covering all four prongs, conjunction status, substitution status, and a single final classification.

On failure: if no prong yields a usable signal (binary cannot be read, runtime cannot be invoked, on-disk and platform cache both absent), do not invent a classification. Record

INDETERMINATE
with the reason "no prong yielded signal" and stop.

Validation

  • Every state claim in the evidence table is paired with a specific observation (no bare assertions).
  • The flag's gate-call default value is recorded (or explicitly noted as unreadable).
  • Telemetry-event occurrences are not counted as gate evidence.
  • Conjunction gates have per-flag classifications and a capability-level classification.
  • Every DARK / INDETERMINATE row has an explicit substitution check.
  • The artifact records the binary version so future probes are diff-able.
  • No real product names, version-pinned identifiers, or dark-only flag names appear in any artifact intended for publication (see
    redact-for-public-disclosure
    ).

Common Pitfalls

  • Conflating telemetry events with gates. A string that appears in
    emit("$FLAG", ...)
    is a label, not a gate. A flag that is "telemetry-only" has no rollout state and should be classified UNKNOWN, not DARK.
  • Skipping Prong B (live invocation). Static evidence alone (the binary says
    default=false
    ) is not the same as runtime evidence (the capability did not appear). A flag with default-false in the binary may be flipped to true by a server-side override; only the runtime probe shows what the session actually got.
  • Missing the conjunction. Classifying the primary flag as LIVE because its single occurrence shows
    default=true
    while ignoring the surrounding
    && gate("B") && gate("C")
    produces a falsely confident LIVE for a capability that is actually gated by B or C.
  • Calling DARK without a substitution check. Many DARK flags are genuinely unreachable, but many others have a fully-supported user-invokable route. The substitution check is what turns "alarming dark-launch" into "honest finding."
  • Probing a stale binary version. A probe artifact with no version stamp is useless — you cannot tell whether it reflects current state or last quarter's state. Always record the version, and diff future probes against the artifact.
  • Activating the gate to confirm it. Flipping a flag to test it is not part of this skill. Some dark gates are off for safety reasons (incomplete capability, regulatory hold, unfinished migration). Document; never bypass.
  • Capturing other users' state. Prong C and Prong D inspect your own on-disk state and your own cache. Reading another user's cache is exfiltration and is out of scope.
  • Treating INDETERMINATE as a failure. It is not — it is the honest classification when evidence is partial. Forcing INDETERMINATE results into LIVE or DARK to make the report look decisive is the fastest way to be wrong.

Related Skills

  • monitor-binary-version-baselines
    — Phase 1 of the parent guide; the marker tracking this skill builds on supplies the candidate flag inventory.
  • conduct-empirical-wire-capture
    — Phase 4; deeper runtime evidence (network capture, lifecycle hooks) when Prong B's surface-level probe is insufficient.
  • security-audit-codebase
    — dark-launched code is part of attack-surface archaeology; this skill is the discovery half of that audit.
  • redact-for-public-disclosure
    — Phase 5; the redaction discipline that decides which probe artifacts can leave the private workspace.