Crucible quality-gate

Iterative red-teaming of any artifact (design docs, plans, code, hypotheses, mockups). Loops until clean or stagnation. Invoked by artifact-producing skills or their parent orchestrator.

install
source · Clone the upstream repo
git clone https://github.com/raddue/crucible
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/raddue/crucible "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/quality-gate" ~/.claude/skills/raddue-crucible-quality-gate && rm -rf "$T"
manifest: skills/quality-gate/SKILL.md
source content

Quality Gate

<!-- CANONICAL: shared/dispatch-convention.md -->

All subagent dispatches use disk-mediated dispatch. See

shared/dispatch-convention.md
for the full protocol.

Shared iterative red-teaming mechanism invoked at the end of artifact-producing skills. Provides rigorous adversarial review as the core quality mechanism.

Announce at start: "Running quality gate on [artifact type]."

Skill type: Rigid -- follow exactly, no shortcuts.

Execution model: When this skill is running, YOU are the orchestrator. You drive the loop, dispatch fix agents and reviewers as subagents, track scores, and make escalation decisions. All references to "the orchestrator" in this document refer to you.

Consensus Detection

At the start of the quality gate, check whether the

consensus_query
MCP tool is available in the current environment:

  1. If the tool is available: consensus-eligible rounds will use multi-model dispatch (see Multi-Model Red-Team Review and Multi-Model Consensus in Stagnation Detection below).
  2. If the tool is not available: all rounds use standard single-model dispatch. No degradation, no warnings — the gate behaves exactly as it did before consensus was introduced.

Do NOT:

  • Prompt the user to set up consensus if it is unavailable
  • Log warnings about missing consensus configuration
  • Change any scoring, stagnation, or escalation logic based on consensus availability

Consensus is a transparent enhancement. Its presence improves coverage; its absence changes nothing.

External Model Review (Optional)

At the start of the quality gate, check whether the

external_review
MCP tool is available in the current environment AND
skills.quality_gate
is enabled in the external review config. If either check fails, skip all external review steps silently — no warnings, no prompts.

When It Runs

Every red-team round, alongside the host red-team dispatch. Call

external_review
with:

  • prompt
    : contents of
    skills/shared/external-review-prompt.md
  • context
    : the same artifact context given to the red-team subagent
  • skill
    :
    "quality_gate"
    (top-level argument for per-skill toggle enforcement)
  • metadata
    :
    {"skill": "quality_gate", "round": N}
    (traceability)

Consensus Bridge (rounds 1, 4, 7, 10, 13)

On consensus-eligible rounds where both

consensus_query
and
external_review
are available:

  1. Run
    external_review
    FIRST, before calling
    consensus_query
  2. Only bridge reviews where
    error
    is null. Skip errored reviews — their empty content would corrupt the consensus signal.
  3. Pass the non-errored external review responses as the
    additional_responses
    parameter to
    consensus_query
  4. The aggregator deduplicates findings across all models (consensus + external), surfaces cross-model disagreements, and tags external-unique findings with confidence levels
  5. On non-consensus rounds, external review runs independently — its findings are appended to round output but not routed through the aggregator

Scoring Invariant (INV-2)

CRITICAL: External findings do NOT affect the scoring algorithm.

  • The weighted score (Fatal=3, Significant=1) is computed from host red-team findings ONLY
  • External findings are appended to round output for visibility
  • External findings are added to the fix journal context (so the fix agent sees them as additional perspective)
  • External findings are NEVER inputs to the stagnation detection scoring

This invariant is load-bearing. The quality gate's convergence guarantees depend on a single, consistent scoring source. Mixing external signal into scoring would create non-deterministic stagnation behavior.

Graceful Degradation

  • external_review
    tool not available (MCP server not running): skip silently.
  • Response
    status
    is
    "unavailable"
    (no config or disabled): skip silently.
  • Response
    status
    is
    "error"
    (all models failed): skip silently, note failure in round output. Distinct from "unavailable" — means the feature is configured but every model errored.
  • Response
    status
    is
    "partial"
    (some models failed): include available reviews, note which models failed in round output.
  • External review timeout or failure never blocks or delays the host red-team round.

Anti-Rationalization Table — quality-gate

RationalizationRebuttalRule
"This finding is minor, I'll just fix it inline instead of dispatching a fix agent."Orchestrator-applied fixes break separation of concerns and corrupt the fix journal. Fix-agent overhead for trivial fixes is negligible; the risk of conflation is not.All fixes route through the fix agent — no exceptions, no matter how small.
"Round N fixed everything, I can return PASS without another red-team round."Fixing is not passing. A fresh red-team round is the verification step. Skipping it is a skip disguised as a pass.The gate is only PASS after a fresh red-team round returns 0 Fatal, 0 Significant.
"The red-team finding is wrong / overblown, I'll mark it resolved without a fix."Rationalizing away findings defeats the point of adversarial review. If a finding is wrong, the fix agent explicitly justifies dismissal in the fix journal — the orchestrator does not dismiss findings unilaterally.Every Fatal/Significant finding is either fixed or documented as dismissed by the fix agent with reasoning.
"The score went up but I can tell it's close, skip the stagnation judge."Stagnation detection uses weighted score, not orchestrator intuition. Score-based inline judgment is the exact failure the judge exists to catch.Dispatch the stagnation judge whenever score is not strictly lower than the prior round.
"Round 15 hit — I'll squeeze in one more round, surely the next will pass."The 15-round limit is a circuit breaker, not a suggestion. Exceeding it silently is how runaway loops happen.At round 15, escalate to the user with full round history — never silently continue.
"Pre-flight dependency audit is noise for this artifact, skip it."The audit only runs on
code
artifacts, and on code artifacts it's mandatory. Dependency drift is a documented source of shipped bugs.
Run the dependency audit on every
code
artifact; skip silently only for non-code types.
"The user said 'move on', that's approval to skip the gate."General feedback is never skip approval. Skip requires an unambiguous instruction specifically referencing the gate.Only an explicit, gate-referencing instruction counts as skip approval.

How It Works

  1. Receives: artifact content, artifact type, project context
  2. Pre-flight dependency audit (code artifacts only). If artifact type is
    code
    , run the pre-flight dependency audit (see Pre-Flight Dependency Audit below). If the result is BLOCKED and the user does not approve continuation, abort the gate. For all other artifact types, skip this step entirely — no scan, no output, no scratch files.
  3. Prepares the artifact for review (see Artifact Preparation below)
  4. Invokes
    crucible:red-team
    as a single-pass reviewer (one dispatch = one review round). Quality-gate owns the iteration loop; red-team produces findings for one round and returns. Red-team does NOT run its own stagnation loop when invoked by quality-gate.
  5. If red-team finds zero Fatal and zero Significant issues: artifact approved. Write final artifact to scratch directory, output consolidated Minor observations from all rounds (see Minor Issue Handling), surface pre-flight audit results (if any) alongside gate results, clean up, and return.
  6. If red-team finds Fatal or Significant issues: a. Dispatch a separate fix agent (see Fix Mechanism below) — receive revised artifact, append to fix journal b. Dispatch Fix Verifier (see Fix Verification below) — one Sonnet check per fix round c. Append verifier output to fix journal under
    ### Verifier Assessment
    heading; write verdict summary to
    round-N-verification.md
    d. If Fatal-severity Unresolved: flag as "prior unresolved Fatal — must address" in next round's fix dispatch (binding, one-round grace) e. If Significant-severity Unresolved: appended to fix journal as informational context f. Invoke a FRESH red-team on the revised artifact (no anchoring)
  7. Track weighted score between rounds (Fatal=3, Significant=1):
    • Strictly lower score → progress, loop again
    • Same or higher score → dispatch the Stagnation Judge (see Stagnation Detection below)
  8. Read the judge's verdict and act on it (see Stagnation Detection below)
  9. Progress notification. After round 5 and every 3 rounds thereafter (rounds 5, 8, 11, 14), emit: "Quality gate round [N]: score progression [list]." If the judge was dispatched, append recurring/new counts. Informational only — no pause.
  10. Global safety limit: 15 rounds. This is a runaway protection circuit-breaker. If you hit 15, escalate to user with full round history.

Multi-Model Red-Team Review (when available)

Applies to: Round 1 and every 3rd round thereafter (rounds 1, 4, 7, 10, 13). Intermediate rounds: Standard single-model red-team dispatch (no change).

On consensus-eligible rounds:

  1. Instead of dispatching a single red-team subagent, call
    consensus_query(mode: "review")
    with the red-team prompt and artifact content
  2. The consensus response provides merged findings with per-finding severity (Fatal/Significant/Minor), confidence (High/Medium/Low based on model agreement), provenance (which models raised it), and unique findings flagged as "potentially novel"
  3. The orchestrator processes these findings exactly as single-model findings: compute weighted score, compare to prior round, dispatch fix agent if needed
  4. Findings from consensus rounds include provenance metadata in
    round-N-findings.md

Cost control: The consensus dispatch replaces (not supplements) the single-model dispatch on eligible rounds. Fallback: If consensus is unavailable on an eligible round, dispatch standard single-model red-team review.

Non-Skippability

This gate cannot be bypassed without explicit user approval. Task size, complexity, or scope is never a valid reason to skip. The invoking skill is responsible for always dispatching the gate AND letting it run to completion.

The gate is not "done" until it completes with a clean round (0 Fatal, 0 Significant on a fresh review). Fixing findings and moving on without a verification round is a skip, not a pass. The iteration loop exists because fix agents introduce new issues or incompletely resolve old ones — fresh-eyes re-review catches what the fixer missed.

The only valid skip is an unambiguous user instruction specifically referencing the gate (e.g., "skip the quality gate"). General feedback like "looks good" or "move on" is not skip approval. Once a gate has run and presented findings to the user, the user's decision to proceed is authoritative.

Fix Mechanism

The orchestrator coordinates the loop but does NOT fix artifacts directly. Fixes are dispatched to a separate subagent to maintain separation of concerns between coordination, review, and remediation.

Artifact TypeFix Agent
designPlan Writer subagent revises the doc
planPlan Writer subagent revises the plan
codeFix subagent (new, not the original implementer)
hypothesisDebugging skill's hypothesis refinement (see below)
mockupFix subagent
translationFix subagent revises the translation map

Before dispatching the fix agent (code artifacts only): If crucible:checkpoint is available, create checkpoint with reason "pre-qg-fix-round-N". Non-code artifacts (design, plan, hypothesis, mockup, translation) skip this step — they are fully captured by the existing artifact-N.md snapshots.

The fix agent receives: (a) the current artifact, (b) the red-team findings, (c) project context, and (d) the fix journal from prior rounds (see Fix Memory below). It returns the revised artifact. The orchestrator writes the revised artifact to the scratch directory and dispatches the next red-team round.

The orchestrator never applies fixes directly. Even trivial fixes go through a fix agent to maintain separation of concerns. The cost of dispatching for a small fix is negligible; the risk of the orchestrator conflating coordination with fixing is not.

Scope Anchoring for Fix Agents

Fix agents are prone to drift — addressing findings by adding unrequested features, restructuring documents, or expanding scope beyond what was asked. This costs real time in re-anchoring and rework.

Before dispatching each fix agent, the orchestrator MUST include in the fix prompt:

  1. Scope statement: "You are fixing ONLY the findings listed below. Do not add features, restructure the document, or make changes outside the scope of these findings."
  2. Change boundary: List the specific sections/files the fix agent is allowed to modify. If a finding requires changes outside these boundaries, the fix agent must flag it rather than making the change.
  3. Drift detection: After the fix agent completes, the orchestrator checks whether the fix touched files or sections not listed in the change boundary. If out-of-scope changes are detected: reject the entire fix round output, re-dispatch the fix agent with explicit instructions to omit the out-of-scope changes, and include the out-of-scope items as context for the next red-team round.

Why this matters: The #1 user friction with the quality gate is fix agents drifting from the original design by adding unrequested content. Scope anchoring turns "stop. skipping. steps." into a structural guardrail.

Fix Memory

Anti-anchoring is a property of review, not remediation. Reviewers need fresh eyes to avoid confirmation bias. Fix agents need institutional memory to avoid repeating failed strategies.

The quality gate maintains a fix journal (

fix-journal.md
in the scratch directory) that accumulates across rounds. After each fix agent completes, the orchestrator appends a structured entry:

## Round N Fix
- **Findings addressed:** [list of Fatal/Significant findings from round N, summarized]
- **Approach taken:** [1-2 sentence description of fix strategy]
- **Files changed:** [list of files modified]
- **Reasoning:** [why this approach was chosen over alternatives]

On subsequent rounds, the fix agent receives the full fix journal. This gives the fix agent critical context:

  • What approaches were already tried (avoid repeating failed strategies)
  • Which files were already modified (avoid unknowingly reverting prior fixes)
  • The reasoning chain across rounds (understand the trajectory of remediation)

Anti-anchoring is preserved. The fix journal is NEVER passed to the red-team reviewer. Reviewers see only the clean artifact. The journal flows exclusively through the remediation path: fix agent writes it, next fix agent reads it, orchestrator maintains it.

Round 1 fix agents receive an empty journal (no prior rounds). This is the only round where the fix agent works without remediation history.

Why this matters: Without fix memory, the most common causes of stagnation and oscillation are fix agents repeating failed approaches or unknowingly reverting prior fixes while addressing new findings. Fix memory turns these escalation events into solvable problems -- the fix agent can see what was already tried and choose a genuinely different approach.

Compaction recovery: The fix journal is written to

fix-journal.md
in the scratch directory alongside round scores and findings. It is recovered automatically when the orchestrator reads the scratch directory after compaction.

Fix Verification

After each fix agent completes and before the next red-team round, dispatch a Fix Verifier — a dedicated Sonnet agent that checks whether each fix actually resolves its stated finding. No re-fix sub-loop; the verifier checks once, and its output feeds into the fix journal for the next round.

Dispatch method: Task tool (model: Sonnet), same pattern as the stagnation judge. The verifier needs no file access; the orchestrator includes all input in the dispatch file directly.

Input the orchestrator provides:

  1. Round N findings (the findings the fix agent was asked to address)
  2. The current round's fix journal entry only — the
    ## Round N Fix
    section just appended (not the full journal)
  3. Prepared artifact:
    • Non-code (design docs, plans, hypotheses, mockups, translations): post-fix version in full
    • Code: diff + full post-fix source of files touched by the diff. For large implementations (>2000 lines), dispatch one verifier call per finding if context exceeds limits.
  4. The full content of
    fix-verifier-prompt.md
    as the agent's instructions

Reading the verdict: The verifier returns a per-finding Resolved/Unresolved table and an overall PASS/FAIL.

Handling Unresolved findings:

  • Fatal-severity Unresolved: Flagged as "prior unresolved Fatal — must address" in the next round's fix dispatch. This is binding with one-round grace: if the fix agent addresses it and the next red-team round does NOT re-raise the finding, the binding expires. If the verifier marks the same Fatal as Unresolved again (persistent disagreement), the verdict downgrades to informational. Sonnet should not permanently override Opus.
  • Significant-severity Unresolved: Appended to the fix journal as informational context. The next round's fix agent may address, disagree with, or deprioritize.
  • All Resolved (PASS): Proceed to next red-team round normally.

Fix journal integration: The verifier's output is appended under a

### Verifier Assessment
heading in the fix journal, distinct from the
## Round N Fix
entry format. This keeps verifier assessments on the remediation path (fix agents see them) without contaminating the review path (red-team never sees them).

Anti-anchoring preserved: The verifier is on the remediation path — its output flows to fix agents only, never to the red-team reviewer. Same isolation as the fix journal itself.

Round counter unchanged: The verifier dispatch does not increment the round counter. It is part of the fix step, not a separate review round.

Stagnation Detection

Two-layer system: the orchestrator handles scoring; a dedicated judge agent handles semantic analysis.

First-Pass Check (orchestrator — runs every round)

Stagnation uses weighted scoring (Fatal=3, Significant=1) AND Fatal count tracking.

Progress requires EITHER:

  • Weighted score strictly lower than prior round, OR
  • Fatal count strictly lower AND weighted score same-or-lower

If either condition is met → progress, loop again. No judge needed.

Oscillation detection: If the weighted score increases (not just stays the same), escalate immediately as a regression. Report: "Round N score (X) is higher than Round N-1 score (Y). The fix cycle introduced new issues. Escalating." No judge needed.

Regression with checkpoint: If a pre-qg-fix-round checkpoint exists for the prior round, include in the escalation: "A checkpoint of the pre-fix state exists (

<hash>
). Options: (a) restore to pre-fix checkpoint and retry with different fix strategy, (b) continue with current state, (c) escalate to user." If no checkpoint exists, escalate as currently specified.

Multi-Model Consensus (when available)

When the

consensus_query
MCP tool is available and consensus mode
verdict
is enabled:

  1. Instead of dispatching a single Sonnet judge via Task tool, call

    consensus_query(mode: "verdict")
    with:

    • prompt: the stagnation judge prompt from
      stagnation-judge-prompt.md
    • context: round N findings, round N-1 findings, latest fix journal entry, prior comparison files (same inputs as the single-model judge)
    • metadata: { artifact_type, round_number, score_progression }
  2. Read the consensus response:

    • If
      status: "complete"
      or
      status: "partial"
      :
      • Use the
        synthesis
        verdict (PROGRESS/STAGNATION/DIMINISHING_RETURNS)
      • If the verdict is STAGNATION or DIMINISHING_RETURNS and disagreements exist, include the dissent summary in the escalation message: "Stagnation detected (consensus: N/M models agree, dissent: [summary])."
    • If
      status: "unavailable"
      :
      • Fall back to single-Sonnet judge dispatch (existing behavior)
  3. The comparison file (

    round-N-comparison.md
    ) includes the consensus metadata: models queried, models responded, agreement level, and any dissenting verdicts.

Judge Dispatch (only when first-pass check would trigger stagnation)

If neither progress condition is met AND the score did not increase (i.e., same score, no Fatal count improvement), dispatch the Stagnation Judge — a dedicated Sonnet agent that performs semantic comparison of findings across rounds. If the

consensus_query
tool is not available in the environment, this step uses the standard single-Sonnet dispatch described below.

Dispatch method: Task tool (model: Sonnet). The judge needs no file access; the orchestrator includes all input in the dispatch file directly.

Input the orchestrator provides:

  1. The content of
    round-N-findings.md
    (current round)
  2. The content of
    round-(N-1)-findings.md
    (prior round)
  3. The latest fix journal entry only — extract the last
    ## Round N Fix
    section from
    fix-journal.md
    (not the full journal)
  4. The content of any prior
    round-*-comparison.md
    files (for consecutive-round state tracking)
  5. The full content of
    stagnation-judge-prompt.md
    as the agent's instructions

Reading the verdict: The judge returns a structured verdict: PROGRESS, STAGNATION, or DIMINISHING_RETURNS.

  • PROGRESS → loop again
  • STAGNATION → escalate: "Stagnation detected: Round N has [X] recurring issues from round N-1 and [Y] new issues. Recurring: [list from judge]. Escalating."
  • DIMINISHING_RETURNS → escalate: "Quality gate has resolved all prior issues. Round N found [X] new findings, all Structural (require design-level decisions). Remaining findings: [list from judge]. Presenting for user judgment."

The judge also writes: a

round-N-comparison.md
file. The orchestrator saves the judge's full output as
round-N-comparison.md
in the scratch directory. This file is used by future judge dispatches for consecutive-round tracking.

Artifact Preparation

Small artifacts (design docs, plans, hypotheses, mockups, translations)

Pass the full artifact content to the red-team subagent. No preparation needed.

Code artifacts

Code artifacts vary in size. The orchestrator prepares the artifact based on scope:

  • Small implementations (<500 lines diff): Pass the full diff + any new files in full.
  • Medium implementations (500-2000 lines): Pass full source of high-risk files (new files, files with complex logic changes) + summaries of routine changes (imports, wiring, boilerplate). Include a change manifest listing all files with 1-line descriptions.
  • Large implementations (>2000 lines): Split into logical chunks (by subsystem, module, or feature boundary). Run a quality gate on each chunk, then a final cross-chunk round reviewing the integration points. Present the chunking plan to the user before proceeding. Normal stagnation detection, progress notifications, and round 15 safety limit apply to total rounds across all chunks, not per chunk. Chunked compaction recovery: Use a parent run-id for the entire chunked gate. Write
    chunk-manifest.md
    (lists all chunks with gated/pending status) to the parent scratch directory. Per-chunk round files go in
    chunk-N/
    subdirectories. Only delete the parent scratch directory after the final cross-chunk round completes. The
    active-run.md
    marker references the parent run-id throughout.

The red-team subagent receives the prepared artifact, not raw diff. This mirrors audit's Tier 1/Tier 2 context management approach.

Hypothesis artifacts

Hypotheses are 1-2 sentence statements, not plans or designs. The red-team prompt template is plan-centric and does not map well to hypothesis testing. For hypothesis artifacts, the orchestrator frames the red-team dispatch with hypothesis-specific attack vectors:

  • Does this hypothesis explain ALL observed symptoms?
  • What evidence would disprove it?
  • Are there simpler alternative explanations?
  • What assumptions does this hypothesis make that could be wrong?

Include these in the dispatch prompt alongside the standard red-team template. The debugging skill's Phase 3.5 defines these questions -- the quality-gate orchestrator should use them.

Minor Issue Handling

Minor issues do not trigger fix rounds and do not count toward stagnation. However, they accumulate across rounds and contain useful information. Do not silently discard them.

After the gate completes (artifact approved or stagnation escalated):

  1. Consolidate: Collect all Minor observations from all rounds, deduplicate.
  2. Quick-fix pass: Dispatch a fix subagent with the consolidated minors and the final artifact. The fix agent addresses easy wins only — changes that are simple, low-risk, and unambiguous (typos, naming inconsistencies, missing edge-case guards, trivial cleanup). It skips anything requiring judgment or design decisions.
  3. Present remainder: Output any minors the fix agent skipped as "Remaining minor observations" so the user can decide whether to address them. No further red-team round on the quick fixes — the gate is already complete.

Pre-Flight Dependency Audit

Runs ecosystem-appropriate dependency audit commands before the red-team loop begins. Produces an independent supply-chain signal that is surfaced to the orchestrator and user — the red-team never sees audit data.

Artifact-type scoping: Runs only when the artifact type is

code
. Unconditionally skipped for
design
,
plan
,
hypothesis
,
mockup
, and
translation
artifacts. When skipped, no audit section appears in gate output and no scratch files are written.

Timing: Runs after the active-run marker is written (setup phase, before the numbered steps in How It Works) but before artifact preparation and red-team dispatch. The pre-flight completes fully before the first red-team round begins.

Skill Arguments

skip_blocking
(boolean, default:
false
) — Global override. When
true
, disables ALL blocking regardless of
min_blocking_severity
. Findings are still reported in
audit-results.md
but no blocking occurs and the result is FINDINGS (not BLOCKED).
skip_blocking
supersedes
min_blocking_severity
entirely — they do not interact as independent thresholds.

min_blocking_severity
(string, default:
"critical"
, case-insensitive) — The minimum normalized severity at which a finding triggers blocking. Accepted values:
"critical"
,
"high"
,
"moderate"
,
"low"
. Invalid values are rejected with an error before execution begins. This does not change what gets reported — all findings always appear in
audit-results.md
; it only affects whether the result is BLOCKED vs FINDINGS.

Manifest Scanning

Walk the directory tree from artifact root, collecting all manifest files matching the supported set:

Manifest FileEcosystem
package.json
Node.js
Cargo.toml
Rust
requirements.txt
Python
pyproject.toml
Python

Excluded directories:

node_modules/
,
.git/
,
target/
,
dist/
,
vendor/
,
third_party/
,
.venv/
,
venv/
. These contain vendored or installed dependencies, not the project's own manifests.

Symlinks are not followed — following them risks infinite recursion in repos with circular symlinks or deeply nested node_modules.

npm workspace detection: Before scheduling per-directory

npm audit
runs, inspect each discovered
package.json
for a top-level
"workspaces"
field. If a workspace root is detected, schedule a single
npm audit
from that root directory. Do not schedule separate runs for
package.json
files in subdirectories that are members of that workspace.

Python dual-manifest handling: When a directory contains both

requirements.txt
and
pyproject.toml
, audit both. They may represent different dependency sets. Duplicate findings are deduplicated at result-write time in
audit-results.md
using the key (package name + CVE ID) — each unique (package, CVE) pair appears once with a note of which sources reported it. Version differences for the same (package, CVE) pair are noted but not double-counted.

Manifest list finalization: The manifest list is written to

preflight-audit.md
before any audit tool is invoked. This list is the authoritative scope for the run. If compaction occurs after this point, the gate resumes from the recorded list — it does not re-scan.

Zero manifests: If zero manifests are found anywhere in the tree, pre-flight completes as a no-op and notes this in the output summary.

Ecosystem Detection and Ordering

Detected manifests are audited in fixed order for deterministic output: Node.js -> Rust -> Python.

Manifest FileAudit CommandNotes
package.json
npm audit --json
Run from workspace root if applicable, otherwise cwd = manifest directory
Cargo.toml
cargo audit --json
Run with cwd = manifest directory
requirements.txt
pip-audit --format json -r requirements.txt
Explicit
-r
flag; does NOT require active venv
pyproject.toml
pip-audit --format json
Requires active venv or lockfile (see below)

All detected manifests are audited independently (after workspace consolidation). Each runs as an isolated subprocess. A failure in one audit does not abort or skip audits for other manifests. All ecosystems run to completion before the overall result is computed — a BLOCKED result from one ecosystem does not short-circuit audits for remaining ecosystems.

Audit Tool Availability

Before invoking any audit tool, the gate checks availability:

CaseConditionAction
AvailableTool in PATH, environment readyRun audit
Tool missingTool not in PATHWrite warning to audit-results.md, surface to user
Tool brokenTool found but
--version
fails
Write warning, skip
Environment not readyTool found but required environment absentWrite specific reason, skip with warning

Per-manifest environment readiness checks:

  • requirements.txt
    :
    pip-audit -r requirements.txt
    reads the file directly. No virtualenv required. Available if
    pip-audit
    is on PATH.
  • pyproject.toml
    :
    pip-audit
    without
    -r
    inspects the installed environment. Requires an active virtualenv or a lockfile (
    poetry.lock
    ,
    pdm.lock
    ,
    uv.lock
    ). If neither is present, skip with: "pip-audit requires a virtual environment or lock file for pyproject.toml; results would be unreliable."
  • Cargo.toml
    :
    Requires
    Cargo.lock
    to be present. If absent: "skipped — Cargo.lock absent; run cargo generate-lockfile first."
  • package.json
    :
    Requires
    package-lock.json
    (or
    npm-shrinkwrap.json
    ) in the same directory (or workspace root). If absent: "skipped — no lockfile found; run npm install to generate package-lock.json."
    npm
    must be on PATH.

Python manifest confidence: When only

pyproject.toml
is present (no
requirements.txt
or lockfile in the same directory), include a notice in
audit-results.md
: "Confidence: Reduced — No requirements.txt or lock file found. pip-audit is resolving dependencies from pyproject.toml directly. Results may be incomplete."

Tool availability results are written to

audit-results.md
(not
preflight-audit.md
), because they are discovered at execution time, not scan time.

A run where all manifests are skipped (missing tools or environment-not-ready) is reported as INCONCLUSIVE, not passing.

Audit Tool Error Handling

Audit tools exit non-zero for two distinct reasons:

  • Vulnerabilities found — treated as a successful audit with findings (status: FINDINGS).
  • Audit request failed (network error, registry timeout, corrupt lockfile) — treated as a failed run (status: FAILED). Warning written, gate continues to next manifest.

Exit code contracts per tool:

ToolCleanFindingsError
npm audit
exit 0exit 1exit 2+
cargo audit
exit 0exit 1exit 2+
pip-audit
exit 0exit 1exit 2+ (or non-zero with unparseable stdout)

Use exit codes to distinguish outcomes. Do not parse stderr substring content to classify results.

Severity Normalization

Audit tools use different severity vocabularies. The gate normalizes to a common scale. CVSS boundaries are inclusive on the lower bound, exclusive on the upper (e.g., a CVSS score of exactly 9.0 is Critical, not High).

Levelnpm auditcargo auditpip-audit
Critical
critical
CVSS >= 9.0CVSS >= 9.0
High
high
CVSS >= 7.0 and < 9.0CVSS >= 7.0 and < 9.0
Moderate
moderate
CVSS >= 4.0 and < 7.0CVSS >= 4.0 and < 7.0
Low
low
CVSS >= 0.1 and < 4.0CVSS >= 0.1 and < 4.0
InformationalCVSS = 0.0CVSS = 0.0

CVSS 0.0 findings are classified as Informational — reported in

audit-results.md
but never count toward blocking. They do not map to any blocking severity level.

If a finding has no CVSS score (advisory-only, no CVE assigned), it is treated as Moderate and flagged with

[no-cvss]
in the output.

Output Model

Pre-flight produces two files under

scratch/<run-id>/
:

preflight-audit.md
— Scan-time plan. Written before any audit tool runs. Contains only scan-time information:

  • Run ID and
    generated-at
    timestamp (ISO-8601)
  • Manifest list with path, ecosystem, and deduplication/workspace decisions

This file is not updated after execution begins. It is the immutable record of what the scan discovered.

audit-results.md
— Execution-time output. Written incrementally as each ecosystem completes. Contains:

  • Tool availability results (discovered at execution time)
  • Per-manifest findings with normalized severity
  • Deduplication notes (same CVE from multiple sources)
  • Reduced-confidence notices
  • Overall result

Each ecosystem section ends with a

status: complete
sentinel line. A section without this sentinel is considered incomplete and must be discarded and re-run on recovery.

Schema for

audit-results.md
:

# Dependency Audit
generated-at: <ISO-8601>
run-id: <run-id>

> This section is independent of red-team findings. The red-team did not see this data.

## Tool Availability
- npm audit: available
- cargo audit: available
- pip-audit (requirements.txt): available
- pip-audit (pyproject.toml): unavailable — no venv or lock file

## Summary
Result: CLEAN | FINDINGS | BLOCKED | INCONCLUSIVE | FAILED
Critical: N  High: N  Moderate: N  Low: N  Informational: N

## npm — packages/api/package.json — FINDINGS
[findings list: package, severity, CVE, fix-available]
status: complete

## pip — src/requirements.txt — FINDINGS
## pip — src/pyproject.toml — FINDINGS
[deduplicated: CVE-2024-XXXXX reported by both src/requirements.txt and src/pyproject.toml — counted once]
status: complete

## Warnings
[environment-not-ready, reduced-confidence, or deduplication notes]

Overall Result Computation

When results span multiple manifests with mixed outcomes, the overall

Result:
field uses this precedence (highest wins):

PriorityResultCondition
1 (highest)BLOCKEDFindings at or above
min_blocking_severity
and
skip_blocking
is false
2FINDINGSAt least one manifest returned vulnerability findings (below blocking threshold or override active)
3INCONCLUSIVEAt least one manifest was skipped (tool missing, environment not ready); no findings
4FAILEDAt least one manifest tool errored; no findings and no skips
5 (lowest)CLEANAll manifests completed without findings

INCONCLUSIVE outranks FAILED because unknown coverage (a manifest exists but was never audited) is more dangerous than a known, retryable tool error.

Blocking and Prompting Behavior

When a finding at or above

min_blocking_severity
is present and
skip_blocking
is not
true
:

  • Interactive session (Claude Code can prompt the user): Present the finding summary grouped by fix availability — "Fixable (N)" and "No fix available (M)" — and ask whether to continue to red-team review or abort. This grouping gives the user immediate signal on remediation effort: all-fixable blockers are a quick
    npm audit fix
    /
    cargo update
    away; no-fix blockers may require dependency replacement or acceptance.
  • Non-interactive context (automated pipeline, piped input): Write
    Result: BLOCKED
    and return to the parent orchestrator without prompting.

Whether a session is interactive is a Claude Code runtime property, not something the skill detects via TTY heuristics or environment inspection.

Parent-pipeline integration: When the gate returns with

Result: BLOCKED
, the parent orchestrator (build, spec, or direct user invocation) treats this the same as any gate failure — escalate to the user with the blocking findings listed. The
red-team-rounds: 0
field indicates the red-team loop never ran.

Anti-Anchoring Preservation

Neither

preflight-audit.md
nor
audit-results.md
is passed to red-team dispatch. The red-team receives only the artifact under review — unchanged from current behavior. Audit findings are surfaced to the user at gate completion as an independent signal alongside (not merged with) red-team findings.

Stale Audit Results

The

generated-at
timestamp marks when results were produced. Results are valid for that point in time only. The gate does not re-run pre-flight after fix-agent remediation within the same gate run. This is an explicit design boundary: the gate run is a point-in-time evaluation.

Anti-Anchoring Rules

The iterative loop's value depends on each reviewer seeing the artifact with fresh eyes. To prevent information leaking between rounds:

  1. Clean artifact only. The artifact passed to each round's reviewer must be the current version with no revision marks, "Fixed:" annotations, or comments about prior reviews. If the fix agent left review-response comments in the artifact, strip them before the next round.
  2. Standardized framing. The orchestrator's dispatch prompt must use the same framing for every round. Do not mention that prior review rounds occurred, what was fixed, or how many rounds have run. The reviewer sees the artifact as if it is the first review.
  3. No findings forwarding. Never pass prior round findings to the next reviewer. This is already specified in
    crucible:red-team
    but is restated here because the quality-gate orchestrator is the most likely point of accidental leakage.

Round History and Compaction Recovery

Quality gate writes round state to disk for compaction recovery.

Scratch directory:

~/.claude/projects/<project-hash>/memory/quality-gate/scratch/<run-id>/
where
<run-id>
is a timestamp generated at the start of the gate. This path is persistent and discoverable (matching the audit skill's pattern), so it survives compaction even if the run-id is lost from context — the orchestrator can list the directory to find active runs.

Tool constraint: All scratch directory operations (create, read, list, delete) must use Write, Read, and Glob tools — NOT Bash. Safety hooks block Bash commands referencing

.claude/
paths.

Active run marker: At the start of the gate, write

~/.claude/projects/<project-hash>/memory/quality-gate/active-run-<run-id>.md
containing the run-id and scratch directory path. Delete only your own marker when the gate completes. After compaction, glob for
active-run-*.md
files to locate active runs — recover the one whose run-id matches context, or the most recent if context is lost.

Stale cleanup: At the start of each gate, delete scratch directories whose timestamps are older than 2 hours AND that are NOT referenced by any

active-run-*.md
marker. Also delete any
fix-journal-*.md
handoff files in the
memory/quality-gate/
directory whose mtime is older than 24 hours (the longer window accommodates overnight breaks between QG and forge sessions).

After each round, write:

  • round-N-score.md
    : weighted score, Fatal count, Significant count, Minor count
  • round-N-findings.md
    : the red-team findings for this round
  • artifact-N.md
    : the artifact snapshot after fixes (input to round N+1)
  • fix-journal.md
    : cumulative fix journal (appended after each fix agent completes; see Fix Memory above)
  • round-N-comparison.md
    : stagnation judge output (only exists for rounds where the judge was dispatched — absence on clean-progress rounds is expected, not an error). When multi-model consensus was used, this file also contains consensus metadata: models queried, models responded, agreement level, and any dissenting verdicts.
  • round-N-verification.md
    : fix verifier verdict summary (written after every fix round — unlike comparison files, these exist for every round that had fixes)

Compaction recovery: 0. Read

## Compression State
from
pipeline-status.md
— recover Goal, Key Decisions (including parent skill decisions that affect the gate), Active Constraints, and Next Steps. If absent, skip to step 1. Note: quality-gate is invoked by a parent skill (build, debugging, spec), so the Compression State reflects the parent's context. The quality-gate orchestrator inherits this context.

  1. Glob for
    active-run-*.md
    markers to locate the scratch directory. 1b. Pre-flight recovery (code artifacts only): Check for
    preflight-audit.md
    in the scratch directory. If absent, restart from manifest scan. If present, read it to recover the manifest list. Then check
    audit-results.md
    for completed ecosystem sections (those ending with
    status: complete
    sentinel). Sections without the sentinel are discarded as incomplete. Resume from the first manifest not yet present as a complete section. Recovery re-invokes the audit tool for incomplete manifests — no raw output is cached between compaction events. After all manifests complete, regenerate the Summary section of
    audit-results.md
    .
  2. Read scratch directory to determine current round (highest N in
    round-N-score.md
    files).
  3. Read the latest
    artifact-N.md
    as the current artifact state.
  4. Read all
    round-N-score.md
    files to reconstruct the score progression.
  5. Read all
    round-N-comparison.md
    files to reconstruct consecutive-round state for the stagnation judge. Absence of comparison files is expected on clean-progress rounds.
  6. Read all
    round-N-verification.md
    files to recover fix verifier state. If any Fatal-severity Unresolved verdicts exist in the latest verification file, carry them forward as binding context for the next fix dispatch.
  7. Output status to user: "Quality gate recovered after compaction. Round N complete, score progression: [list]. Continuing."
  8. Emit a Compression State Block into the conversation with gate-specific state: current round, score progression, artifact type under review. Inherit Goal and Key Decisions from the parent skill's last Compression State if available. 8b. Check whether
    consensus_query
    MCP tool is available (consensus availability may have changed across compaction boundary). Use current availability for subsequent rounds regardless of what was used pre-compaction.
  9. Dispatch the next red-team round.

Checkpoint Timing

Emit a Compression State Block at:

  • Every 3 rounds: After rounds 3, 6, 9, 12
  • Before stagnation judge dispatch: When the first-pass check would trigger stagnation
  • Gate completion: When the gate passes or escalates (before returning to parent skill)
  • Health transitions: On any GREEN->YELLOW or YELLOW->RED transition

Dead-end handoff (step 5, code artifacts only): After Minor Issue Handling and before cleanup, if

fix-journal.md
exists in the scratch directory and contains 1+ round entries, copy its contents to
~/.claude/projects/<project-hash>/memory/quality-gate/fix-journal-<run-id>.md
(using the gate's run-id). This is a transient handoff artifact for the next forge retrospective. On stagnation/escalation exit paths, also write the handoff file before escalating — stagnated sessions produce the highest-value dead-end data.

Cleanup: Delete scratch directory and your

active-run-<run-id>.md
marker after the gate completes (pass or stagnation). Do NOT delete verdict marker files (
gate-verdict-<run-id>.md
) — the build orchestrator is responsible for their lifecycle.

Verdict Marker

After Minor Issue Handling completes and before cleanup begins, write a verdict marker file to a stable location outside the scratch directory. This marker survives scratch cleanup and serves as a cross-skill consistency signal for the build orchestrator's gate ledger.

When: After Minor Issue Handling (the quick-fix pass on consolidated minors) and before cleanup. Written on ALL exit paths — PASS, FAIL, STAGNATION, and ESCALATED. The Verdict field reflects the actual outcome.

Path:

~/.claude/projects/<project-hash>/memory/quality-gate/gate-verdict-<run-id>.md

Format: Key-value pairs, one per line:

Verdict: PASS | FAIL | STAGNATION | ESCALATED
Phase: <phase name from invoking orchestrator, omit if standalone>
PipelineID: <pipeline-id from invoking orchestrator, omit if standalone>
Rounds: <total round count>
FinalScore: <weighted score from last round>
Timestamp: <ISO-8601>
RunID: <quality-gate run-id>

Tool: Write tool (not Bash) since the path is under

.claude/
.

Standalone invocations: When quality-gate is invoked directly (not by build), the

Phase
and
PipelineID
fields are omitted. The marker is still written — it serves as a completion record even without pipeline context.

Stale cleanup exclusion: Verdict markers are NOT subject to the 2-hour stale cleanup that applies to scratch directories. They are deleted by the build orchestrator after writing the corresponding gate ledger entry. Orphaned markers (from crashed runs) are cleaned up during the build skill's ledger initialization.

Invocation Convention

Quality gate is invoked by the outermost orchestrator only — not self-invoked by child skills. This avoids double-gating.

Rule: Skills NEVER self-invoke quality-gate. They only document that their output is gateable. The outermost orchestrator (build, the user session, or another pipeline) always handles gating. This eliminates the ambiguity of skills trying to detect whether they are running standalone or as a sub-skill.

When Used Standalone (user invokes directly)

The user's session is the outermost orchestrator. When a user runs

/design
directly, the design skill produces the doc and documents it as gateable. The user's session (following the design skill's instructions) invokes quality-gate.

When Used as a Sub-Skill of Build

Build is the outermost orchestrator and controls all quality gates:

  • Phase 1 (after design): Quality gate on design doc (artifact type: design)
  • Phase 2 (after plan review): Quality gate on plan (artifact type: plan)
  • Phase 4 (after implementation): Quality gate on full implementation (artifact type: code)

Context from invoking orchestrator: When build invokes quality-gate, it includes a "Context from invoking orchestrator" block in the dispatch prompt containing:

  • Phase: <phase name>
    — "design", "plan", or "code"
  • PipelineID: <pipeline-id>
    — the build's PipelineID (format:
    build-YYYYMMDD-HHMMSS
    )

Quality-gate reads these values from its dispatch context and includes them in the verdict marker. These are dispatch context values, not tool arguments — quality-gate is a skill, not an API.

Artifact Types

TypeProduced ByGate Trigger
design
crucible:design
After design doc is saved
plan
crucible:planning
After plan passes review
hypothesis
crucible:debugging
Phase 3.5, before implementation
code
crucible:debugging
, build
After implementation/fix
mockup
crucible:mockup-builder
After mockup is created
translation
crucible:mock-to-unity
After self-verification

Documentation Convention

Each artifact-producing skill's SKILL.md documents:

"This skill produces [artifact type]. The outermost orchestrator invokes

crucible:quality-gate
after [trigger]."

Escalation

Three exit modes beyond clean approval:

  • Stagnation → escalate to user with recurring/new classification from the judge: "Stagnation detected: Round N has [X] recurring issues from round N-1 and [Y] new issues. Recurring: [list]. Escalating."
  • Diminishing returns → escalate to user with structural findings from the judge: "Quality gate has resolved all prior issues. Round N found [X] new findings, all Structural (require design-level decisions). Remaining findings: [list]. Presenting for user judgment."
  • Regression (score increased) → escalate immediately, no judge needed: "Round N score (X) is higher than Round N-1 score (Y). The fix cycle introduced new issues. Escalating."
  • Global safety limit reached (15 rounds) → escalate to user with full round history
  • Architectural concerns → escalate immediately (bypass loop)
  • User can interrupt at any time to skip the gate

Red Flags

  • Orchestrator fixing artifacts directly instead of dispatching a fix agent
  • Rationalizing away red-team findings instead of addressing them
  • Skipping the gate without explicit user approval — including autonomous decisions based on task size, complexity, or scope assessment ("this is small", "this is trivial", "this is just a config change")
  • Rationalizing that a change doesn't need adversarial review based on perceived simplicity
  • Declaring the gate complete after fixing findings without a clean verification round — the iteration loop must run to completion (0 Fatal, 0 Significant on a fresh review)
  • Exceeding the 15-round safety limit without escalating
  • Using the same red-team agent across rounds (always dispatch fresh)
  • Declaring stagnation on raw issue count without using weighted score (Fatal=3, Significant=1)
  • Passing revision context, prior findings, round history, or fix journal to the red-team reviewer (fix journal is for fix agents ONLY)
  • Leaving review-response artifacts (comments, annotations) in the artifact between rounds
  • Dispatching a fix agent without the fix journal on round 2+ (fix agents need remediation history)
  • Orchestrator performing semantic comparison inline instead of dispatching the stagnation judge
  • Dispatching the judge when the score is strictly improving (waste — score alone is sufficient)
  • Forgetting to save the judge's output as
    round-N-comparison.md
    (breaks consecutive-round tracking)
  • Skipping the fix verifier dispatch after a fix agent completes (every fix round gets verified)
  • Passing verifier output to the red-team reviewer (verifier is on the remediation path only)
  • Allowing fix agents to drift outside the declared change boundary without flagging
  • Re-dispatching the fix agent based on verifier results (no re-fix sub-loop — verifier checks once, output feeds into next round)
  • Skipping Compression State Block emission at checkpoint boundaries
  • Emitting a Compression State Block with stale or missing Key Decisions (decisions must be cumulative across all prior blocks)
  • Allowing the Goal field to drift across successive Compression State Blocks (must match original user request)
  • Exceeding 10 entries in the Key Decisions list without overflow-compressing the oldest
  • Using consensus on every red-team round (periodic only: rounds 1, 4, 7, ...)
  • Treating single-model unique findings from consensus as less important than multi-model agreements
  • Passing consensus provenance metadata to the fix agent's red-team framing (provenance is for the fix journal and orchestrator, not for biasing the next reviewer)
  • Including external review findings in the weighted score calculation (INV-2: host red-team findings ONLY)
  • Using external findings as inputs to stagnation detection scoring
  • Blocking the host red-team round on external review availability or timeout
  • Passing pre-flight audit findings (preflight-audit.md or audit-results.md) to red-team dispatch — audit is an independent parallel signal, not red-team input
  • Skipping pre-flight for code artifacts without explicit user approval
  • Re-running pre-flight after fix rounds within the same gate run (pre-flight is a point-in-time evaluation)
  • Treating INCONCLUSIVE audit results as CLEAN — unknown coverage is more dangerous than no findings
  • Running pre-flight for non-code artifact types (design, plan, hypothesis, mockup, translation)

Integration

  • crucible:red-team — The engine that performs each review round. Loop ownership: Quality-gate uses red-team as a single-pass reviewer only (one dispatch = one review round, findings returned). Quality-gate owns the iteration loop, stagnation detection, and round tracking. Red-team does NOT run its own stagnation loop when invoked by quality-gate. Red-team's stagnation rules apply only when red-team is invoked directly (e.g., by
    crucible:finish
    ).
  • crucible:design — Produces design docs (gateable artifact)
  • crucible:planning — Produces plans (gateable artifact)
  • crucible:debugging — Produces hypotheses and fixes (gateable artifacts). Note: Debugging's Phase 5 must invoke
    crucible:quality-gate
    for fix review, not
    crucible:red-team
    directly. This ensures fixes get iteration tracking, compaction recovery, and user checkpoints.
  • crucible:mockup-builder — Produces mockups (gateable artifact)
  • crucible:mock-to-unity — Produces translation maps and implementations (gateable artifacts)
  • crucible:build — Outermost orchestrator, controls all gates in pipeline
  • crucible:checkpoint — Shadow git checkpoints before code-artifact fix rounds (recommended). Provides rollback target when fix rounds introduce regressions.