Marketplace judge
Scoring framework for test-kitchen cookoff and omakase-off. Invoked at Phase 4 to evaluate implementations using 5-criteria scoring. Do not invoke directly - called by cookoff/omakase-off.
git clone https://github.com/aiskillstore/marketplace
T=$(mktemp -d) && git clone --depth=1 https://github.com/aiskillstore/marketplace "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/2389-research/judge" ~/.claude/skills/aiskillstore-marketplace-judge && rm -rf "$T"
skills/2389-research/judge/SKILL.mdTest Kitchen Judge
Score implementations using the 5-criteria framework. Fill out ALL sections exactly as shown.
Terminology: This skill uses "impl" but works for both:
- Cookoff: impl-1, impl-2, impl-3 (same design, different implementations)
- Omakase: variant-a, variant-b (different approaches/designs)
REQUIRED OUTPUT FORMAT
You MUST produce this exact structure. Do not summarize or abbreviate.
## Gate Check | Impl | Tests Pass | Design Adherence | |------|------------|------------------| | impl-1 | X/X ✓ or ✗ | Yes/No | | impl-2 | X/X ✓ or ✗ | Yes/No | ## Feasibility Check | Impl | Status | Notes | |------|--------|-------| | impl-1 | ✓ OK / ⚠️ Flag | Details | | impl-2 | ✓ OK / ⚠️ Flag | Details | ## Scoring Worksheet ### impl-1 **Fitness for Purpose** (Does it solve the actual problem?) *Functional requirements:* - [ ] Primary use case works end-to-end? - [ ] All explicitly stated requirements implemented? - [ ] Handles realistic scenarios, not just happy path? *User needs (beyond literal requirements):* - [ ] Would the user actually use this, or just demo it? - [ ] Does it solve the real problem, not just the literal request? - [ ] Does deployment/distribution match stated needs? *Future considerations (if relevant):* - [ ] If growth/scaling mentioned, does architecture support it? - [ ] If team/collaboration mentioned, is it maintainable by others? Checklist: _/8 YES → **Score: _/5** (7-8=5, 5-6=4, 4=3, 2-3=2, 0-1=1) *Note: Not all items apply to every project. Score based on relevant items.* **Justified Complexity** (Every line earning its keep?) - Unnecessary abstractions: ___ - Dead code: ___ - Bloat estimate: ___% *Line count comparison (if multiple impls):* - This impl: ___ lines - Smallest impl: ___ lines - Extra lines justified by: ___ → **Score: _/5** (5=minimal, 4=slight bloat <10%, 3=10-25% bloat, 2=25-50%, 1=>50%) **Readability** (Understand core flow in 5 min?) Violations: - [ ] Single-letter vars (not loop index): +1 each = __ - [ ] Functions >50 lines: +1 each = __ - [ ] Nesting >3 levels: +1 each = __ - [ ] Magic numbers: +1 each = __ - [ ] Bad function names: +1 each = __ Total violations: __ → **Score: _/5** (0=5, 1-2=4, 3-4=3, 5-7=2, 8+=1) **Robustness & Scale** (Handles unexpected + growth?) - [ ] Input validation? - [ ] External call error handling? - [ ] Useful error messages? - [ ] Null/empty handling? - [ ] Async timeouts? - [ ] No unbounded loops? - [ ] O(n log n) or better? - [ ] Bounded memory? - [ ] Queries paginated? - [ ] No blocking I/O in hot path? - [ ] Backoff/retry logic? - [ ] Handles 10x load? Checklist: _/12 YES + feasibility flags → **Score: _/5** (11-12 + no flags=5, 9-10 or minor flag=4, 7-8=3, 5-6 or major flag=2, <5 or critical flag=1) **Maintainability** (Pain of next change?) - [ ] Single responsibility per function? - [ ] Explicit dependencies (no globals)? - [ ] Business logic separated from infra? - [ ] New feature = ≤3 files changed? - [ ] Config externalized? - [ ] Tests catch regressions? Checklist: _/6 YES → **Score: _/5** (6=5, 5=4, 4=3, 2-3=2, 0-1=1) ### impl-2 [REPEAT SAME FORMAT] ### impl-3 (if applicable) [REPEAT SAME FORMAT] ## Judge Scorecard | Criterion | impl-1 | impl-2 | impl-3 | Best | |-----------|--------|--------|--------|------| | Fitness for Purpose | | | | | | Justified Complexity | | | | | | Readability | | | | | | Robustness & Scale | | | | | | Maintainability | | | | | | **TOTAL** | /25 | /25 | /25 | | ## Hard Gates | Gate | Result | |------|--------| | Fitness Gate (Δ ≥ 2) | Triggered/Not triggered | | Critical Flaw (any = 1) | Triggered/Not triggered | ## Winner Selection **Winner: impl-X** (Score: __/25) **Selection rationale:** [2-3 sentences explaining WHY this implementation won] **Trade-offs acknowledged:** [What the other implementations did better]
Scoring Reference
Scores Meaning
| Score | Meaning |
|---|---|
| 5 | Excellent - exceeds expectations |
| 4 | Good - fully meets requirements |
| 3 | Adequate - core works, some gaps |
| 2 | Poor - significant issues |
| 1 | Critical flaw - disqualifying |
Hard Gates (Automatic)
- Fitness Gate: If Fitness Δ ≥ 2 between impls → Higher fitness WINS immediately
- Critical Flaw: If ANY criterion = 1 → That impl is ELIMINATED
Fitness Gate Interpretation
The Fitness Gate triggers the same way in both contexts, but means different things:
| Context | What Fitness Δ ≥ 2 Means |
|---|---|
| Cookoff | One implementation deviated from or misunderstood the design. All impls should have similar Fitness since they're implementing the same spec. A large gap is a red flag. |
| Omakase | One approach genuinely solves the problem better. Different approaches can legitimately have different Fitness. A large gap means one approach is clearly superior. |
In both cases, higher Fitness wins. The interpretation just explains why the gap exists.
Feasibility Red Flags
Check before scoring:
- O(n²) or worse on unbounded data
- Unbounded memory growth
- Self-DDoS patterns (polling, no backoff)
- Missing pagination
- Blocking I/O in hot path
- No error recovery
Process
- Read all implementation code (should already be in context)
- Fill out the worksheet for EACH implementation - do not skip sections
- Check hard gates
- Announce winner with rationale
CRITICAL: Use integer scores only (1-5). Do not use half points like 4.5.
CRITICAL: Fill out every checkbox. Do not summarize or abbreviate the worksheet.