Everything-claude-code gan-style-harness
GAN-inspired Generator-Evaluator agent harness for building high-quality applications autonomously. Based on Anthropic's March 2026 harness design paper.
git clone https://github.com/affaan-m/everything-claude-code
T=$(mktemp -d) && git clone --depth=1 https://github.com/affaan-m/everything-claude-code "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/gan-style-harness" ~/.claude/skills/affaan-m-everything-claude-code-gan-style-harness && rm -rf "$T"
skills/gan-style-harness/SKILL.mdGAN-Style Harness Skill
Inspired by Anthropic's Harness Design for Long-Running Application Development (March 24, 2026)
A multi-agent harness that separates generation from evaluation, creating an adversarial feedback loop that drives quality far beyond what a single agent can achieve.
Core Insight
When asked to evaluate their own work, agents are pathological optimists — they praise mediocre output and talk themselves out of legitimate issues. But engineering a separate evaluator to be ruthlessly strict is far more tractable than teaching a generator to self-critique.
This is the same dynamic as GANs (Generative Adversarial Networks): the Generator produces, the Evaluator critiques, and that feedback drives the next iteration.
When to Use
- Building complete applications from a one-line prompt
- Frontend design tasks requiring high visual quality
- Full-stack projects that need working features, not just code
- Any task where "AI slop" aesthetics are unacceptable
- Projects where you want to invest $50-200 for production-quality output
When NOT to Use
- Quick single-file fixes (use standard
)claude -p - Tasks with tight budget constraints (<$10)
- Simple refactoring (use de-sloppify pattern instead)
- Tasks that are already well-specified with tests (use TDD workflow)
Architecture
┌─────────────┐ │ PLANNER │ │ (Opus 4.6) │ └──────┬──────┘ │ Product Spec │ (features, sprints, design direction) ▼ ┌────────────────────────┐ │ │ │ GENERATOR-EVALUATOR │ │ FEEDBACK LOOP │ │ │ │ ┌──────────┐ │ │ │GENERATOR │--build-->│──┐ │ │(Opus 4.6)│ │ │ │ └────▲─────┘ │ │ │ │ │ │ live app │ feedback │ │ │ │ │ │ │ ┌────┴─────┐ │ │ │ │EVALUATOR │<-test----│──┘ │ │(Opus 4.6)│ │ │ │+Playwright│ │ │ └──────────┘ │ │ │ │ 5-15 iterations │ └────────────────────────┘
The Three Agents
1. Planner Agent
Role: Product manager — expands a brief prompt into a full product specification.
Key behaviors:
- Takes a one-line prompt and produces a 16-feature, multi-sprint specification
- Defines user stories, technical requirements, and visual design direction
- Is deliberately ambitious — conservative planning leads to underwhelming results
- Produces evaluation criteria that the Evaluator will use later
Model: Opus 4.6 (needs deep reasoning for spec expansion)
2. Generator Agent
Role: Developer — implements features according to the spec.
Key behaviors:
- Works in structured sprints (or continuous mode with newer models)
- Negotiates a "sprint contract" with the Evaluator before writing code
- Uses full-stack tooling: React, FastAPI/Express, databases, CSS
- Manages git for version control between iterations
- Reads Evaluator feedback and incorporates it in next iteration
Model: Opus 4.6 (needs strong coding capability)
3. Evaluator Agent
Role: QA engineer — tests the live running application, not just code.
Key behaviors:
- Uses Playwright MCP to interact with the live application
- Clicks through features, fills forms, tests API endpoints
- Scores against four criteria (configurable):
- Design Quality — Does it feel like a coherent whole?
- Originality — Custom decisions vs. template/AI patterns?
- Craft — Typography, spacing, animations, micro-interactions?
- Functionality — Do all features actually work?
- Returns structured feedback with scores and specific issues
- Is engineered to be ruthlessly strict — never praises mediocre work
Model: Opus 4.6 (needs strong judgment + tool use)
Evaluation Criteria
The default four criteria, each scored 1-10:
## Evaluation Rubric ### Design Quality (weight: 0.3) - 1-3: Generic, template-like, "AI slop" aesthetics - 4-6: Competent but unremarkable, follows conventions - 7-8: Distinctive, cohesive visual identity - 9-10: Could pass for a professional designer's work ### Originality (weight: 0.2) - 1-3: Default colors, stock layouts, no personality - 4-6: Some custom choices, mostly standard patterns - 7-8: Clear creative vision, unique approach - 9-10: Surprising, delightful, genuinely novel ### Craft (weight: 0.3) - 1-3: Broken layouts, missing states, no animations - 4-6: Works but feels rough, inconsistent spacing - 7-8: Polished, smooth transitions, responsive - 9-10: Pixel-perfect, delightful micro-interactions ### Functionality (weight: 0.2) - 1-3: Core features broken or missing - 4-6: Happy path works, edge cases fail - 7-8: All features work, good error handling - 9-10: Bulletproof, handles every edge case
Scoring
- Weighted score = sum of (criterion_score * weight)
- Pass threshold = 7.0 (configurable)
- Max iterations = 15 (configurable, typically 5-15 sufficient)
Usage
Via Command
# Full three-agent harness /project:gan-build "Build a project management app with Kanban boards, team collaboration, and dark mode" # With custom config /project:gan-build "Build a recipe sharing platform" --max-iterations 10 --pass-threshold 7.5 # Frontend design mode (generator + evaluator only, no planner) /project:gan-design "Create a landing page for a crypto portfolio tracker"
Via Shell Script
# Basic usage ./scripts/gan-harness.sh "Build a music streaming dashboard" # With options GAN_MAX_ITERATIONS=10 \ GAN_PASS_THRESHOLD=7.5 \ GAN_EVAL_CRITERIA="functionality,performance,security" \ ./scripts/gan-harness.sh "Build a REST API for task management"
Via Claude Code (Manual)
# Step 1: Plan claude -p --model opus "You are a Product Planner. Read PLANNER_PROMPT.md. Expand this brief into a full product spec: 'Build a Kanban board app'. Write spec to spec.md" # Step 2: Generate (iteration 1) claude -p --model opus "You are a Generator. Read spec.md. Implement Sprint 1. Start the dev server on port 3000." # Step 3: Evaluate (iteration 1) claude -p --model opus --allowedTools "Read,Bash,mcp__playwright__*" "You are an Evaluator. Read EVALUATOR_PROMPT.md. Test the live app at http://localhost:3000. Score against the rubric. Write feedback to feedback-001.md" # Step 4: Generate (iteration 2 — reads feedback) claude -p --model opus "You are a Generator. Read spec.md and feedback-001.md. Address all issues. Improve the scores." # Repeat steps 3-4 until pass threshold met
Evolution Across Model Capabilities
The harness should simplify as models improve. Following Anthropic's evolution:
Stage 1 — Weaker Models (Sonnet-class)
- Full sprint decomposition required
- Context resets between sprints (avoid context anxiety)
- 2-agent minimum: Initializer + Coding Agent
- Heavy scaffolding compensates for model limitations
Stage 2 — Capable Models (Opus 4.5-class)
- Full 3-agent harness: Planner + Generator + Evaluator
- Sprint contracts before each implementation phase
- 10-sprint decomposition for complex apps
- Context resets still useful but less critical
Stage 3 — Frontier Models (Opus 4.6-class)
- Simplified harness: single planning pass, continuous generation
- Evaluation reduced to single end-pass (model is smarter)
- No sprint structure needed
- Automatic compaction handles context growth
Key principle: Every harness component encodes an assumption about what the model can't do alone. When models improve, re-test those assumptions. Strip away what's no longer needed.
Configuration
Environment Variables
| Variable | Default | Description |
|---|---|---|
| | Maximum generator-evaluator cycles |
| | Weighted score to pass (1-10) |
| | Model for planning agent |
| | Model for generator agent |
| | Model for evaluator agent |
| | Comma-separated criteria |
| | Port for the live app |
| | Command to start dev server |
| | Project working directory |
| | Skip planner, use spec directly |
| | , , or |
Evaluation Modes
| Mode | Tools | Best For |
|---|---|---|
| Browser MCP + live interaction | Full-stack apps with UI |
| Screenshot + visual analysis | Static sites, design-only |
| Tests + linting + build | APIs, libraries, CLI tools |
Anti-Patterns
-
Evaluator too lenient — If the evaluator passes everything on iteration 1, your rubric is too generous. Tighten scoring criteria and add explicit penalties for common AI patterns.
-
Generator ignoring feedback — Ensure feedback is passed as a file, not inline. The generator should read
at the start of each iteration.feedback-NNN.md -
Infinite loops — Always set
. If the generator can't improve past a score plateau after 3 iterations, stop and flag for human review.GAN_MAX_ITERATIONS -
Evaluator testing superficially — The evaluator must use Playwright to interact with the live app, not just screenshot it. Click buttons, fill forms, test error states.
-
Evaluator praising its own fixes — Never let the evaluator suggest fixes and then evaluate those fixes. The evaluator only critiques; the generator fixes.
-
Context exhaustion — For long sessions, use Claude Agent SDK's automatic compaction or reset context between major phases.
Results: What to Expect
Based on Anthropic's published results:
| Metric | Solo Agent | GAN Harness | Improvement |
|---|---|---|---|
| Time | 20 min | 4-6 hours | 12-18x longer |
| Cost | $9 | $125-200 | 14-22x more |
| Quality | Barely functional | Production-ready | Phase change |
| Core features | Broken | All working | N/A |
| Design | Generic AI slop | Distinctive, polished | N/A |
The tradeoff is clear: ~20x more time and cost for a qualitative leap in output quality. This is for projects where quality matters.
References
- Anthropic: Harness Design for Long-Running Apps — Original paper by Prithvi Rajasekaran
- Epsilla: The GAN-Style Agent Loop — Architecture deconstruction
- Martin Fowler: Harness Engineering — Broader industry context
- OpenAI: Harness Engineering — OpenAI's parallel work