Claude-skill-registry battle

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/battle" ~/.claude/skills/majiayu000-claude-skill-registry-battle && rm -rf "$T"
manifest: skills/data/battle/SKILL.md
source content

Battle Skill

Red vs Blue Team Security Competition Orchestrator

Pits a Red Team (attack) against a Blue Team (defense) in a long-running competitive loop. Each team leverages all

.pi/skills
to attack or defend a target codebase.

Architecture

Based on research into RvB framework, DARPA AIxCC, and Microsoft PyRIT:

┌─────────────────────────────────────────────────────────┐
│                 Battle Orchestrator                      │
│  - Game loop (RvB pattern)                              │
│  - Concurrent Red/Blue execution                        │
│  - Entropy-driven termination                           │
│  - Checkpointing for overnight runs                     │
└─────────────────────────────────────────────────────────┘
         │                              │
    ┌────┴────┐                    ┌────┴────┐
    │ Red Team │                   │ Blue Team│
    │ (Thread) │                   │ (Thread) │
    ├──────────┤                   ├──────────┤
    │ Skills:  │                   │ Skills:  │
    │ - hack   │                   │ - anvil  │
    │ - memory │                   │ - memory │
    └──────────┘                   └──────────┘
         │                              │
         └──────────┬───────────────────┘
                    │
    ┌───────────────┴────────────────────┐
    │           Digital Twin              │
    │  ┌─────────────────────────────┐   │
    │  │ Mode: git_worktree          │   │
    │  │   - Red attacks arena       │   │
    │  │   - Blue patches workspace  │   │
    │  │   - Cherry-pick to test     │   │
    │  ├─────────────────────────────┤   │
    │  │ Mode: docker                │   │
    │  │   - Isolated containers     │   │
    │  │   - Battle network          │   │
    │  ├─────────────────────────────┤   │
    │  │ Mode: qemu                  │   │
    │  │   - Emulated firmware       │   │
    │  │   - GDB attach points       │   │
    │  └─────────────────────────────┘   │
    └────────────────────────────────────┘

Digital Twin Modes

The battle skill supports multiple target types through its Digital Twin system:

1. Source Code (git_worktree)

For battling over git repositories. Creates isolated git worktrees for each team.

./run.sh battle /path/to/repo --rounds 100

2. Docker Container (docker)

For battling over containerized applications. Spins up separate containers for each team.

# Using a Docker image
./run.sh battle --docker-image nginx:latest --rounds 100

# Using a Dockerfile in the target directory
./run.sh battle /path/with/Dockerfile --mode docker

3. Firmware/Microprocessor (qemu)

For battling over firmware and embedded systems. Boots firmware in QEMU emulator.

# Auto-detect architecture from ELF header
./run.sh battle firmware.elf --rounds 100

# Specify machine type explicitly
./run.sh battle firmware.bin --qemu-machine arm
./run.sh battle firmware.bin --qemu-machine riscv64
./run.sh battle bios.rom --qemu-machine x86_64

Supported QEMU machines:

  • arm
    - ARM Cortex-M (STM32, etc.)
  • aarch64
    - ARM64
  • riscv32
    /
    riscv64
    - RISC-V
  • x86_64
    /
    i386
    - x86
  • mips
    - MIPS (routers, embedded)

4. Copy Mode (fallback)

For non-git directories. Creates simple file copies for each team.

Commands

# Start a battle (10 rounds for testing)
./run.sh battle /path/to/codebase --rounds 10

# Start overnight battle (1000 rounds)
./run.sh battle /path/to/codebase --overnight

# Battle a Docker container
./run.sh battle --docker-image myapp:latest --rounds 100

# Battle firmware with QEMU
./run.sh battle firmware.bin --qemu-machine arm --rounds 100

# Check battle status
./run.sh status

# Resume interrupted battle
./run.sh resume <battle-id>

# Generate report from completed battle
./run.sh report <battle-id>

Scoring System (AIxCC-style)

MetricWeightDescription
Vulnerability Discovery1xRed team finds vulnerability
Exploit Proof+0.5xRed team proves exploitability
Successful Patch3xBlue team patches vulnerability
Time DecayVariableFaster responses score higher
Functionality PreservedRequiredPatches must not break code

Scores

  • TDSR (True Defense Success Rate): Vulnerabilities fixed AND code works
  • FDSR (Fake Defense Success Rate): Attack blocked but code broken
  • ASC (Attack Success Count): Total unique exploits discovered

Game Loop (Learning-Based)

Each round follows a learn → act → reflect pattern:

Round k:

┌─────────────────────────────────────────────────────────────┐
│                    1. RESEARCH PHASE                         │
├─────────────────────────────────────────────────────────────┤
│ Red Team:                      Blue Team:                    │
│ - Recall past attack attempts  - Recall past defenses        │
│ - Query /dogpile for new       - Query /dogpile for          │
│   exploitation techniques        hardening strategies        │
│ - Review opponent's patterns   - Analyze attack evolution    │
│ (Budget: 3 research calls max)                               │
└─────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────┐
│                    2. ACTION PHASE                           │
├─────────────────────────────────────────────────────────────┤
│ Red Team Attack:               Blue Team Defense:            │
│ - Execute learned strategy     - Apply patches via anvil     │
│ - AFL++ fuzzing with coverage  - Verify via QCOW2 overlay    │
│ - Collect crashes/findings     - Run regression tests        │
│ - Tag findings with /taxonomy  - Tag patches with /taxonomy  │
└─────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────┐
│                   3. REFLECTION PHASE                        │
├─────────────────────────────────────────────────────────────┤
│ Both Teams:                                                  │
│ - Archive round episode (actions, outcomes, learnings)       │
│ - Store successful strategies in /memory                     │
│ - Update belief about opponent's capabilities                │
│ - Evolve strategy for next round                            │
└─────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────┐
│                   4. SCORING & CHECKPOINT                    │
├─────────────────────────────────────────────────────────────┤
│ - Calculate AIxCC-style scores                               │
│ - Check termination conditions                               │
│ - Save checkpoint (QEMU state + team memories)              │
└─────────────────────────────────────────────────────────────┘

Memory Architecture

Each team maintains isolated knowledge:

battle_red_<battle_id>/           battle_blue_<battle_id>/
├── strategies/                   ├── strategies/
│   ├── successful_attacks        │   ├── successful_patches
│   └── failed_attempts           │   └── broken_defenses
├── research/                     ├── research/
│   └── dogpile_results           │   └── dogpile_results
├── episodes/                     ├── episodes/
│   ├── round_001.json            │   ├── round_001.json
│   └── round_002.json            │   └── round_002.json
└── taxonomy/                     └── taxonomy/
    ├── cwe_classifications       ├── mitigation_types
    └── severity_scores           └── effectiveness_scores

Teams cannot access opponent's memory - this creates true adversarial learning.

Termination Conditions

Battle ends when ANY condition is met:

  1. Null Production: Both teams fail to generate new findings for 3 rounds
  2. Maximum Rounds: Configured limit reached
  3. Metric Convergence: Scores stable for 5 consecutive rounds
  4. Kill Switch: Manual termination via
    ./run.sh stop

Task Monitor Integration

Battles register with task-monitor for overnight progress tracking:

# View battle progress in TUI
.pi/skills/task-monitor/run.sh tui --filter battle

Report Output

After battle completion, generates:

  • Executive Summary: Winner, key metrics, risk score
  • Vulnerability Report: By severity, category, remediation status
  • Attack Evolution: How Red team adapted over rounds
  • Defense Timeline: Blue team improvements over time
  • Recommendations: Prioritized security improvements

Leveraged Skills

SkillTeamPurpose
hackRedScanning, auditing, exploitation
anvilBlueMulti-agent patching (Thunderdome)
memoryBothRecall prior strategies
treesitterBlueCode structure analysis
taxonomyBothClassify findings
task-monitorOrchestratorProgress tracking
ops-dockerBothContainer management

Example Battle

# Start 100-round battle on current project
./run.sh battle --target . --rounds 100

# Output:
# Battle ID: battle_20250128_221500
# Target: /home/user/project
# Rounds: 100
#
# Registering with task-monitor...
# Starting Round 1/100...
# [Red] Scanning target with hack...
# [Red] Found 3 potential vulnerabilities
# [Blue] Analyzing attack logs...
# [Blue] Generating patch for SQL injection...
# [Blue] Patch applied, running verification...
# Round 1 complete. Red: 3 pts, Blue: 9 pts
# ...
#
# Battle Complete!
# Winner: Blue Team (847 pts vs 423 pts)
# Report: ./reports/battle_20250128_221500.md