Claude-skill-registry judge-llm

Ultrathink LLM-as-Judge validation of completed work. Uses extended thinking by DEFAULT for thorough evaluation.

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/judge-llm" ~/.claude/skills/majiayu000-claude-skill-registry-judge-llm && rm -rf "$T"
manifest: skills/data/judge-llm/SKILL.md
source content

/sw:judge-llm - Ultrathink LLM-as-Judge Validation

ULTRATHINK BY DEFAULT - Validate completed work using extended thinking and the LLM-as-Judge pattern.

Implementation: Opus Model + Timeout Handling

Model:

opus
for deepest analysis Timeout: 60 seconds default (configurable with
--timeout
) Progress Log:
.specweave/logs/judge-llm.log

Implementation in

src/core/skills/skill-judge.ts
:

  • Uses Anthropic SDK with user's ANTHROPIC_API_KEY
  • AbortController-based timeout to prevent stuck states
  • Progress logging for visibility during evaluation
  • Fallback to basic pattern matching if no API key

CRITICAL: Extended Thinking is DEFAULT

This command ALWAYS uses ultrathink (extended thinking) for thorough LLM-as-Judge evaluation:

DEFAULT BEHAVIOR = ULTRATHINK MODE
- Extended thinking enabled
- Deep chain-of-thought reasoning
- Thorough multi-dimensional analysis
- ~60-90 seconds for comprehensive evaluation
- Uses Opus model for maximum quality

Use

--quick
only if you explicitly need faster (but less thorough) validation.

Purpose

Use when you've completed work and want maximum-quality AI validation:

  • Works on any files (not just SpecWeave increments)
  • Uses ultrathink extended thinking for deepest analysis
  • Returns clear verdict with detailed reasoning

Usage

# DEFAULT: Ultrathink validation (recommended)
/sw:judge-llm src/file.ts
/sw:judge-llm "src/**/*.ts"

# Validate git changes (ultrathink by default)
/sw:judge-llm --staged           # Staged changes
/sw:judge-llm --last-commit      # Last commit
/sw:judge-llm --diff main        # Diff vs branch

# Quick mode (ONLY if you need speed over thoroughness)
/sw:judge-llm src/file.ts --quick

# Timeout control (default: 60s)
/sw:judge-llm src/file.ts --timeout 120000   # 120 seconds
/sw:judge-llm src/file.ts --timeout 30000    # 30 seconds (faster cutoff)

# Additional options
/sw:judge-llm src/file.ts --strict   # Fail on any concern
/sw:judge-llm src/file.ts --fix      # Include fix suggestions
/sw:judge-llm src/file.ts --export   # Export report to markdown
/sw:judge-llm src/file.ts --verbose  # Show progress to console

Visibility & Stuck Detection

Progress is always logged to

.specweave/logs/judge-llm.log
:

[2026-01-19T10:30:00.000Z] [0.0s] [INFO] Starting LLM Judge evaluation for domain: backend
[2026-01-19T10:30:00.001Z] [0.0s] [INFO] Task: Validate authentication implementation...
[2026-01-19T10:30:00.002Z] [0.0s] [INFO] Using model: opus
[2026-01-19T10:30:00.003Z] [0.0s] [INFO] Timeout: 60000ms
[2026-01-19T10:30:00.004Z] [0.0s] [PROGRESS] Sending request to Opus...
[2026-01-19T10:30:45.000Z] [45.0s] [PROGRESS] Response received, parsing...

If evaluation gets stuck:

  1. Check
    .specweave/logs/judge-llm.log
    for last progress
  2. Default timeout (60s) will abort if stuck
  3. Increase timeout with
    --timeout
    if legitimately slow
  4. Result will show
    timedOut: true
    if aborted

How It Works

When you invoke

/sw:judge-llm
, Claude will:

Step 1: Gather Input

Determine what to validate:

  • If file paths provided → read those files
  • If
    --staged
    → get staged git changes
  • If
    --last-commit
    → get files from last commit
  • If
    --diff <branch>
    → get diff against branch
  • If no args → validate recent work in conversation context

Step 2: ULTRATHINK Analysis (Default)

MANDATORY: Use extended thinking for deep LLM-as-Judge evaluation:

Claude MUST use ultrathink/extended thinking to:

1. **DEEP READ**: Thoroughly understand all code, context, and intent
2. **MULTI-DIMENSIONAL ANALYSIS**: Evaluate across ALL dimensions:
   - Correctness: Does it work exactly as intended?
   - Completeness: ALL edge cases handled? ALL requirements met?
   - Security: ANY vulnerabilities? OWASP Top 10 checked?
   - Performance: Algorithmic complexity? Memory usage? Bottlenecks?
   - Maintainability: Clean? Clear? Follows conventions?
   - Testability: Can it be tested? Are tests adequate?
   - Error handling: All failure modes covered?
3. **CRITICAL EVALUATION**: Weigh ALL findings by severity
4. **REASONED VERDICT**: Form verdict based on thorough analysis

Step 3: Return Verdict

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
JUDGE-LLM VERDICT: APPROVED | CONCERNS | REJECTED
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Mode: ULTRATHINK (extended thinking)
Confidence: 0.XX
Files Analyzed: N

REASONING:
[Detailed chain-of-thought from extended thinking]

ISSUES (if any):
🔴 CRITICAL: [title]
   [description]
   📍 [file:line]
   💡 [suggestion]

🟡 HIGH: [title]
   ...

🟢 LOW: [title]
   ...

VERDICT: [summary sentence]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Verdict Meanings

VerdictMeaningAction
APPROVEDWork is solid, no significant issuesSafe to proceed
CONCERNSIssues found worth addressingReview and fix recommended
REJECTEDCritical issues foundMUST fix before proceeding

Validation Modes

Default Mode (ULTRATHINK) - RECOMMENDED

  • Extended thinking ENABLED by default
  • Most thorough validation (~60-90 seconds)
  • Deep multi-dimensional analysis
  • Best for any completed work
  • Cost: ~$0.10-0.25

Quick Mode (
--quick
)

  • Fast validation (~10-15 seconds)
  • Standard reasoning (no extended thinking)
  • Good for quick sanity checks during development
  • Cost: ~$0.02-0.05

Strict Mode (
--strict
)

  • Any concern results in REJECTED
  • Use for critical paths, security code, or CI gates
  • Combines with ultrathink by default

Examples

Example 1: Default ultrathink validation

User: /sw:judge-llm src/core/auth/login.ts

Claude: [Uses extended thinking for deep analysis]

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
JUDGE-LLM VERDICT: APPROVED
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Mode: ULTRATHINK (extended thinking)
Confidence: 0.91
Files Analyzed: 1

REASONING:
After thorough analysis with extended thinking:

The login implementation demonstrates excellent security practices:
- Password hashing uses bcrypt with cost factor 12 (appropriate)
- Rate limiting implemented correctly (5 attempts, 15 min exponential backoff)
- Input validation prevents SQL injection and XSS
- Error messages are generic (don't leak user existence)
- Session tokens use cryptographically secure random generation
- CSRF protection properly implemented

Edge cases handled:
- Empty input validation ✓
- Unicode normalization for usernames ✓
- Timing attack mitigation via constant-time comparison ✓

No security, performance, or maintainability issues found.

VERDICT: Production-ready implementation with excellent security posture.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Example 2: Validate staged changes

User: /sw:judge-llm --staged

Claude: [Uses extended thinking]

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
JUDGE-LLM VERDICT: CONCERNS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Mode: ULTRATHINK (extended thinking)
Confidence: 0.84
Files Analyzed: 3

REASONING:
Extended thinking analysis of staged changes reveals:

Positive aspects:
- New API endpoint follows existing patterns
- TypeScript types are correct
- Error handling present

However, thorough analysis found issues:

🟡 HIGH: Missing Input Validation
   User input passed to database without sanitization
   📍 src/api/users.ts:45
   💡 Add Zod schema validation:
   ```typescript
   const schema = z.object({ userId: z.string().uuid() });
   const { userId } = schema.parse(req.body);

🟡 HIGH: Information Disclosure Risk Stack traces exposed in error responses 📍 src/api/users.ts:62 💡 Use production error handler that sanitizes output

🟢 LOW: Missing rate limiting New endpoint has no rate limiting 📍 src/api/users.ts:30 💡 Add rate limiter middleware

VERDICT: Address HIGH issues before merging. LOW can be follow-up. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━


### Example 3: Quick validation (when needed)

User: /sw:judge-llm src/utils/format.ts --quick

Claude: [Standard reasoning, no extended thinking]

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ JUDGE-LLM VERDICT: APPROVED ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Mode: QUICK (standard reasoning) Confidence: 0.75 Files Analyzed: 1

REASONING: Utility formatting functions look correct. No obvious issues.

VERDICT: Looks good for a utility file. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━


## Simplest Usage

Just say in your prompt:

"judge-llm my work" "use judge-llm" "judge-llm this"


Claude will:
1. Automatically gather context from the conversation
2. Use ULTRATHINK extended thinking by default
3. Apply thorough LLM-as-Judge evaluation

## Difference from /sw:qa

| Aspect | `/sw:qa` | `/sw:judge-llm` |
|--------|-----------------|------------------------|
| **Scope** | Increments only | Any files |
| **Input** | Increment ID | Files, git diff, context |
| **Default Mode** | Standard | **ULTRATHINK** |
| **Pattern** | 7-dimension scoring | Judge LLM reasoning |
| **Focus** | Spec quality, risks | Code correctness |
| **When** | Before increment close | After any work |

## Best Practices

1. **Use by default**: Ultrathink is worth the extra time for quality
2. **Use `--staged`**: Validate before committing
3. **Use `--strict` for critical code**: Payment, auth, security
4. **Fix CRITICAL issues immediately**: Never ignore these
5. **Trust the ultrathink analysis**: Extended thinking catches subtle issues

## Limitations

- ❌ Doesn't execute tests (use test runners)
- ❌ Doesn't auto-apply fixes (only suggests)
- ❌ May miss domain-specific issues
- ❌ Not a replacement for human review

## Related

- `/sw:qa` - Increment-bound quality assessment
- `/sw:validate` - Rule-based increment validation
- `ado-sync-judge` agent - Uses judge pattern for sync validation