Claude-skill-registry judge-llm
Ultrathink LLM-as-Judge validation of completed work. Uses extended thinking by DEFAULT for thorough evaluation.
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/judge-llm" ~/.claude/skills/majiayu000-claude-skill-registry-judge-llm && rm -rf "$T"
skills/data/judge-llm/SKILL.md/sw:judge-llm - Ultrathink LLM-as-Judge Validation
ULTRATHINK BY DEFAULT - Validate completed work using extended thinking and the LLM-as-Judge pattern.
Implementation: Opus Model + Timeout Handling
Model:
opus for deepest analysis
Timeout: 60 seconds default (configurable with --timeout)
Progress Log: .specweave/logs/judge-llm.log
Implementation in
src/core/skills/skill-judge.ts:
- Uses Anthropic SDK with user's ANTHROPIC_API_KEY
- AbortController-based timeout to prevent stuck states
- Progress logging for visibility during evaluation
- Fallback to basic pattern matching if no API key
CRITICAL: Extended Thinking is DEFAULT
This command ALWAYS uses ultrathink (extended thinking) for thorough LLM-as-Judge evaluation:
DEFAULT BEHAVIOR = ULTRATHINK MODE - Extended thinking enabled - Deep chain-of-thought reasoning - Thorough multi-dimensional analysis - ~60-90 seconds for comprehensive evaluation - Uses Opus model for maximum quality
Use
--quick only if you explicitly need faster (but less thorough) validation.
Purpose
Use when you've completed work and want maximum-quality AI validation:
- Works on any files (not just SpecWeave increments)
- Uses ultrathink extended thinking for deepest analysis
- Returns clear verdict with detailed reasoning
Usage
# DEFAULT: Ultrathink validation (recommended) /sw:judge-llm src/file.ts /sw:judge-llm "src/**/*.ts" # Validate git changes (ultrathink by default) /sw:judge-llm --staged # Staged changes /sw:judge-llm --last-commit # Last commit /sw:judge-llm --diff main # Diff vs branch # Quick mode (ONLY if you need speed over thoroughness) /sw:judge-llm src/file.ts --quick # Timeout control (default: 60s) /sw:judge-llm src/file.ts --timeout 120000 # 120 seconds /sw:judge-llm src/file.ts --timeout 30000 # 30 seconds (faster cutoff) # Additional options /sw:judge-llm src/file.ts --strict # Fail on any concern /sw:judge-llm src/file.ts --fix # Include fix suggestions /sw:judge-llm src/file.ts --export # Export report to markdown /sw:judge-llm src/file.ts --verbose # Show progress to console
Visibility & Stuck Detection
Progress is always logged to
.specweave/logs/judge-llm.log:
[2026-01-19T10:30:00.000Z] [0.0s] [INFO] Starting LLM Judge evaluation for domain: backend [2026-01-19T10:30:00.001Z] [0.0s] [INFO] Task: Validate authentication implementation... [2026-01-19T10:30:00.002Z] [0.0s] [INFO] Using model: opus [2026-01-19T10:30:00.003Z] [0.0s] [INFO] Timeout: 60000ms [2026-01-19T10:30:00.004Z] [0.0s] [PROGRESS] Sending request to Opus... [2026-01-19T10:30:45.000Z] [45.0s] [PROGRESS] Response received, parsing...
If evaluation gets stuck:
- Check
for last progress.specweave/logs/judge-llm.log - Default timeout (60s) will abort if stuck
- Increase timeout with
if legitimately slow--timeout - Result will show
if abortedtimedOut: true
How It Works
When you invoke
/sw:judge-llm, Claude will:
Step 1: Gather Input
Determine what to validate:
- If file paths provided → read those files
- If
→ get staged git changes--staged - If
→ get files from last commit--last-commit - If
→ get diff against branch--diff <branch> - If no args → validate recent work in conversation context
Step 2: ULTRATHINK Analysis (Default)
MANDATORY: Use extended thinking for deep LLM-as-Judge evaluation:
Claude MUST use ultrathink/extended thinking to: 1. **DEEP READ**: Thoroughly understand all code, context, and intent 2. **MULTI-DIMENSIONAL ANALYSIS**: Evaluate across ALL dimensions: - Correctness: Does it work exactly as intended? - Completeness: ALL edge cases handled? ALL requirements met? - Security: ANY vulnerabilities? OWASP Top 10 checked? - Performance: Algorithmic complexity? Memory usage? Bottlenecks? - Maintainability: Clean? Clear? Follows conventions? - Testability: Can it be tested? Are tests adequate? - Error handling: All failure modes covered? 3. **CRITICAL EVALUATION**: Weigh ALL findings by severity 4. **REASONED VERDICT**: Form verdict based on thorough analysis
Step 3: Return Verdict
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ JUDGE-LLM VERDICT: APPROVED | CONCERNS | REJECTED ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Mode: ULTRATHINK (extended thinking) Confidence: 0.XX Files Analyzed: N REASONING: [Detailed chain-of-thought from extended thinking] ISSUES (if any): 🔴 CRITICAL: [title] [description] 📍 [file:line] 💡 [suggestion] 🟡 HIGH: [title] ... 🟢 LOW: [title] ... VERDICT: [summary sentence] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Verdict Meanings
| Verdict | Meaning | Action |
|---|---|---|
| APPROVED | Work is solid, no significant issues | Safe to proceed |
| CONCERNS | Issues found worth addressing | Review and fix recommended |
| REJECTED | Critical issues found | MUST fix before proceeding |
Validation Modes
Default Mode (ULTRATHINK) - RECOMMENDED
- Extended thinking ENABLED by default
- Most thorough validation (~60-90 seconds)
- Deep multi-dimensional analysis
- Best for any completed work
- Cost: ~$0.10-0.25
Quick Mode (--quick
)
--quick- Fast validation (~10-15 seconds)
- Standard reasoning (no extended thinking)
- Good for quick sanity checks during development
- Cost: ~$0.02-0.05
Strict Mode (--strict
)
--strict- Any concern results in REJECTED
- Use for critical paths, security code, or CI gates
- Combines with ultrathink by default
Examples
Example 1: Default ultrathink validation
User: /sw:judge-llm src/core/auth/login.ts Claude: [Uses extended thinking for deep analysis] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ JUDGE-LLM VERDICT: APPROVED ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Mode: ULTRATHINK (extended thinking) Confidence: 0.91 Files Analyzed: 1 REASONING: After thorough analysis with extended thinking: The login implementation demonstrates excellent security practices: - Password hashing uses bcrypt with cost factor 12 (appropriate) - Rate limiting implemented correctly (5 attempts, 15 min exponential backoff) - Input validation prevents SQL injection and XSS - Error messages are generic (don't leak user existence) - Session tokens use cryptographically secure random generation - CSRF protection properly implemented Edge cases handled: - Empty input validation ✓ - Unicode normalization for usernames ✓ - Timing attack mitigation via constant-time comparison ✓ No security, performance, or maintainability issues found. VERDICT: Production-ready implementation with excellent security posture. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Example 2: Validate staged changes
User: /sw:judge-llm --staged Claude: [Uses extended thinking] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ JUDGE-LLM VERDICT: CONCERNS ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Mode: ULTRATHINK (extended thinking) Confidence: 0.84 Files Analyzed: 3 REASONING: Extended thinking analysis of staged changes reveals: Positive aspects: - New API endpoint follows existing patterns - TypeScript types are correct - Error handling present However, thorough analysis found issues: 🟡 HIGH: Missing Input Validation User input passed to database without sanitization 📍 src/api/users.ts:45 💡 Add Zod schema validation: ```typescript const schema = z.object({ userId: z.string().uuid() }); const { userId } = schema.parse(req.body);
🟡 HIGH: Information Disclosure Risk Stack traces exposed in error responses 📍 src/api/users.ts:62 💡 Use production error handler that sanitizes output
🟢 LOW: Missing rate limiting New endpoint has no rate limiting 📍 src/api/users.ts:30 💡 Add rate limiter middleware
VERDICT: Address HIGH issues before merging. LOW can be follow-up. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
### Example 3: Quick validation (when needed)
User: /sw:judge-llm src/utils/format.ts --quick
Claude: [Standard reasoning, no extended thinking]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ JUDGE-LLM VERDICT: APPROVED ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Mode: QUICK (standard reasoning) Confidence: 0.75 Files Analyzed: 1
REASONING: Utility formatting functions look correct. No obvious issues.
VERDICT: Looks good for a utility file. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
## Simplest Usage Just say in your prompt:
"judge-llm my work" "use judge-llm" "judge-llm this"
Claude will: 1. Automatically gather context from the conversation 2. Use ULTRATHINK extended thinking by default 3. Apply thorough LLM-as-Judge evaluation ## Difference from /sw:qa | Aspect | `/sw:qa` | `/sw:judge-llm` | |--------|-----------------|------------------------| | **Scope** | Increments only | Any files | | **Input** | Increment ID | Files, git diff, context | | **Default Mode** | Standard | **ULTRATHINK** | | **Pattern** | 7-dimension scoring | Judge LLM reasoning | | **Focus** | Spec quality, risks | Code correctness | | **When** | Before increment close | After any work | ## Best Practices 1. **Use by default**: Ultrathink is worth the extra time for quality 2. **Use `--staged`**: Validate before committing 3. **Use `--strict` for critical code**: Payment, auth, security 4. **Fix CRITICAL issues immediately**: Never ignore these 5. **Trust the ultrathink analysis**: Extended thinking catches subtle issues ## Limitations - ❌ Doesn't execute tests (use test runners) - ❌ Doesn't auto-apply fixes (only suggests) - ❌ May miss domain-specific issues - ❌ Not a replacement for human review ## Related - `/sw:qa` - Increment-bound quality assessment - `/sw:validate` - Rule-based increment validation - `ado-sync-judge` agent - Uses judge pattern for sync validation