install
source · Clone the upstream repo
git clone https://github.com/ComeOnOliver/skillshub
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ComeOnOliver/skillshub "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/LeoYeAI/openclaw-master-skills/agentic-eval" ~/.claude/skills/comeonoliver-skillshub-agentic-eval && rm -rf "$T"
OpenClaw · Install into ~/.openclaw/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ComeOnOliver/skillshub "$T" && mkdir -p ~/.openclaw/skills && cp -r "$T/skills/LeoYeAI/openclaw-master-skills/agentic-eval" ~/.openclaw/skills/comeonoliver-skillshub-agentic-eval && rm -rf "$T"
manifest:
skills/LeoYeAI/openclaw-master-skills/agentic-eval/SKILL.mdsource content
Agentic Evaluation Patterns
Patterns for self-improvement through iterative evaluation and refinement.
Overview
Evaluation patterns enable agents to assess and improve their own outputs, moving beyond single-shot generation to iterative refinement loops.
Generate → Evaluate → Critique → Refine → Output ↑ │ └──────────────────────────────┘
When to Use
- Quality-critical generation: Code, reports, analysis requiring high accuracy
- Tasks with clear evaluation criteria: Defined success metrics exist
- Content requiring specific standards: Style guides, compliance, formatting
Pattern 1: Basic Reflection
Agent evaluates and improves its own output through self-critique.
def reflect_and_refine(task: str, criteria: list[str], max_iterations: int = 3) -> str: """Generate with reflection loop.""" output = llm(f"Complete this task:\n{task}") for i in range(max_iterations): # Self-critique critique = llm(f""" Evaluate this output against criteria: {criteria} Output: {output} Rate each: PASS/FAIL with feedback as JSON. """) critique_data = json.loads(critique) all_pass = all(c["status"] == "PASS" for c in critique_data.values()) if all_pass: return output # Refine based on critique failed = {k: v["feedback"] for k, v in critique_data.items() if v["status"] == "FAIL"} output = llm(f"Improve to address: {failed}\nOriginal: {output}") return output
Key insight: Use structured JSON output for reliable parsing of critique results.
Pattern 2: Evaluator-Optimizer
Separate generation and evaluation into distinct components for clearer responsibilities.
class EvaluatorOptimizer: def __init__(self, score_threshold: float = 0.8): self.score_threshold = score_threshold def generate(self, task: str) -> str: return llm(f"Complete: {task}") def evaluate(self, output: str, task: str) -> dict: return json.loads(llm(f""" Evaluate output for task: {task} Output: {output} Return JSON: {{"overall_score": 0-1, "dimensions": {{"accuracy": ..., "clarity": ...}}}} """)) def optimize(self, output: str, feedback: dict) -> str: return llm(f"Improve based on feedback: {feedback}\nOutput: {output}") def run(self, task: str, max_iterations: int = 3) -> str: output = self.generate(task) for _ in range(max_iterations): evaluation = self.evaluate(output, task) if evaluation["overall_score"] >= self.score_threshold: break output = self.optimize(output, evaluation) return output
Pattern 3: Code-Specific Reflection
Test-driven refinement loop for code generation.
class CodeReflector: def reflect_and_fix(self, spec: str, max_iterations: int = 3) -> str: code = llm(f"Write Python code for: {spec}") tests = llm(f"Generate pytest tests for: {spec}\nCode: {code}") for _ in range(max_iterations): result = run_tests(code, tests) if result["success"]: return code code = llm(f"Fix error: {result['error']}\nCode: {code}") return code
Evaluation Strategies
Outcome-Based
Evaluate whether output achieves the expected result.
def evaluate_outcome(task: str, output: str, expected: str) -> str: return llm(f"Does output achieve expected outcome? Task: {task}, Expected: {expected}, Output: {output}")
LLM-as-Judge
Use LLM to compare and rank outputs.
def llm_judge(output_a: str, output_b: str, criteria: str) -> str: return llm(f"Compare outputs A and B for {criteria}. Which is better and why?")
Rubric-Based
Score outputs against weighted dimensions.
RUBRIC = { "accuracy": {"weight": 0.4}, "clarity": {"weight": 0.3}, "completeness": {"weight": 0.3} } def evaluate_with_rubric(output: str, rubric: dict) -> float: scores = json.loads(llm(f"Rate 1-5 for each dimension: {list(rubric.keys())}\nOutput: {output}")) return sum(scores[d] * rubric[d]["weight"] for d in rubric) / 5
Best Practices
| Practice | Rationale |
|---|---|
| Clear criteria | Define specific, measurable evaluation criteria upfront |
| Iteration limits | Set max iterations (3-5) to prevent infinite loops |
| Convergence check | Stop if output score isn't improving between iterations |
| Log history | Keep full trajectory for debugging and analysis |
| Structured output | Use JSON for reliable parsing of evaluation results |
Quick Start Checklist
## Evaluation Implementation Checklist ### Setup - [ ] Define evaluation criteria/rubric - [ ] Set score threshold for "good enough" - [ ] Configure max iterations (default: 3) ### Implementation - [ ] Implement generate() function - [ ] Implement evaluate() function with structured output - [ ] Implement optimize() function - [ ] Wire up the refinement loop ### Safety - [ ] Add convergence detection - [ ] Log all iterations for debugging - [ ] Handle evaluation parse failures gracefully