ClawForge langsmith-evaluator
Use this skill for ANY question about CREATING evaluators. Covers creating custom metrics, LLM as Judge evaluators, code-based evaluators, and uploading evaluation logic to LangSmith. Does NOT cover RUNNING evaluations.
git clone https://github.com/jackjin1997/ClawForge
T=$(mktemp -d) && git clone --depth=1 https://github.com/jackjin1997/ClawForge "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/langsmith-evaluator" ~/.claude/skills/jackjin1997-clawforge-langsmith-evaluator && rm -rf "$T"
skills/langsmith-evaluator/SKILL.mdLangSmith Evaluator
Create evaluators to measure agent performance on your datasets. LangSmith supports two types: LLM as Judge (uses LLM to grade outputs) and Custom Code (deterministic logic).
Setup
Environment Variables
LANGSMITH_API_KEY=lsv2_pt_your_api_key_here # Required LANGSMITH_WORKSPACE_ID=your-workspace-id # Optional: for org-scoped keys OPENAI_API_KEY=your_openai_key # For LLM as Judge
Dependencies
pip install langsmith langchain-openai python-dotenv
Evaluator Format
Evaluators support two function signatures:
Method 1: Dict Parameters (For running evaluations locally):
def evaluator_name(inputs: dict, outputs: dict, reference_outputs: dict = None) -> dict: """Evaluate a single prediction.""" user_query = inputs.get("query", "") agent_response = outputs.get("expected_response", "") expected = reference_outputs.get("expected_response", "") if reference_outputs else None return { "key": "metric_name", # Metric identifier "score": 0.85, # Number or boolean "comment": "Reason..." # Optional explanation }
Method 2: Run/Example Parameters (For uploading to LangSmith):
def evaluator_name(run, example): """Evaluate using run/example dicts. Args: run: Dict with run["outputs"] containing agent outputs example: Dict with example["outputs"] containing expected outputs """ agent_response = run["outputs"].get("expected_response", "") expected = example["outputs"].get("expected_response", "") return { "metric_name": 0.85, # Metric name as key directly "comment": "Reason..." # Optional explanation }
LLM as Judge Evaluators
Use structured output for reliable grading:
from typing import TypedDict, Annotated from langchain_openai import ChatOpenAI class AccuracyGrade(TypedDict): """Structured evaluation output.""" reasoning: Annotated[str, ..., "Explain your reasoning"] is_accurate: Annotated[bool, ..., "True if response is accurate"] confidence: Annotated[float, ..., "Confidence 0.0-1.0"] # Configure model with structured output judge = ChatOpenAI(model="gpt-4o-mini", temperature=0).with_structured_output( AccuracyGrade, method="json_schema", strict=True ) async def accuracy_evaluator(run, example): """Evaluate factual accuracy for LangSmith upload.""" expected = example["outputs"].get('expected_response', '') agent_output = run["outputs"].get('expected_response', '') prompt = f"""Expected: {expected} Agent Output: {agent_output} Evaluate accuracy:""" grade = await judge.ainvoke([{"role": "user", "content": prompt}]) return { "accuracy": 1 if grade["is_accurate"] else 0, "comment": f"{grade['reasoning']} (confidence: {grade['confidence']})" }
Common Metrics: Completeness, correctness, helpfulness, professionalism
Custom Code Evaluators
Exact Match
def exact_match_evaluator(run, example): """Check if output exactly matches expected.""" output = run["outputs"].get("expected_response", "").strip().lower() expected = example["outputs"].get("expected_response", "").strip().lower() match = output == expected return { "exact_match": 1 if match else 0, "comment": f"Match: {match}" }
Trajectory Validation
def trajectory_evaluator(run, example): """Evaluate tool call sequence.""" trajectory = run["outputs"].get("expected_trajectory", []) expected = example["outputs"].get("expected_trajectory", []) # Exact sequence match exact = trajectory == expected # All required tools used (order-agnostic) all_tools = set(expected).issubset(set(trajectory)) # Efficiency: count extra steps extra_steps = len(trajectory) - len(expected) return { "trajectory_match": 1 if exact else 0, "comment": f"Exact: {exact}, All tools: {all_tools}, Extra: {extra_steps}" }
Single Step Validation
def single_step_evaluator(run, example): """Evaluate single node output.""" output = run["outputs"].get("output", {}) expected = example["outputs"].get("expected_output", {}) node_name = run["outputs"].get("node_name", "") # For classification nodes if "classification" in node_name: classification = output.get("classification", "") expected_class = expected.get("classification", "") match = classification.lower() == expected_class.lower() return { "classification_correct": 1 if match else 0, "comment": f"Output: {classification}, Expected: {expected_class}" } # For other nodes match = output == expected return { "output_match": 1 if match else 0, "comment": f"Match: {match}" }
Running Evaluations
from langsmith import Client client = Client() # Define your agent function def run_agent(inputs: dict) -> dict: """Your agent invocation logic.""" result = your_agent.invoke(inputs) return {"expected_response": result} # Run evaluation results = await client.aevaluate( run_agent, data="Skills: Final Response", # Dataset name evaluators=[ exact_match_evaluator, accuracy_evaluator, trajectory_evaluator ], experiment_prefix="skills-eval-v1", max_concurrency=4 )
Upload Evaluators to LangSmith
The upload script is a utility tool to deploy your custom evaluators to LangSmith. Write evaluators specific to your use case, then upload them.
Navigate to
skills/langsmith-evaluator/scripts/ to upload evaluators.
Important: LangSmith API requires evaluators to use
(run, example) signature where:
: dict withrun
containing agent outputsrun["outputs"]
: dict withexample
containing expected outputsexample["outputs"]
Create Evaluator File
# my_project/evaluators/custom_evals.py def my_custom_evaluator(run, example): """Your custom evaluation logic. Args: run: Dict with run["outputs"] - agent outputs example: Dict with example["outputs"] - expected outputs Returns: Dict with metric_name as key, score as value, optional comment """ # Extract relevant data agent_output = run["outputs"].get("expected_trajectory", []) expected = example["outputs"].get("expected_trajectory", []) # Your custom logic here match = agent_output == expected return { "my_metric": 1 if match else 0, "comment": "Custom reasoning here" }
Upload
# List existing evaluators python upload_evaluators.py list # Upload evaluator python upload_evaluators.py upload my_evaluators.py \ --name "Trajectory Match" \ --function trajectory_match \ --dataset "Skills: Trajectory" \ --replace # Delete evaluator (will prompt for confirmation) python upload_evaluators.py delete "Trajectory Match" # Skip confirmation prompts (use with caution) python upload_evaluators.py delete "Trajectory Match" --yes python upload_evaluators.py upload my_evaluators.py \ --name "Trajectory Match" \ --function trajectory_match \ --replace --yes
Options:
- Display name in LangSmith--name
- Function name to extract--function
- Target dataset name--dataset
- Target project name--project
- Sampling rate (0.0-1.0)--sample-rate
- Replace if exists (will prompt for confirmation)--replace
- Skip confirmation prompts for replace/delete operations--yes
IMPORTANT - Safety Prompts:
- The script prompts for confirmation before any destructive operations (delete, replace)
- ALWAYS respect these prompts - wait for user input before proceeding
- NEVER use
flag unless the user explicitly requests it--yes - The
flag skips all safety prompts and should only be used in automated workflows when explicitly authorized by the user--yes
Best Practices
- Use structured output for LLM judges - More reliable than parsing free-text
- Match evaluator to dataset type
- Final Response → LLM as Judge for quality, Custom Code for format
- Single Step → Custom Code for exact match
- Trajectory → Custom Code for sequence/efficiency
- Combine multiple evaluators - Run both subjective (LLM) and objective (code)
- Use async for LLM judges - Enables parallel evaluation, much faster
- Test evaluators independently - Validate on known good/bad examples first
- Upload to LangSmith - Automatic evaluation on new runs
Example Workflow
# 1. Create evaluators file cat > evaluators.py <<'EOF' def exact_match(run, example): """Check if output exactly matches expected.""" output = run["outputs"].get("expected_response", "").strip().lower() expected = example["outputs"].get("expected_response", "").strip().lower() match = output == expected return { "exact_match": 1 if match else 0, "comment": f"Match: {match}" } EOF # 2. Upload to LangSmith python upload_evaluators.py upload evaluators.py \ --name "Exact Match" \ --function exact_match \ --dataset "Skills: Final Response" \ --replace # 3. Evaluator runs automatically on new dataset runs
Resources
Related Skills
- Use langsmith-trace skill to query and export traces
- Use langsmith-dataset skill to generate evaluation datasets from traces