Commonly-used-high-value-skills promptfoo-evaluation
Configures and runs LLM evaluation using Promptfoo framework. Use when setting up prompt testing, creating evaluation configs (promptfooconfig.yaml), writing Python custom assertions, implementing llm-rubric for LLM-as-judge, or managing few-shot examples in prompts. Triggers on keywords like "promptfoo", "eval", "LLM evaluation", "prompt testing", or "model comparison".
git clone https://github.com/seaworld008/Commonly-used-high-value-skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/seaworld008/Commonly-used-high-value-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/developer-engineering/promptfoo-evaluation" ~/.claude/skills/seaworld008-commonly-used-high-value-skills-promptfoo-evaluation-3b3ea0 && rm -rf "$T"
skills/developer-engineering/promptfoo-evaluation/SKILL.mdPromptfoo Evaluation
Overview
This skill provides guidance for configuring and running LLM evaluations using Promptfoo, an open-source CLI tool for testing and comparing LLM outputs.
Quick Start
# Initialize a new evaluation project npx promptfoo@latest init # Run evaluation npx promptfoo@latest eval # View results in browser npx promptfoo@latest view
Configuration Structure
A typical Promptfoo project structure:
project/ ├── promptfooconfig.yaml # Main configuration ├── prompts/ │ ├── system.md # System prompt │ └── chat.json # Chat format prompt ├── tests/ │ └── cases.yaml # Test cases └── scripts/ └── metrics.py # Custom Python assertions
Core Configuration (promptfooconfig.yaml)
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json description: "My LLM Evaluation" # Prompts to test prompts: - file://prompts/system.md - file://prompts/chat.json # Models to compare providers: - id: anthropic:messages:claude-sonnet-4-6 label: Claude-Sonnet-4.6 - id: openai:gpt-4.1 label: GPT-4.1 # Test cases tests: file://tests/cases.yaml # Concurrency control (MUST be under commandLineOptions, NOT top-level) commandLineOptions: maxConcurrency: 2 # Default assertions for all tests defaultTest: assert: - type: python value: file://scripts/metrics.py:custom_assert - type: llm-rubric value: | Evaluate the response quality on a 0-1 scale. threshold: 0.7 # Output path outputPath: results/eval-results.json
Prompt Formats
Text Prompt (system.md)
You are a helpful assistant. Task: {{task}} Context: {{context}}
Chat Format (chat.json)
[ {"role": "system", "content": "{{system_prompt}}"}, {"role": "user", "content": "{{user_input}}"} ]
Few-Shot Pattern
Embed examples directly in prompt or use chat format with assistant messages:
[ {"role": "system", "content": "{{system_prompt}}"}, {"role": "user", "content": "Example input: {{example_input}}"}, {"role": "assistant", "content": "{{example_output}}"}, {"role": "user", "content": "Now process: {{actual_input}}"} ]
Test Cases (tests/cases.yaml)
- description: "Test case 1" vars: system_prompt: file://prompts/system.md user_input: "Hello world" # Load content from files context: file://data/context.txt assert: - type: contains value: "expected text" - type: python value: file://scripts/metrics.py:custom_check threshold: 0.8
Python Custom Assertions
Create a Python file for custom assertions (e.g.,
scripts/metrics.py):
def get_assert(output: str, context: dict) -> dict: """Default assertion function.""" vars_dict = context.get('vars', {}) # Access test variables expected = vars_dict.get('expected', '') # Return result return { "pass": expected in output, "score": 0.8, "reason": "Contains expected content", "named_scores": {"relevance": 0.9} } def custom_check(output: str, context: dict) -> dict: """Custom named assertion.""" word_count = len(output.split()) passed = 100 <= word_count <= 500 return { "pass": passed, "score": min(1.0, word_count / 300), "reason": f"Word count: {word_count}" }
Key points:
- Default function name is
get_assert - Specify function with
file://path.py:function_name - Return
,bool
(score), orfloat
with pass/score/reasondict - Access variables via
context['vars']
LLM-as-Judge (llm-rubric)
assert: - type: llm-rubric value: | Evaluate the response based on: 1. Accuracy of information 2. Clarity of explanation 3. Completeness Score 0.0-1.0 where 0.7+ is passing. threshold: 0.7 provider: openai:gpt-4.1 # Optional: override grader model
When using a relay/proxy API, each
llm-rubric assertion needs its own provider config with apiBaseUrl. Otherwise the grader falls back to the default Anthropic/OpenAI endpoint and gets 401 errors:
assert: - type: llm-rubric value: | Evaluate quality on a 0-1 scale. threshold: 0.7 provider: id: anthropic:messages:claude-sonnet-4-6 config: apiBaseUrl: https://your-relay.example.com/api
Best practices:
- Provide clear scoring criteria
- Use
to set minimum passing scorethreshold - Default grader uses available API keys (OpenAI → Anthropic → Google)
- When using relay/proxy: every
must have its ownllm-rubric
withprovider
— the main provider'sapiBaseUrl
is NOT inheritedapiBaseUrl
Common Assertion Types
| Type | Usage | Example |
|---|---|---|
| Check substring | |
| Case-insensitive | |
| Exact match | |
| Pattern match | |
| Custom logic | |
| LLM grading | |
| Response time | |
File References
All
file:// paths are resolved relative to promptfooconfig.yaml location (NOT the YAML file containing the reference). This is a common gotcha when tests: references a separate YAML file — the file:// paths inside that test file still resolve from the config root.
# Load file content as variable vars: content: file://data/input.txt # Load prompt from file prompts: - file://prompts/main.md # Load test cases from file tests: file://tests/cases.yaml # Load Python assertion assert: - type: python value: file://scripts/check.py:validate
Running Evaluations
# Basic run npx promptfoo@latest eval # With specific config npx promptfoo@latest eval --config path/to/config.yaml # Output to file npx promptfoo@latest eval --output results.json # Filter tests npx promptfoo@latest eval --filter-metadata category=math # View results npx promptfoo@latest view
Relay / Proxy API Configuration
When using an API relay or proxy instead of direct Anthropic/OpenAI endpoints:
providers: - id: anthropic:messages:claude-sonnet-4-6 label: Claude-Sonnet-4.6 config: max_tokens: 4096 apiBaseUrl: https://your-relay.example.com/api # Promptfoo appends /v1/messages # CRITICAL: maxConcurrency MUST be under commandLineOptions (NOT top-level) commandLineOptions: maxConcurrency: 1 # Respect relay rate limits
Key rules:
goes inapiBaseUrl
— Promptfoo appendsproviders[].config
automatically/v1/messages
must be undermaxConcurrency
— placing it at top level is silently ignoredcommandLineOptions:- When using relay with LLM-as-judge, set
to avoid concurrent request limits (generation + grading share the same pool)maxConcurrency: 1 - Pass relay token as
env varANTHROPIC_API_KEY
Troubleshooting
Python not found:
export PROMPTFOO_PYTHON=python3
Large outputs truncated: Outputs over 30000 characters are truncated. Use
head_limit in assertions.
File not found errors: All
file:// paths resolve relative to promptfooconfig.yaml location.
maxConcurrency ignored (shows "up to N at a time"):
maxConcurrency must be under commandLineOptions:, not at the YAML top level. This is a common mistake.
LLM-as-judge returns 401 with relay API: Each
llm-rubric assertion must have its own provider with apiBaseUrl. The main provider config is not inherited by grader assertions.
HTML tags in model output inflating metrics: Models may output
<br>, <b>, etc. in structured content. Strip HTML in Python assertions before measuring:
import re clean_text = re.sub(r'<[^>]+>', '', raw_text)
Echo Provider (Preview Mode)
Use the echo provider to preview rendered prompts without making API calls:
# promptfooconfig-preview.yaml providers: - echo # Returns prompt as output, no API calls tests: - vars: input: "test content"
Use cases:
- Preview prompt rendering before expensive API calls
- Verify Few-shot examples are loaded correctly
- Debug variable substitution issues
- Validate prompt structure
# Run preview mode npx promptfoo@latest eval --config promptfooconfig-preview.yaml
Cost: Free - no API tokens consumed.
Advanced Few-Shot Implementation
Multi-turn Conversation Pattern
For complex few-shot learning with full examples:
[ {"role": "system", "content": "{{system_prompt}}"}, // Few-shot Example 1 {"role": "user", "content": "Task: {{example_input_1}}"}, {"role": "assistant", "content": "{{example_output_1}}"}, // Few-shot Example 2 (optional) {"role": "user", "content": "Task: {{example_input_2}}"}, {"role": "assistant", "content": "{{example_output_2}}"}, // Actual test {"role": "user", "content": "Task: {{actual_input}}"} ]
Test case configuration:
tests: - vars: system_prompt: file://prompts/system.md # Few-shot examples example_input_1: file://data/examples/input1.txt example_output_1: file://data/examples/output1.txt example_input_2: file://data/examples/input2.txt example_output_2: file://data/examples/output2.txt # Actual test actual_input: file://data/test1.txt
Best practices:
- Use 1-3 few-shot examples (more may dilute effectiveness)
- Ensure examples match the task format exactly
- Load examples from files for better maintainability
- Use echo provider first to verify structure
Long Text Handling
For Chinese/long-form content evaluations (10k+ characters):
Configuration:
providers: - id: anthropic:messages:claude-sonnet-4-6 config: max_tokens: 8192 # Increase for long outputs defaultTest: assert: - type: python value: file://scripts/metrics.py:check_length
Python assertion for text metrics:
import re def strip_tags(text: str) -> str: """Remove HTML tags for pure text.""" return re.sub(r'<[^>]+>', '', text) def check_length(output: str, context: dict) -> dict: """Check output length constraints.""" raw_input = context['vars'].get('raw_input', '') input_len = len(strip_tags(raw_input)) output_len = len(strip_tags(output)) reduction_ratio = 1 - (output_len / input_len) if input_len > 0 else 0 return { "pass": 0.7 <= reduction_ratio <= 0.9, "score": reduction_ratio, "reason": f"Reduction: {reduction_ratio:.1%} (target: 70-90%)", "named_scores": { "input_length": input_len, "output_length": output_len, "reduction_ratio": reduction_ratio } }
Real-World Example
Project: Chinese short-video content curation from long transcripts
Structure:
tiaogaoren/ ├── promptfooconfig.yaml # Production config ├── promptfooconfig-preview.yaml # Preview config (echo provider) ├── prompts/ │ ├── tiaogaoren-prompt.json # Chat format with few-shot │ └── v4/system-v4.md # System prompt ├── tests/cases.yaml # 3 test samples ├── scripts/metrics.py # Custom metrics (reduction ratio, etc.) ├── data/ # 5 samples (2 few-shot, 3 eval) └── results/
See:
./tiaogaoren/ (example project root) for full implementation.
Resources
For detailed API reference and advanced patterns, see references/promptfoo_api.md.