Learn-skills.dev ai-observability-promptfoo
Testing and evaluation framework for LLM prompts and applications -- promptfooconfig.yaml, assertions, model-graded evals, red teaming, CI/CD integration, custom providers, and comparative evaluation
git clone https://github.com/NeverSight/learn-skills.dev
T=$(mktemp -d) && git clone --depth=1 https://github.com/NeverSight/learn-skills.dev "$T" && mkdir -p ~/.claude/skills && cp -r "$T/data/skills-md/agents-inc/skills/ai-observability-promptfoo" ~/.claude/skills/neversight-learn-skills-dev-ai-observability-promptfoo && rm -rf "$T"
data/skills-md/agents-inc/skills/ai-observability-promptfoo/SKILL.mdPromptfoo Patterns
Quick Guide: Use promptfoo for systematic LLM evaluation. Define prompts, providers, and test cases in
. Use assertion types (promptfooconfig.yaml,contains,is-json,llm-rubric,similar,cost) to validate outputs. Uselatencyto run (exits with code 100 on test failures),promptfoo evalfor results UI. Use model-graded assertions (promptfoo view,llm-rubric) for subjective quality. Usefactualityfor security scanning. Usepromptfoo redteam runflag or--shareto share results. All provider API keys come from environment variables -- never hardcode them.promptfoo share
<critical_requirements>
CRITICAL: Before Using This Skill
All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering,
, named constants)import type
(You MUST define test cases with explicit
arrays -- tests without assertions only capture output without validating it)assert
(You MUST use
for subjective quality evaluation -- do NOT rely solely on deterministic assertions for natural language output)llm-rubric
(You MUST set
on similarity and model-graded assertions -- omitting thresholds uses defaults that may not match your quality bar)threshold
(You MUST use environment variables for all API keys -- never hardcode keys in promptfooconfig.yaml or provider configs)
(You MUST verify
exit code in CI pipelines -- it returns exit code 100 on test failures, exit code 1 on other errors)promptfoo eval
</critical_requirements>
Auto-detection: promptfoo, promptfooconfig, promptfooconfig.yaml, promptfoo eval, promptfoo view, promptfoo redteam, llm-rubric, model-graded-closedqa, promptfoo share, promptfoo cache, assertion type, LLM evaluation, prompt testing, red teaming, PROMPTFOO_CONFIG
When to use:
- Writing or evaluating LLM prompts across one or more providers
- Setting up automated test suites for LLM-powered features
- Comparing model outputs side-by-side (GPT vs Claude vs Gemini)
- Running model-graded evaluations (LLM-as-a-judge)
- Red teaming LLM applications for security vulnerabilities
- Integrating LLM quality gates into CI/CD pipelines
- Validating structured output (JSON, function calls) from LLMs
Key patterns covered:
structure (prompts, providers, tests, defaultTest)promptfooconfig.yaml- Assertion types (deterministic, model-graded, performance)
- Custom TypeScript providers
- Red teaming configuration (plugins, strategies)
- CI/CD integration with GitHub Actions
- Programmatic API (
function)evaluate() - Result sharing and caching
When NOT to use:
- Unit testing application code (use your test runner)
- Load testing / benchmarking API throughput (use a load testing tool)
- Runtime monitoring of production LLM calls (use observability tooling)
Examples Index
- Core: Config & Assertions -- promptfooconfig.yaml structure, providers, prompts, test cases, assertion types
- Model-Graded & Advanced Assertions -- llm-rubric, factuality, similar, context evaluation, custom assertions
- Red Teaming -- Security scanning, plugins, strategies, presets
- Custom Providers & Programmatic API -- TypeScript providers, evaluate() function, CI/CD integration
- Quick API Reference -- CLI commands, assertion type table, provider IDs, red team plugins
<philosophy>
Philosophy
Promptfoo brings test-driven development to LLM applications. Instead of manually checking outputs, you define expected behaviors as assertions and run them systematically across prompts and providers.
Core principles:
- Declarative test definitions -- YAML config over imperative test scripts. Define prompts, providers, test cases, and assertions in
. No code required for standard evaluations.promptfooconfig.yaml - Assertion-driven validation -- Every test case should have assertions. Deterministic assertions (
,contains
,is-json
) for structured output; model-graded assertions (equals
,llm-rubric
) for subjective quality.factuality - Comparative evaluation -- Run the same tests across multiple providers or prompt variants simultaneously. The results matrix shows which combination performs best.
- Shift-left LLM testing -- Catch prompt regressions in CI before they reach production.
exits with code 100 on test failures, making it a natural CI quality gate.promptfoo eval - Red teaming as a first-class concern -- Security scanning for prompt injection, PII leakage, harmful content, and jailbreak vulnerabilities is built in, not bolted on.
<patterns>
Core Patterns
Pattern 1: Basic Configuration
Every promptfoo project starts with
promptfooconfig.yaml. Three required sections: prompts, providers, tests.
# promptfooconfig.yaml description: "Translation quality evaluation" prompts: - "Convert the following to {{language}}: {{input}}" providers: - openai:gpt-4o - anthropic:messages:claude-sonnet-4-6 tests: - vars: language: French input: Hello world assert: - type: icontains value: "bonjour" - type: llm-rubric value: "Output is a natural French translation, not word-for-word"
Why good: Declarative config, multi-provider comparison, both deterministic and model-graded assertions
# BAD: Tests without assertions tests: - vars: language: French input: Hello world # No assert array -- output is captured but never validated
Why bad: Tests without assertions only log output, they never fail -- you lose the entire point of automated evaluation
See: examples/core.md for prompts from files, provider config, defaultTest, variable loading from CSV
Pattern 2: Deterministic Assertions
Use for outputs with predictable, verifiable structure.
assert: # String matching - type: contains value: "error" - type: icontains # case-insensitive value: "success" - type: not-contains value: "internal server error" - type: starts-with value: "{" - type: regex value: "\\d{4}-\\d{2}-\\d{2}" # date pattern # Structured output - type: is-json - type: contains-json - type: is-valid-openai-tools-call # Performance - type: cost threshold: 0.01 # max $0.01 per call - type: latency threshold: 5000 # max 5 seconds
Why good: Fast, deterministic, no LLM cost for evaluation, catches structural regressions immediately
# BAD: Using llm-rubric for JSON validation assert: - type: llm-rubric value: "Output must be valid JSON"
Why bad: Expensive (requires LLM call), slower, non-deterministic --
is-json does this deterministically for free
See: examples/core.md for all deterministic assertion types with examples
Pattern 3: Model-Graded Assertions
Use for subjective quality where deterministic checks cannot capture intent.
assert: - type: llm-rubric value: "Response is helpful, accurate, and conversational in tone" provider: openai:gpt-4o - type: factuality value: "The capital of France is Paris. It has a population of ~2.1 million." provider: openai:gpt-4o - type: similar value: "The weather in Paris is sunny today" threshold: 0.8 - type: model-graded-closedqa value: "Paris is the capital of France" provider: openai:gpt-4o
Why good: Evaluates subjective quality that deterministic assertions cannot capture, configurable grading provider
# BAD: No threshold on similar assertion assert: - type: similar value: "expected output" # Missing threshold -- uses default which may be too lenient or strict
Why bad: Default similarity threshold may not match your quality bar, always set it explicitly
See: examples/model-graded.md for llm-rubric with custom providers, context evaluation, factuality, custom grading prompts
Pattern 4: Red Teaming
Use
redteam section to scan for security vulnerabilities.
# promptfooconfig.yaml targets: - openai:gpt-4o redteam: purpose: "Customer support chatbot for an e-commerce platform" numTests: 10 plugins: - harmful - pii - contracts - hallucination - prompt-extraction strategies: - jailbreak - prompt-injection
Why good: Declarative security scanning, purpose provides context for realistic attacks, composable plugins and strategies
# BAD: Red team without purpose redteam: plugins: - harmful # Missing purpose -- attacks will be generic and less effective
Why bad: Without
purpose, the red team generator creates generic attacks that miss application-specific vulnerabilities
See: examples/red-teaming.md for presets (OWASP, NIST), advanced strategies, multi-turn attacks
Pattern 5: Custom TypeScript Provider
Use when your LLM integration is not a direct API call (RAG pipelines, agent chains, custom middleware).
// providers/my-app.ts import type { ApiProvider, ProviderOptions, ProviderResponse, CallApiContextParams, } from "promptfoo"; // NOTE: default export required by promptfoo's file:// provider loader export default class MyAppProvider implements ApiProvider { private config: Record<string, unknown>; constructor(options: ProviderOptions) { this.config = options.config || {}; } id(): string { return "my-app-provider"; } async callApi( prompt: string, context?: CallApiContextParams, ): Promise<ProviderResponse> { // Call your application's LLM pipeline const result = await myApp.processQuery(prompt); return { output: result.answer, tokenUsage: { total: result.totalTokens, prompt: result.promptTokens, completion: result.completionTokens, }, cost: result.cost, }; } }
# promptfooconfig.yaml providers: - file://providers/my-app.ts
Why good: Type-safe, full control over LLM pipeline, reports token usage and cost for assertions
See: examples/custom-providers.md for inline function providers, programmatic API, CI/CD integration
Pattern 6: CI/CD Integration
Run evaluations in CI with quality gates.
# .github/workflows/llm-eval.yml name: LLM Eval on: pull_request: paths: - "prompts/**" - "promptfooconfig.yaml" jobs: evaluate: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: node-version: "22" - uses: actions/cache@v4 with: path: ~/.cache/promptfoo key: ${{ runner.os }}-promptfoo-v1 - name: Run eval env: OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} run: npx promptfoo@latest eval -o results.json --share
Why good: Caches LLM responses across runs,
promptfoo eval exits with code 100 on test failures (CI fails automatically), --share generates a shareable results URL
See: examples/custom-providers.md for npm scripts, quality gate thresholds, programmatic evaluation
</patterns><decision_framework>
Decision Framework
Which Assertion Type to Use
What are you validating? +-- Exact or structural match? | +-- Exact text -> equals | +-- Contains substring -> contains / icontains | +-- Regex pattern -> regex | +-- Valid JSON -> is-json | +-- Valid function call -> is-valid-openai-tools-call | +-- Cost under budget -> cost (with threshold) | +-- Response time -> latency (with threshold) +-- Subjective quality? | +-- General quality criteria -> llm-rubric | +-- Factual accuracy against ground truth -> factuality | +-- Semantic similarity -> similar (with threshold) | +-- Closed-domain QA accuracy -> model-graded-closedqa | +-- RAG context fidelity -> context-faithfulness +-- Custom logic? +-- JavaScript function -> javascript +-- Python function -> python +-- External service -> webhook
When to Use Red Teaming vs Eval
What are you testing? +-- Prompt quality and correctness? | +-- Use promptfoo eval with test cases and assertions +-- Security vulnerabilities? | +-- Use promptfoo redteam run with plugins and strategies +-- Both? +-- Run eval for quality, redteam for security -- separate configs or sections
Provider Selection
How does your LLM integration work? +-- Direct API call to OpenAI/Anthropic/etc? | +-- Use built-in provider: openai:gpt-4o, anthropic:messages:claude-sonnet-4-6 +-- Custom pipeline (RAG, agents, middleware)? | +-- Use custom TypeScript provider: file://providers/my-app.ts +-- HTTP endpoint? | +-- Use HTTP provider: id: https://api.example.com/chat +-- Multiple providers to compare? +-- List all in providers array -- promptfoo runs tests against each
</decision_framework>
<red_flags>
RED FLAGS
High Priority Issues:
- Tests without
arrays (output is captured but never validated -- tests always "pass")assert - Not checking
exit code in CI (promptfoo eval
exits 100 on test failures -- ensure your CI pipeline treats non-zero exit codes as failures)promptfoo eval - Hardcoded API keys in
(use environment variables)promptfooconfig.yaml - Using
for checks thatllm-rubric
oris-json
can do deterministically (wastes money and adds non-determinism)contains - Red teaming without
(generic attacks miss application-specific vulnerabilities)purpose
Medium Priority Issues:
- Missing
onthreshold
assertions (default may not match your quality bar)similar - Not caching in CI (every run makes full API calls -- expensive and slow)
- Using
whenmodel-graded-closedqa
would be simpler (closedqa is for specific ground-truth QA)llm-rubric - Not setting
on model-graded assertions (uses default which may not be the grader you want)provider - Running red team with default
in production scans (too few for comprehensive coverage)numTests: 5
Common Mistakes:
- Confusing
(the LLM prompt templates) withprompts
(the evaluation cases) -- prompts define what to send, tests define what to checktests - Using
for natural language output (LLM output is non-deterministic, useequals
orllm-rubric
)similar - Forgetting
syntax in prompts (promptfoo uses Nunjucks templating, not{{variable}}
)${variable} - Putting assertions in
that should only apply to specific tests (assertions indefaultTest
apply to ALL tests)defaultTest - Using
paths without the prefix (promptfoo treats bare paths as literal strings, not file references)file://
Gotchas & Edge Cases:
caches LLM responses by default -- usepromptfoo eval
orpromptfoo cache clear
to force fresh calls--no-cache
uploads results to promptfoo's servers -- do not use with sensitive data unless self-hosting--share- Red team
wrapstrategies
output -- a plugin generates the malicious content, a strategy delivers it (e.g., via jailbreak encoding)plugins
merges with per-test assertions, it does not replace them -- both arrays rundefaultTest.assert- CSV test files map column headers to variable names -- header
becomesinput
in prompts{{input}}
in test options runs JavaScript on the output before assertions -- useful for extracting JSON from markdown-wrapped responsestransform- Provider configs in YAML use
key for model parameters (config:
,temperature
), not top-level fieldsmax_tokens - The
property on assertions affects scoring in the results UI but does not change pass/fail behaviorweight
</red_flags>
<critical_reminders>
CRITICAL REMINDERS
All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering,
, named constants)import type
(You MUST define test cases with explicit
arrays -- tests without assertions only capture output without validating it)assert
(You MUST use
for subjective quality evaluation -- do NOT rely solely on deterministic assertions for natural language output)llm-rubric
(You MUST set
on similarity and model-graded assertions -- omitting thresholds uses defaults that may not match your quality bar)threshold
(You MUST use environment variables for all API keys -- never hardcode keys in promptfooconfig.yaml or provider configs)
(You MUST verify
exit code in CI pipelines -- it returns exit code 100 on test failures, exit code 1 on other errors)promptfoo eval
Failure to follow these rules will produce untested, insecure, or falsely-passing LLM evaluation pipelines.
</critical_reminders>