Learn-skills.dev ai-observability-promptfoo

Testing and evaluation framework for LLM prompts and applications -- promptfooconfig.yaml, assertions, model-graded evals, red teaming, CI/CD integration, custom providers, and comparative evaluation

install

source · Clone the upstream repo

git clone https://github.com/NeverSight/learn-skills.dev

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/NeverSight/learn-skills.dev "$T" && mkdir -p ~/.claude/skills && cp -r "$T/data/skills-md/agents-inc/skills/ai-observability-promptfoo" ~/.claude/skills/neversight-learn-skills-dev-ai-observability-promptfoo && rm -rf "$T"

manifest: data/skills-md/agents-inc/skills/ai-observability-promptfoo/SKILL.md

source content

Promptfoo Patterns

Quick Guide: Use promptfoo for systematic LLM evaluation. Define prompts, providers, and test cases in
promptfooconfig.yaml
. Use assertion types (
contains
,
is-json
,
llm-rubric
,
similar
,
cost
,
latency
) to validate outputs. Use
promptfoo eval
to run (exits with code 100 on test failures),
promptfoo view
for results UI. Use model-graded assertions (
llm-rubric
,
factuality
) for subjective quality. Use
promptfoo redteam run
for security scanning. Use
--share
flag or
promptfoo share
to share results. All provider API keys come from environment variables -- never hardcode them.

<critical_requirements>

CRITICAL: Before Using This Skill

All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering,
import type
, named constants)

(You MUST define test cases with explicit

assert

arrays -- tests without assertions only capture output without validating it)

(You MUST use

llm-rubric

for subjective quality evaluation -- do NOT rely solely on deterministic assertions for natural language output)

(You MUST set

threshold

on similarity and model-graded assertions -- omitting thresholds uses defaults that may not match your quality bar)

(You MUST use environment variables for all API keys -- never hardcode keys in promptfooconfig.yaml or provider configs)

(You MUST verify

promptfoo eval

exit code in CI pipelines -- it returns exit code 100 on test failures, exit code 1 on other errors)

</critical_requirements>

Auto-detection: promptfoo, promptfooconfig, promptfooconfig.yaml, promptfoo eval, promptfoo view, promptfoo redteam, llm-rubric, model-graded-closedqa, promptfoo share, promptfoo cache, assertion type, LLM evaluation, prompt testing, red teaming, PROMPTFOO_CONFIG

When to use:

Writing or evaluating LLM prompts across one or more providers
Setting up automated test suites for LLM-powered features
Comparing model outputs side-by-side (GPT vs Claude vs Gemini)
Running model-graded evaluations (LLM-as-a-judge)
Red teaming LLM applications for security vulnerabilities
Integrating LLM quality gates into CI/CD pipelines
Validating structured output (JSON, function calls) from LLMs

Key patterns covered:

```
promptfooconfig.yaml
```
structure (prompts, providers, tests, defaultTest)
Assertion types (deterministic, model-graded, performance)
Custom TypeScript providers
Red teaming configuration (plugins, strategies)
CI/CD integration with GitHub Actions
Programmatic API (
```
evaluate()
```
function)
Result sharing and caching

When NOT to use:

Unit testing application code (use your test runner)
Load testing / benchmarking API throughput (use a load testing tool)
Runtime monitoring of production LLM calls (use observability tooling)

Examples Index

Core: Config & Assertions -- promptfooconfig.yaml structure, providers, prompts, test cases, assertion types
Model-Graded & Advanced Assertions -- llm-rubric, factuality, similar, context evaluation, custom assertions
Red Teaming -- Security scanning, plugins, strategies, presets
Custom Providers & Programmatic API -- TypeScript providers, evaluate() function, CI/CD integration
Quick API Reference -- CLI commands, assertion type table, provider IDs, red team plugins

Philosophy

Promptfoo brings test-driven development to LLM applications. Instead of manually checking outputs, you define expected behaviors as assertions and run them systematically across prompts and providers.

Core principles:

Declarative test definitions -- YAML config over imperative test scripts. Define prompts, providers, test cases, and assertions in
```
promptfooconfig.yaml
```
. No code required for standard evaluations.
Assertion-driven validation -- Every test case should have assertions. Deterministic assertions (
```
contains
```
,
```
is-json
```
,
```
equals
```
) for structured output; model-graded assertions (
```
llm-rubric
```
,
```
factuality
```
) for subjective quality.
Comparative evaluation -- Run the same tests across multiple providers or prompt variants simultaneously. The results matrix shows which combination performs best.
Shift-left LLM testing -- Catch prompt regressions in CI before they reach production.
```
promptfoo eval
```
exits with code 100 on test failures, making it a natural CI quality gate.
Red teaming as a first-class concern -- Security scanning for prompt injection, PII leakage, harmful content, and jailbreak vulnerabilities is built in, not bolted on.

</philosophy>

Core Patterns

Pattern 1: Basic Configuration

Every promptfoo project starts with

promptfooconfig.yaml

. Three required sections:

prompts

providers

tests

# promptfooconfig.yaml
description: "Translation quality evaluation"

prompts:
  - "Convert the following to {{language}}: {{input}}"

providers:
  - openai:gpt-4o
  - anthropic:messages:claude-sonnet-4-6

tests:
  - vars:
      language: French
      input: Hello world
    assert:
      - type: icontains
        value: "bonjour"
      - type: llm-rubric
        value: "Output is a natural French translation, not word-for-word"

Why good: Declarative config, multi-provider comparison, both deterministic and model-graded assertions

# BAD: Tests without assertions
tests:
  - vars:
      language: French
      input: Hello world
  # No assert array -- output is captured but never validated

Why bad: Tests without assertions only log output, they never fail -- you lose the entire point of automated evaluation

See: examples/core.md for prompts from files, provider config, defaultTest, variable loading from CSV

Pattern 2: Deterministic Assertions

Use for outputs with predictable, verifiable structure.

assert:
  # String matching
  - type: contains
    value: "error"
  - type: icontains # case-insensitive
    value: "success"
  - type: not-contains
    value: "internal server error"
  - type: starts-with
    value: "{"
  - type: regex
    value: "\\d{4}-\\d{2}-\\d{2}" # date pattern

  # Structured output
  - type: is-json
  - type: contains-json
  - type: is-valid-openai-tools-call

  # Performance
  - type: cost
    threshold: 0.01 # max $0.01 per call
  - type: latency
    threshold: 5000 # max 5 seconds

Why good: Fast, deterministic, no LLM cost for evaluation, catches structural regressions immediately

# BAD: Using llm-rubric for JSON validation
assert:
  - type: llm-rubric
    value: "Output must be valid JSON"

Why bad: Expensive (requires LLM call), slower, non-deterministic --

is-json

does this deterministically for free

See: examples/core.md for all deterministic assertion types with examples

Pattern 3: Model-Graded Assertions

Use for subjective quality where deterministic checks cannot capture intent.

assert:
  - type: llm-rubric
    value: "Response is helpful, accurate, and conversational in tone"
    provider: openai:gpt-4o

  - type: factuality
    value: "The capital of France is Paris. It has a population of ~2.1 million."
    provider: openai:gpt-4o

  - type: similar
    value: "The weather in Paris is sunny today"
    threshold: 0.8

  - type: model-graded-closedqa
    value: "Paris is the capital of France"
    provider: openai:gpt-4o

Why good: Evaluates subjective quality that deterministic assertions cannot capture, configurable grading provider

# BAD: No threshold on similar assertion
assert:
  - type: similar
    value: "expected output"
    # Missing threshold -- uses default which may be too lenient or strict

Why bad: Default similarity threshold may not match your quality bar, always set it explicitly

See: examples/model-graded.md for llm-rubric with custom providers, context evaluation, factuality, custom grading prompts

Pattern 4: Red Teaming

Use

redteam

section to scan for security vulnerabilities.

# promptfooconfig.yaml
targets:
  - openai:gpt-4o

redteam:
  purpose: "Customer support chatbot for an e-commerce platform"
  numTests: 10
  plugins:
    - harmful
    - pii
    - contracts
    - hallucination
    - prompt-extraction
  strategies:
    - jailbreak
    - prompt-injection

Why good: Declarative security scanning, purpose provides context for realistic attacks, composable plugins and strategies

# BAD: Red team without purpose
redteam:
  plugins:
    - harmful
  # Missing purpose -- attacks will be generic and less effective

Why bad: Without

purpose

, the red team generator creates generic attacks that miss application-specific vulnerabilities

See: examples/red-teaming.md for presets (OWASP, NIST), advanced strategies, multi-turn attacks

Pattern 5: Custom TypeScript Provider

Use when your LLM integration is not a direct API call (RAG pipelines, agent chains, custom middleware).

// providers/my-app.ts
import type {
  ApiProvider,
  ProviderOptions,
  ProviderResponse,
  CallApiContextParams,
} from "promptfoo";

// NOTE: default export required by promptfoo's file:// provider loader
export default class MyAppProvider implements ApiProvider {
  private config: Record<string, unknown>;

  constructor(options: ProviderOptions) {
    this.config = options.config || {};
  }

  id(): string {
    return "my-app-provider";
  }

  async callApi(
    prompt: string,
    context?: CallApiContextParams,
  ): Promise<ProviderResponse> {
    // Call your application's LLM pipeline
    const result = await myApp.processQuery(prompt);

    return {
      output: result.answer,
      tokenUsage: {
        total: result.totalTokens,
        prompt: result.promptTokens,
        completion: result.completionTokens,
      },
      cost: result.cost,
    };
  }
}

# promptfooconfig.yaml
providers:
  - file://providers/my-app.ts

Why good: Type-safe, full control over LLM pipeline, reports token usage and cost for assertions

See: examples/custom-providers.md for inline function providers, programmatic API, CI/CD integration

Pattern 6: CI/CD Integration

Run evaluations in CI with quality gates.

# .github/workflows/llm-eval.yml
name: LLM Eval
on:
  pull_request:
    paths:
      - "prompts/**"
      - "promptfooconfig.yaml"
jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: "22"
      - uses: actions/cache@v4
        with:
          path: ~/.cache/promptfoo
          key: ${{ runner.os }}-promptfoo-v1
      - name: Run eval
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: npx promptfoo@latest eval -o results.json --share

Why good: Caches LLM responses across runs,

promptfoo eval

exits with code 100 on test failures (CI fails automatically),

--share

generates a shareable results URL

See: examples/custom-providers.md for npm scripts, quality gate thresholds, programmatic evaluation

</patterns>

<decision_framework>

Decision Framework

Which Assertion Type to Use

What are you validating?
+-- Exact or structural match?
|   +-- Exact text -> equals
|   +-- Contains substring -> contains / icontains
|   +-- Regex pattern -> regex
|   +-- Valid JSON -> is-json
|   +-- Valid function call -> is-valid-openai-tools-call
|   +-- Cost under budget -> cost (with threshold)
|   +-- Response time -> latency (with threshold)
+-- Subjective quality?
|   +-- General quality criteria -> llm-rubric
|   +-- Factual accuracy against ground truth -> factuality
|   +-- Semantic similarity -> similar (with threshold)
|   +-- Closed-domain QA accuracy -> model-graded-closedqa
|   +-- RAG context fidelity -> context-faithfulness
+-- Custom logic?
    +-- JavaScript function -> javascript
    +-- Python function -> python
    +-- External service -> webhook

When to Use Red Teaming vs Eval

What are you testing?
+-- Prompt quality and correctness?
|   +-- Use promptfoo eval with test cases and assertions
+-- Security vulnerabilities?
|   +-- Use promptfoo redteam run with plugins and strategies
+-- Both?
    +-- Run eval for quality, redteam for security -- separate configs or sections

Provider Selection

How does your LLM integration work?
+-- Direct API call to OpenAI/Anthropic/etc?
|   +-- Use built-in provider: openai:gpt-4o, anthropic:messages:claude-sonnet-4-6
+-- Custom pipeline (RAG, agents, middleware)?
|   +-- Use custom TypeScript provider: file://providers/my-app.ts
+-- HTTP endpoint?
|   +-- Use HTTP provider: id: https://api.example.com/chat
+-- Multiple providers to compare?
    +-- List all in providers array -- promptfoo runs tests against each

</decision_framework>

<red_flags>

RED FLAGS

High Priority Issues:

Tests without
```
assert
```
arrays (output is captured but never validated -- tests always "pass")
Not checking
```
promptfoo eval
```
exit code in CI (
```
promptfoo eval
```
exits 100 on test failures -- ensure your CI pipeline treats non-zero exit codes as failures)
Hardcoded API keys in
```
promptfooconfig.yaml
```
(use environment variables)
Using
```
llm-rubric
```
for checks that
```
is-json
```
or
```
contains
```
can do deterministically (wastes money and adds non-determinism)
Red teaming without
```
purpose
```
(generic attacks miss application-specific vulnerabilities)

Medium Priority Issues:

Missing
```
threshold
```
on
```
similar
```
assertions (default may not match your quality bar)
Not caching in CI (every run makes full API calls -- expensive and slow)
Using
```
model-graded-closedqa
```
when
```
llm-rubric
```
would be simpler (closedqa is for specific ground-truth QA)
Not setting
```
provider
```
on model-graded assertions (uses default which may not be the grader you want)
Running red team with default
```
numTests: 5
```
in production scans (too few for comprehensive coverage)

Common Mistakes:

Confusing
```
prompts
```
(the LLM prompt templates) with
```
tests
```
(the evaluation cases) -- prompts define what to send, tests define what to check
Using
```
equals
```
for natural language output (LLM output is non-deterministic, use
```
llm-rubric
```
or
```
similar
```
)
Forgetting
```
{{variable}}
```
syntax in prompts (promptfoo uses Nunjucks templating, not
```
${variable}
```
)
Putting assertions in
```
defaultTest
```
that should only apply to specific tests (assertions in
```
defaultTest
```
apply to ALL tests)
Using
```
file://
```
paths without the prefix (promptfoo treats bare paths as literal strings, not file references)

Gotchas & Edge Cases:

```
promptfoo eval
```
caches LLM responses by default -- use
```
promptfoo cache clear
```
or
```
--no-cache
```
to force fresh calls
```
--share
```
uploads results to promptfoo's servers -- do not use with sensitive data unless self-hosting
Red team
```
strategies
```
wrap
```
plugins
```
output -- a plugin generates the malicious content, a strategy delivers it (e.g., via jailbreak encoding)
```
defaultTest.assert
```
merges with per-test assertions, it does not replace them -- both arrays run
CSV test files map column headers to variable names -- header
```
input
```
becomes
```
{{input}}
```
in prompts
```
transform
```
in test options runs JavaScript on the output before assertions -- useful for extracting JSON from markdown-wrapped responses
Provider configs in YAML use
```
config:
```
key for model parameters (
```
temperature
```
,
```
max_tokens
```
), not top-level fields
The
```
weight
```
property on assertions affects scoring in the results UI but does not change pass/fail behavior

</red_flags>

<critical_reminders>

CRITICAL REMINDERS