Skillshub braintrust

Braintrust — AI Evaluation and Observability

install

source · Clone the upstream repo

git clone https://github.com/ComeOnOliver/skillshub

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/ComeOnOliver/skillshub "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/TerminalSkills/skills/braintrust" ~/.claude/skills/comeonoliver-skillshub-braintrust && rm -rf "$T"

manifest: skills/TerminalSkills/skills/braintrust/SKILL.md

source content

Braintrust — AI Evaluation and Observability

You are an expert in Braintrust, the evaluation and observability platform for AI applications. You help developers run systematic evaluations, compare model versions, track experiments, log production traces, and measure quality metrics — with a focus on making AI development as rigorous as traditional software testing.

Core Capabilities

import { Eval, init } from "braintrust";

init({ apiKey: process.env.BRAINTRUST_API_KEY });

// Run evaluation
await Eval("support-chatbot", {
  data: () => [
    { input: "How do I reset my password?", expected: "Go to Settings > Security > Reset Password" },
    { input: "What's the pricing?", expected: "Plans start at $29/month" },
    { input: "I need a refund", expected: "Contact support at help@example.com" },
  ],
  task: async (input) => {
    const response = await callChatbot(input);
    return response.text;
  },
  scores: [
    // Built-in scorers
    Factuality,                            // Does output match expected facts?
    ClosedQA,                              // Is the answer correct given context?
    // Custom scorer
    (output, expected) => {
      const containsKey = expected.toLowerCase().split(" ")
        .some(word => output.toLowerCase().includes(word));
      return { name: "keyword_match", score: containsKey ? 1 : 0 };
    },
  ],
});
// Results visible in Braintrust dashboard with diffs, regressions, improvements

# Python
from braintrust import Eval

Eval(
    "rag-pipeline",
    data=lambda: [{"input": q, "expected": a} for q, a in test_pairs],
    task=lambda input: rag_pipeline.query(input),
    scores=[Factuality, Relevance],
)

Installation

npm install braintrust autoevals
# or
pip install braintrust autoevals

Best Practices

Eval-driven development — Write evals first, then iterate on prompts/models; measure before optimizing
Built-in scorers — Use Factuality, ClosedQA, Relevance from
```
autoevals
```
; LLM-based quality scoring
Custom scorers — Add domain-specific metrics; combine with built-in for comprehensive evaluation
Experiments — Each eval run is an experiment; compare side-by-side in dashboard
Production logging — Use
```
braintrust.traced()
```
for production observability; same dashboard as evals
CI integration — Run evals in CI; fail builds on quality regressions
Dataset management — Store test datasets in Braintrust; version and share across team
A/B comparison — Compare two model versions on the same dataset; statistical significance reported