LLMs-Universal-Life-Science-and-Clinical-Skills- agentic-evals-observability

Design evaluation, tracing, monitoring, and rollback discipline for agent systems. Use when an agent workflow is becoming important enough that you need evidence, not vibes, to decide whether it is good.

install

source · Clone the upstream repo

git clone https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills-

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills- "$T" && mkdir -p ~/.claude/skills && cp -r "$T/Skills/Agentic_AI/Agentic_Evals_Observability" ~/.claude/skills/mdbabumiamssm-llms-universal-life-science-and-clinical-skills-agentic-evals-obse && rm -rf "$T"

manifest: Skills/Agentic_AI/Agentic_Evals_Observability/SKILL.md

source content

Agentic Evals and Observability

Use this skill when the question changes from "can the agent run" to "can we trust it in production".

Workflow

Define the task classes, success criteria, and failure classes before running benchmarks.
Instrument traces first so every eval failure can be debugged at the step level.
Separate offline evaluation from online monitoring; both are required.
Score for correctness, tool behavior, cost, latency, and safety, not just final-answer quality.
Set rollback thresholds before deployment so regressions have teeth.

Guardrails

Do not ship agent changes without a representative eval set.
Do not rely on one metric; combine exact checks, LLM judges, human review, and cost telemetry.
Record model, prompt, tool config, and environment for every major run.
Prefer OTel-compatible tracing so data is portable across observability stacks.

Output Requirements

Include offline eval design.
Include online monitoring signals.
Include at least one rollback threshold tied to quality, safety, or cost.