LLMs-Universal-Life-Science-and-Clinical-Skills- agentic-evals-observability
Design evaluation, tracing, monitoring, and rollback discipline for agent systems. Use when an agent workflow is becoming important enough that you need evidence, not vibes, to decide whether it is good.
install
source · Clone the upstream repo
git clone https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills-
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills- "$T" && mkdir -p ~/.claude/skills && cp -r "$T/Skills/Agentic_AI/Agentic_Evals_Observability" ~/.claude/skills/mdbabumiamssm-llms-universal-life-science-and-clinical-skills-agentic-evals-obse && rm -rf "$T"
manifest:
Skills/Agentic_AI/Agentic_Evals_Observability/SKILL.mdsource content
Agentic Evals and Observability
Use this skill when the question changes from "can the agent run" to "can we trust it in production".
Workflow
- Define the task classes, success criteria, and failure classes before running benchmarks.
- Instrument traces first so every eval failure can be debugged at the step level.
- Separate offline evaluation from online monitoring; both are required.
- Score for correctness, tool behavior, cost, latency, and safety, not just final-answer quality.
- Set rollback thresholds before deployment so regressions have teeth.
Guardrails
- Do not ship agent changes without a representative eval set.
- Do not rely on one metric; combine exact checks, LLM judges, human review, and cost telemetry.
- Record model, prompt, tool config, and environment for every major run.
- Prefer OTel-compatible tracing so data is portable across observability stacks.
Output Requirements
- Include offline eval design.
- Include online monitoring signals.
- Include at least one rollback threshold tied to quality, safety, or cost.