install
source · Clone the upstream repo
git clone https://github.com/spideystreet/medox
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/spideystreet/medox "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/run-eval" ~/.claude/skills/spideystreet-medox-run-eval && rm -rf "$T"
manifest:
.claude/skills/run-eval/SKILL.mdsource content
/run-eval
Steps
-
Ensure Docker is running
docker compose psIf PostgreSQL or ChromaDB is not up:
docker compose up -d -
Run the evaluation suite
uv run dotenv -f .env run -- python scripts/run_eval.pyNote the experiment name printed (e.g.
).medox-<hash> -
Fetch and display results Write the following script to
then run it:/tmp/check_eval.pyfrom langsmith import Client client = Client() runs = list(client.list_runs(project_name='<experiment_name>', is_root=True)) print(f'Eval cases: {len(runs)}') print() passed, failed = 0, 0 for run in runs: fb = list(client.list_feedback(run_ids=[str(run.id)])) score = fb[0].score if fb else None comment = fb[0].comment if fb else '' prompt = (run.inputs or {}).get('prompt', '').strip()[:75] status = 'PASS' if score == 1 else 'FAIL' if score == 1: passed += 1 else: failed += 1 print(f'[{status}] {prompt}') if comment and comment != 'OK': print(f' -> {comment}') print() print(f'Result: {passed} passed, {failed} failed out of {len(runs)}')Replace
with the value printed in step 2, then:<experiment_name>uv run dotenv -f .env run -- python3 /tmp/check_eval.py -
Investigate failures For any
, read the comment and:[FAIL]- Check the relevant node/tool in
src/medox/agent/ - Check the evaluator logic in
scripts/run_eval.py - Use
to add a regression case if a new edge case was found/add-eval-case
- Check the relevant node/tool in
-
Report summary Print the final
line to the user.Result: N passed, M failed out of X