Claude-skill-registry experiment-analysis

Analyze GRPO training runs for learning dynamics and pipeline performance. Use when diagnosing training issues, reviewing Elo progression, checking throughput, or updating experiment results.

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/experiment-analysis-bglick13-diplomacy-v2" ~/.claude/skills/majiayu000-claude-skill-registry-experiment-analysis && rm -rf "$T"
manifest: skills/data/experiment-analysis-bglick13-diplomacy-v2/SKILL.md
source content

Experiment Analysis

Diagnose GRPO training runs using WandB metrics and Axiom logs.

Quick Reference

QuestionCommand
Full Elo analysis
uv run python .claude/skills/experiment-analysis/analyze_elo.py <run>
Compare sweep runs
uv run python .claude/skills/experiment-analysis/analyze_sweep.py --sweep <prefix>
Is model learning?
uv run python scripts/wandb_cli.py get-metrics -r <run> --all-metrics
Rollout throughput?
uv run python scripts/axiom_cli.py rollout-timing --last 6h
Any errors?
uv run python scripts/axiom_cli.py errors --last 1h
Extraction rate?
uv run python scripts/axiom_cli.py extraction-stats --last 24h
System health?
uv run python scripts/axiom_cli.py health --last 1h

Tools Overview

WandB CLI (
scripts/wandb_cli.py
)

Training metrics and Elo ratings. Use for:

  • Elo trajectory analysis (learning signal)
  • Reward/loss curves
  • KL divergence and grad norm

Axiom CLI (
scripts/axiom_cli.py
)

Real-time logs and events. Use for:

  • Rollout timing and throughput
  • Inference engine performance
  • Error monitoring
  • Order extraction stats

Detailed Guides

Key Metrics

Learning Signal (Fixed Reference Analysis)

Key insight: Win rate against a dynamic league is meaningless. Use FIXED references.

MetricGood SignBad Sign
base_model EloDecliningStable/Rising
Baseline bot EloDeclining (exploited)Rising
Best checkpoint - base_model gapGrowingShrinking
Older checkpoint EloDecliningStable
KL divergenceStable <0.1Spikes >0.2

Fixed references (base_model, chaos_bot, etc.) don't change, so their Elo changes = learning. Elo gap (best checkpoint - base_model) measures how much better trained model is.

Performance

MetricTargetAction if Miss
Rollout p95 duration<120sCheck inference engine
Extraction rate>95%Check logits processor
Error rate<1%Check Axiom errors
Grad norm<50Policy may be unstable