Claude-skill-registry experiment-analysis
Analyze GRPO training runs for learning dynamics and pipeline performance. Use when diagnosing training issues, reviewing Elo progression, checking throughput, or updating experiment results.
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/experiment-analysis-bglick13-diplomacy-v2" ~/.claude/skills/majiayu000-claude-skill-registry-experiment-analysis && rm -rf "$T"
manifest:
skills/data/experiment-analysis-bglick13-diplomacy-v2/SKILL.mdsource content
Experiment Analysis
Diagnose GRPO training runs using WandB metrics and Axiom logs.
Quick Reference
| Question | Command |
|---|---|
| Full Elo analysis | |
| Compare sweep runs | |
| Is model learning? | |
| Rollout throughput? | |
| Any errors? | |
| Extraction rate? | |
| System health? | |
Tools Overview
WandB CLI (scripts/wandb_cli.py
)
scripts/wandb_cli.pyTraining metrics and Elo ratings. Use for:
- Elo trajectory analysis (learning signal)
- Reward/loss curves
- KL divergence and grad norm
Axiom CLI (scripts/axiom_cli.py
)
scripts/axiom_cli.pyReal-time logs and events. Use for:
- Rollout timing and throughput
- Inference engine performance
- Error monitoring
- Order extraction stats
Detailed Guides
- Learning Dynamics - Elo, rewards, KL analysis
- Pipeline Performance - Throughput, timing, errors
- Experiment Tracker Guide - Updating docs/experiment-tracker.md
- Examples - Real analysis walkthrough
Key Metrics
Learning Signal (Fixed Reference Analysis)
Key insight: Win rate against a dynamic league is meaningless. Use FIXED references.
| Metric | Good Sign | Bad Sign |
|---|---|---|
| base_model Elo | Declining | Stable/Rising |
| Baseline bot Elo | Declining (exploited) | Rising |
| Best checkpoint - base_model gap | Growing | Shrinking |
| Older checkpoint Elo | Declining | Stable |
| KL divergence | Stable <0.1 | Spikes >0.2 |
Fixed references (base_model, chaos_bot, etc.) don't change, so their Elo changes = learning. Elo gap (best checkpoint - base_model) measures how much better trained model is.
Performance
| Metric | Target | Action if Miss |
|---|---|---|
| Rollout p95 duration | <120s | Check inference engine |
| Extraction rate | >95% | Check logits processor |
| Error rate | <1% | Check Axiom errors |
| Grad norm | <50 | Policy may be unstable |