AutoResearchClaw rl-policy-optimization
Best practices for reinforcement learning policy optimization. Use when working on RL agents, PPO, SAC, or reward design.
install
source · Clone the upstream repo
git clone https://github.com/aiming-lab/AutoResearchClaw
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/aiming-lab/AutoResearchClaw "$T" && mkdir -p ~/.claude/skills && cp -r "$T/researchclaw/skills/builtin/domain/rl-policy-optimization" ~/.claude/skills/aiming-lab-autoresearchclaw-rl-policy-optimization && rm -rf "$T"
manifest:
researchclaw/skills/builtin/domain/rl-policy-optimization/SKILL.mdsource content
RL Policy Optimization Best Practice
Algorithm selection:
- Discrete actions: PPO, DQN, A2C
- Continuous actions: SAC, TD3, PPO
- Multi-agent: MAPPO, QMIX
- Offline: CQL, IQL, Decision Transformer
Training recipe:
- PPO: clip=0.2, lr=3e-4, gamma=0.99, GAE lambda=0.95
- SAC: lr=3e-4, tau=0.005, auto-tune alpha
- Use vectorized environments (e.g., gymnasium.vector)
- Normalize observations and rewards
- Log episode return, episode length, value loss, policy entropy
Evaluation:
- Report mean +/- std over 10+ evaluation episodes
- Use deterministic policy for evaluation
- Compare against random policy and simple baselines
- Report sample efficiency (return vs. env steps)
Common pitfalls:
- Reward shaping can introduce bias
- Seed sensitivity is HIGH — use 5+ seeds
- Hyperparameter sensitivity — do a small sweep