AutoResearchClaw rl-policy-optimization

Best practices for reinforcement learning policy optimization. Use when working on RL agents, PPO, SAC, or reward design.

install
source · Clone the upstream repo
git clone https://github.com/aiming-lab/AutoResearchClaw
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/aiming-lab/AutoResearchClaw "$T" && mkdir -p ~/.claude/skills && cp -r "$T/researchclaw/skills/builtin/domain/rl-policy-optimization" ~/.claude/skills/aiming-lab-autoresearchclaw-rl-policy-optimization && rm -rf "$T"
manifest: researchclaw/skills/builtin/domain/rl-policy-optimization/SKILL.md
source content

RL Policy Optimization Best Practice

Algorithm selection:

  • Discrete actions: PPO, DQN, A2C
  • Continuous actions: SAC, TD3, PPO
  • Multi-agent: MAPPO, QMIX
  • Offline: CQL, IQL, Decision Transformer

Training recipe:

  • PPO: clip=0.2, lr=3e-4, gamma=0.99, GAE lambda=0.95
  • SAC: lr=3e-4, tau=0.005, auto-tune alpha
  • Use vectorized environments (e.g., gymnasium.vector)
  • Normalize observations and rewards
  • Log episode return, episode length, value loss, policy entropy

Evaluation:

  • Report mean +/- std over 10+ evaluation episodes
  • Use deterministic policy for evaluation
  • Compare against random policy and simple baselines
  • Report sample efficiency (return vs. env steps)

Common pitfalls:

  • Reward shaping can introduce bias
  • Seed sensitivity is HIGH — use 5+ seeds
  • Hyperparameter sensitivity — do a small sweep