Awesome-Agent-Skills-for-Empirical-Research reinforcement-learning-guide
Reinforcement learning fundamentals, algorithms, and research
install
source · Clone the upstream repo
git clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/43-wentorai-research-plugins/skills/domains/ai-ml/reinforcement-learning-guide" ~/.claude/skills/brycewang-stanford-awesome-agent-skills-for-empirical-research-reinforcement-lea && rm -rf "$T"
manifest:
skills/43-wentorai-research-plugins/skills/domains/ai-ml/reinforcement-learning-guide/SKILL.mdsource content
Reinforcement Learning Guide
Understand and implement reinforcement learning algorithms from tabular methods through deep RL, including policy gradients, actor-critic, and model-based approaches.
RL Fundamentals
The RL Framework
An agent interacts with an environment to maximize cumulative reward:
Agent Environment | | |--- action a_t ---------->| | |--- next state s_{t+1} |<-- reward r_t, state s_t |--- reward r_{t+1} | |
| Concept | Symbol | Definition |
|---|---|---|
| State | s | Observation of the environment |
| Action | a | Decision made by the agent |
| Reward | r | Scalar feedback signal |
| Policy | pi(a|s) | Mapping from states to actions |
| Value function | V(s) | Expected cumulative reward from state s |
| Q-function | Q(s, a) | Expected cumulative reward from (s, a) |
| Discount factor | gamma | Weight of future vs. immediate rewards (0-1) |
| Return | G_t | Sum of discounted future rewards from time t |
Key Equations
# Return (discounted cumulative reward) G_t = r_t + gamma * r_{t+1} + gamma^2 * r_{t+2} + ... # Bellman equation for V V(s) = E[r + gamma * V(s') | s] # Bellman equation for Q Q(s, a) = E[r + gamma * max_a' Q(s', a') | s, a] # Policy gradient theorem gradient J(theta) = E[gradient log pi_theta(a|s) * Q(s, a)]
Algorithm Taxonomy
| Category | Algorithm | Key Idea | On/Off Policy |
|---|---|---|---|
| Value-based | Q-Learning | Learn Q(s,a), act greedily | Off-policy |
| DQN | Q-Learning + neural net + replay buffer | Off-policy | |
| Double DQN | Two networks to reduce overestimation | Off-policy | |
| Dueling DQN | Separate value and advantage streams | Off-policy | |
| Policy gradient | REINFORCE | Monte Carlo policy gradient | On-policy |
| PPO | Clipped surrogate objective | On-policy | |
| TRPO | Trust region constraint | On-policy | |
| Actor-Critic | A2C/A3C | Advantage actor-critic (parallel) | On-policy |
| SAC | Maximum entropy + off-policy AC | Off-policy | |
| TD3 | Twin delayed DDPG | Off-policy | |
| Model-based | Dreamer | World model + imagination | On-policy |
| MBPO | Model-based policy optimization | Off-policy | |
| MuZero | Learned model + planning (MCTS) | Off-policy |
Implementation: DQN
import torch import torch.nn as nn import torch.optim as optim import numpy as np from collections import deque import random class QNetwork(nn.Module): def __init__(self, state_dim, action_dim, hidden_dim=128): super().__init__() self.net = nn.Sequential( nn.Linear(state_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, action_dim) ) def forward(self, x): return self.net(x) class DQNAgent: def __init__(self, state_dim, action_dim, lr=1e-3, gamma=0.99, epsilon=1.0, epsilon_decay=0.995, epsilon_min=0.01, buffer_size=10000, batch_size=64): self.action_dim = action_dim self.gamma = gamma self.epsilon = epsilon self.epsilon_decay = epsilon_decay self.epsilon_min = epsilon_min self.batch_size = batch_size self.q_network = QNetwork(state_dim, action_dim) self.target_network = QNetwork(state_dim, action_dim) self.target_network.load_state_dict(self.q_network.state_dict()) self.optimizer = optim.Adam(self.q_network.parameters(), lr=lr) self.replay_buffer = deque(maxlen=buffer_size) def select_action(self, state): if random.random() < self.epsilon: return random.randint(0, self.action_dim - 1) with torch.no_grad(): q_values = self.q_network(torch.FloatTensor(state)) return q_values.argmax().item() def store_transition(self, state, action, reward, next_state, done): self.replay_buffer.append((state, action, reward, next_state, done)) def train_step(self): if len(self.replay_buffer) < self.batch_size: return 0.0 batch = random.sample(self.replay_buffer, self.batch_size) states, actions, rewards, next_states, dones = zip(*batch) states = torch.FloatTensor(np.array(states)) actions = torch.LongTensor(actions) rewards = torch.FloatTensor(rewards) next_states = torch.FloatTensor(np.array(next_states)) dones = torch.FloatTensor(dones) # Current Q values q_values = self.q_network(states).gather(1, actions.unsqueeze(1)).squeeze() # Target Q values (Double DQN variant) with torch.no_grad(): best_actions = self.q_network(next_states).argmax(1) next_q = self.target_network(next_states).gather(1, best_actions.unsqueeze(1)).squeeze() targets = rewards + self.gamma * next_q * (1 - dones) loss = nn.MSELoss()(q_values, targets) self.optimizer.zero_grad() loss.backward() self.optimizer.step() self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay) return loss.item() def update_target(self): self.target_network.load_state_dict(self.q_network.state_dict())
Implementation: PPO
class PPOAgent: def __init__(self, state_dim, action_dim, lr=3e-4, gamma=0.99, lam=0.95, clip_ratio=0.2, epochs=10): self.gamma = gamma self.lam = lam self.clip_ratio = clip_ratio self.epochs = epochs self.actor = nn.Sequential( nn.Linear(state_dim, 64), nn.Tanh(), nn.Linear(64, 64), nn.Tanh(), nn.Linear(64, action_dim), nn.Softmax(dim=-1) ) self.critic = nn.Sequential( nn.Linear(state_dim, 64), nn.Tanh(), nn.Linear(64, 64), nn.Tanh(), nn.Linear(64, 1) ) self.optimizer = optim.Adam( list(self.actor.parameters()) + list(self.critic.parameters()), lr=lr ) def compute_gae(self, rewards, values, dones): """Generalized Advantage Estimation.""" advantages = [] gae = 0 for t in reversed(range(len(rewards))): next_value = values[t + 1] if t + 1 < len(values) else 0 delta = rewards[t] + self.gamma * next_value * (1 - dones[t]) - values[t] gae = delta + self.gamma * self.lam * (1 - dones[t]) * gae advantages.insert(0, gae) return torch.FloatTensor(advantages) def update(self, states, actions, old_log_probs, rewards, dones): values = self.critic(states).squeeze().detach().numpy() advantages = self.compute_gae(rewards, values, dones) returns = advantages + torch.FloatTensor(values[:len(advantages)]) advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8) for _ in range(self.epochs): probs = self.actor(states) dist = torch.distributions.Categorical(probs) new_log_probs = dist.log_prob(actions) entropy = dist.entropy().mean() ratio = (new_log_probs - old_log_probs).exp() clipped = torch.clamp(ratio, 1 - self.clip_ratio, 1 + self.clip_ratio) actor_loss = -torch.min(ratio * advantages, clipped * advantages).mean() critic_loss = nn.MSELoss()(self.critic(states).squeeze(), returns) loss = actor_loss + 0.5 * critic_loss - 0.01 * entropy self.optimizer.zero_grad() loss.backward() self.optimizer.step()
Research Environments
| Environment | Domain | Complexity | Key Paper |
|---|---|---|---|
| Gymnasium (ex-Gym) | Classic control, Atari | Low-High | Brockman et al., 2016 |
| MuJoCo | Continuous control, robotics | Medium-High | Todorov et al., 2012 |
| DMControl | Continuous control from pixels | High | Tassa et al., 2018 |
| ProcGen | Procedurally generated games | High (generalization) | Cobbe et al., 2020 |
| Minigrid | Grid-world navigation | Low-Medium | Chevalier-Boisvert et al. |
| Isaac Gym | GPU-accelerated physics sim | High | Makoviychuk et al., 2021 |
| NetHack | Complex roguelike game | Very High | Kuttler et al., 2020 |
Top Venues
| Venue | Type | Focus |
|---|---|---|
| NeurIPS | Conference | Broad ML including RL |
| ICML | Conference | Broad ML including RL |
| ICLR | Conference | Representation learning, deep RL |
| AAAI | Conference | Broad AI |
| CoRL | Conference | Robot learning |
| JMLR | Journal | Broad ML (open access) |
| L4DC | Conference | Learning for dynamics and control |
Key Research Directions (2024-2025)
- RLHF / RLAIF: RL from human or AI feedback for LLM alignment
- Offline RL: Learning from pre-collected datasets without environment interaction
- Foundation models for control: Using pre-trained LLMs/VLMs as world models or planners
- Multi-agent RL: Cooperative and competitive settings with communication
- Safe RL: Constrained optimization to ensure safety during training and deployment
- Sample-efficient RL: Reducing the gap between model-free and model-based sample complexity