Awesome-Agent-Skills-for-Empirical-Research reinforcement-learning-guide

Reinforcement learning fundamentals, algorithms, and research

install
source · Clone the upstream repo
git clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/43-wentorai-research-plugins/skills/domains/ai-ml/reinforcement-learning-guide" ~/.claude/skills/brycewang-stanford-awesome-agent-skills-for-empirical-research-reinforcement-lea && rm -rf "$T"
manifest: skills/43-wentorai-research-plugins/skills/domains/ai-ml/reinforcement-learning-guide/SKILL.md
source content

Reinforcement Learning Guide

Understand and implement reinforcement learning algorithms from tabular methods through deep RL, including policy gradients, actor-critic, and model-based approaches.

RL Fundamentals

The RL Framework

An agent interacts with an environment to maximize cumulative reward:

Agent                     Environment
  |                           |
  |--- action a_t ---------->|
  |                           |--- next state s_{t+1}
  |<-- reward r_t, state s_t |--- reward r_{t+1}
  |                           |
ConceptSymbolDefinition
StatesObservation of the environment
ActionaDecision made by the agent
RewardrScalar feedback signal
Policypi(a|s)Mapping from states to actions
Value functionV(s)Expected cumulative reward from state s
Q-functionQ(s, a)Expected cumulative reward from (s, a)
Discount factorgammaWeight of future vs. immediate rewards (0-1)
ReturnG_tSum of discounted future rewards from time t

Key Equations

# Return (discounted cumulative reward)
G_t = r_t + gamma * r_{t+1} + gamma^2 * r_{t+2} + ...

# Bellman equation for V
V(s) = E[r + gamma * V(s') | s]

# Bellman equation for Q
Q(s, a) = E[r + gamma * max_a' Q(s', a') | s, a]

# Policy gradient theorem
gradient J(theta) = E[gradient log pi_theta(a|s) * Q(s, a)]

Algorithm Taxonomy

CategoryAlgorithmKey IdeaOn/Off Policy
Value-basedQ-LearningLearn Q(s,a), act greedilyOff-policy
DQNQ-Learning + neural net + replay bufferOff-policy
Double DQNTwo networks to reduce overestimationOff-policy
Dueling DQNSeparate value and advantage streamsOff-policy
Policy gradientREINFORCEMonte Carlo policy gradientOn-policy
PPOClipped surrogate objectiveOn-policy
TRPOTrust region constraintOn-policy
Actor-CriticA2C/A3CAdvantage actor-critic (parallel)On-policy
SACMaximum entropy + off-policy ACOff-policy
TD3Twin delayed DDPGOff-policy
Model-basedDreamerWorld model + imaginationOn-policy
MBPOModel-based policy optimizationOff-policy
MuZeroLearned model + planning (MCTS)Off-policy

Implementation: DQN

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from collections import deque
import random

class QNetwork(nn.Module):
    def __init__(self, state_dim, action_dim, hidden_dim=128):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, action_dim)
        )

    def forward(self, x):
        return self.net(x)

class DQNAgent:
    def __init__(self, state_dim, action_dim, lr=1e-3, gamma=0.99,
                 epsilon=1.0, epsilon_decay=0.995, epsilon_min=0.01,
                 buffer_size=10000, batch_size=64):
        self.action_dim = action_dim
        self.gamma = gamma
        self.epsilon = epsilon
        self.epsilon_decay = epsilon_decay
        self.epsilon_min = epsilon_min
        self.batch_size = batch_size

        self.q_network = QNetwork(state_dim, action_dim)
        self.target_network = QNetwork(state_dim, action_dim)
        self.target_network.load_state_dict(self.q_network.state_dict())
        self.optimizer = optim.Adam(self.q_network.parameters(), lr=lr)

        self.replay_buffer = deque(maxlen=buffer_size)

    def select_action(self, state):
        if random.random() < self.epsilon:
            return random.randint(0, self.action_dim - 1)
        with torch.no_grad():
            q_values = self.q_network(torch.FloatTensor(state))
            return q_values.argmax().item()

    def store_transition(self, state, action, reward, next_state, done):
        self.replay_buffer.append((state, action, reward, next_state, done))

    def train_step(self):
        if len(self.replay_buffer) < self.batch_size:
            return 0.0

        batch = random.sample(self.replay_buffer, self.batch_size)
        states, actions, rewards, next_states, dones = zip(*batch)

        states = torch.FloatTensor(np.array(states))
        actions = torch.LongTensor(actions)
        rewards = torch.FloatTensor(rewards)
        next_states = torch.FloatTensor(np.array(next_states))
        dones = torch.FloatTensor(dones)

        # Current Q values
        q_values = self.q_network(states).gather(1, actions.unsqueeze(1)).squeeze()

        # Target Q values (Double DQN variant)
        with torch.no_grad():
            best_actions = self.q_network(next_states).argmax(1)
            next_q = self.target_network(next_states).gather(1, best_actions.unsqueeze(1)).squeeze()
            targets = rewards + self.gamma * next_q * (1 - dones)

        loss = nn.MSELoss()(q_values, targets)
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

        self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)
        return loss.item()

    def update_target(self):
        self.target_network.load_state_dict(self.q_network.state_dict())

Implementation: PPO

class PPOAgent:
    def __init__(self, state_dim, action_dim, lr=3e-4, gamma=0.99,
                 lam=0.95, clip_ratio=0.2, epochs=10):
        self.gamma = gamma
        self.lam = lam
        self.clip_ratio = clip_ratio
        self.epochs = epochs

        self.actor = nn.Sequential(
            nn.Linear(state_dim, 64), nn.Tanh(),
            nn.Linear(64, 64), nn.Tanh(),
            nn.Linear(64, action_dim), nn.Softmax(dim=-1)
        )
        self.critic = nn.Sequential(
            nn.Linear(state_dim, 64), nn.Tanh(),
            nn.Linear(64, 64), nn.Tanh(),
            nn.Linear(64, 1)
        )
        self.optimizer = optim.Adam(
            list(self.actor.parameters()) + list(self.critic.parameters()), lr=lr
        )

    def compute_gae(self, rewards, values, dones):
        """Generalized Advantage Estimation."""
        advantages = []
        gae = 0
        for t in reversed(range(len(rewards))):
            next_value = values[t + 1] if t + 1 < len(values) else 0
            delta = rewards[t] + self.gamma * next_value * (1 - dones[t]) - values[t]
            gae = delta + self.gamma * self.lam * (1 - dones[t]) * gae
            advantages.insert(0, gae)
        return torch.FloatTensor(advantages)

    def update(self, states, actions, old_log_probs, rewards, dones):
        values = self.critic(states).squeeze().detach().numpy()
        advantages = self.compute_gae(rewards, values, dones)
        returns = advantages + torch.FloatTensor(values[:len(advantages)])
        advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)

        for _ in range(self.epochs):
            probs = self.actor(states)
            dist = torch.distributions.Categorical(probs)
            new_log_probs = dist.log_prob(actions)
            entropy = dist.entropy().mean()

            ratio = (new_log_probs - old_log_probs).exp()
            clipped = torch.clamp(ratio, 1 - self.clip_ratio, 1 + self.clip_ratio)
            actor_loss = -torch.min(ratio * advantages, clipped * advantages).mean()

            critic_loss = nn.MSELoss()(self.critic(states).squeeze(), returns)

            loss = actor_loss + 0.5 * critic_loss - 0.01 * entropy
            self.optimizer.zero_grad()
            loss.backward()
            self.optimizer.step()

Research Environments

EnvironmentDomainComplexityKey Paper
Gymnasium (ex-Gym)Classic control, AtariLow-HighBrockman et al., 2016
MuJoCoContinuous control, roboticsMedium-HighTodorov et al., 2012
DMControlContinuous control from pixelsHighTassa et al., 2018
ProcGenProcedurally generated gamesHigh (generalization)Cobbe et al., 2020
MinigridGrid-world navigationLow-MediumChevalier-Boisvert et al.
Isaac GymGPU-accelerated physics simHighMakoviychuk et al., 2021
NetHackComplex roguelike gameVery HighKuttler et al., 2020

Top Venues

VenueTypeFocus
NeurIPSConferenceBroad ML including RL
ICMLConferenceBroad ML including RL
ICLRConferenceRepresentation learning, deep RL
AAAIConferenceBroad AI
CoRLConferenceRobot learning
JMLRJournalBroad ML (open access)
L4DCConferenceLearning for dynamics and control

Key Research Directions (2024-2025)

  1. RLHF / RLAIF: RL from human or AI feedback for LLM alignment
  2. Offline RL: Learning from pre-collected datasets without environment interaction
  3. Foundation models for control: Using pre-trained LLMs/VLMs as world models or planners
  4. Multi-agent RL: Cooperative and competitive settings with communication
  5. Safe RL: Constrained optimization to ensure safety during training and deployment
  6. Sample-efficient RL: Reducing the gap between model-free and model-based sample complexity