Claude-skill-registry exploration-strategies

Master ε-greedy, UCB, curiosity-driven, RND, intrinsic motivation exploration

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/exploration-strategies" ~/.claude/skills/majiayu000-claude-skill-registry-exploration-strategies && rm -rf "$T"

manifest: skills/data/exploration-strategies/SKILL.md

source content

Exploration Strategies in Deep RL

When to Use This Skill

Invoke this skill when you encounter:

Exploration-Exploitation Problem: Agent stuck in local optimum, not finding sparse rewards
ε-Greedy Tuning: Designing or debugging epsilon decay schedules
Sparse Reward Environments: Montezuma's Revenge, goal-conditioned tasks, minimal feedback
Large State Spaces: Too many states for random exploration to be effective
Curiosity-Driven Learning: Implementing or understanding intrinsic motivation
RND (Random Network Distillation): Novelty-based exploration for sparse rewards
Count-Based Exploration: Encouraging discovery in discrete/tabular domains
Exploration Stability: Agent explores too much/little, inconsistent performance
Method Selection: Which exploration strategy for this problem?
Computational Cost: Balancing exploration sophistication vs overhead
Boltzmann Exploration: Softmax-based action selection and temperature tuning

Core Problem: Many RL agents get stuck exploiting a local optimum, never finding sparse rewards or exploring high-dimensional state spaces effectively. Choosing the right exploration strategy is fundamental to success.

Do NOT Use This Skill For

Algorithm selection (route to rl-foundations or specific algorithm skills like value-based-methods, policy-gradient-methods)
Reward design issues (route to reward-shaping-engineering)
Environment bugs causing poor exploration (route to rl-debugging first to verify environment works correctly)
Basic RL concepts (route to rl-foundations for MDPs, value functions, Bellman equations)
Training instability unrelated to exploration (route to appropriate algorithm skill or rl-debugging)

Core Principle: The Exploration-Exploitation Tradeoff

The Fundamental Tension

In reinforcement learning, every action selection is a decision:

Exploit: Take the action with highest estimated value (maximize immediate reward)
Explore: Try a different action to learn about its value (find better actions)

Exploitation Extreme:
- Only take the best-known action
- High immediate reward (in training)
- BUT: Stuck in local optimum if initial action wasn't optimal
- Risk: Never find the actual best reward

Exploration Extreme:
- Take random actions uniformly
- Will eventually find any reward
- BUT: Wasting resources on clearly bad actions
- Risk: No learning because too much randomness

Optimal Balance:
- Explore enough to find good actions
- Exploit enough to benefit from learning

Why Exploration Matters

Scenario 1: Sparse Reward Environment

Imagine an agent in Montezuma's Revenge (classic exploration benchmark):

Most states give reward = 0
First coin gives +1 (at step 500+)
Without exploring systematically, random actions won't find that coin in millions of steps

Without exploration strategy:

Steps 0-1,000: Random actions, no reward signal
Steps 1,000-10,000: Learned to get to the coin, finally seeing reward
Problem: Took 1,000 steps of pure random exploration!

With smart exploration (RND):
Steps 0-100: RND detects novel states, guides toward unexplored areas
Steps 100-500: Finds coin much faster because exploring strategically
Result: Reward found in 10% of steps

Scenario 2: Local Optimum Trap

Agent finds a small reward (+1) from a simple policy:

Without decay:
- Agent learns exploit_policy achieves +1
- ε-greedy with ε=0.3: Still 30% random (good, explores)
- BUT: 70% exploiting suboptimal policy indefinitely

With decay:
- Step 0: ε=1.0, 100% explore
- Step 100k: ε=0.05, 5% explore
- Step 500k: ε=0.01, 1% explore
- Result: Enough exploration to find +5 reward, then exploit it

Core Rule

Exploration is an investment with declining returns.

Early training: Exploration critical (don't know anything yet)
Mid training: Balanced (learning but not confident)
Late training: Exploitation dominant (confident in good actions)

Part 1: ε-Greedy Exploration

The Baseline Method

ε-Greedy is the simplest exploration strategy: with probability ε, take a random action; otherwise, take the greedy (best-known) action.

import numpy as np

def epsilon_greedy_action(q_values, epsilon):
    """
    Select action using ε-greedy.

    Args:
        q_values: Q(s, *) - values for all actions
        epsilon: exploration probability [0, 1]

    Returns:
        action: int (0 to num_actions-1)
    """
    if np.random.random() < epsilon:
        # Explore: random action
        return np.random.randint(len(q_values))
    else:
        # Exploit: best action
        return np.argmax(q_values)

Why ε-Greedy Works

Simple: Easy to implement and understand
Guaranteed Convergence: Will eventually visit all states (if ε > 0)
Effective Baseline: Works surprisingly well for many tasks
Interpretable: ε has clear meaning (probability of random action)

When ε-Greedy Fails

Problem Space → Exploration Effectiveness:

Small discrete spaces (< 100 actions):
- ε-greedy: Excellent ✓
- Reason: Random exploration covers space quickly

Large discrete spaces (100-10,000 actions):
- ε-greedy: Poor ✗
- Reason: Random action is almost always bad
- Example: Game with 500 actions, random 1/500 chance is right action

Continuous action spaces:
- ε-greedy: Terrible ✗
- Reason: Random action in [-∞, ∞] is meaningless noise
- Alternative: Gaussian noise on action (not true ε-greedy)

Sparse rewards, large state spaces:
- ε-greedy: Hopeless ✗
- Reason: Random exploration won't find rare reward before heat death
- Alternative: Curiosity, RND, intrinsic motivation

ε-Decay Schedules

The key insight: ε should decay over time. Explore early, exploit late.

Linear Decay

def epsilon_linear(step, total_steps, epsilon_start=1.0, epsilon_end=0.1):
    """
    Linear decay from epsilon_start to epsilon_end.

    ε(t) = ε_start - (ε_start - ε_end) * t / T
    """
    t = min(step, total_steps)
    return epsilon_start - (epsilon_start - epsilon_end) * t / total_steps

Properties:

Simple, predictable, easy to tune
Equal exploration reduction per step
Good for most tasks

Guidance:

Use if no special knowledge about task
```
epsilon_start = 1.0
```
(explore fully initially)
```
epsilon_end = 0.01
```
to
```
0.1
```
(small residual exploration)
```
total_steps = 1,000,000
```
(typical deep RL)

Exponential Decay

def epsilon_exponential(step, decay_rate=0.9995):
    """
    Exponential decay with constant rate.

    ε(t) = ε_0 * decay_rate^t
    """
    return 1.0 * (decay_rate ** step)

Properties:

Fast initial decay, slow tail
Aggressive early exploration cutoff
Exploration drops exponentially

Guidance:

Use if task rewards are found quickly
```
decay_rate = 0.9995
```
is gentle (1% per 100 steps)
```
decay_rate = 0.999
```
is aggressive (1% per step)
Watch for premature convergence to local optimum

Polynomial Decay

def epsilon_polynomial(step, total_steps, epsilon_start=1.0,
                       epsilon_end=0.01, power=2.0):
    """
    Polynomial decay: ε(t) = ε_start * (1 - t/T)^p

    power=1: Linear
    power=2: Quadratic (faster early decay)
    power=0.5: Slower decay
    """
    t = min(step, total_steps)
    fraction = t / total_steps
    return epsilon_start * (1 - fraction) ** power

Properties:

Smooth, tunable decay curve
Power > 1: Fast early decay, slow tail
Power < 1: Slow early decay, fast tail

Guidance:

```
power = 2.0
```
: Quadratic (balanced, common)
```
power = 3.0
```
: Cubic (aggressive early decay)
```
power = 0.5
```
: Slower (gentle early decay)

Practical Guidance: Choosing Epsilon Parameters

Rule of Thumb:
- epsilon_start = 1.0 (explore uniformly initially)
- epsilon_end = 0.01 to 0.1 (maintain minimal exploration)
  - 0.01: For large action spaces (need some exploration)
  - 0.05: Default choice
  - 0.1: For small action spaces (can afford random actions)
- total_steps: Based on training duration
  - Usually 500k to 1M steps
  - Longer if rewards are sparse or delayed

Task-Specific Adjustments:
- Sparse rewards: Longer decay (explore for more steps)
- Dense rewards: Shorter decay (can exploit earlier)
- Large action space: Higher epsilon_end (maintain exploration)
- Small action space: Lower epsilon_end (exploitation is cheap)

ε-Greedy Pitfall 1: Decay Too Fast

# WRONG: Decays to 0 in just 10k steps
epsilon_final = 0.01
decay_steps = 10_000
epsilon = epsilon_final ** (step / decay_steps)  # ← BUG

# CORRECT: Decays gently over training
total_steps = 1_000_000
epsilon_linear(step, total_steps, epsilon_start=1.0, epsilon_end=0.01)

Symptom: Agent plateaus early, never improves past initial local optimum

Fix: Use longer decay schedule, ensure epsilon_end > 0

ε-Greedy Pitfall 2: Never Decays (Constant ε)

# WRONG: Fixed epsilon forever
epsilon = 0.3  # Constant

# CORRECT: Decay epsilon over time
epsilon = epsilon_linear(step, total_steps=1_000_000)

Symptom: Agent learns but performance noisy, can't fully exploit learned policy

Fix: Add epsilon decay schedule

ε-Greedy Pitfall 3: Epsilon on Continuous Actions

# WRONG: Discrete epsilon-greedy on continuous actions
action = np.random.uniform(-1, 1) if random() < epsilon else greedy_action

# CORRECT: Gaussian noise on continuous actions
def continuous_exploration(action, exploration_std=0.1):
    return action + np.random.normal(0, exploration_std, action.shape)

Symptom: Continuous action spaces don't benefit from ε-greedy (random action is meaningless)

Fix: Use Gaussian noise or other continuous exploration methods

Part 2: Boltzmann Exploration

Temperature-Based Action Selection

Instead of deterministic greedy action, select actions proportional to their Q-values using softmax with temperature T.

def boltzmann_exploration(q_values, temperature=1.0):
    """
    Select action using Boltzmann distribution.

    P(a) = exp(Q(s,a) / T) / Σ exp(Q(s,a') / T)

    Args:
        q_values: Q(s, *) - values for all actions
        temperature: Exploration parameter
          T → 0: Becomes deterministic (greedy)
          T → ∞: Becomes uniform random

    Returns:
        action: int (sampled from distribution)
    """
    # Subtract max for numerical stability
    q_shifted = q_values - np.max(q_values)

    # Compute probabilities
    probabilities = np.exp(q_shifted / temperature)
    probabilities = probabilities / np.sum(probabilities)

    # Sample action
    return np.random.choice(len(q_values), p=probabilities)

Properties vs ε-Greedy

Feature	ε-Greedy	Boltzmann
Good actions	Probability: 1-ε	Probability: higher (proportional to Q)
Bad actions	Probability: ε/(n-1)	Probability: lower (proportional to Q)
Action selection	Deterministic or random	Stochastic distribution
Exploration	Uniform random	Biased toward better actions
Tuning	ε (1 parameter)	T (1 parameter)

Key Advantage: Boltzmann balances better—good actions are preferred but still get chances.

Example: Three actions with Q=[10, 0, -10]

ε-Greedy (ε=0.2):
- Action 0: P=0.8 (exploit best)
- Action 1: P=0.1 (random)
- Action 2: P=0.1 (random)
- Problem: Good actions (Q=0, -10) barely sampled

Boltzmann (T=2):
- Action 0: P=0.88 (exp(10/2)=e^5 ≈ 148)
- Action 1: P=0.11 (exp(0/2)=1)
- Action 2: P=0.01 (exp(-10/2)≈0.007)
- Better: Action 1 still gets 11% (not negligible)

Temperature Decay Schedule

Like epsilon, temperature should decay: start high (explore), end low (exploit).

def temperature_decay(step, total_steps, temp_start=1.0, temp_end=0.1):
    """
    Linear temperature decay.

    T(t) = T_start - (T_start - T_end) * t / T_total
    """
    t = min(step, total_steps)
    return temp_start - (temp_start - temp_end) * t / total_steps

# Usage in training loop
for step in range(total_steps):
    T = temperature_decay(step, total_steps)
    action = boltzmann_exploration(q_values, temperature=T)
    # ...

When to Use Boltzmann vs ε-Greedy

Choose ε-Greedy if:
- Simple implementation preferred
- Discrete action space
- Task has clear good/bad actions (wide Q-value spread)

Choose Boltzmann if:
- Actions have similar Q-values (nuanced exploration)
- Want to bias exploration toward promising actions
- Fine-grained control over exploration desired

Part 3: UCB (Upper Confidence Bound)

Theoretical Optimality

UCB is provably optimal for the multi-armed bandit problem:

def ucb_action(q_values, action_counts, total_visits, c=1.0):
    """
    Select action using Upper Confidence Bound.

    UCB(a) = Q(a) + c * sqrt(ln(N) / N(a))

    Args:
        q_values: Current Q-value estimates
        action_counts: N(a) - times each action visited
        total_visits: N - total visits to state
        c: Exploration constant (usually 1.0 or sqrt(2))

    Returns:
        action: int (maximizing UCB)
    """
    # Avoid division by zero
    action_counts = np.maximum(action_counts, 1)

    # Compute exploration bonus
    exploration_bonus = c * np.sqrt(np.log(total_visits) / action_counts)

    # Upper confidence bound
    ucb = q_values + exploration_bonus

    return np.argmax(ucb)

Why UCB Works

UCB balances exploitation and exploration via optimism under uncertainty:

If Q(a) is high → exploit it
If Q(a) is uncertain (rarely visited) → exploration bonus makes UCB high

Example: Bandit with 2 arms
- Arm A: Visited 100 times, estimated Q=2.0
- Arm B: Visited 10 times, estimated Q=1.5

UCB(A) = 2.0 + 1.0 * sqrt(ln(110) / 100) ≈ 2.0 + 0.26 = 2.26
UCB(B) = 1.5 + 1.0 * sqrt(ln(110) / 10) ≈ 1.5 + 0.82 = 2.32

Result: Try Arm B despite lower Q estimate (less certain)

Critical Limitation: Doesn't Scale to Deep RL

UCB assumes tabular setting (small, discrete state space where you can count visits):

# WORKS: Tabular Q-learning
state_action_counts = defaultdict(int)  # N(s, a)
state_counts = defaultdict(int)  # N(s)

# BREAKS in deep RL:
# With function approximation, states don't repeat exactly
# Can't count "how many times visited state X" in continuous/image observations

Practical Issue:

In image-based RL (Atari, vision), never see the same pixel image twice. State counting is impossible.

When UCB Applies

Use UCB if:
✓ Discrete action space (< 100 actions)
✓ Discrete state space (< 10,000 states)
✓ Tabular Q-learning (no function approximation)
✓ Rewards come quickly (don't need long-term planning)

Examples: Simple bandits, small Gridworlds, discrete card games

DO NOT use UCB if:
✗ Using neural networks (state approximation)
✗ Continuous actions or large state space
✗ Image observations (pixel space too large)
✗ Sparse rewards (need different methods)

Connection to Deep RL

For deep RL, need to estimate uncertainty without explicit counts:

def deep_ucb_approximation(mean_q, uncertainty, c=1.0):
    """
    Approximate UCB using learned uncertainty (not action counts).

    Used in methods like:
    - Deep Ensembles: Use ensemble variance as uncertainty
    - Dropout: Use MC-dropout variance
    - Bootstrap DQN: Ensemble of Q-networks

    UCB ≈ Q(s,a) + c * uncertainty(s,a)
    """
    return mean_q + c * uncertainty

Modern Approach: Instead of counting visits, learn uncertainty through:

Ensemble Methods: Train multiple Q-networks, use disagreement
Bayesian Methods: Learn posterior over Q-values
Bootstrap DQN: Separate Q-networks give uncertainty estimates

These adapt UCB principles to deep RL.

Part 4: Curiosity-Driven Exploration (ICM)

The Core Insight

Prediction Error as Exploration Signal

Agent is "curious" about states where it can't predict the next state well:

Intuition: If I can't predict what will happen, I probably
haven't learned about this state yet. Let me explore here!

Intrinsic Reward = ||next_state - predicted_next_state||^2

Intrinsic Curiosity Module (ICM)

import torch
import torch.nn as nn

class IntrinsicCuriosityModule(nn.Module):
    """
    ICM = Forward Model + Inverse Model

    Forward Model: Predicts next state from (state, action)
    - Input: current state + action taken
    - Output: predicted next state
    - Error: prediction error = surprise

    Inverse Model: Predicts action from (state, next_state)
    - Input: current state and next state
    - Output: predicted action taken
    - Purpose: Learn representation that distinguishes states
    """

    def __init__(self, state_dim, action_dim, hidden_dim=128):
        super().__init__()

        # Inverse model: (s, s') → a
        self.inverse = nn.Sequential(
            nn.Linear(2 * state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, action_dim)
        )

        # Forward model: (s, a) → s'
        self.forward = nn.Sequential(
            nn.Linear(state_dim + action_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, state_dim)
        )

    def compute_intrinsic_reward(self, state, action, next_state):
        """
        Curiosity reward = prediction error of forward model.

        high_error → Unseen state → Reward exploration
        low_error → Seen state → Ignore (already learned)
        """
        # Predict next state
        predicted_next = self.forward(torch.cat([state, action], dim=-1))

        # Compute prediction error
        prediction_error = torch.norm(next_state - predicted_next, dim=-1)

        # Intrinsic reward is prediction error (exploration bonus)
        return prediction_error

    def loss(self, state, action, next_state, action_pred_logits):
        """
        Combine forward and inverse losses.

        Forward loss: Forward model prediction error
        Inverse loss: Inverse model action prediction error
        """
        # Forward loss
        predicted_next = self.forward(torch.cat([state, action], dim=-1))
        forward_loss = torch.mean((next_state - predicted_next) ** 2)

        # Inverse loss
        predicted_action = action_pred_logits
        inverse_loss = torch.mean((action - predicted_action) ** 2)

        return forward_loss + inverse_loss

Why Both Forward and Inverse Models?

Forward model alone:
- Can predict next state without learning features
- Might just memorize (Q: Do pixels change when I do action X?)
- Doesn't necessarily learn task-relevant state representation

Inverse model:
- Forces feature learning that distinguishes states
- Can only predict action if states are well-represented
- Improves forward model's learned representation

Together: Forward + Inverse
- Better feature learning (inverse helps)
- Better prediction (forward is primary)

Critical Pitfall: Random Environment Trap

# WRONG: Using curiosity in stochastic environment
# Environment: Atari with pixel randomness/motion artifacts

# Agent gets reward for predicting pixel noise
# Prediction error = pixels changed randomly
# Intrinsic reward goes to the noisiest state!
# Result: Agent learns nothing about task, just explores random pixels

# CORRECT: Use RND instead (next section)
# RND uses FROZEN random network, doesn't get reward for actual noise

Key Distinction:

ICM: Learns to predict environment (breaks if environment has noise/randomness)
RND: Uses frozen random network (robust to environment randomness)

Computational Cost

# ICM adds significant overhead:
# - Forward model network (encoder + layers + output)
# - Inverse model network (encoder + layers + output)
# - Training both networks every step

# Overhead estimate:
# Base agent: 1 network (policy/value)
# With ICM: 3+ networks (policy + forward + inverse)
# Training time: ~2-3× longer
# Memory: ~3× larger

# When justified:
# - Sparse rewards (ICM critical)
# - Large state spaces (ICM helps)
#
# When NOT justified:
# - Dense rewards (environment signal sufficient)
# - Continuous control with simple rewards (ε-greedy enough)

Part 5: RND (Random Network Distillation)

The Elegant Solution

RND is simpler and more robust than ICM:

class RandomNetworkDistillation(nn.Module):
    """
    RND: Intrinsic reward = prediction error of target network

    Key innovation: Target network is RANDOM and FROZEN
    (never updated)

    Two networks:
    1. Target (random, frozen): f_target(s) - fixed throughout training
    2. Predictor (trained): f_predict(s) - learns to predict target

    Intrinsic reward = ||f_target(s) - f_predict(s)||^2

    New state (s not seen) → high prediction error → reward exploration
    Seen state (s familiar) → low prediction error → ignore
    """

    def __init__(self, state_dim, embedding_dim=128):
        super().__init__()

        # Target network: random, never updates
        self.target = nn.Sequential(
            nn.Linear(state_dim, embedding_dim),
            nn.ReLU(),
            nn.Linear(embedding_dim, embedding_dim)
        )

        # Predictor network: learns to mimic target
        self.predictor = nn.Sequential(
            nn.Linear(state_dim, embedding_dim),
            nn.ReLU(),
            nn.Linear(embedding_dim, embedding_dim)
        )

        # Freeze target network
        for param in self.target.parameters():
            param.requires_grad = False

    def compute_intrinsic_reward(self, state, scale=1.0):
        """
        Intrinsic reward = prediction error of target network.

        Args:
            state: Current observation
            scale: Scale factor for reward (usually 0.1-1.0)

        Returns:
            Intrinsic reward (novelty signal)
        """
        with torch.no_grad():
            target_features = self.target(state)

        predicted_features = self.predictor(state)

        # L2 prediction error
        prediction_error = torch.norm(
            target_features - predicted_features,
            dim=-1,
            p=2
        )

        return scale * prediction_error

    def predictor_loss(self, state):
        """
        Loss for predictor: minimize prediction error.

        Only update predictor (target stays frozen).
        """
        with torch.no_grad():
            target_features = self.target(state)

        predicted_features = self.predictor(state)

        # MSE loss
        return torch.mean((target_features - predicted_features) ** 2)

Why RND is Elegant

No Environment Model: Doesn't need to model dynamics (unlike ICM)
Robust to Randomness: Random network isn't trying to predict anything real, so environment noise doesn't fool it
Simple: Just predict random features
Fast: Train only predictor (target frozen)

RND vs ICM Comparison

Aspect	ICM	RND
Networks	Forward + Inverse	Target (frozen) + Predictor
Learns	Environment dynamics	Random feature prediction
Robust to noise	No (breaks with stochastic envs)	Yes (random target immune)
Complexity	High (3+ networks, 2 losses)	Medium (2 networks, 1 loss)
Computation	2-3× base agent	1.5-2× base agent
When to use	Dense features, clean env	Sparse rewards, noisy env

RND Pitfall: Training Instability

# WRONG: High learning rate, large reward scale
rnd_loss = rnd.predictor_loss(state)
optimizer.zero_grad()
rnd_loss.backward()
optimizer.step()  # ← high learning rate causes divergence

# CORRECT: Careful hyperparameter tuning
rnd_lr = 1e-4  # Much smaller than main agent
rnd_optimizer = Adam(rnd.predictor.parameters(), lr=rnd_lr)

# Scale intrinsic reward appropriately
intrinsic_reward = rnd.compute_intrinsic_reward(state, scale=0.01)

Symptom: RND rewards explode, agent overfits to novelty

Fix: Lower learning rate for RND, scale intrinsic rewards carefully

Part 6: Count-Based Exploration

State Visitation Counts

For discrete/tabular environments, track how many times each state visited:

from collections import defaultdict

class CountBasedExploration:
    """
    Count-based exploration: encourage visiting rarely-seen states.

    Works for:
    ✓ Tabular (small discrete state space)
    ✓ Gridworlds, simple games

    Doesn't work for:
    ✗ Continuous spaces
    ✗ Image observations (never see same image twice)
    ✗ Large state spaces
    """

    def __init__(self):
        self.state_counts = defaultdict(int)

    def compute_intrinsic_reward(self, state, reward_scale=1.0):
        """
        Intrinsic reward inversely proportional to state visitation.

        intrinsic_reward = reward_scale / sqrt(N(s))

        Rarely visited states (small N) → high intrinsic reward
        Frequently visited states (large N) → low intrinsic reward
        """
        count = max(self.state_counts[state], 1)  # Avoid division by zero
        return reward_scale / np.sqrt(count)

    def update_counts(self, state):
        """Increment visitation count for state."""
        self.state_counts[state] += 1

Example: Gridworld with Sparse Reward

# Gridworld: 10×10 grid, reward at (9, 9), start at (0, 0)
# Without exploration: Random walking takes exponential time
# With count-based: Directed toward unexplored cells

# Pseudocode:
for episode in range(episodes):
    state = env.reset()
    for step in range(max_steps):
        # Compute exploration bonus
        intrinsic_reward = count_explorer.compute_intrinsic_reward(state)

        # Combine with task reward
        combined_reward = env_reward + lambda * intrinsic_reward

        # Q-learning with combined reward
        action = epsilon_greedy(q_values[state], epsilon)
        next_state, env_reward = env.step(action)

        q_values[state][action] += alpha * (
            combined_reward + gamma * max(q_values[next_state]) - q_values[state][action]
        )

        # Update counts
        count_explorer.update_counts(next_state)
        state = next_state

Critical Limitation: Doesn't Scale

# Works: Small state space
state_space_size = 100  # 10×10 grid
# Can track counts for all states

# Fails: Large/continuous state space
state_space_size = 10^18  # Image observations
# Can't track visitation counts for 10^18 unique states!

Part 7: When Exploration is Critical

Decision Framework

Exploration matters when:

Sparse Rewards (rewards rare, hard to find)
- Examples: Montezuma's Revenge, goal-conditioned tasks, real robotics
- No dense reward signal to guide learning
- Agent must explore to find any reward
- Solution: Intrinsic motivation (curiosity, RND)
Large State Spaces (too many possible states)
- Examples: Image-based RL, continuous control
- Random exploration covers infinitesimal fraction
- Systematic exploration essential
- Solution: Curiosity-driven or RND
Long Horizons (many steps before reward)
- Examples: Multi-goal tasks, planning problems
- Temporal credit assignment hard
- Need to explore systematically to connect actions to delayed rewards
- Solution: Sophisticated exploration strategy
Deceptive Reward Landscape (local optima common)
- Examples: Multiple solutions, trade-offs
- Easy to get stuck in suboptimal policy
- Exploration helps escape local optima
- Solution: Slow decay schedule, maintain exploration

Decision Framework (Quick Check)

Do you have SPARSE rewards?
  YES → Use intrinsic motivation (curiosity, RND)
  NO → Continue

Is state space large (images, continuous)?
  YES → Use curiosity-driven or RND
  NO → Continue

Is exploration reasonably efficient with ε-greedy?
  YES → Use ε-greedy + appropriate decay schedule
  NO → Use curiosity-driven or RND

Example: Reward Structure Analysis

def analyze_reward_structure(rewards):
    """Determine if exploration strategy needed."""

    # Check sparsity
    nonzero_rewards = np.count_nonzero(rewards)
    sparsity = 1 - (nonzero_rewards / len(rewards))

    if sparsity > 0.95:
        print("SPARSE REWARDS detected")
        print("  → Use: Intrinsic motivation (RND or curiosity)")
        print("  → Why: Reward signal too rare to guide learning")

    # Check reward magnitude
    reward_std = np.std(rewards)
    reward_mean = np.mean(rewards)

    if reward_std < 0.1:
        print("WEAK/NOISY REWARDS detected")
        print("  → Use: Intrinsic motivation")
        print("  → Why: Reward signal insufficient to learn from")

    # Check reward coverage
    episode_length = len(rewards)
    if episode_length > 1000:
        print("LONG HORIZONS detected")
        print("  → Use: Strong exploration decay or intrinsic motivation")
        print("  → Why: Temporal credit assignment difficult")

Part 8: Combining Exploration with Task Rewards

Combining Intrinsic and Extrinsic Rewards

When using intrinsic motivation, balance with task reward:

def combine_rewards(extrinsic_reward, intrinsic_reward,
                    intrinsic_scale=0.01):
    """
    Combine extrinsic (task) and intrinsic (curiosity) rewards.

    r_total = r_extrinsic + λ * r_intrinsic

    λ controls tradeoff:
    - λ = 0: Ignore intrinsic reward (no exploration)
    - λ = 0.01: Curiosity helps, task reward primary (typical)
    - λ = 0.1: Curiosity significant
    - λ = 1.0: Curiosity dominates (might ignore task)
    """
    return extrinsic_reward + intrinsic_scale * intrinsic_reward

Challenges: Reward Hacking

# PROBLEM: Intrinsic reward encourages anything novel
# Even if novel thing is useless for task

# Example: Atari with RND
# If game has pixel randomness, RND rewards exploring random pixels
# Instead of exploring to find coins/power-ups

# SOLUTION: Scale intrinsic reward carefully
# Make it significant but not dominant

# SOLUTION 2: Curriculum learning
# Start with high intrinsic reward (discover environment)
# Gradually reduce as agent finds reward signals

Intrinsic Reward Scale Tuning

# Quick tuning procedure:
for intrinsic_scale in [0.001, 0.01, 0.1, 1.0]:
    agent = RL_Agent(intrinsic_reward_scale=intrinsic_scale)
    for episode in episodes:
        performance = train_episode(agent)

    print(f"Scale={intrinsic_scale}: Performance={performance}")

# Find scale where agent learns task well AND explores
# Usually 0.01-0.1 is sweet spot

Part 9: Common Pitfalls and Debugging

Pitfall 1: Epsilon Decay Too Fast

Symptom: Agent plateaus at poor performance early in training

Root Cause: Epsilon decays to near-zero before agent finds good actions

# WRONG: Decays in 10k steps
epsilon_final = 0.0
epsilon_decay = 0.9999  # Per-step decay
# After 10k steps: ε ≈ 0, almost no exploration left

# CORRECT: Decay over full training
total_training_steps = 1_000_000
epsilon_linear(step, total_training_steps,
               epsilon_start=1.0, epsilon_end=0.01)

Diagnosis:

Plot epsilon over training: does it reach 0 too early?
Check if performance improves after epsilon reaches low values

Fix:

Use longer decay (more steps)
Use higher epsilon_end (never go to pure exploitation)

Pitfall 2: Intrinsic Reward Too Strong

Symptom: Agent explores forever, ignores task reward

Root Cause: Intrinsic reward scale too high

# WRONG: Intrinsic reward dominates
r_total = r_task + 1.0 * r_intrinsic
# Agent optimizes novelty, ignores task

# CORRECT: Intrinsic reward is small bonus
r_total = r_task + 0.01 * r_intrinsic
# Task reward primary, intrinsic helps exploration

Diagnosis:

Agent explores everywhere but doesn't collect task rewards
Intrinsic reward signal going to seemingly useless states

Fix:

Reduce intrinsic_reward_scale (try 0.01, 0.001)
Verify agent eventually starts collecting task rewards

Pitfall 3: ε-Greedy on Continuous Actions

Symptom: Exploration ineffective, agent doesn't learn

Root Cause: Random action in continuous space is meaningless

# WRONG: ε-greedy on continuous actions
if random() < epsilon:
    action = np.random.uniform(-1, 1)  # Random in action space
else:
    action = network(state)  # Neural network action

# Random action is far from learned policy, completely unhelpful

# CORRECT: Gaussian noise on action
action = network(state)
noisy_action = action + np.random.normal(0, exploration_std)
noisy_action = np.clip(noisy_action, -1, 1)

Diagnosis:

Continuous action space and using ε-greedy
Agent not learning effectively

Fix:

Use Gaussian noise: action + N(0, σ)
Decay exploration_std over time (like epsilon decay)

Pitfall 4: Forgetting to Decay Exploration

Symptom: Training loss decreases but policy doesn't improve, noisy behavior

Root Cause: Agent keeps exploring randomly instead of exploiting learned policy

# WRONG: Constant exploration forever
epsilon = 0.3

# CORRECT: Decaying exploration
epsilon = epsilon_linear(step, total_steps)

Diagnosis:

No epsilon decay schedule mentioned in code
Agent behaves randomly even after many training steps

Fix:

Add decay schedule (linear, exponential, polynomial)

Pitfall 5: Using Exploration at Test Time

Symptom: Test performance worse than training, highly variable

Root Cause: Applying exploration strategy (ε > 0) at test time

# WRONG: Test with exploration
for test_episode in test_episodes:
    action = epsilon_greedy(q_values, epsilon=0.05)  # Wrong!
    # Agent still explores at test time

# CORRECT: Test with greedy policy
for test_episode in test_episodes:
    action = np.argmax(q_values)  # Deterministic, no exploration

Diagnosis:

Test performance has high variance
Test performance < training performance (exploration hurts)

Fix:

At test time, use greedy/deterministic policy
No ε-greedy, no Boltzmann, no exploration noise

Pitfall 6: RND Predictor Overfitting

Symptom: RND loss decreases but intrinsic rewards still large everywhere

Root Cause: Predictor overfits to training data, doesn't generalize to new states

# WRONG: High learning rate, no regularization
rnd_optimizer = Adam(rnd.predictor.parameters(), lr=0.001)
rnd_loss.backward()
rnd_optimizer.step()

# Predictor fits perfectly to seen states but doesn't generalize

# CORRECT: Lower learning rate, regularization
rnd_optimizer = Adam(rnd.predictor.parameters(), lr=0.0001)
# Add weight decay for regularization

Diagnosis:

RND training loss is low (close to 0)
But intrinsic rewards still high for most states
Suggests predictor fitted to training states but not generalizing

Fix:

Reduce RND learning rate
Add weight decay (L2 regularization)
Use batch normalization in predictor

Pitfall 7: Count-Based on Non-Tabular Problems

Symptom: Exploration ineffective, agent keeps revisiting similar states

Root Cause: State counting doesn't work for continuous/image spaces

# WRONG: Counting state IDs in image-based RL
state = env.render(mode='rgb_array')  # 84x84 image
state_id = hash(state.tobytes())  # Different hash every time!
count_based_explorer.update_counts(state_id)

# Every frame is "new" because of slight pixel differences
# State counting broken

# CORRECT: Use RND or curiosity instead
rnd = RandomNetworkDistillation(state_dim)
# RND handles high-dimensional states

Diagnosis:

Using count-based exploration with images/continuous observations
Exploration not working effectively

Fix:

Switch to RND or curiosity-driven methods
Count-based only for small discrete state spaces

Part 10: Red Flags and Pressure Tests

Red Flags Checklist

Pressure Test Scenario 1: "My Agent Stuck in Local Optimum"

User Query: "I trained my agent and it found a solution giving +10 reward. But I know the optimal solution is +100. Agent is stuck at +10. How do I fix it?"

What Skill Must Provide:

Diagnosis: Likely exploration problem (local optimum)
Solution: Increase exploration (slower decay, or intrinsic motivation)
Prevention: Design exploration strategy before training

Skill Response:

"Your agent found a local optimum (+10) and exploitation is too strong. This is a classic exploration-exploitation problem.

Quick Fixes:

Check epsilon decay: Is epsilon decaying too fast? Use slower decay schedule.
Check epsilon_end: Is epsilon_end = 0? Change to 0.01-0.05.
Add intrinsic motivation: Your environment might need curiosity-driven exploration.

Systematic Approach:

If rewards are dense (frequent signals): ε-greedy with slow decay
If rewards are sparse (rare signals): Add RND or curiosity
If state space huge: Definitely use RND

Example Fix:

# Instead of epsilon decaying to 0.01 in 100k steps
# Decay over 1M steps with epsilon_end=0.05
epsilon = epsilon_linear(step, total_steps=1_000_000,
                         epsilon_start=1.0, epsilon_end=0.05)
```"

### Pressure Test Scenario 2: "Which Exploration for Montezuma's Revenge?"

**User Query:** "I'm training on Montezuma's Revenge (classic sparse reward benchmark). ε-Greedy isn't working (agent never finds first coin). What exploration strategy should I use?"

**What Skill Must Provide:**
1. Recognition: This is THE benchmark for intrinsic motivation
2. Solution: RND is standard, ICM is alternative
3. Implementation: Code/guidance for using RND

**Skill Response:**

"Montezuma's Revenge is the canonical example where ε-greedy fails. The first coin is at step 500+ with no reward signal. Random exploration won't find it.

**Recommended:** RND (Random Network Distillation)

Why RND:
- Robust to stochasticity (doesn't try to predict environment)
- Simpler than ICM (no environment model needed)
- Proven to work on Montezuma's Revenge

**Implementation:**
```python
rnd = RandomNetworkDistillation(state_dim=...)
lambda_intrinsic = 0.01  # Scale intrinsic reward

for step in training_steps:
    state = env.observation

    # Get intrinsic reward
    r_intrinsic = rnd.compute_intrinsic_reward(state)

    # Combine with task reward
    r_total = r_task + lambda_intrinsic * r_intrinsic

    # Train agent with combined reward
    action = agent.select_action(state)
    next_state, r_task = env.step(action)

    # Update RND predictor
    rnd_loss = rnd.predictor_loss(state)
    rnd_optimizer.zero_grad()
    rnd_loss.backward()
    rnd_optimizer.step()

Alternative: Curiosity-driven (ICM) also works if environment deterministic.

What to tune:

lambda_intrinsic: Start 0.01, increase if agent doesn't explore enough
RND learning rate: Usually 1e-4 (much smaller than main agent)"

Pressure Test Scenario 3: "ε-Greedy Decay Schedule"

User Query: "I'm using ε-greedy but not sure about the decay schedule. I have 100k training steps. Should I decay epsilon linearly? Over how many steps? To what final value?"

What Skill Must Provide:

Formula: Linear, exponential, polynomial decay
Guidance: Typical values for different problem types
Tuning: How to adjust based on task

Skill Response:

"Quick Answer:

total_steps = 100_000
epsilon = epsilon_linear(step, total_steps,
                         epsilon_start=1.0, epsilon_end=0.05)

Reasoning:

```
epsilon_start=1.0
```
: Explore fully initially (know nothing)
```
epsilon_end=0.05
```
: Keep 5% exploration always (helps with stochasticity)
Linear decay: Simple, works well for most tasks

Decision Tree:

How dense are rewards?
- Sparse: Use slower decay (maybe 0.1 epsilon_end)
- Dense: Use faster decay (0.01 epsilon_end)
How many steps?
- 100k steps: Decay over 50k-100k (half to full training)
- 1M steps: Decay over 500k-1M
- Rule: Decay over 50-100% of training time
Which decay curve?
- Linear (ε_linear): Safe default
- Exponential (ε_exponential): Faster early decay
- Polynomial with power=2: Balanced

Example Tuning:

# Sparse rewards: More exploration needed
epsilon = epsilon_linear(step, total_steps=100_000,
                        epsilon_start=1.0, epsilon_end=0.1)

# Very sparse rewards: Even more
epsilon = epsilon_linear(step, total_steps=100_000,
                        epsilon_start=1.0, epsilon_end=0.2)

# Dense rewards: Faster decay to exploitation
epsilon = epsilon_linear(step, total_steps=50_000,
                        epsilon_start=1.0, epsilon_end=0.01)
```"

### Pressure Test Scenario 4: "ICM vs RND vs ε-Greedy"

**User Query:** "I have sparse rewards and a deterministic environment. Should I use ε-greedy, ICM curiosity, or RND? How do I decide?"

**What Skill Must Provide:**
1. Comparison table: Pros/cons of each
2. Decision framework: Which for which situation
3. Implementation difficulty: Is it worth the complexity?

**Skill Response:**

"**Comparison:**

| Method | Complexity | Robustness | When to Use |
|--------|-----------|-----------|------------|
| ε-Greedy | Very Simple | Good on dense rewards | Dense rewards, simple env |
| ICM | Complex (2 models) | Poor on stochastic | Sparse, deterministic, learning important |
| RND | Medium (1 model) | Excellent | Sparse, any environment |

**Decision Framework:**

1. **Sparse rewards + Deterministic + Lots of learning:**
   → Try ICM first (forward model learns useful features)

2. **Sparse rewards + Any environment:**
   → Use RND (robust, simpler)

3. **Sparse rewards + Large pixel space:**
   → RND (better for high-dimensional observations)

4. **Dense rewards:**
   → Just use ε-greedy (intrinsic motivation not needed)

5. **Not sure, want safe choice:**
   → Use RND (works almost everywhere)

**Implementation Difficulty:**
- ε-Greedy: 5 lines of code
- ICM: 100+ lines (forward + inverse models, 2 losses)
- RND: 50 lines (target + predictor)

**My Recommendation:**
Start ε-greedy. If not working after tuning decay schedules, switch to RND. Only use ICM if environment is fully deterministic and you care about learning dynamics."

### Pressure Test Scenario 5: "Computational Cost of Curiosity"

**User Query:** "I want to add curiosity-driven exploration (ICM) to my agent. What's the computational overhead? Will it slow down training significantly?"

**What Skill Must Provide:**
1. Overhead estimate: Exact computation cost
2. Tradeoff: When overhead is justified
3. Alternatives: Lighter-weight options

**Skill Response:**

"**Computational Overhead Estimate:**

Base Agent:
- 1 policy/value network
- Training time: 1.0×

With ICM:
- Policy/value network + Forward model + Inverse model
- Forward model: Encoder + hidden layers + state-dim output
- Inverse model: Concatenate states + hidden + action-dim output
- Training time: ~2.5-3.0×

**Memory:**
- Base: X MB
- With ICM: ~3-4× X MB
- GPU memory often bottleneck

**When Overhead is Justified:**
- Sparse rewards (ICM critical to success)
- Large state space (intrinsic motivation helps)
- Willing to wait longer for better exploration

**When Overhead is NOT Justified:**
- Dense rewards (ε-greedy sufficient)
- Real-time training constraints
- Limited GPU memory

**Lighter Alternative:**
Use RND instead of ICM:
- ~1.5-2.0× overhead (vs 2.5-3.0× for ICM)
- Same exploration benefits
- Simpler to implement

**Scaling to Large Models:**
```python
# ICM with huge state encoders can be prohibitive
# Example: Vision transformer encoder → ICM
# That's very expensive

# RND scales better: predictor can be small
# Don't need sophisticated encoder

Bottom Line: ICM costs 2-3× training time. If you can afford it and rewards are very sparse, worth it. Otherwise try RND or even ε-greedy with slower decay first."

Part 11: Rationalization Resistance Table

Rationalization	Reality	Counter-Guidance	Red Flag
"ε-Greedy works everywhere"	Fails on sparse rewards, large spaces	Use ε-greedy for dense/small, intrinsic motivation for sparse/large	Applying ε-greedy to Montezuma's Revenge
"Higher epsilon is better"	High ε → too random, doesn't exploit	Use decay schedule (ε high early, low late)	Using constant ε=0.5 throughout training
"Decay epsilon to zero"	Agent needs residual exploration	Keep ε_end=0.01-0.1 always	Setting ε_final=0 (pure exploitation)
"Curiosity always helps"	Can break with stochasticity (model tries to predict noise)	Use RND for stochastic, ICM for deterministic	Agent learns to explore random noise instead of task
"RND is just ICM simplified"	RND is fundamentally different (frozen random vs learned model)	Understand frozen network prevents overfitting/noise	Not grasping why RND frozen network matters
"More intrinsic reward = faster exploration"	Too much intrinsic reward drowns out task signal	Balance with λ=0.01-0.1, tune on task performance	Agent explores forever, ignores task
"Count-based works anywhere"	Only works tabular (can't count unique images)	Use RND for continuous/high-dimensional spaces	Trying count-based on Atari images
"Boltzmann is always better than ε-greedy"	Boltzmann smoother but harder to tune	Use ε-greedy for simplicity (it works well)	Switching to Boltzmann without clear benefit
"Test with ε>0 for exploration"	Test should use learned policy, not explore	ε=0 or greedy policy at test time	Variable test performance from exploration
"Longer decay is always better"	Very slow decay wastes time in early training	Match decay to task difficulty (faster for easy, slower for hard)	Decaying over 10M steps when training only 1M
"Skip exploration, increase learning rate"	Learning rate is for optimization, exploration for coverage	Use both: exploration strategy + learning rate	Agent oscillates without exploration
"ICM is the SOTA exploration"	RND simpler and more robust	Use RND unless you need environment model	Implementing ICM when RND would suffice

Part 12: Summary and Decision Framework

Quick Decision Tree

START: Need exploration strategy?

├─ Are rewards sparse? (rare reward signal)
│  ├─ YES → Need intrinsic motivation
│  │  ├─ Environment stochastic?
│  │  │  ├─ YES → RND
│  │  │  └─ NO → ICM (or RND for simplicity)
│  │  └─ Choose RND for safety
│  │
│  └─ NO → Dense rewards
│     └─ Use ε-greedy + decay schedule

├─ Is state space large? (images, continuous)
│  ├─ YES → Intrinsic motivation (RND/curiosity)
│  └─ NO → ε-greedy usually sufficient

└─ Choosing decay schedule:
   ├─ Sparse rewards → slower decay (ε_end=0.05-0.1)
   ├─ Dense rewards → faster decay (ε_end=0.01)
   └─ Default: Linear decay over 50% of training

Implementation Checklist

Define reward structure (dense vs sparse)
Estimate state space size (discrete vs continuous)
Choose exploration method (ε-greedy, curiosity, RND, UCB, count-based)
Set epsilon/temperature parameters (start, end)
Choose decay schedule (linear, exponential, polynomial)
If using intrinsic motivation: set λ (usually 0.01)
Use greedy policy at test time (ε=0)
Monitor exploration vs exploitation (plot epsilon decay)
Tune hyperparameters (decay schedule, λ) based on task performance

Typical Configurations

Dense Rewards, Small Action Space (e.g., simple game)

epsilon = epsilon_linear(step, total_steps=100_000,
                        epsilon_start=1.0, epsilon_end=0.01)
# Fast exploitation, low exploration needed

Sparse Rewards, Discrete Actions (e.g., Atari)

rnd = RandomNetworkDistillation(...)
epsilon = epsilon_linear(step, total_steps=1_000_000,
                        epsilon_start=1.0, epsilon_end=0.05)
r_total = r_task + 0.01 * r_intrinsic
# Intrinsic motivation + slow decay

Continuous Control, Sparse (e.g., Robotics)

rnd = RandomNetworkDistillation(...)
action = policy(state) + gaussian_noise(std=exploration_std)
exploration_std = exploration_std_linear(..., std_end=0.01)
r_total = r_task + 0.01 * r_intrinsic
# Gaussian noise + RND

Key Takeaways

Exploration is fundamental: Don't ignore it. Design exploration strategy before training.
Match method to problem:
- Dense rewards → ε-greedy
- Sparse rewards → Intrinsic motivation (RND preferred)
- Large state space → Intrinsic motivation
Decay exploration over time: Explore early, exploit late.
Avoid common pitfalls:
- Don't decay to zero (ε_end > 0)
- Don't use ε-greedy on continuous actions
- Don't forget decay schedule
- Don't use exploration at test time
Balance intrinsic and extrinsic: If using intrinsic rewards, don't let them dominate.
RND is the safe choice: Works for most exploration problems, simpler than ICM.
Test exploration hypothesis: Plot epsilon or intrinsic rewards, verify exploration strategy is active.

This skill is about systematic exploration design, not just tuning one hyperparameter.