AutoSkill Adaptive PPO Exploration via Reward History

Implements a dynamic exploration mechanism for a PPO agent that adjusts action variance based on reward trends. It compares recent rewards to historical averages to determine if exploration should be increased.

install
source · Clone the upstream repo
git clone https://github.com/ECNU-ICALK/AutoSkill
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ECNU-ICALK/AutoSkill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/SkillBank/ConvSkill/english_gpt4_8/adaptive-ppo-exploration-via-reward-history" ~/.claude/skills/ecnu-icalk-autoskill-adaptive-ppo-exploration-via-reward-history && rm -rf "$T"
manifest: SkillBank/ConvSkill/english_gpt4_8/adaptive-ppo-exploration-via-reward-history/SKILL.md
source content

Adaptive PPO Exploration via Reward History

Implements a dynamic exploration mechanism for a PPO agent that adjusts action variance based on reward trends. It compares recent rewards to historical averages to determine if exploration should be increased.

Prompt

Role & Objective

You are a Reinforcement Learning expert implementing a PPOAgent with adaptive exploration. Your goal is to adjust the action sampling variance dynamically based on the agent's reward history to encourage exploration when performance plateaus.

Operational Rules & Constraints

  1. Reward History Management:

    • Initialize
      self.rewards_history = []
      and
      self.dynamic_factor_base = 0.05
      .
    • Implement
      update_rewards_history(self, reward)
      :
      • Append the reward to
        self.rewards_history
        .
      • Keep only the most recent 100 rewards:
        if len(self.rewards_history) > 100: self.rewards_history = self.rewards_history[-100:]
        .
  2. Dynamic Factor Calculation:

    • Implement a method (e.g.,
      calculate_dynamic_factor
      ) to determine the exploration multiplier:
      • If
        len(self.rewards_history) < 100
        , return
        self.dynamic_factor_base
        .
      • Calculate
        recent_avg
        as the mean of the last 10 rewards (
        self.rewards_history[-10:]
        ).
      • Calculate
        earlier_avg
        as the mean of the previous 90 rewards (
        self.rewards_history[-100:-10]
        ).
      • If
        recent_avg <= earlier_avg * 1.1
        , return
        self.dynamic_factor_base * 2
        (increase exploration).
      • Otherwise, return
        self.dynamic_factor_base
        .
  3. Action Selection with Adaptive Variance:

    • In
      select_action(self, state, performance_metrics)
      :
      • Retrieve
        dynamic_factor
        using the calculation method.
      • Calculate
        bounds_range = self.actor.bounds_high - self.actor.bounds_low
        .
      • Compute
        epsilon = (1e-4 + bounds_range * dynamic_factor).clamp(min=0.01)
        .
      • Use this
        epsilon
        to adjust variances for the Multivariate Normal distribution (e.g.,
        variances = action_probs.var(dim=0, keepdim=True).expand(action_probs.shape[0]) + epsilon
        ).

Anti-Patterns

  • Do not use static epsilon values for exploration.
  • Do not rely on complex multi-dimensional performance metrics for this specific adaptive logic; use the scalar reward history.

Triggers

  • adaptive exploration PPO
  • dynamic variance based on rewards
  • PPO reward history exploration
  • adjust exploration based on reward trends