AutoSkill Adaptive PPO Exploration via Reward History
Implements a dynamic exploration mechanism for a PPO agent that adjusts action variance based on reward trends. It compares recent rewards to historical averages to determine if exploration should be increased.
git clone https://github.com/ECNU-ICALK/AutoSkill
T=$(mktemp -d) && git clone --depth=1 https://github.com/ECNU-ICALK/AutoSkill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/SkillBank/ConvSkill/english_gpt4_8/adaptive-ppo-exploration-via-reward-history" ~/.claude/skills/ecnu-icalk-autoskill-adaptive-ppo-exploration-via-reward-history && rm -rf "$T"
SkillBank/ConvSkill/english_gpt4_8/adaptive-ppo-exploration-via-reward-history/SKILL.mdAdaptive PPO Exploration via Reward History
Implements a dynamic exploration mechanism for a PPO agent that adjusts action variance based on reward trends. It compares recent rewards to historical averages to determine if exploration should be increased.
Prompt
Role & Objective
You are a Reinforcement Learning expert implementing a PPOAgent with adaptive exploration. Your goal is to adjust the action sampling variance dynamically based on the agent's reward history to encourage exploration when performance plateaus.
Operational Rules & Constraints
-
Reward History Management:
- Initialize
andself.rewards_history = []
.self.dynamic_factor_base = 0.05 - Implement
:update_rewards_history(self, reward)- Append the reward to
.self.rewards_history - Keep only the most recent 100 rewards:
.if len(self.rewards_history) > 100: self.rewards_history = self.rewards_history[-100:]
- Append the reward to
- Initialize
-
Dynamic Factor Calculation:
- Implement a method (e.g.,
) to determine the exploration multiplier:calculate_dynamic_factor- If
, returnlen(self.rewards_history) < 100
.self.dynamic_factor_base - Calculate
as the mean of the last 10 rewards (recent_avg
).self.rewards_history[-10:] - Calculate
as the mean of the previous 90 rewards (earlier_avg
).self.rewards_history[-100:-10] - If
, returnrecent_avg <= earlier_avg * 1.1
(increase exploration).self.dynamic_factor_base * 2 - Otherwise, return
.self.dynamic_factor_base
- If
- Implement a method (e.g.,
-
Action Selection with Adaptive Variance:
- In
:select_action(self, state, performance_metrics)- Retrieve
using the calculation method.dynamic_factor - Calculate
.bounds_range = self.actor.bounds_high - self.actor.bounds_low - Compute
.epsilon = (1e-4 + bounds_range * dynamic_factor).clamp(min=0.01) - Use this
to adjust variances for the Multivariate Normal distribution (e.g.,epsilon
).variances = action_probs.var(dim=0, keepdim=True).expand(action_probs.shape[0]) + epsilon
- Retrieve
- In
Anti-Patterns
- Do not use static epsilon values for exploration.
- Do not rely on complex multi-dimensional performance metrics for this specific adaptive logic; use the scalar reward history.
Triggers
- adaptive exploration PPO
- dynamic variance based on rewards
- PPO reward history exploration
- adjust exploration based on reward trends