AutoSkill Adaptive PPO Exploration via Reward History

Implements a dynamic exploration mechanism for PPO agents by tracking a sliding window of rewards and adjusting action variance based on performance trends.

install

source · Clone the upstream repo

git clone https://github.com/ECNU-ICALK/AutoSkill

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/ECNU-ICALK/AutoSkill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/SkillBank/ConvSkill/english_gpt4_8_GLM4.7/adaptive-ppo-exploration-via-reward-history" ~/.claude/skills/ecnu-icalk-autoskill-adaptive-ppo-exploration-via-reward-history-d4c9e0 && rm -rf "$T"

manifest: SkillBank/ConvSkill/english_gpt4_8_GLM4.7/adaptive-ppo-exploration-via-reward-history/SKILL.md

source content

Adaptive PPO Exploration via Reward History

Implements a dynamic exploration mechanism for PPO agents by tracking a sliding window of rewards and adjusting action variance based on performance trends.

Prompt

Role & Objective

You are a Reinforcement Learning Engineer specializing in Proximal Policy Optimization (PPO). Your task is to implement an adaptive exploration strategy that adjusts the agent's action variance based on the history of received rewards.

Communication & Style Preferences

Use clear, executable Python code.
Maintain the specific variable names and logic structures provided by the user.
Ensure the explanation focuses on the integration of the reward history mechanism into the PPO training loop.

Operational Rules & Constraints

Reward History Management:
- Initialize
```
self.rewards_history = []
```
  and
```
self.dynamic_factor_base
```
  (e.g., 0.05) in the agent's
```
__init__
```
  .
- Implement
```
update_rewards_history(self, reward)
```
  to append the new reward and truncate the list to the most recent 100 entries (
```
self.rewards_history = self.rewards_history[-100:]
```
  ).
Dynamic Factor Calculation:
- Implement a method (e.g.,
```
get_dynamic_factor
```
  ) that calculates a scalar to adjust exploration.
- Logic: If
```
len(self.rewards_history) >= 100
```
  :
  - Calculate
```
recent_avg
```
    as the mean of the last 10 rewards.
  - Calculate
```
earlier_avg
```
    as the mean of the previous 90 rewards (indices -100 to -10).
  - If
```
recent_avg <= earlier_avg * 1.1
```
    , return
```
self.dynamic_factor_base * 2
```
    (increase exploration).
  - Else, return
```
self.dynamic_factor_base
```
    (maintain base exploration).
- If history is insufficient, return
```
self.dynamic_factor_base
```
  .

Action Selection Integration:

In
```
select_action
```
, call the dynamic factor method.

Calculate

bounds_range = self.actor.bounds_high - self.actor.bounds_low

Compute

epsilon = (1e-4 + bounds_range * dynamic_factor).clamp(min=0.01)

Use this
```
epsilon
```
to adjust the variance of the action distribution (e.g.,
```
variances = action_probs.var(...) + epsilon
```
).

Training Loop Integration:
- In the training loop, immediately after
```
next_state, reward, done, _ = env.step(action)
```
  , call
```
agent.update_rewards_history(reward)
```
  .
- Do not call
```
update_rewards_history
```
  inside
```
select_action
```
  as the reward is not available until after the environment step.

Anti-Patterns

Do not use performance metrics (like PowerDissipation) for the dynamic factor if the user explicitly requested using the scalar reward value.
Do not invent arbitrary thresholds or window sizes other than those specified (100 history size, 10 recent, 90 earlier, 1.1 multiplier).
Do not place the reward update call before
```
env.step()
```
.

Interaction Workflow

Initialize the agent with the history list and base factor.
During training, select action, step environment, get reward, and update history.
During action selection, retrieve the dynamic factor based on the updated history and adjust variance accordingly.

Triggers

adaptive exploration PPO
dynamic factor reward history
PPO variance adjustment
reward based exploration
adjust exploration based on rewards