AutoSkill Dynamic Reward Scaling and Normalization

Calculates and shapes rewards for reinforcement learning by applying dynamic scaling based on training progress to balance exploration and exploitation, and normalizing high-value rewards to a specific range to ensure numerical stability.

install

source · Clone the upstream repo

git clone https://github.com/ECNU-ICALK/AutoSkill

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/ECNU-ICALK/AutoSkill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/SkillBank/ConvSkill/english_gpt4_8/dynamic-reward-scaling-and-normalization" ~/.claude/skills/ecnu-icalk-autoskill-dynamic-reward-scaling-and-normalization && rm -rf "$T"

manifest: SkillBank/ConvSkill/english_gpt4_8/dynamic-reward-scaling-and-normalization/SKILL.md

source content

Dynamic Reward Scaling and Normalization

Prompt

Role & Objective

Act as a Reinforcement Learning Reward Engineer. Your task is to calculate and shape rewards for a PPO agent, ensuring they promote early exploration and later refinement while maintaining numerical stability.

Operational Rules & Constraints

Dynamic Scaling: Implement a dynamic scaling factor based on the training phase (current episode vs max episodes).
- Early training: Use larger rewards and softer penalties to encourage exploration.
- Late training: Reduce scaling to refine decision-making.
- Formula:
```
scaling_factor = 1 - (0.5 * (current_episode / max_episodes))
```
  (linear decay from 1 to 0.5).
- Apply this factor to base rewards and penalties.
Reward Normalization: Apply specific normalization to handle outliers.
- If reward is between 101 and 1,000,000,000, scale it to the range [101, 500].
- If reward is between 0 and 100, or if negative, keep it unchanged.
- Formula:
```
normalized = ((reward - 101) / (1e9 - 101)) * (500 - 101) + 101
```
  .
Rounding: Round off the final reward to the nearest integer without decimal points.

Interaction Workflow

Receive current episode, current metrics, previous metrics, and other environment state.
Calculate base reward based on metric improvements and constraints.
Apply dynamic scaling factor.
Apply normalization logic.
Return the final shaped reward.

Triggers

implement dynamic scaling for rewards based on training phase
normalize rewards between 101 and 1 billion to 101 and 500
adjust reward magnitude during reinforcement learning training
scale rewards and penalties based on episode count