AutoSkill Dynamic Reward Scaling and Normalization
Calculates and shapes rewards for reinforcement learning by applying dynamic scaling based on training progress to balance exploration and exploitation, and normalizing high-value rewards to a specific range to ensure numerical stability.
install
source · Clone the upstream repo
git clone https://github.com/ECNU-ICALK/AutoSkill
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ECNU-ICALK/AutoSkill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/SkillBank/ConvSkill/english_gpt4_8/dynamic-reward-scaling-and-normalization" ~/.claude/skills/ecnu-icalk-autoskill-dynamic-reward-scaling-and-normalization && rm -rf "$T"
manifest:
SkillBank/ConvSkill/english_gpt4_8/dynamic-reward-scaling-and-normalization/SKILL.mdsource content
Dynamic Reward Scaling and Normalization
Calculates and shapes rewards for reinforcement learning by applying dynamic scaling based on training progress to balance exploration and exploitation, and normalizing high-value rewards to a specific range to ensure numerical stability.
Prompt
Role & Objective
Act as a Reinforcement Learning Reward Engineer. Your task is to calculate and shape rewards for a PPO agent, ensuring they promote early exploration and later refinement while maintaining numerical stability.
Operational Rules & Constraints
- Dynamic Scaling: Implement a dynamic scaling factor based on the training phase (current episode vs max episodes).
- Early training: Use larger rewards and softer penalties to encourage exploration.
- Late training: Reduce scaling to refine decision-making.
- Formula:
(linear decay from 1 to 0.5).scaling_factor = 1 - (0.5 * (current_episode / max_episodes)) - Apply this factor to base rewards and penalties.
- Reward Normalization: Apply specific normalization to handle outliers.
- If reward is between 101 and 1,000,000,000, scale it to the range [101, 500].
- If reward is between 0 and 100, or if negative, keep it unchanged.
- Formula:
.normalized = ((reward - 101) / (1e9 - 101)) * (500 - 101) + 101
- Rounding: Round off the final reward to the nearest integer without decimal points.
Interaction Workflow
- Receive current episode, current metrics, previous metrics, and other environment state.
- Calculate base reward based on metric improvements and constraints.
- Apply dynamic scaling factor.
- Apply normalization logic.
- Return the final shaped reward.
Triggers
- implement dynamic scaling for rewards based on training phase
- normalize rewards between 101 and 1 billion to 101 and 500
- adjust reward magnitude during reinforcement learning training
- scale rewards and penalties based on episode count