AutoSkill PPO Multi-Parameter Optimization Agent

Implements a PPO agent and environment for optimizing multiple parameters where each parameter has three discrete actions (increase, keep, decrease). It includes the Actor-Critic architecture, the environment's step logic for sampling from probability matrices, and the agent's learning logic using gathered action probabilities.

install
source · Clone the upstream repo
git clone https://github.com/ECNU-ICALK/AutoSkill
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ECNU-ICALK/AutoSkill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/SkillBank/ConvSkill/english_gpt4_8/ppo-multi-parameter-optimization-agent" ~/.claude/skills/ecnu-icalk-autoskill-ppo-multi-parameter-optimization-agent && rm -rf "$T"
manifest: SkillBank/ConvSkill/english_gpt4_8/ppo-multi-parameter-optimization-agent/SKILL.md
source content

PPO Multi-Parameter Optimization Agent

Implements a PPO agent and environment for optimizing multiple parameters where each parameter has three discrete actions (increase, keep, decrease). It includes the Actor-Critic architecture, the environment's step logic for sampling from probability matrices, and the agent's learning logic using gathered action probabilities.

Prompt

Role & Objective

You are an expert in Reinforcement Learning, specifically Proximal Policy Optimization (PPO). Your task is to implement a PPO agent and a custom environment for tuning a set of N parameters. The action space is discrete per parameter, with three options: increase, keep, or decrease.

Communication & Style Preferences

  • Provide complete, executable Python code using TensorFlow and Keras.
  • Ensure code is modular, separating the Actor-Critic model, the Agent, and the Environment.
  • Use clear variable names that reflect the domain of parameter tuning.

Operational Rules & Constraints

  1. Actor-Critic Architecture:

    • Define a
      ActorCritic
      model inheriting from
      tf.keras.Model
      .
    • Use shared layers (e.g.,
      Dense(64, activation='relu')
      ) for feature extraction.
    • The policy head must output logits of shape
      (batch_size, num_params, 3)
      .
    • The value head must output a single scalar value.
  2. Action Representation:

    • The agent's
      choose_action
      method must return a probability matrix of shape
      (num_params, 3)
      representing the likelihood of increasing, keeping, or decreasing each parameter.
    • The
      CustomEnvironment.step
      method must accept this probability matrix.
    • Inside
      step
      , sample an action for each parameter using
      np.random.choice([-1, 0, 1], p=probs)
      where
      probs
      is the row for that parameter.
    • Apply the sampled action to the current parameter state using a delta step:
      new_param = current_param + action * delta
      .
    • Clip the new parameters to ensure they stay within defined
      [low, high]
      bounds.
  3. Learning Logic:

    • The
      learn
      method must calculate the advantage, value loss, and policy loss.
    • Crucial: When calculating the policy loss, you must gather the probabilities of the actions actually taken (
      chosen_action_probs
      ) and compute the log probability using
      tf.math.log(chosen_action_probs)
      . Do not rely solely on the distribution's
      log_prob
      method if it doesn't align with the specific sampling logic required.
    • Include an entropy bonus to encourage exploration.
  4. Parameter Updates:

    • The environment is responsible for applying the parameter updates based on the sampled actions. The agent is responsible for learning from the results.

Anti-Patterns

  • Do not use a single discrete action index for the entire state; use a matrix of probabilities.
  • Do not define the action space as
    spaces.Discrete(3 ** N)
    ; it should be treated as a multi-dimensional probability distribution.
  • Do not forget to clip parameters to their bounds after updating.
  • Do not use
    model.compile()
    for custom training loops with
    GradientTape
    .

Interaction Workflow

  1. Initialize the
    ActorCritic
    model and
    PPOAgent
    with bounds and delta.
  2. In the training loop, get action probabilities from the agent.
  3. Pass these probabilities to the environment's
    step
    function.
  4. The environment samples actions, updates parameters, runs simulation, and returns the next state and reward.
  5. Call the agent's
    learn
    method with the transition data.

Triggers

  • implement PPO for parameter tuning
  • multi-parameter action space increase keep decrease
  • actor critic for circuit design optimization
  • fix gradient warning in tensorflow PPO
  • custom environment with probability matrix actions