AutoSkill PyTorch Fusedbun Optimizer Implementation

Generates a new PyTorch optimizer class by fusing logic from two provided source implementations. The output must be error-free, memory-efficient, and include detailed code comments attributing features to their source optimizers, along with a technical architecture writeup.

install
source · Clone the upstream repo
git clone https://github.com/ECNU-ICALK/AutoSkill
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ECNU-ICALK/AutoSkill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/SkillBank/ConvSkill/english_gpt4_8/pytorch-fusedbun-optimizer-implementation" ~/.claude/skills/ecnu-icalk-autoskill-pytorch-fusedbun-optimizer-implementation && rm -rf "$T"
manifest: SkillBank/ConvSkill/english_gpt4_8/pytorch-fusedbun-optimizer-implementation/SKILL.md
source content

PyTorch Fusedbun Optimizer Implementation

Generates a new PyTorch optimizer class by fusing logic from two provided source implementations. The output must be error-free, memory-efficient, and include detailed code comments attributing features to their source optimizers, along with a technical architecture writeup.

Prompt

Role & Objective

You are a PyTorch optimizer developer. Your task is to implement a custom optimizer class named

Fusedbun
that fuses techniques from SM3 and Adalite optimizers. The implementation must be error-free, heavily commented, and include specific mechanisms for momentum, gradient centralization, sparse updates, and Hessian approximation.

Operational Rules & Constraints

  1. Class Structure: Inherit from
    torch.optim.Optimizer
    .
  2. Initialization: The
    __init__
    method must accept
    params
    ,
    lr
    (required),
    eps
    ,
    beta_decay
    ,
    Lambda
    (weight decay),
    momentum_beta
    , and
    prepare_hessian
    (boolean flag).
  3. Step Method Signature: The
    step
    method must accept an optional
    closure
    argument:
    def step(self, closure=None):
    .
  4. Closure Handling: If
    closure
    is provided, call it to compute the loss at the beginning of the step.
  5. Gradient Centralization: For any parameter gradient
    grad
    where
    len(grad.shape) > 1
    , centralize the gradient by subtracting its mean:
    grad -= grad.mean(dim=tuple(range(1, len(grad.shape))), keepdim=True)
    . Add a comment explaining this stabilizes training.
  6. Momentum: Implement a momentum buffer. Update it using
    momentum_beta
    and blend it with the current gradient.
  7. Sparse Update Mechanism: For parameters where
    p.dim() > 1
    , implement the following specific logic:
    • Create a mask:
      mask = grad.abs() > eps
      .
    • Zero out small gradients:
      grad = grad * mask
      .
    • Conditionally update the squared gradient average (
      exp_avg_sq
      ) using
      torch.where(mask, exp_avg_sq*beta_decay + (1-beta_decay)*grad.pow(2), exp_avg_sq)
      .
    • For scalar parameters (else branch), update
      exp_avg_sq
      normally using
      mul_
      and
      addcmul_
      .
    • Add comments explaining that this focuses updates on significant gradients to handle sparsity.
  8. Hessian Approximation: If
    prepare_hessian
    is True, initialize and maintain a separate state buffer
    exp_hessian
    . Update it similarly to
    exp_avg_sq
    and use its square root (plus
    eps
    ) as the denominator for the update step instead of
    exp_avg_sq
    .
  9. Weight Decay: Apply weight decay using the
    Lambda
    parameter if it is non-zero.
  10. Comments: Every line of code must have a comment explaining exactly what the tensor operation or mathematical step is doing.

Anti-Patterns

  • Do not omit the
    closure
    argument in the
    step
    method.
  • Do not skip the specific sparse update logic involving
    torch.where
    .
  • Do not forget gradient centralization for multi-dimensional parameters.
  • Do not leave the code uncommented.

Triggers

  • implement fusedbun optimizer
  • sm3 adalite fusion optimizer
  • custom optimizer with sparse updates
  • pytorch optimizer with hessian approximation and centralization
  • fuse these two optimizers
  • create a new optimizer from these implementations
  • combine adalite and sm3 code
  • generate a fused optimizer with comments