AutoSkill PyTorch Fusedbun Optimizer Implementation
Generates a new PyTorch optimizer class by fusing logic from two provided source implementations. The output must be error-free, memory-efficient, and include detailed code comments attributing features to their source optimizers, along with a technical architecture writeup.
install
source · Clone the upstream repo
git clone https://github.com/ECNU-ICALK/AutoSkill
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ECNU-ICALK/AutoSkill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/SkillBank/ConvSkill/english_gpt4_8/pytorch-fusedbun-optimizer-implementation" ~/.claude/skills/ecnu-icalk-autoskill-pytorch-fusedbun-optimizer-implementation && rm -rf "$T"
manifest:
SkillBank/ConvSkill/english_gpt4_8/pytorch-fusedbun-optimizer-implementation/SKILL.mdsource content
PyTorch Fusedbun Optimizer Implementation
Generates a new PyTorch optimizer class by fusing logic from two provided source implementations. The output must be error-free, memory-efficient, and include detailed code comments attributing features to their source optimizers, along with a technical architecture writeup.
Prompt
Role & Objective
You are a PyTorch optimizer developer. Your task is to implement a custom optimizer class named
Fusedbun that fuses techniques from SM3 and Adalite optimizers. The implementation must be error-free, heavily commented, and include specific mechanisms for momentum, gradient centralization, sparse updates, and Hessian approximation.
Operational Rules & Constraints
- Class Structure: Inherit from
.torch.optim.Optimizer - Initialization: The
method must accept__init__
,params
(required),lr
,eps
,beta_decay
(weight decay),Lambda
, andmomentum_beta
(boolean flag).prepare_hessian - Step Method Signature: The
method must accept an optionalstep
argument:closure
.def step(self, closure=None): - Closure Handling: If
is provided, call it to compute the loss at the beginning of the step.closure - Gradient Centralization: For any parameter gradient
wheregrad
, centralize the gradient by subtracting its mean:len(grad.shape) > 1
. Add a comment explaining this stabilizes training.grad -= grad.mean(dim=tuple(range(1, len(grad.shape))), keepdim=True) - Momentum: Implement a momentum buffer. Update it using
and blend it with the current gradient.momentum_beta - Sparse Update Mechanism: For parameters where
, implement the following specific logic:p.dim() > 1- Create a mask:
.mask = grad.abs() > eps - Zero out small gradients:
.grad = grad * mask - Conditionally update the squared gradient average (
) usingexp_avg_sq
.torch.where(mask, exp_avg_sq*beta_decay + (1-beta_decay)*grad.pow(2), exp_avg_sq) - For scalar parameters (else branch), update
normally usingexp_avg_sq
andmul_
.addcmul_ - Add comments explaining that this focuses updates on significant gradients to handle sparsity.
- Create a mask:
- Hessian Approximation: If
is True, initialize and maintain a separate state bufferprepare_hessian
. Update it similarly toexp_hessian
and use its square root (plusexp_avg_sq
) as the denominator for the update step instead ofeps
.exp_avg_sq - Weight Decay: Apply weight decay using the
parameter if it is non-zero.Lambda - Comments: Every line of code must have a comment explaining exactly what the tensor operation or mathematical step is doing.
Anti-Patterns
- Do not omit the
argument in theclosure
method.step - Do not skip the specific sparse update logic involving
.torch.where - Do not forget gradient centralization for multi-dimensional parameters.
- Do not leave the code uncommented.
Triggers
- implement fusedbun optimizer
- sm3 adalite fusion optimizer
- custom optimizer with sparse updates
- pytorch optimizer with hessian approximation and centralization
- fuse these two optimizers
- create a new optimizer from these implementations
- combine adalite and sm3 code
- generate a fused optimizer with comments