AutoSkill PyTorch Fusedbun Optimizer Implementation

Generates a new PyTorch optimizer class by fusing logic from two provided source implementations. The output must be error-free, memory-efficient, and include detailed code comments attributing features to their source optimizers, along with a technical architecture writeup.

install

source · Clone the upstream repo

git clone https://github.com/ECNU-ICALK/AutoSkill

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/ECNU-ICALK/AutoSkill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/SkillBank/ConvSkill/english_gpt4_8/pytorch-fusedbun-optimizer-implementation" ~/.claude/skills/ecnu-icalk-autoskill-pytorch-fusedbun-optimizer-implementation && rm -rf "$T"

manifest: SkillBank/ConvSkill/english_gpt4_8/pytorch-fusedbun-optimizer-implementation/SKILL.md

source content

PyTorch Fusedbun Optimizer Implementation

Prompt

Role & Objective

You are a PyTorch optimizer developer. Your task is to implement a custom optimizer class named

Fusedbun

that fuses techniques from SM3 and Adalite optimizers. The implementation must be error-free, heavily commented, and include specific mechanisms for momentum, gradient centralization, sparse updates, and Hessian approximation.

Operational Rules & Constraints

Class Structure: Inherit from
```
torch.optim.Optimizer
```
.
Initialization: The
```
__init__
```
method must accept
```
params
```
,
```
lr
```
(required),
```
eps
```
,
```
beta_decay
```
,
```
Lambda
```
(weight decay),
```
momentum_beta
```
, and
```
prepare_hessian
```
(boolean flag).
Step Method Signature: The
```
step
```
method must accept an optional
```
closure
```
argument:
```
def step(self, closure=None):
```
.
Closure Handling: If
```
closure
```
is provided, call it to compute the loss at the beginning of the step.
Gradient Centralization: For any parameter gradient
```
grad
```
where
```
len(grad.shape) > 1
```
, centralize the gradient by subtracting its mean:
```
grad -= grad.mean(dim=tuple(range(1, len(grad.shape))), keepdim=True)
```
. Add a comment explaining this stabilizes training.
Momentum: Implement a momentum buffer. Update it using
```
momentum_beta
```
and blend it with the current gradient.
Sparse Update Mechanism: For parameters where
```
p.dim() > 1
```
, implement the following specific logic:
- Create a mask:
```
mask = grad.abs() > eps
```
  .
- Zero out small gradients:
```
grad = grad * mask
```
  .
- Conditionally update the squared gradient average (
```
exp_avg_sq
```
  ) using
```
torch.where(mask, exp_avg_sq*beta_decay + (1-beta_decay)*grad.pow(2), exp_avg_sq)
```
  .
- For scalar parameters (else branch), update
```
exp_avg_sq
```
  normally using
```
mul_
```
  and
```
addcmul_
```
  .
- Add comments explaining that this focuses updates on significant gradients to handle sparsity.
Hessian Approximation: If
```
prepare_hessian
```
is True, initialize and maintain a separate state buffer
```
exp_hessian
```
. Update it similarly to
```
exp_avg_sq
```
and use its square root (plus
```
eps
```
) as the denominator for the update step instead of
```
exp_avg_sq
```
.
Weight Decay: Apply weight decay using the
```
Lambda
```
parameter if it is non-zero.
Comments: Every line of code must have a comment explaining exactly what the tensor operation or mathematical step is doing.

Anti-Patterns

Do not omit the
```
closure
```
argument in the
```
step
```
method.
Do not skip the specific sparse update logic involving
```
torch.where
```
.
Do not forget gradient centralization for multi-dimensional parameters.
Do not leave the code uncommented.

Triggers

implement fusedbun optimizer
sm3 adalite fusion optimizer
custom optimizer with sparse updates
pytorch optimizer with hessian approximation and centralization
fuse these two optimizers
create a new optimizer from these implementations
combine adalite and sm3 code
generate a fused optimizer with comments