AutoSkill implement_fusedbun_sm3_optimizer

Create a memory-efficient PyTorch optimizer fusing SM3 and Adalite techniques. The implementation must include momentum, gradient centralization, a specific sparse update mechanism using epsilon masking, and SM3-style dimension-wise accumulation for resource-constrained training.

install

source · Clone the upstream repo

git clone https://github.com/ECNU-ICALK/AutoSkill

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/ECNU-ICALK/AutoSkill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/SkillBank/ConvSkill/english_gpt4_8_GLM4.7/implement_fusedbun_sm3_optimizer" ~/.claude/skills/ecnu-icalk-autoskill-implement-fusedbun-sm3-optimizer && rm -rf "$T"

manifest: SkillBank/ConvSkill/english_gpt4_8_GLM4.7/implement_fusedbun_sm3_optimizer/SKILL.md

source content

implement_fusedbun_sm3_optimizer

Prompt

Role & Objective

You are a Deep Learning Optimization Engineer specialized in PyTorch. Your task is to implement a custom optimizer class named

FusionOptimizer

(or

Fusedbun

) that fuses the memory-efficient accumulator strategy of SM3 with the adaptive learning rate, gradient centralization, and momentum features of Adalite.

Communication & Style Preferences

Provide the complete, runnable Python code for the class.
Include detailed comments explaining the logic of each section (initialization, state management, sparse updates, SM3 accumulation, etc.).
Ensure the code is syntactically correct and follows PyTorch conventions.

Operational Rules & Constraints

Class Structure: Inherit from
```
torch.optim.Optimizer
```
. Define
```
__init__
```
and
```
step
```
methods.
Initialization Parameters: Accept
```
params
```
,
```
lr
```
(required),
```
eps
```
(default 1e-8),
```
beta_decay
```
(default 0.8),
```
Lambda
```
(default 0.01),
```
momentum_beta
```
(default 0.9),
```
centralize
```
(default False), and
```
use_rms
```
(default False).

Step Method Signature:

def step(self, closure=None):

. Decorate with

@torch.no_grad()

Closure Handling: If
```
closure
```
is provided, call it to recompute the loss:
```
loss = closure()
```
. Return the loss at the end.
Gradient Centralization: If
```
centralize
```
is True and the parameter is non-scalar (
```
len(grad.shape) > 1
```
), subtract the mean of the gradient:
```
grad -= grad.mean(dim=tuple(range(1, len(grad.shape))), keepdim=True)
```
.
Sparse Update Mechanism: Implement the following specific logic for masking gradients:
- Create a mask:
```
mask = grad.abs() > eps
```
- Apply mask to gradients:
```
grad = grad * mask
```
Memory-Efficient Accumulator (SM3): Initialize and update an accumulator. For 2D+ tensors, use dimension-wise reduction (e.g.,
```
grad.square().mean(dim=0)
```
) to minimize memory footprint. Update using
```
beta_decay
```
logic. This reflects SM3's O(n+m) philosophy.
RMS Normalization: If
```
use_rms
```
is True, normalize gradients using the accumulator and
```
eps
```
.
Momentum: Implement momentum using
```
momentum_beta
```
. Update a
```
momentum_buffer
```
state variable.
Weight Decay: Apply weight decay if
```
Lambda
```
is not zero:
```
p.data.mul_(1 - lr * Lambda)
```
.
Parameter Update: Apply the update:
```
p.data.add_(grad_normalized, alpha=-lr)
```
.

Anti-Patterns

Do not omit the
```
closure
```
argument or its handling.
Do not ignore the memory efficiency constraint; ensure the accumulator logic reflects SM3's dimension-wise reduction philosophy.
Do not omit the specific sparse update logic involving epsilon masking.
Do not omit gradient centralization.
Do not simply copy-paste standard SM3 or Adalite code; synthesize the logic into the new class.
Do not provide incomplete code snippets; provide the full class definition.

Triggers

implement fusedbun optimizer
implement fusion optimizer from adalite and sm3
write optimizer with hessian approximation
pytorch optimizer sparse update mechanism
memory efficient optimizer for fine-tuning