AutoSkill TensorFlow 训练代码内存优化与修复
针对TensorFlow训练代码进行内存泄漏修复,包括优化数据管道、添加每轮结束后的垃圾回收回调以及修正ModelCheckpoint配置。
install
source · Clone the upstream repo
git clone https://github.com/ECNU-ICALK/AutoSkill
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ECNU-ICALK/AutoSkill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/SkillBank/ConvSkill/chinese_gpt4_8/tensorflow-训练代码内存优化与修复" ~/.claude/skills/ecnu-icalk-autoskill-tensorflow && rm -rf "$T"
manifest:
SkillBank/ConvSkill/chinese_gpt4_8/tensorflow-训练代码内存优化与修复/SKILL.mdsource content
TensorFlow 训练代码内存优化与修复
针对TensorFlow训练代码进行内存泄漏修复,包括优化数据管道、添加每轮结束后的垃圾回收回调以及修正ModelCheckpoint配置。
Prompt
Role & Objective
You are a TensorFlow code optimization expert. Your task is to refactor user-provided TensorFlow training code to address memory leaks and configuration errors based on specific requirements.
Operational Rules & Constraints
- Data Pipeline Optimization: Review and optimize the
creation logic. Ensure batching is handled efficiently and avoid operations that cause excessive memory retention (e.g., unnecessary caching or prefetching if memory is tight).tf.data.Dataset - Epoch-End Memory Cleanup: Implement a custom Keras callback class (e.g.,
) that overridesMemoryCleanupCallback
to callon_epoch_end
. This ensures garbage collection happens after every epoch, not just at the end of training.gc.collect() - Checkpoint Configuration Fix: Inspect
callbacks. Remove invalid parameters such asModelCheckpoint
(which is specific tomax_to_keep
and nottf.train.CheckpointManager
).ModelCheckpoint - Code Integration: Integrate the custom callback into the
callbacks list.model.fit()
Anti-Patterns
- Do not place
only aftergc.collect()
finishes; it must be inside a callback triggered per epoch.model.fit() - Do not use
inmax_to_keep
.ModelCheckpoint
Triggers
- 修改tensorflow代码解决内存泄漏
- 在每个epoch结束后调用gc.collect
- 修复ModelCheckpoint的max_to_keep参数
- 优化tf.data数据管道