Awesome-omni-skill doc_processor
A comprehensive tool for parsing, cleaning, generating content for, and reconstructing MS Word (.docx) documents.
git clone https://github.com/diegosouzapw/awesome-omni-skill
T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/tools/doc_processor" ~/.claude/skills/diegosouzapw-awesome-omni-skill-doc-processor && rm -rf "$T"
skills/tools/doc_processor/SKILL.mdDoc Processor Skill
This skill allows you to "re-architect" a Word document. It can extract the deep structure, wipe content to create a template, generate new content based on rules or AI, and refill the document.
Capabilities
- Parse Structure: Extract a hierarchical JSON representation including Sections, Paragraphs, Tables, and "Slots".
- Clean Template: Create a "Clean" blank version of the document.
- Generate Content: Produce a content map based on the parsed structure and a user topic.
- Local Repository Integration: Automatically queries local question bank for authentic exam materials.
- Source Citation: All borrowed content is properly annotated with exam source information.
- Build Document: Inject content back into the Clean Template.
Usage Workflow
Task: "Rewrite this lesson plan for the topic 'Past Tense'."
⚠️ 重要:模板文件路径检查
问题背景:
/tmp/目录下的文件在session结束后会被清理。如果用户提供的模板路径是/tmp/xxx.docx,在新session中可能已不存在。
解决方案:
- 生成前必须检查:使用
检查模板文件是否存在os.path.exists() - 文件不存在时:必须询问用户提供正确的模板路径,不要假设文件存在
- 建议用户:将模板文件保存在非/tmp/目录(如
)~/Documents/
标准工作流程
-
Parse Original:
python skills/doc_processor/scripts/parser.py input.docx > structure.json(Optionally redirect output to file)
-
Create Template (Clean):
python skills/doc_processor/scripts/cleaner.py input.docx template_clean.docx⚠️ 路径保存建议:
- 清理后的模板保存在非/tmp/目录,如:
~/Documents/templates/lesson_template_clean.docx - 或保存在工作目录:
/Users/xielk/webdata/english/lesson/templates/
- 清理后的模板保存在非/tmp/目录,如:
-
Generate Content (The "Brain"):
- Goal: Create a
file that mapscontent.json
IDs to new content.structure.json - Process:
- Read the
to find the Slot IDs (structure.json
,p_X
) and their types.t_X - MANDATORY: Query Local Question Bank via Index System (CRITICAL CONSTRAINT)
- MUST use the Index + On-Demand Loading system to access exam questions. NEVER directly load all docx files (65MB+).
- Workflow:
- Load index file (
)/Users/xielk/webdata/english/lesson/resource/index.json - Search index for matching files (search filename and preview text)
- Load only the most relevant 3-5 docx files on-demand
- Extract questions with proper citations
- Load index file (
- Implementation:
from skills.doc_processor.scripts.searcher import search_question_bank # Search for questions matching topic and student profile results, questions = search_question_bank( topic="非谓语动词", # Grammar topic district="嘉定", # Student's district (priority) year="2025" # Most recent year (priority) ) # questions contains content with source annotations for q in questions: print(q['content']) # Question text print(q['source']) # Source: (2025 嘉定一模) - NEVER fabricate or hallucinate exam questions. All content MUST be sourced from the local repository.
- Citation Requirement: EVERY piece of content MUST be annotated:
(YYYY 区域 考试类型)- Examples:
,(2025 徐汇一模)
,(2024 浦东二模)(2023 嘉定一模)
- Examples:
- Priority Rules:
- Most recent year (2025 > 2024 > 2023)
- Student's district (if specified)
- Load max 3-5 files, max 5 questions per file (control token usage)
- STRICTLY ADHERE to Rules from
:.agent/rules/lesson.md- Length Constraint: Resulting doc MUST be > 14 pages. You must generate EXTENSIVE examples, detailed logic explanations, and sufficient practice questions to meet this. Do not compress content.
- Time Duration: Content must cover a full 2-hour lesson.
- Topic Focus: Single core topic (e.g., "Prepositions") only. All examples must align.
- Structure Mapping:
- Row 1-3: Teaching Objectives & Difficulties.
- Row 6: Icebreaker/Review.
- Row 7-10: Knowledge Points (Deep Dive). This is the bulk. Use "Methodology + Logic" style (When/Why/Trap/How).
- Row 15: Variant Practice (Part A: Drill, Part B: Application).
- Row 17: Class Quiz (Part A: Real Exams, Part B: Extension).
- Row 18: Reflection.
- Exam Alignment: Use tags like
or(2023 Shanghai Zhongkao)
.(2024 Pudong Model) - Formatting: No Markdown symbols (
,**
), use|
for blanks.____
- Synthesize Content:
- Write a JSON file where Keys = IDs, Values = Strings (or Arrays for Tables).
- Ensure all exam questions, reading passages, and reference materials include proper source citations as specified above.
- Read the
- Action: Save the result to
.content.json
- Goal: Create a
-
Build Final Doc: Run the builder script to inject your generated content into the clean template.
python skills/doc_processor/scripts/builder.py <path_to_clean_template_docx> <path_to_content_json> <path_to_final_docx>⚠️ 异常处理流程:
如果模板文件不存在(FileNotFoundError),必须执行以下流程:
import os template_path = "/tmp/xxx.docx" # 用户提供的路径 if not os.path.exists(template_path): # 1. 报告错误 print(f"❌ 模板文件不存在: {template_path}") # 2. 解释原因 print("可能原因:") print(" • /tmp/目录文件在session结束后被清理") print(" • 文件路径错误") print(" • 文件被移动或删除") # 3. 询问用户 print("\n💡 请提供正确的模板文件路径:") print(" 建议将模板复制到非/tmp/目录,如 ~/Documents/templates/") # 4. 等待用户提供新路径(在对话中) # 不要继续生成,避免生成格式错误的文档!在新session中的处理流程:
用户:帮我生成教案,模板是 /tmp/template.docx 助手:检查文件是否存在... 如果发现文件不存在: "⚠️ 模板文件 /tmp/template.docx 不存在! /tmp/目录下的文件会在session结束后被清理。 请提供正确的模板路径,或者重新上传模板文件。 建议将模板保存在 ~/Documents/ 目录下。" 用户:(提供新路径或重新上传) 助手:(使用正确的路径继续生成)
Scripts Reference
: Analyzes structure. Returns valid JSON.scripts/parser.py
: Wipes content cells/paragraphs.scripts/cleaner.py
: Optional mock script. In real usage, the Agent generates thescripts/generator.py
.content.json
: Fills blocks by ID. Matches iteration order ofscripts/builder.py
.parser.py
Local Question Bank Integration (强制约束)
Repository Path Configuration
Default Path:
/Users/xielk/webdata/english/lesson/resource
This directory contains authentic exam materials organized by:
- District (区):
,徐汇/
,浦东/
, etc.嘉定/ - Year:
,2025/
,2024/
, etc.2023/ - Type:
,一模/
,二模/
, etc.中考/ - Category:
,语法/
,阅读/
, etc.作文/
Index System (索引+按需加载)
解决大文件问题: 题库总计约65MB,直接加载所有docx会产生巨大token费用。使用索引+按需加载机制:
1. 生成索引(首次使用或更新题库时执行)
# 创建索引(只需执行一次,约10秒) python skills/doc_processor/scripts/indexer.py
索引文件位置:
/Users/xielk/webdata/english/lesson/resource/index.json
索引包含:
- 文件路径、文件名
- 年份、区域、考试类型、题型(自动解析)
- 预览内容(前500字符)
- 文件大小、修改时间
2. 搜索使用方式
方式A:使用Searcher类(推荐)
from skills.doc_processor.scripts.searcher import QuestionBankSearcher # 初始化(加载索引,token极少) searcher = QuestionBankSearcher() # 搜索索引(仅查索引,不加载docx) results = searcher.search( keyword="非谓语", # 关键词 district="徐汇", # 可选:区域筛选 year="2025", # 可选:年份筛选 limit=10 # 返回结果数 ) # 智能搜索(索引+按需加载docx) idx_results, questions = searcher.smart_search( topic="非谓语", district="嘉定", # 优先学生所在区 year="2025", max_docs=3, # 最多加载3个文件 max_questions_per_doc=5 # 每个文件最多5题 ) # questions中包含题目内容和来源标注 for q in questions: print(q['content']) # 题目内容 print(q['source']) # 来源:(2025 嘉定一模)
方式B:便捷函数
from skills.doc_processor.scripts.searcher import search_question_bank # 一键搜索 results, questions = search_question_bank( topic="定语从句", district="浦东", year="2024" )
3. Token费用对比
| 方式 | Token消耗 | 说明 |
|---|---|---|
| 直接加载所有docx(65MB) | 巨大 | ❌ 不推荐 |
| 预转txt后全文搜索 | 大 | ⚠️ 稍好但仍贵 |
| 索引+按需加载 | 极小 | ✅ 只加载需要的3-5个文件 |
Search Strategy (MUST FOLLOW)
使用索引系统进行搜索:
- 加载索引(token极少,一次性)
- 搜索索引(匹配文件名和预览内容)
- 按需加载(只加载最相关的3-5个docx文件)
- 提取题目(带来源标注)
具体步骤:
# Step 1: 确保索引已创建 python skills/doc_processor/scripts/indexer.py # Step 2: 在Python中使用Searcher搜索 python << 'PYEOF' from skills.doc_processor.scripts.searcher import search_question_bank # 搜索语法题目(优先嘉定区2025年) results, questions = search_question_bank("非谓语", "嘉定", "2025") # 搜索阅读材料 results, passages = search_question_bank("阅读B篇", "徐汇", "2024") # 搜索作文范文 results, compositions = search_question_bank("中考作文", None, "2023") PYEOF
Source Citation Format (强制标注)
Every piece of content extracted from the repository MUST include source annotation:
Format:
(YYYY 区域 考试类型 [题型])
Examples:
- 2025 Xuhui District First Mock Exam, Grammar MCQ(2025 徐汇一模 语法单选)
- 2024 Pudong District Second Mock Exam, Reading Passage B(2024 浦东二模 阅读B篇)
- 2023 Shanghai High School Entrance Exam, Composition(2023 Shanghai Zhongkao 作文)
- 2024 Jiading District Mock Exam, Cloze Test(2024 Jiading Model 完形填空)
Placement:
- Place citation immediately after the question title or passage title
- Example:
【例题1】选择最佳答案(2025 徐汇一模 语法单选) The problem ______ at the meeting tomorrow is important. A. to be discussed B. being discussed C. discussed D. to discuss
Priority Rules
When multiple sources are available, select in this order:
- Recency: Prioritize 2025 over 2024 over 2023
- Student's District: If student is from Jiading, use Jiading papers first
- Difficulty Match: Select materials matching student's current level (98分 → medium difficulty, avoid too basic)
- Topic Relevance: Exact topic match > Related topic > General review
Error Handling
If required content is NOT found in the repository:
- Expand search to adjacent years (e.g., if 2025 not found, try 2024)
- Expand search to other districts (e.g., if 徐汇 not found, try 浦东)
- If still not found, inform user: "未在题库中找到[具体年份/区域]的相关题目,已使用[替代来源]的相似题目替代"
- NEVER fabricate exam questions or pretend they exist in the repository
Content Types to Search
- Grammar Questions: 单选题, 填空题, 改错题, 完成句子
- Reading Materials: A篇应用文, B篇记叙文, C篇首字母填空, D篇回答问题
- Compositions: 中考作文范文, 满分作文, 常见话题模板
- Vocabulary: 考纲词汇, 高频短语, 固定搭配
Shanghai Zhongkao Question Type Structure (上海中考题型结构)
必须理解上海中考英语试卷结构(与其他地区不同):
| 题型 | 内容 | 分值 | 特点 |
|---|---|---|---|
| Part 1 | 听力 | 30分 | 短对话、长对话、短文 |
| Part 2 | 语音/语法/词汇 | 40分 | 语音、词汇变形、语法选择 |
| Part 3 | 阅读理解 | 50分 | A/B/C/D四篇 |
| - A篇 | 应用文阅读 | 约12分 | 广告、通知、指南,3-4题选择题 |
| - B篇 | 记叙文阅读 | 约12分 | 故事类,3-4题选择题 |
| - C篇 | 首字母填空 | 14分 | ⚠️ 不是选择题! 首字母提示填空(7空×2分) |
| - D篇 | 回答问题 | 12分 | 阅读后回答问题(6题) |
| Part 4 | 写作 | 20分 | 命题作文(80-100词) |
⚠️ 常见错误警示:
❌ 错误理解: C篇是阅读理解选择题(这是全国卷题型) ✅ 正确理解: 上海中考C篇是首字母填空(Cloze with initial letters)
C篇特点:
- 给出一篇150-200词的短文
- 7个空格,每空首字母已给出
- 需根据上下文和首字母填入正确单词
- 考点:词汇拼写、语法搭配、上下文逻辑
搜索关键词对照:
- C篇 / 首字母填空 / 首字母
- 不是:阅读理解 / 阅读C篇 / 选择题