Claude-skill-registry chunking-strategies
Document chunking strategies for RAG systems. Use when implementing document processing pipelines to determine optimal chunking approaches based on document type and retrieval requirements.
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/chunking-strategies" ~/.claude/skills/majiayu000-claude-skill-registry-chunking-strategies && rm -rf "$T"
manifest:
skills/data/chunking-strategies/SKILL.mdsource content
Chunking Strategies Skill
This skill provides chunking strategies for RAG document processing.
Chunking Methods
1. Fixed-Size Chunking
def fixed_size_chunk(text: str, chunk_size: int = 500, overlap: int = 50): chunks = [] start = 0 while start < len(text): end = start + chunk_size chunks.append(text[start:end]) start = end - overlap return chunks
2. Semantic Chunking
Split on natural boundaries (sentences, paragraphs).
def semantic_chunk(text: str, max_tokens: int = 500): paragraphs = text.split("\n\n") chunks = [] current_chunk = [] current_tokens = 0 for para in paragraphs: para_tokens = count_tokens(para) if current_tokens + para_tokens > max_tokens: chunks.append("\n\n".join(current_chunk)) current_chunk = [para] current_tokens = para_tokens else: current_chunk.append(para) current_tokens += para_tokens if current_chunk: chunks.append("\n\n".join(current_chunk)) return chunks
3. Recursive Chunking
Hierarchical splitting on multiple separators.
SEPARATORS = ["\n\n", "\n", ". ", " "] def recursive_chunk(text: str, max_size: int, separators: list[str]): if len(text) <= max_size: return [text] sep = separators[0] if separators else "" chunks = [] parts = text.split(sep) for part in parts: if len(part) <= max_size: chunks.append(part) elif len(separators) > 1: chunks.extend(recursive_chunk(part, max_size, separators[1:])) else: chunks.append(part[:max_size]) return chunks
Chunking by Document Type
| Document Type | Recommended Strategy | Chunk Size |
|---|---|---|
| Technical docs | Semantic (headers) | 500-1000 tokens |
| Legal documents | Semantic (sections) | 1000-2000 tokens |
| Code | Function/class based | 200-500 tokens |
| Conversations | Message boundaries | 100-300 tokens |
| General text | Recursive | 300-500 tokens |
Chunk Enrichment
@dataclass class EnrichedChunk: content: str metadata: dict summary: str # LLM-generated keywords: list[str] parent_id: str # For hierarchical retrieval
Best Practices
- Add overlap between chunks (10-20%)
- Preserve semantic boundaries
- Include metadata (source, position)
- Consider hierarchical chunking for long docs
- Test retrieval quality with different sizes