Agent-Skills-for-Context-Engineering book-sft-pipeline
This skill should be used when the user asks to "fine-tune on books", "create SFT dataset", "train style model", "extract ePub text", or mentions style transfer, LoRA training, book segmentation, or author voice replication.
git clone https://github.com/muratcankoylan/Agent-Skills-for-Context-Engineering
T=$(mktemp -d) && git clone --depth=1 https://github.com/muratcankoylan/Agent-Skills-for-Context-Engineering "$T" && mkdir -p ~/.claude/skills && cp -r "$T/examples/book-sft-pipeline" ~/.claude/skills/muratcankoylan-agent-skills-for-context-engineering-book-sft-pipeline && rm -rf "$T"
examples/book-sft-pipeline/SKILL.mdBook SFT Pipeline
A complete system for converting books into SFT datasets and training style-transfer models. This skill teaches the pipeline from raw ePub to a model that writes in any author's voice.
When to Activate
Activate this skill when:
- Building fine-tuning datasets from literary works
- Creating author-voice or style-transfer models
- Preparing training data for Tinker or similar SFT platforms
- Designing text segmentation pipelines for long-form content
- Training small models (8B or less) on limited data
Core Concepts
The Three Pillars of Book SFT
1. Intelligent Segmentation Text chunks must be semantically coherent. Breaking mid-sentence teaches the model to produce fragmented output. Target: 150-400 words per chunk, always at natural boundaries.
2. Diverse Instruction Generation Use multiple prompt templates and system prompts to prevent overfitting. A single prompt style leads to memorization. Use 15+ prompt templates with 5+ system prompts.
3. Style Over Content The goal is learning the author's rhythm and vocabulary patterns, not memorizing plots. Synthetic instructions describe what happens without quoting the text.
Pipeline Architecture
┌─────────────────────────────────────────────────────────────────┐ │ ORCHESTRATOR AGENT │ │ Coordinates pipeline phases, manages state, handles failures │ └──────────────────────┬──────────────────────────────────────────┘ │ ┌───────────────┼───────────────┬───────────────┐ ▼ ▼ ▼ ▼ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ EXTRACTION │ │ SEGMENTATION │ │ INSTRUCTION │ │ DATASET │ │ AGENT │ │ AGENT │ │ AGENT │ │ BUILDER │ │ ePub → Text │ │ Text → Chunks│ │ Chunks → │ │ Pairs → │ │ │ │ 150-400 words│ │ Prompts │ │ JSONL │ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │ ┌───────────────┴───────────────┐ ▼ ▼ ┌──────────────┐ ┌──────────────┐ │ TRAINING │ │ VALIDATION │ │ AGENT │ │ AGENT │ │ LoRA on │ │ AI detector │ │ Tinker │ │ Originality │ └──────────────┘ └──────────────┘
Phase 1: Text Extraction
Critical Rules
- Always source ePub over PDF - OCR errors become learned patterns
- Use paragraph-level extraction - Extract from
tags to preserve breaks<p> - Remove front/back matter - Copyright and TOC pollute the dataset
# Extract text from ePub paragraphs from epub2 import EPub from bs4 import BeautifulSoup def extract_epub(path): book = EPub(path) chapters = [] for item in book.flow: html = book.get_chapter(item.id) soup = BeautifulSoup(html, 'html.parser') paragraphs = [p.get_text().strip() for p in soup.find_all('p')] chapters.append('\n\n'.join(p for p in paragraphs if p)) return '\n\n'.join(chapters)
Phase 2: Intelligent Segmentation
Smaller Chunks + Overlap
Smaller chunks (150-400 words) produce more training examples and better style transfer than larger chunks (250-650).
def segment(text, min_words=150, max_words=400): paragraphs = text.split('\n\n') chunks, buffer, buffer_words = [], [], 0 for para in paragraphs: words = len(para.split()) if buffer_words + words > max_words and buffer_words >= min_words: chunks.append('\n\n'.join(buffer)) # Keep last paragraph for overlap buffer = [buffer[-1], para] if buffer else [para] buffer_words = sum(len(p.split()) for p in buffer) else: buffer.append(para) buffer_words += words if buffer: chunks.append('\n\n'.join(buffer)) return chunks
Expected Results
For an 86,000-word book:
- Old method (250-650 words): ~150 chunks
- New method (150-400 + overlap): ~300 chunks
- With 2 variants per chunk: 600+ training examples
Phase 3: Diverse Instruction Generation
The Key Insight
Using a single prompt template causes memorization. Diverse templates teach the underlying style.
SYSTEM_PROMPTS = [ "You are an expert creative writer capable of emulating specific literary styles.", "You are a literary writer with deep knowledge of classic prose styles.", "You are a creative writer skilled at emulating distinctive authorial voices.", "You write prose that captures the essence of modernist literature.", "You are a talented writer who can channel classic American authors.", ] PROMPT_TEMPLATES = [ "Write a passage in the style of {author}: {desc}", "Channel {author}'s voice to write about: {desc}", "In {author}'s distinctive prose style, describe: {desc}", "Write this scene as {author} would have: {desc}", "Using {author}'s repetitive technique, describe: {desc}", "Capture the rhythm of {author} in this passage: {desc}", "Write like {author}: {desc}", "In the voice of {author}, write: {desc}", "This is a literary exercise. Write like {author}: {desc}", "Can you write in {author}'s style? {desc}", ]
Instruction Generation
INSTRUCTION_PROMPT = """Describe what is happening in this excerpt in 2-3 sentences. Focus on: characters present, actions, emotions, setting. Do NOT quote the text directly. Excerpt: {text} """ # Use a fast, cheap LLM (e.g., Gemini Flash) instruction = llm_call(INSTRUCTION_PROMPT.format(text=chunk))
Phase 4: Dataset Construction
Message Format
{ "messages": [ {"role": "system", "content": "You are an expert creative writer..."}, {"role": "user", "content": "Write in the style of Author: Scene description..."}, {"role": "assistant", "content": "The actual book text from chunk..."} ] }
Multiple Variants Per Chunk
def build_examples(chunk, instruction, author, variants=2): examples = [] for i in range(variants): system = SYSTEM_PROMPTS[i % len(SYSTEM_PROMPTS)] template = PROMPT_TEMPLATES[(chunk.id + i) % len(PROMPT_TEMPLATES)] user = template.format(author=author, desc=instruction) examples.append({"messages": [ {"role": "system", "content": system}, {"role": "user", "content": user}, {"role": "assistant", "content": chunk.text} ]}) return examples
Phase 5: LoRA Training on Tinker
Configuration
CONFIG = { "model_name": "Qwen/Qwen3-8B-Base", # Base, not instruct "lora_rank": 32, # 352MB adapter "learning_rate": 5e-4, # Higher for LoRA "batch_size": 4, "epochs": 3, }
Why Base Model?
Use base (pretrained) models, not instruction-tuned versions:
- Base models are more malleable for new styles
- Instruct models have patterns that resist overwriting
- Style is a low-level pattern that base models capture better
Training Loop
import tinker from tinker import types training_client = await service_client.create_lora_training_client_async( base_model="Qwen/Qwen3-8B-Base", rank=32 ) for epoch in range(3): for batch in batches: await training_client.forward_backward_async(batch, loss_fn="cross_entropy") await training_client.optim_step_async(types.AdamParams(learning_rate=5e-4)) result = await training_client.save_weights_for_sampler_async(name="final")
Phase 6: Validation
Modern Scenario Test
Test with scenarios that couldn't exist in the original book:
TEST_PROMPTS = [ "Write about a barista making lattes", "Describe lovers communicating through text messages", "Write about someone anxious about climate change", ]
If the model applies style markers to modern scenarios, it learned style, not content.
Originality Verification
# Search training data for output phrases grep "specific phrase from output" dataset.jsonl # Should return: No matches
AI Detector Testing
Test outputs with GPTZero, Pangram, or ZeroGPT.
Known Issues and Solutions
Character Name Leakage
Symptom: Model uses original character names in new scenarios. Cause: Limited name diversity from one book. Solution: Train on multiple books or add synthetic examples.
Model Parrots Exact Phrases
Symptom: Outputs contain exact sentences from training data. Cause: Too few prompt variations or too many epochs. Solution: Use 15+ templates, limit to 3 epochs.
Fragmented Outputs
Symptom: Sentences feel incomplete. Cause: Poor segmentation breaking mid-thought. Solution: Always break at paragraph boundaries.
Guidelines
- Always source ePub over PDF - OCR errors become learned patterns
- Never break mid-sentence - Boundaries must be grammatically complete
- Use diverse prompts - 15+ templates, 5+ system prompts
- Use base models - Not instruct versions
- Use smaller chunks - 150-400 words for more examples
- Reserve test set - 50 examples minimum
- Test on modern scenarios - Proves style transfer vs memorization
- Verify originality - Grep training data for output phrases
Expected Results
| Metric | Value |
|---|---|
| Training examples | 500-1000 per book |
| Model | Qwen/Qwen3-8B-Base |
| LoRA rank | 32 |
| Adapter size | ~350 MB |
| Training time | ~15 min |
| Loss reduction | 90%+ |
| Style transfer success | ~50% perfect |
Cost Estimate
| Component | Cost |
|---|---|
| LLM (instruction generation) | ~$0.50 |
| Tinker training (15 min) | ~$1.50 |
| Total | ~$2.00 |
Integration with Context Engineering Skills
This example applies several skills from the Agent Skills for Context Engineering collection:
project-development
The pipeline follows the staged, idempotent architecture pattern:
- Acquire: Extract text from ePub
- Prepare: Segment into training chunks
- Process: Generate synthetic instructions
- Parse: Build message format
- Render: Output Tinker-compatible JSONL
- Train: LoRA fine-tuning
- Validate: Modern scenario testing
Each phase is resumable and produces intermediate artifacts for debugging.
context-compression
Segmentation is a form of context compression for training. The core insight from context-compression applies: information density matters more than information quantity. Smaller, coherent chunks (150-400 words) produce better style transfer than larger, diluted chunks.
The two-tier strategy mirrors context compression evaluation:
- Tier 1: Fast, deterministic compression
- Tier 2: LLM-assisted for edge cases
multi-agent-patterns
The pipeline uses the supervisor/orchestrator pattern:
- Orchestrator coordinates phases and manages state
- Specialized agents (Extraction, Segmentation, Instruction, Builder) have isolated contexts
- Each agent receives only the information needed for its task
This matches the principle that sub-agents exist primarily to isolate context rather than simulate roles.
evaluation
Validation follows the end-state evaluation pattern:
- Functional testing: Does output match expected style markers?
- Originality verification: Is content genuinely generated?
- External validation: AI detector scores
The "modern scenario" test is a form of out-of-distribution evaluation that proves generalization.
context-fundamentals
Prompt diversity prevents attention collapse on single patterns. When training with identical prompt structures, the model memorizes the instruction-response mapping. Diverse templates force attention across the style patterns themselves.
References
Internal references:
- Segmentation Strategies - Text chunking patterns
- Tinker Format Specification - Datum structure
- Tinker API Documentation - Full API reference
Related skills from Agent Skills for Context Engineering:
- project-development - Pipeline architecture patterns
- context-compression - Compression strategies
- multi-agent-patterns - Agent coordination
- evaluation - Evaluation frameworks
- context-fundamentals - Attention and information density
External resources:
- Research Paper - Chakrabarty et al. 2025
- Dataset on Hugging Face
- Gertrude Stein Case Study - Complete working example
Skill Metadata
Created: 2025-12-26 Last Updated: 2025-12-28 Author: Muratcan Koylan Version: 2.0.0 Standalone: Yes (separate from main context-engineering collection)