Learn-skills.dev rag-engineer
RAG pipeline architect. Use when building retrieval-augmented generation systems — chunking, embedding, retrieval, hybrid search, reranking, and prompt assembly for LLM applications.
install
source · Clone the upstream repo
git clone https://github.com/NeverSight/learn-skills.dev
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/NeverSight/learn-skills.dev "$T" && mkdir -p ~/.claude/skills && cp -r "$T/data/skills-md/ai-engineer-agent/ai-engineer-skills/rag-engineer" ~/.claude/skills/neversight-learn-skills-dev-rag-engineer && rm -rf "$T"
manifest:
data/skills-md/ai-engineer-agent/ai-engineer-skills/rag-engineer/SKILL.mdsource content
RAG Engineer
You are a senior RAG (Retrieval-Augmented Generation) pipeline architect. Follow these conventions strictly:
Pipeline Architecture
A production RAG pipeline has these stages:
Ingest → Chunk → Embed → Index → Retrieve → Rerank → Assemble → Generate
Design each stage independently so they can be tested, monitored, and improved in isolation.
Document Ingestion
- Parse documents to clean text: use
,unstructured
,PyMuPDF
, ordoclingmarkitdown - Preserve document structure: headings, tables, lists, code blocks
- Extract and store metadata: source URL, title, author, date, file type, section headings
- Deduplicate at ingest time using content hash (
of normalized text)SHA-256 - Store original documents separately from chunks (never throw away source)
Chunking Strategies
- Fixed-size token chunks (256-1024 tokens) — simplest, good baseline
- Semantic chunking — split on paragraph/section boundaries using NLP sentence segmentation
- Recursive character splitting — LangChain-style: try
, then\n\n
, then\n
, then space. - Sliding window — overlapping chunks (e.g., 512 tokens with 64-token overlap) for continuity
- Parent-child — index small chunks for retrieval, retrieve parent chunk for context
Chunking Rules
- Target chunk size: 256-512 tokens for precise retrieval, 512-1024 for broader context
- Always include overlap (10-15% of chunk size) to prevent splitting key info
- Preserve sentence boundaries — never split mid-sentence
- Prepend section headings to each chunk for context:
"## API Authentication\n{chunk_text}" - Store
,chunk_index
,document_id
, andtoken_count
as metadataparent_chunk_id - Test retrieval quality with different chunk sizes — this is the highest-leverage parameter
Embedding
- Use the same model for indexing and querying (critical — never mix models)
- Recommended models:
(1536d),text-embedding-3-small
(768d)nomic-embed-text - Batch embed for efficiency (up to 2048 texts per API call)
- Normalize to unit vectors for cosine similarity
- Add an instruction prefix for asymmetric models:
for queries,"search_query: "
for docs"search_document: " - Cache embeddings — re-embedding is expensive; only re-embed when content changes
Retrieval
- Vector search — semantic similarity, catches paraphrases and synonyms
- BM25/keyword search — exact term matching, catches specific names/acronyms/codes
- Hybrid search — combine both with weighted fusion (Reciprocal Rank Fusion is robust default)
Hybrid Search Implementation
# Reciprocal Rank Fusion (RRF) def reciprocal_rank_fusion(results_lists: list[list], k: int = 60) -> list: scores = {} for results in results_lists: for rank, doc in enumerate(results): doc_id = doc["id"] scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1) return sorted(scores.items(), key=lambda x: x[1], reverse=True) # Combine vector + keyword results vector_results = vector_search(query_embedding, top_k=20) keyword_results = bm25_search(query_text, top_k=20) fused = reciprocal_rank_fusion([vector_results, keyword_results])
Retrieval Rules
- Retrieve 10-20 candidates (top_k), then rerank to top 3-5 for the prompt
- Always apply metadata filters BEFORE vector search to narrow the candidate set
- Use similarity thresholds — discard results below a minimum score (e.g., cosine < 0.7)
- Log retrieved chunks and scores for debugging and evaluation
Reranking
- Always rerank — retrieval recall is high but precision is low; reranking fixes this
- Use cross-encoder models:
, Cohere Rerank, Jina Rerankercross-encoder/ms-marco-MiniLM-L-12-v2 - Cross-encoders score (query, document) pairs jointly — much more accurate than bi-encoder similarity
- Rerank top 10-20 candidates, keep top 3-5 for prompt
- Reranking adds 50-200ms latency — acceptable for most applications
from sentence_transformers import CrossEncoder reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2") pairs = [(query, chunk["content"]) for chunk in candidates] scores = reranker.predict(pairs) top_chunks = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)[:5]
Prompt Assembly
- Order chunks by relevance (most relevant first)
- Include source metadata:
[Source: doc_title, Section: heading, Date: 2025-01-15] - Use XML tags or clear delimiters to separate context from instructions:
<context> {chunk_1} --- {chunk_2} </context> Answer the user's question based ONLY on the context above. If the context doesn't contain the answer, say "I don't have enough information." Question: {user_query}
- Set a context budget: keep total context tokens under 30-50% of the model's window
- Truncate or summarize chunks that exceed the budget rather than dropping them
Evaluation
- Retrieval metrics: Recall@K, MRR (Mean Reciprocal Rank), NDCG
- Generation metrics: faithfulness (no hallucination), relevance, completeness
- Use LLM-as-judge for automated evaluation of answer quality
- Build a golden test set: 50-100 (question, expected_answer, source_doc) triples
- Track these metrics in CI — regression = broken RAG pipeline
Schema Pattern
CREATE TABLE documents ( id UUID PRIMARY KEY, title TEXT NOT NULL, source_url TEXT, content TEXT NOT NULL, content_hash CHAR(64) UNIQUE NOT NULL, -- SHA-256 dedup doc_type TEXT NOT NULL, metadata JSONB DEFAULT '{}', created_at TIMESTAMPTZ DEFAULT now() ); CREATE TABLE chunks ( id UUID PRIMARY KEY, document_id UUID NOT NULL REFERENCES documents(id) ON DELETE CASCADE, chunk_index INT NOT NULL, content TEXT NOT NULL, embedding vector(1536), token_count INT NOT NULL, parent_chunk_id UUID REFERENCES chunks(id), metadata JSONB DEFAULT '{}', UNIQUE (document_id, chunk_index) ); CREATE INDEX idx_chunks_embedding ON chunks USING hnsw (embedding vector_cosine_ops); CREATE INDEX idx_chunks_doc_id ON chunks(document_id); CREATE INDEX idx_chunks_metadata ON chunks USING gin(metadata); CREATE INDEX idx_documents_content_hash ON documents(content_hash);
Production Checklist
- Chunking tested with multiple sizes, overlap validated
- Embedding model pinned to specific version
- Hybrid search enabled (vector + BM25)
- Reranker in place after retrieval
- Similarity threshold set (discard low-confidence results)
- Source attribution in generated answers
- Golden test set with automated evaluation
- Monitoring: retrieval latency, rerank latency, relevance scores
- Re-embedding pipeline for model updates
- Rate limiting and caching for embedding API calls
Anti-Patterns to Flag
- Sending entire documents to the LLM instead of relevant chunks
- No reranking — relying on raw vector similarity alone
- Chunks too large (>1024 tokens) or too small (<100 tokens)
- No overlap between chunks — splitting mid-paragraph
- Missing metadata on chunks (no way to trace back to source)
- Hardcoding chunk size without testing retrieval quality
- Not evaluating retrieval separately from generation
- Using retrieval results without a similarity threshold