Awesome-claude-code rag-pipeline

RAG Pipeline

install
source · Clone the upstream repo
git clone https://github.com/sah1l/awesome-claude-code
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/sah1l/awesome-claude-code "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/rag-pipeline" ~/.claude/skills/sah1l-awesome-claude-code-rag-pipeline && rm -rf "$T"
manifest: .claude/skills/rag-pipeline/SKILL.md
source content

RAG Pipeline

Why This Skill Exists

Retrieval-Augmented Generation (RAG) is the most common pattern for building LLM applications that use private or up-to-date data. But naive implementations — dump everything into a vector DB, retrieve top-5, stuff into prompt — produce unreliable results. This skill covers the full pipeline with production-tested patterns for each stage.

Pipeline Overview

Documents → Parsing → Chunking → Embedding → Indexing
                                                  │
User Query → Embedding → Retrieval → Reranking → Context Assembly → LLM → Response

Each stage has tunable knobs that dramatically affect output quality. The defaults below are reasonable starting points — measure and adjust for your domain.

Stage 1: Document Ingestion

Parsing

Extract clean text from source documents. Garbage in, garbage out.

SourceToolWatch Out For
PDFPyMuPDF, pdfplumberTables, multi-column layouts, headers/footers
HTMLBeautifulSoup, trafilaturaBoilerplate (nav, ads, footer) — extract main content only
MarkdownDirect parsePreserve heading hierarchy for metadata
CodeTree-sitterPreserve function/class boundaries
Office docspython-docx, openpyxlEmbedded images, track changes

Metadata Extraction

Attach metadata at parse time — you'll need it for filtering later.

document = {
    "content": "...",
    "metadata": {
        "source": "docs/architecture.md",
        "title": "System Architecture",
        "doc_type": "technical",
        "last_modified": "2025-03-15",
        "section": "Backend Services",   # heading hierarchy
    }
}

Stage 2: Chunking

Chunking strategy is the single biggest lever for RAG quality.

Strategies

StrategyHow It WorksBest For
Fixed-sizeSplit every N tokens with overlapSimple, predictable, baseline
RecursiveSplit by paragraphs → sentences → wordsGeneral-purpose text
SemanticSplit when embedding similarity dropsLong-form, topic-shifting content
Document-structureSplit on headings, sections, functionsTechnical docs, code

Fixed-Size with Overlap

chunk_size = 512      # tokens per chunk
chunk_overlap = 64    # tokens overlap between chunks

# Why overlap? Prevents information from being split across chunk boundaries
# A fact at the end of chunk N is also at the start of chunk N+1

Recursive Splitting (Recommended Default)

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""],  # try biggest breaks first
)
chunks = splitter.split_text(document)

Document-Structure Splitting

# For Markdown — split on headings, preserving hierarchy
def chunk_by_headings(markdown_text):
    sections = []
    current_section = {"heading": "", "content": ""}

    for line in markdown_text.split("\n"):
        if line.startswith("#"):
            if current_section["content"].strip():
                sections.append(current_section)
            current_section = {"heading": line, "content": ""}
        else:
            current_section["content"] += line + "\n"

    if current_section["content"].strip():
        sections.append(current_section)
    return sections

Chunk Size Guidelines

Content TypeChunk SizeWhy
FAQ / Q&A200-300 tokensEach Q&A is self-contained
Technical docs500-1000 tokensNeeds enough context for a concept
Legal / contracts1000-1500 tokensClauses reference each other
CodeFunction/class levelNatural boundaries

Stage 3: Embedding

Model Selection

ModelDimensionsSpeedQualityCost
OpenAI
text-embedding-3-small
1536FastGoodLow
OpenAI
text-embedding-3-large
3072MediumBetterMedium
Cohere
embed-v4.0
1024FastVery goodMedium
Open-source (e5-large, BGE)1024Self-hostedGoodInfra cost

Key trade-off: Higher dimensions = better quality = more storage + slower search. Start with a smaller model and upgrade only if retrieval quality is measurably insufficient.

Embedding Best Practices

  • Embed query and documents the same way — or use models with separate query/document modes (asymmetric embedding)
  • Normalize vectors — required for cosine similarity
  • Batch embed — don't embed one document at a time
  • Cache embeddings — re-embedding unchanged documents is wasted compute

Stage 4: Indexing (Vector Database)

Options

DatabaseTypeBest For
pgvectorExtensionAlready using PostgreSQL, moderate scale
PineconeManagedNo infra management, large scale
WeaviateManaged/Self-hostedHybrid search built-in
MilvusSelf-hostedFull control, very large scale
ChromaIn-processPrototyping, small datasets
QdrantSelf-hosted/CloudHigh performance, rich filtering

Choosing

Start here:
  Already use PostgreSQL?  →  pgvector (simplest path)
  <1M vectors?             →  pgvector or Chroma
  Need managed + scale?    →  Pinecone or Weaviate
  Need full control?       →  Milvus or Qdrant

Index Type

  • Flat (brute force): Exact results, slow at scale. Fine for <100K vectors.
  • IVF: Partitions vectors into clusters. Fast, approximate.
  • HNSW: Graph-based, excellent recall/speed trade-off. Default choice for most use cases.

Stage 5: Retrieval

Basic Retrieval

results = vector_db.similarity_search(
    query_embedding,
    top_k=10,              # retrieve more than you need (reranker will filter)
    score_threshold=0.7,   # minimum similarity — filter noise
)

Hybrid Search (Keyword + Semantic)

Semantic search misses exact matches (product codes, error messages, proper nouns). Combine with keyword search for best results.

# Reciprocal Rank Fusion — merge two ranked lists
def hybrid_search(query, alpha=0.5):
    semantic_results = vector_search(query, top_k=20)
    keyword_results = bm25_search(query, top_k=20)

    scores = {}
    for rank, doc in enumerate(semantic_results):
        scores[doc.id] = alpha * (1 / (rank + 60))      # RRF formula
    for rank, doc in enumerate(keyword_results):
        scores[doc.id] = scores.get(doc.id, 0) + (1 - alpha) * (1 / (rank + 60))

    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

Metadata Filtering

Pre-filter before vector search — dramatically reduces noise.

results = vector_db.similarity_search(
    query_embedding,
    top_k=10,
    filter={
        "doc_type": "technical",
        "last_modified": {"$gte": "2025-01-01"},
    },
)

Stage 6: Reranking

Bi-encoder retrieval (embedding similarity) is fast but rough. Cross-encoder reranking is slower but much more precise.

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

# Score each (query, document) pair
pairs = [(query, doc.content) for doc in retrieved_docs]
scores = reranker.predict(pairs)

# Keep top results after reranking
reranked = sorted(zip(retrieved_docs, scores), key=lambda x: x[1], reverse=True)
final_docs = [doc for doc, score in reranked[:5]]

Pipeline: Retrieve 20 with embeddings (fast) → Rerank to top 5 with cross-encoder (precise).

Stage 7: Context Assembly

Prompt Structure

context = "\n\n---\n\n".join([
    f"Source: {doc.metadata['source']}\n{doc.content}"
    for doc in final_docs
])

prompt = f"""Answer the question based on the provided context.
If the context doesn't contain enough information, say so — don't guess.

Context:
{context}

Question: {user_query}

Answer:"""

Context Window Budget

ComponentToken Budget
System prompt200-500
Retrieved context2000-6000
User query50-200
Response headroom1000-2000

Rule: Don't stuff the maximum context. More context = more noise = worse answers. 3-5 highly relevant chunks outperform 20 mediocre ones.

Evaluation

Metrics

MetricWhat It MeasuresTarget
Retrieval precision% of retrieved docs that are relevant>80%
Retrieval recall% of relevant docs that were retrieved>70%
Answer faithfulnessDoes the answer match the retrieved context?>90%
Answer relevanceDoes the answer address the question?>85%

RAGAS Framework

Automated evaluation using LLM-as-judge:

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision

results = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_precision],
)

Test Set Design

  • Include questions with known answers (ground truth)
  • Include questions that should return "I don't know" (out-of-scope)
  • Include questions requiring information from multiple chunks
  • Include questions with exact match requirements (codes, names, dates)

Anti-Patterns

Anti-PatternProblemFix
Stuffing max contextDilutes relevant informationRetrieve fewer, higher-quality chunks
Ignoring chunk boundariesSplits mid-sentence, loses meaningUse overlap or semantic chunking
No metadata filteringRetrieves outdated or irrelevant docsFilter by date, type, source
Single retrieval strategyMisses exact matches or semantic matchesUse hybrid search
No rerankingTop-k by embedding isn't precise enoughAdd cross-encoder reranking
Embedding once, never updatingStale index as documents changeIncremental re-indexing pipeline
No evaluationNo way to measure if changes help or hurtRAGAS or manual eval set
Treating all docs equallyAPI reference ≠ blog post ≠ changelogChunk and weight by document type

$ARGUMENTS

When invoked with arguments, treat them as a description of the RAG use case (data sources, query types, scale). Design a pipeline following these patterns: recommend chunking strategy, embedding model, vector DB, retrieval approach, and evaluation plan — tailored to the specific use case.