Claude-Skills rag-architect

install
source · Clone the upstream repo
git clone https://github.com/borghei/Claude-Skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/borghei/Claude-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/engineering/rag-architect" ~/.claude/skills/borghei-claude-skills-rag-architect && rm -rf "$T"
manifest: engineering/rag-architect/SKILL.md
source content

RAG Architect

The agent designs, implements, and optimizes production-grade Retrieval-Augmented Generation pipelines, covering the full lifecycle from document chunking through evaluation.

Workflow

  1. Analyse corpus -- Profile the document collection: count, average length, format mix (PDF, HTML, Markdown), language(s), and domain. Validate that sample documents are accessible before proceeding.
  2. Select chunking strategy -- Choose from the Chunking Strategy Matrix based on corpus characteristics. Set chunk size, overlap, and boundary rules. Run a test split on 100 sample documents.
  3. Choose embedding model -- Select an embedding model from the Embedding Model table based on domain, latency budget, and cost constraints. Verify dimension compatibility with the target vector database.
  4. Select vector database -- Pick a vector store from the Vector Database Comparison based on scale, query patterns, and operational requirements.
  5. Design retrieval pipeline -- Configure retrieval strategy (dense, sparse, or hybrid). Add reranking if precision requirements exceed 0.85. Set the top-K parameter and similarity threshold.
  6. Implement query transformations -- If query-document style mismatch exists, enable HyDE. If queries are ambiguous, enable multi-query generation. Validate each transformation improves retrieval metrics on a held-out set.
  7. Configure guardrails -- Enable PII detection, toxicity filtering, hallucination detection, and source attribution. Set confidence score thresholds.
  8. Evaluate end-to-end -- Run the RAGAS evaluation framework. Verify faithfulness > 0.90, context relevance > 0.80, answer relevance > 0.85. Iterate on weak components.

Chunking Strategy Matrix

StrategyBest ForChunk SizeOverlapProsCons
Fixed-size (token)Uniform docs, consistent sizing512-2048 tokens10-20%Predictable, simpleBreaks semantic units
Sentence-basedNarrative text, articles3-8 sentences1 sentencePreserves language boundariesVariable sizes
Paragraph-basedStructured docs, technical manuals1-3 paragraphs0-1 paragraphPreserves topic coherenceHighly variable sizes
SemanticLong-form, research papersDynamicTopic-shift detectionBest coherenceComputationally expensive
RecursiveMixed content typesDynamic, multi-levelPer-levelOptimal utilizationComplex implementation
Document-awareMulti-format collectionsFormat-specificSection-levelPreserves metadataFormat-specific code required

Embedding Model Comparison

ModelDimensionsSpeedQualityCostBest For
all-MiniLM-L6-v2384~14K tok/sGoodFree (local)Prototyping, low-latency
all-mpnet-base-v2768~2.8K tok/sBetterFree (local)Balanced production use
text-embedding-3-small1536APIHigh$0.02/1M tokensCost-effective production
text-embedding-3-large3072APIHighest$0.13/1M tokensMaximum quality
Domain fine-tunedVariesVariesDomain-bestTraining costSpecialized domains (legal, medical)

Vector Database Comparison

DatabaseTypeScalingKey FeatureBest For
PineconeManagedAuto-scalingMetadata filtering, hybrid searchProduction, managed preference
WeaviateOpen sourceHorizontalGraphQL API, multi-modalComplex data types
QdrantOpen sourceDistributedHigh perf, low memory (Rust)Performance-critical
ChromaEmbeddedLimitedSimple API, SQLite-backedPrototyping, small-scale
pgvectorPostgreSQL extPostgreSQL scalingACID, SQL joinsExisting PostgreSQL infra

Retrieval Strategies

StrategyWhen to UseImplementation
Dense (vector similarity)Default for semantic searchCosine similarity with k-NN/ANN
Sparse (BM25/TF-IDF)Exact keyword matching neededElasticsearch or inverted index
Hybrid (dense + sparse)Best of both neededReciprocal Rank Fusion (RRF) with tuned weights
+ RerankingPrecision must exceed 0.85Cross-encoder reranker after initial retrieval

Query Transformation Techniques

TechniqueWhen to UseHow It Works
HyDEQuery/document style mismatchLLM generates hypothetical answer; embed that instead of query
Multi-queryAmbiguous queriesGenerate 3-5 query variations; retrieve for each; deduplicate
Step-backSpecific questions needing general contextTransform to broader query; retrieve general + specific

Context Window Optimization

  • Relevance ordering: Most relevant chunks first in the context window
  • Diversity: Deduplicate semantically similar chunks
  • Token budget: Fit within model context limit; reserve tokens for system prompt and answer
  • Hierarchical inclusion: Include section summary before detailed chunks when available
  • Compression: Summarize low-relevance chunks; extract key facts from verbose passages

Evaluation Metrics (RAGAS Framework)

MetricTargetWhat It Measures
Faithfulness> 0.90Answers grounded in retrieved context
Context Relevance> 0.80Retrieved chunks relevant to query
Answer Relevance> 0.85Answer addresses the original question
Precision@K> 0.70% of top-K results that are relevant
Recall@K> 0.80% of relevant docs found in top-K
MRR> 0.75Reciprocal rank of first relevant result

Guardrails

  • PII detection: Scan retrieved chunks and generated responses for PII; redact or block
  • Hallucination detection: Compare generated claims against source documents via NLI
  • Source attribution: Every factual claim must cite a retrieved chunk
  • Confidence scoring: Return confidence level; if below threshold, return "I don't have enough information"
  • Injection prevention: Sanitize user queries; reject prompt injection attempts

Example: Internal Knowledge Base RAG Pipeline

corpus:
  documents: 12,000 Confluence pages + 3,000 PDFs
  avg_length: 2,400 tokens
  languages: [English]
  domain: internal engineering docs

pipeline:
  chunking:
    strategy: recursive
    max_tokens: 512
    overlap: 50 tokens
    boundary: paragraph
  embedding:
    model: text-embedding-3-small
    dimensions: 1536
    batch_size: 100
  vector_db:
    engine: pgvector
    index: HNSW (ef_construction=128, m=16)
    reason: "Existing PostgreSQL infra; ACID compliance for audit"
  retrieval:
    strategy: hybrid
    dense_weight: 0.7
    sparse_weight: 0.3
    top_k: 10
    reranker: cross-encoder/ms-marco-MiniLM-L-12-v2
    final_k: 5

evaluation_results:
  faithfulness: 0.93
  context_relevance: 0.84
  answer_relevance: 0.88
  precision_at_5: 0.76
  recall_at_10: 0.85

Production Patterns

  • Caching: Query-level (exact match), semantic (similar queries via embedding distance < 0.05), chunk-level (embedding cache)
  • Streaming: Stream generation tokens while retrieval completes; show sources after generation
  • Fallbacks: If primary vector DB is unavailable, serve from read-replica; if retrieval returns no results above threshold, say so explicitly
  • Document refresh: Incremental re-embedding on change detection; full re-index weekly
  • Cost control: Batch embeddings, cache aggressively, route simple queries to BM25 only

Common Pitfalls

ProblemSolution
Chunks break mid-sentenceUse boundary-aware chunking with sentence/paragraph overlap
Low retrieval precisionAdd cross-encoder reranker; tune similarity threshold
High latency (> 2s)Cache embeddings; use faster model; reduce top-K
Inconsistent qualityImplement RAGAS evaluation in CI; add quality scoring
Scalability bottleneckShard vector DB; implement auto-scaling; add caching layer

Scripts

Chunking Optimizer

Analyses corpus and recommends optimal chunking strategy with parameters.

Retrieval Evaluator

Runs evaluation suite (precision, recall, MRR, NDCG) against a test query set.

Pipeline Benchmarker

Measures end-to-end latency, throughput, and cost per query across configurations.

Troubleshooting

ProblemCauseSolution
Chunks contain incomplete sentences or broken code blocksFixed-size chunking ignoring semantic boundariesSwitch to sentence-based or semantic (heading-aware) chunking; enable boundary detection in
chunking_optimizer.py
Retrieved context is relevant but answer is wrongLLM hallucinating beyond retrieved chunksEnable faithfulness evaluation via RAGAS; add source attribution guardrails; lower confidence threshold to surface "I don't know" responses
Precision@K below 0.50 despite relevant documents existingEmbedding model does not capture domain vocabularyFine-tune embedding model on domain data or switch to a domain-specific model; add cross-encoder reranking stage
Query latency exceeds 2 secondsLarge top-K, no caching, or unoptimized HNSW indexReduce top-K, enable query-level and semantic caching, tune HNSW parameters (ef_search, m)
Recall drops after adding new documentsStale embeddings or index fragmentation after incremental insertsTrigger full re-index; verify new documents pass chunking pipeline; check embedding model version consistency
Hybrid retrieval returns duplicate chunksDense and sparse retrievers returning overlapping results without deduplicationApply Reciprocal Rank Fusion (RRF) with deduplication before reranking; tune dense/sparse weight ratio
Evaluation metrics fluctuate across runsNon-deterministic embedding batching or insufficient test query setFix random seeds, increase evaluation sample size, run evaluations on a frozen ground-truth set

Success Criteria

  • Faithfulness > 0.90 -- Generated answers are grounded in retrieved context as measured by the RAGAS faithfulness metric.
  • Context Relevance > 0.80 -- At least 80% of retrieved chunks are relevant to the user query.
  • Precision@5 > 0.70 -- Seven out of ten top-5 result sets contain only relevant documents.
  • End-to-end latency < 500ms -- P95 query-to-response latency stays under 500 milliseconds for interactive workloads.
  • Recall@10 > 0.85 -- The system retrieves at least 85% of relevant documents within the top 10 results.
  • Chunk boundary quality > 0.80 -- At least 80% of chunks end on clean sentence or paragraph boundaries as reported by
    chunking_optimizer.py
    .
  • Monthly cost within budget -- Total embedding, vector DB, and reranking costs stay within the budget ceiling defined in requirements.

Scope & Limitations

This skill covers:

  • End-to-end RAG pipeline architecture design: chunking, embedding, vector storage, retrieval, reranking, and evaluation.
  • Quantitative chunking analysis across four strategy families (fixed-size, sentence, paragraph, semantic).
  • Retrieval quality evaluation using standard IR metrics (Precision@K, Recall@K, MRR, NDCG) with a built-in TF-IDF baseline.
  • Automated pipeline design with component selection, cost projection, and Mermaid architecture diagrams.

This skill does NOT cover:

  • LLM prompt engineering or generation-side optimization -- see
    engineering/prompt-engineer-toolkit
    .
  • Database schema design for metadata stores alongside vector databases -- see
    engineering/database-designer
    .
  • Production observability, alerting, and SLO dashboards for deployed pipelines -- see
    engineering/observability-designer
    .
  • Agent orchestration or multi-step reasoning workflows that sit on top of RAG retrieval -- see
    engineering/agent-workflow-designer
    .

Integration Points

SkillIntegrationData Flow
engineering/prompt-engineer-toolkit
Optimize system prompts and few-shot examples fed alongside retrieved chunksPipeline design output --> prompt templates that reference chunk format and metadata
engineering/database-designer
Design relational metadata stores (tags, access control, source tracking) paired with the vector databaseVector DB recommendation --> metadata schema for hybrid storage
engineering/observability-designer
Set up latency, throughput, and accuracy monitoring for the deployed RAG pipelineEvaluation metrics and SLO targets --> dashboards and alerting rules
engineering/agent-workflow-designer
Embed the RAG retrieval step inside multi-agent reasoning workflowsRetrieval config --> agent tool definition with top-K and threshold parameters
engineering/ci-cd-pipeline-builder
Automate embedding re-indexing, evaluation regression tests, and deployment on document changesEvaluation thresholds --> CI gate that blocks deploys when metrics regress
engineering/api-design-reviewer
Review the query and ingestion API surface exposed by the RAG servicePipeline config --> OpenAPI spec review for search and ingest endpoints

Tool Reference

chunking_optimizer.py

Purpose: Analyzes a document corpus and evaluates multiple chunking strategies (fixed-size, sentence-based, paragraph-based, semantic/heading-aware) to recommend the optimal approach with configuration parameters.

Usage:

python chunking_optimizer.py <directory> [options]

Flags / Parameters:

FlagTypeDefaultDescription
directory
positional, required--Directory containing text/markdown documents to analyze
--output
,
-o
stringNoneOutput file path for results in JSON format
--config
,
-c
stringNoneJSON configuration file to customize strategy parameters (fixed_sizes, overlaps, sentence_max_sizes, paragraph_max_sizes, semantic_max_sizes)
--extensions
string list
.txt .md .markdown
File extensions to include when scanning the corpus
--verbose
,
-v
flagoffPrint all strategy scores in addition to the recommendation

Example:

python chunking_optimizer.py ./docs --output results.json --extensions .txt .md --verbose

Output Formats:

  • Console -- Corpus summary, recommended strategy name, performance score, reasoning text, and two sample chunks. With
    --verbose
    , all strategy scores are listed.
  • JSON (
    --output
    ) -- Full results object containing
    corpus_info
    ,
    strategy_results
    (per-strategy size statistics, boundary quality, semantic coherence, vocabulary statistics, performance score),
    recommendation
    (best strategy, all scores, reasoning), and
    sample_chunks
    .

retrieval_evaluator.py

Purpose: Evaluates retrieval system performance using a built-in TF-IDF baseline retriever and standard information retrieval metrics: Precision@K, Recall@K, MRR, and NDCG. Includes failure analysis and improvement recommendations.

Usage:

python retrieval_evaluator.py <queries> <corpus> <ground_truth> [options]

Flags / Parameters:

FlagTypeDefaultDescription
queries
positional, required--JSON file containing queries (list of
{"id": ..., "query": ...}
objects, or
{"queries": [...]}
)
corpus
positional, required--Directory containing the document corpus
ground_truth
positional, required--JSON file mapping query IDs to lists of relevant document IDs
--output
,
-o
stringNoneOutput file path for results in JSON format
--k-values
int list
1 3 5 10
K values used when computing Precision@K, Recall@K, and NDCG@K
--extensions
string list
.txt .md .markdown
File extensions to include from the corpus directory
--verbose
,
-v
flagoffPrint detailed per-metric values and failure analysis counts

Example:

python retrieval_evaluator.py queries.json ./corpus ground_truth.json --output eval.json --k-values 1 5 10 --verbose

Output Formats:

  • Console -- Evaluation summary table (Precision@1, Precision@5, Recall@5, MRR, NDCG@5) with performance assessment and numbered improvement recommendations. With
    --verbose
    , all aggregate metrics and failure analysis counts are printed.
  • JSON (
    --output
    ) -- Full results object containing
    aggregate_metrics
    ,
    query_results
    (per-query metrics, retrieved docs, relevant docs),
    failure_analysis
    (poor precision/recall counts, zero-result counts, query length analysis, failure patterns),
    evaluation_summary
    , and
    recommendations
    .

rag_pipeline_designer.py

Purpose: Accepts a system requirements specification and generates a complete RAG pipeline design including component recommendations (chunking, embedding, vector DB, retrieval, reranking, evaluation), cost projections, a Mermaid architecture diagram, and deployment configuration templates.

Usage:

python rag_pipeline_designer.py <requirements> [options]

Flags / Parameters:

FlagTypeDefaultDescription
requirements
positional, required--JSON file containing system requirements (document_types, document_count, avg_document_size, queries_per_day, query_patterns, latency_requirement, budget_monthly, accuracy_priority, cost_priority, maintenance_complexity)
--output
,
-o
stringNoneOutput file path for the pipeline design in JSON format
--verbose
,
-v
flagoffPrint full configuration templates for each component

Example:

python rag_pipeline_designer.py requirements.json --output pipeline_design.json --verbose

Output Formats:

  • Console -- Design summary with total monthly cost, per-component recommendations (name, rationale, cost), and a Mermaid architecture diagram. With
    --verbose
    , full JSON configuration templates for each component are printed.
  • JSON (
    --output
    ) -- Complete pipeline design object containing per-component
    ComponentRecommendation
    fields (name, type, config, rationale, pros, cons, cost_monthly),
    total_cost
    ,
    architecture_diagram
    (Mermaid markup), and
    config_templates
    (per-component configs plus deployment/scaling/monitoring settings).