Awesome-Agent-Skills-for-Empirical-Research rag-methodology-guide
RAG architecture for academic knowledge retrieval and synthesis
install
source · Clone the upstream repo
git clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/43-wentorai-research-plugins/skills/tools/knowledge-graph/rag-methodology-guide" ~/.claude/skills/brycewang-stanford-awesome-agent-skills-for-empirical-research-rag-methodology-g && rm -rf "$T"
manifest:
skills/43-wentorai-research-plugins/skills/tools/knowledge-graph/rag-methodology-guide/SKILL.mdsource content
RAG Methodology Guide
Design and implement Retrieval-Augmented Generation (RAG) systems for academic research, including document chunking, embedding strategies, retrieval pipelines, and evaluation.
What Is RAG?
Retrieval-Augmented Generation (RAG) augments a language model's generation with relevant information retrieved from an external knowledge base. For academic research, this enables:
- Question answering over a personal paper library
- Literature synthesis across hundreds of papers
- Fact-checking claims against source documents
- Generating citations with provenance
RAG Pipeline Architecture
Query: "What are the main challenges of protein folding?" | v [1. Query Processing] |-- Embed query using embedding model |-- Optional: Query expansion / HyDE | v [2. Retrieval] |-- Search vector database for top-k relevant chunks |-- Optional: Reranking with cross-encoder | v [3. Context Assembly] |-- Combine retrieved chunks into a prompt |-- Add metadata (source, page, citation) | v [4. Generation] |-- LLM generates answer grounded in retrieved context |-- Include inline citations | v Answer with citations
Step 1: Document Ingestion and Chunking
Chunking Strategies
| Strategy | Description | Best For |
|---|---|---|
| Fixed-size | Split every N characters/tokens | Simple, fast, baseline |
| Sentence-based | Split on sentence boundaries | Natural reading units |
| Paragraph-based | Split on paragraph breaks | Coherent semantic units |
| Section-based | Split on document headings | Academic papers |
| Recursive | Hierarchically split (heading > paragraph > sentence) | General purpose |
| Semantic | Split on topic shifts using embeddings | Best quality, slower |
Implementation
from langchain.text_splitter import RecursiveCharacterTextSplitter def chunk_academic_paper(text, chunk_size=1000, chunk_overlap=200): """Chunk an academic paper using recursive splitting.""" splitter = RecursiveCharacterTextSplitter( chunk_size=chunk_size, chunk_overlap=chunk_overlap, separators=[ "\n## ", # H2 headings (section breaks) "\n### ", # H3 headings (subsection breaks) "\n\n", # Paragraph breaks "\n", # Line breaks ". ", # Sentence breaks " ", # Word breaks ], length_function=len ) chunks = splitter.split_text(text) return chunks # Add metadata to each chunk def create_documents(paper_text, metadata): """Create chunks with source metadata for citation tracking.""" chunks = chunk_academic_paper(paper_text) documents = [] for i, chunk in enumerate(chunks): documents.append({ "text": chunk, "metadata": { **metadata, "chunk_index": i, "chunk_total": len(chunks) } }) return documents # Example usage docs = create_documents( paper_text=extracted_text, metadata={ "title": "Attention Is All You Need", "authors": "Vaswani et al.", "year": 2017, "doi": "10.48550/arXiv.1706.03762", "source_file": "vaswani2017attention.pdf" } )
Step 2: Embedding and Indexing
Embedding Model Selection
| Model | Dimensions | Quality | Speed | Cost |
|---|---|---|---|---|
| OpenAI text-embedding-3-small | 1536 | Good | Fast | $0.02/1M tokens |
| OpenAI text-embedding-3-large | 3072 | Excellent | Fast | $0.13/1M tokens |
| Cohere embed-v3 | 1024 | Excellent | Fast | $0.10/1M tokens |
| sentence-transformers/all-MiniLM-L6-v2 | 384 | Good | Very fast | Free (local) |
| BAAI/bge-large-en-v1.5 | 1024 | Excellent | Medium | Free (local) |
| nomic-embed-text | 768 | Good | Fast | Free (local) |
Vector Database Options
| Database | Type | Scalability | Features |
|---|---|---|---|
| ChromaDB | Embedded | Small-medium | Simple, good for prototyping |
| FAISS | Library | Large | Facebook research, GPU support |
| Pinecone | Cloud | Large | Managed, serverless |
| Weaviate | Self-hosted/Cloud | Large | Hybrid search, filters |
| Qdrant | Self-hosted/Cloud | Large | Rich filtering, payload storage |
| pgvector | PostgreSQL extension | Medium | SQL integration |
Building the Index
import chromadb from sentence_transformers import SentenceTransformer # Initialize embedding model (local, free) embed_model = SentenceTransformer("BAAI/bge-large-en-v1.5") # Initialize ChromaDB client = chromadb.PersistentClient(path="./chroma_db") collection = client.get_or_create_collection( name="research_papers", metadata={"hnsw:space": "cosine"} ) # Index documents def index_documents(documents): """Add documents to the vector database.""" texts = [doc["text"] for doc in documents] embeddings = embed_model.encode(texts, show_progress_bar=True).tolist() ids = [f"doc_{i}" for i in range(len(documents))] metadatas = [doc["metadata"] for doc in documents] collection.add( documents=texts, embeddings=embeddings, metadatas=metadatas, ids=ids ) print(f"Indexed {len(documents)} chunks") index_documents(docs)
Step 3: Retrieval
Basic Retrieval
def retrieve(query, top_k=5): """Retrieve the most relevant chunks for a query.""" query_embedding = embed_model.encode([query]).tolist() results = collection.query( query_embeddings=query_embedding, n_results=top_k, include=["documents", "metadatas", "distances"] ) retrieved = [] for doc, meta, dist in zip( results["documents"][0], results["metadatas"][0], results["distances"][0] ): retrieved.append({ "text": doc, "metadata": meta, "similarity": 1 - dist # Convert distance to similarity }) return retrieved # Example results = retrieve("What are the main components of the Transformer architecture?") for r in results: print(f"[{r['similarity']:.3f}] {r['metadata'].get('title', 'N/A')}") print(f" {r['text'][:150]}...")
Advanced Retrieval: Hybrid Search
def hybrid_retrieve(query, top_k=5, alpha=0.7): """Combine dense (semantic) and sparse (keyword) retrieval.""" # Dense retrieval (vector similarity) dense_results = retrieve(query, top_k=top_k * 2) # Sparse retrieval (BM25 keyword matching) from rank_bm25 import BM25Okapi # Assume all_documents is a list of all chunk texts tokenized_corpus = [doc.split() for doc in all_documents] bm25 = BM25Okapi(tokenized_corpus) bm25_scores = bm25.get_scores(query.split()) sparse_top_k = bm25_scores.argsort()[-top_k * 2:][::-1] # Reciprocal Rank Fusion (RRF) rrf_scores = {} k = 60 # RRF constant for rank, result in enumerate(dense_results): doc_id = result["metadata"].get("chunk_index", rank) rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + alpha / (k + rank + 1) for rank, idx in enumerate(sparse_top_k): rrf_scores[idx] = rrf_scores.get(idx, 0) + (1 - alpha) / (k + rank + 1) # Sort by RRF score and return top-k sorted_results = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True) return sorted_results[:top_k]
Step 4: Generation with Citations
def generate_answer(query, retrieved_contexts): """Generate an answer with inline citations using an LLM.""" # Build context string with citation markers context_parts = [] for i, ctx in enumerate(retrieved_contexts, 1): source = f"{ctx['metadata'].get('authors', 'Unknown')}, {ctx['metadata'].get('year', 'N/A')}" context_parts.append(f"[{i}] ({source}): {ctx['text']}") context_string = "\n\n".join(context_parts) prompt = f"""Based on the following research paper excerpts, answer the question. Use inline citations like [1], [2] to reference specific sources. Only use information from the provided excerpts. If the excerpts do not contain enough information, say so. EXCERPTS: {context_string} QUESTION: {query} ANSWER (with inline citations):""" # Send to LLM (example with OpenAI) # response = openai.chat.completions.create( # model="gpt-4", # messages=[{"role": "user", "content": prompt}], # temperature=0.1 # ) # return response.choices[0].message.content return prompt # Return prompt for inspection
Evaluation Metrics
| Metric | Measures | Tool |
|---|---|---|
| Retrieval precision | Are retrieved chunks relevant? | Manual annotation |
| Retrieval recall | Are all relevant chunks retrieved? | Known-relevant set |
| NDCG | Ranking quality of retrieved results | BEIR benchmark |
| Answer correctness | Is the generated answer factually correct? | Human evaluation |
| Faithfulness | Does the answer only use information from retrieved context? | RAGAS framework |
| Answer relevance | Does the answer address the question? | RAGAS framework |
| Context relevance | Are the retrieved contexts relevant to the question? | RAGAS framework |
# Using RAGAS for automated RAG evaluation from ragas import evaluate from ragas.metrics import faithfulness, answer_relevancy, context_precision # Prepare evaluation dataset eval_data = { "question": ["What is the Transformer architecture?"], "answer": ["The Transformer uses self-attention mechanisms..."], "contexts": [["The Transformer model architecture eschews recurrence..."]], "ground_truth": ["The Transformer is a neural network architecture..."] } result = evaluate( dataset=eval_data, metrics=[faithfulness, answer_relevancy, context_precision] ) print(result)
Best Practices for Academic RAG
- Chunk by section: Academic papers have natural section boundaries. Use them.
- Preserve metadata: Always store title, authors, year, DOI, and page number with each chunk for proper citation.
- Use domain-specific embeddings: Models fine-tuned on scientific text (e.g., SPECTER2) outperform general models for academic content.
- Rerank after retrieval: A cross-encoder reranker significantly improves precision over embedding-only retrieval.
- Handle tables and figures: Extract tables as text or structured data; do not ignore them during chunking.
- Evaluate systematically: Use RAGAS or a custom evaluation set to measure retrieval and generation quality before deploying.