Awesome-Agent-Skills-for-Empirical-Research i3

install
source · Clone the upstream repo
git clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/25-HosungYou-Diverga/skills/i3" ~/.claude/skills/brycewang-stanford-awesome-agent-skills-for-empirical-research-i3 && rm -rf "$T"
manifest: skills/25-HosungYou-Diverga/skills/i3/SKILL.md
source content

⛔ Prerequisites (v8.2 — MCP Enforcement)

diverga_check_prerequisites("i3")
→ must return
approved: true
If not approved → AskUserQuestion for each missing checkpoint (see
.claude/references/checkpoint-templates.md
)

Checkpoints During Execution

  • 🟠 SCH_RAG_READINESS →
    diverga_mark_checkpoint("SCH_RAG_READINESS", decision, rationale)

Fallback (MCP unavailable)

Read

.research/decision-log.yaml
directly to verify prerequisites. Conversation history is last resort.


I3-RAGBuilder

Agent ID: I3 Category: I - Systematic Review Automation Tier: LOW (Haiku) Icon: 🗄️⚡

Overview

Builds a RAG (Retrieval-Augmented Generation) system from PRISMA-selected papers. Uses completely free local embeddings and ChromaDB, making the RAG building stage $0 cost. Handles PDF download, text extraction, chunking, and vector database creation.

Zero-Cost Stack

ComponentToolCost
PDF Downloadrequests$0
Text ExtractionPyMuPDF$0
Embeddingsall-MiniLM-L6-v2$0 (local)
Vector DBChromaDB$0 (local)
ChunkingLangChain$0

Total RAG Building Cost: $0

Input Schema

Required:
  - project_path: "string"

Optional:
  - chunk_size_tokens: "int (default: 500)"
  - chunk_overlap_tokens: "int (default: 100)"
  - embedding_model: "string (default: all-MiniLM-L6-v2)"
  - delay_between_downloads: "float (default: 2.0)"
  - download_timeout: "int (default: 30)"

Output Schema

main_output:
  stage: "rag_build"
  pdf_download:
    total_papers: "int"
    downloaded: "int"
    failed: "int"
    success_rate: "string"
    total_size_mb: "int"
  rag_build:
    total_chunks: "int"
    avg_chunks_per_paper: "float"
    chunk_size_tokens: "int"
    chunk_overlap_tokens: "int"
    embedding_model: "string"
    embedding_dimensions: "int"
    vector_db: "string"
  output_paths:
    pdfs: "string"
    chroma_db: "string"
    rag_config: "string"

Human Checkpoint Protocol

🟠 SCH_RAG_READINESS (RECOMMENDED)

Before completing RAG build, I3 SHOULD:

  1. REPORT build status:

    RAG Build Complete
    
    PDF Download:
    - Total papers: 287
    - PDFs downloaded: 245 (85.4%)
    - PDFs unavailable: 42
    
    Vector Database:
    - Total chunks: 4,850
    - Avg chunks/paper: 19.8
    - Embedding model: all-MiniLM-L6-v2
    - Database: ChromaDB
    
    Storage:
    - PDF size: 1.2 GB
    - Vector DB size: 450 MB
    
    Ready for research queries?
    
  2. ASK if user wants to proceed

  3. CONFIRM RAG is ready for queries

Execution Commands

# Project path (set to your working directory)
cd "$(pwd)"

# Stage 4: PDF Download
python scripts/04_download_pdfs.py \
  --project {project_path} \
  --delay 2.0 \
  --timeout 30

# Stage 5: RAG Build
python scripts/05_build_rag.py \
  --project {project_path} \
  --chunk-size 1000 \
  --chunk-overlap 200 \
  --embedding-model sentence-transformers/all-MiniLM-L6-v2

Chunking Strategy (v1.2.6: Token-Based)

Problem: Documentation says "1000 tokens" but code used "1000 characters"

Fix: Token-based chunking with tiktoken

import tiktoken
tokenizer = tiktoken.get_encoding("cl100k_base")

# Settings
chunk_size_tokens = 500    # Actual tokens
chunk_overlap_tokens = 100  # Actual tokens

# Character fallback (if tiktoken unavailable)
chunk_size_chars = 1000
chunk_overlap_chars = 200

Embedding Model Options

ModelDimensionsSpeedQuality
all-MiniLM-L6-v2 (Default)384FastGood
all-mpnet-base-v2768MediumBetter
bge-small-en-v1.5384FastGood
e5-small-v2384FastGood

All models run locally at zero cost.

PDF Download Strategy

Open Access Sources

SourceURL PatternSuccess Rate
Semantic Scholar
openAccessPdf.url
~40%
OpenAlex
open_access.oa_url
~50%
arXiv
arxiv.org/pdf/{id}.pdf
100%

Retry Logic

max_retries = 3
base_delay = 2.0

for attempt in range(max_retries):
    try:
        download_pdf(url)
        break
    except Timeout:
        delay = base_delay * (2 ** attempt)
        time.sleep(delay)

Validation

  • Minimum file size: 1KB
  • Content-Type: application/pdf
  • PDF header check: %PDF-

Vector Database Structure

data/04_rag/
├── chroma_db/
│   ├── chroma.sqlite3      # Metadata store
│   ├── {collection_id}/    # Vector embeddings
│   └── index/              # HNSW index
└── rag_config.json         # Configuration

Query Testing

After build, I3 tests retrieval with research question:

# Test query
results = vectorstore.similarity_search(
    research_question,
    k=5
)

# Report results
for doc in results:
    print(f"- {doc.metadata['title']} ({doc.metadata['year']})")
    print(f"  Preview: {doc.page_content[:150]}...")

Auto-Trigger Keywords

Keywords (EN)Keywords (KR)Action
build RAG, create vector databaseRAG 구축, 벡터 DBActivate I3
download PDFsPDF 다운로드Activate I3
embed documents문서 임베딩Activate I3

Absorbed Capabilities (v11.0)

From B5 — Parallel Document Processor

  • Distributed Workload Splitting: Partition PDF collection into balanced worker batches by file size, configurable worker count (default: CPU cores - 1, max: 8), dynamic rebalancing
  • High-Throughput PDF Reading: Parallel text extraction using multiprocessing Pool, per-worker memory limits (default: 2GB), automatic fallback (PyMuPDF -> pdfplumber -> OCR), streaming mode for PDFs > 50MB
  • Batch Extraction Pipeline: Pool-based parallel processing with configurable chunk size and overlap
  • Performance Targets: <50 PDFs sequential (<5 min), 50-200 PDFs 4 workers (<10 min), 200-500 PDFs 6 workers (<20 min), 500+ PDFs 8 workers (<45 min)
  • Error Handling in Parallel Mode: Failed PDFs logged without halting other workers, retry queue for transient failures, checkpoint files for resuming interrupted processing

Error Handling

ErrorAction
PDF corruptSkip, log to failed list
OCR neededFall back to pytesseract
Memory limitProcess in batches
Embedding timeoutRetry with smaller batch

Dependencies

requires: ["I2-screening-assistant"]
sequential_next: []
parallel_compatible: []

Related Agents

  • I0-review-pipeline-orchestrator: Pipeline coordination
  • I2-screening-assistant: PRISMA screening