Claude-Code-Scientist literature-search
Orchestrates comprehensive literature search across multiple databases. Use when starting research, expanding literature for specific RQs, or filling evidence gaps. Implements triple-search strategy with citation expansion.
git clone https://github.com/rhowardstone/Claude-Code-Scientist
T=$(mktemp -d) && git clone --depth=1 https://github.com/rhowardstone/Claude-Code-Scientist "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/literature-search" ~/.claude/skills/rhowardstone-claude-code-scientist-literature-search && rm -rf "$T"
.claude/skills/literature-search/SKILL.mdLiterature Search Workflow
Execute comprehensive literature search for research questions.
⛔ STOP - READ THIS BEFORE DOING ANYTHING ⛔
DO NOT CALL MCP TOOLS DIRECTLY. DO NOT CALL search_openalex. DO NOT CALL search_pubmed.
If you are about to call an MCP literature tool, STOP. You are doing it wrong.
MANDATORY: Use the bulk Python pipeline instead.
WHY THIS MATTERS
Each MCP call returns ~12,000 tokens directly into your context. 5 searches × 12k tokens = 60k tokens = CONTEXT EXHAUSTION = COMPACTION = LOST WORK.
The Python pipeline runs OUTSIDE your context:
- Searches 3 databases in parallel
- Downloads PDFs in bulk
- Extracts structured sections
- Saves to JSON files
- You read only the summaries
VIOLATION OF THIS RULE WILL CAUSE SESSION FAILURE.
CRITICAL: Token Conservation Architecture
YOU ARE THE ORCHESTRATOR, NOT THE READER.
The main conversation context is EXPENSIVE. Every token you consume here is money burned.
THE POWER IS IN THE PYTHON PIPELINE, NOT MANUAL MCP CALLS.
Craig has a bulk literature acquisition pipeline that can process 1000+ papers in ~2 hours:
- Parallel PDF downloads across multiple sources (Unpaywall, arXiv, PMC, bioRxiv, etc.)
- PyMuPDF4LLM / Marker AI for text extraction
- Structured section extraction (abstract, intro, methods, results, discussion → JSON)
- Pre-reading that lit scouts can query with
instead of reading full papersjq
MANDATORY WORKFLOW:
- Save RQs to
(goal decomposition does this)$SESSION_DIR/rqs.json - Run:
./scripts/run_literature_pipeline.sh $SESSION_DIR - Pipeline creates pre-read structured JSON files +
with detailed progresspipeline.log - Spawn lit-scout subagents to query the structured JSON
- Lit scouts use
to extract sections, NOT read raw papersjq
Note:
$SESSION_DIR is set by ./session.sh for parallel-safe operation. Falls back to workspace/current.
YOU MUST NOT:
- Read full paper text in main context (15k+ tokens per paper = WASTE)
- Process paper content directly (that's what lit-scouts do)
- Do detailed evidence extraction yourself (delegate to subagents)
- Call MCP tools one-by-one for 50+ papers (use bulk pipeline instead)
Step 1: Save Research Questions
Create
workspace/rqs.json:
{ "research_questions": [ {"id": "RQ1", "question": "What is the effect of X on Y?"}, {"id": "RQ2", "question": "How does Z compare to W?"} ] }
Step 2: Run Bulk Literature Pipeline
USE THE HELPER SCRIPT - IT HANDLES EVERYTHING:
./scripts/run_literature_pipeline.sh $SESSION_DIR
For long searches, run in background:
./scripts/run_literature_pipeline.sh $SESSION_DIR --background # Monitor: tail -f $SESSION_DIR/literature/pipeline.log
DO NOT construct complex multiline bash commands. The script handles PYTHONPATH, validation, and error reporting.
This produces:
- All discovered papers$SESSION_DIR/literature/raw_papers.json
- Papers with structured sections$SESSION_DIR/literature/preread_papers.json
- Papers per RQ for lit scouts$SESSION_DIR/literature/subsets/RQ1_papers.json
- PRISMA-style counts$SESSION_DIR/literature/prisma_flow.json
JSON Schema (IMPORTANT - don't guess, use this):
// raw_papers.json and preread_papers.json structure: { "papers": [ // <-- Access via data['papers'], NOT data[:10] { "doi": "10.1234/example", "title": "Paper title", "authors": ["Last, First", ...], "year": 2024, "abstract": "...", "source": "openalex|pubmed|semantic_scholar", "sections": { // Only in preread_papers.json "abstract": "...", "introduction": "...", "methods": "...", "results": "...", "discussion": "..." } } ], "paper_count": 47 }
Query examples:
jq '.paper_count' raw_papers.json # Get count jq '.papers[:5] | .[].title' raw_papers.json # First 5 titles jq '.papers[] | select(.doi)' raw_papers.json # Papers with DOIs
Alternative: Step-by-step if you need control:
# Set PYTHONPATH once for the session export PYTHONPATH="$HOME/.craig:$PYTHONPATH" # 1. Search only (replace with your actual search query) python3 -m craig.cli.literature_pipeline search "sentence embedding retrieval" \ --max-papers 100 \ --output $SESSION_DIR/literature/search_results.json # 2. Pre-read separately (bulk parallel download + extraction) python3 -m craig.cli.literature_pipeline preread \ $SESSION_DIR/literature/search_results.json \ --output $SESSION_DIR/literature/preread_papers.json \ --concurrent 10
Step 3: Spawn Lit Scouts
After the pipeline completes, spawn lit-scout subagents to analyze the pre-read papers.
The key insight: Lit scouts DON'T read raw PDFs. They query the pre-read structured JSON with
jq:
# Lit scout queries pre-read paper sections jq '.papers[0].sections.results' $SESSION_DIR/literature/preread_papers.json jq '.papers[] | select(.doi == "10.1234/abc") | .sections.methods' preread_papers.json
USE THE TASK TOOL WITH THESE EXACT PARAMETERS:
Task tool call: - subagent_type: "lit-scout" - model: "haiku" <-- COST SAVINGS: Haiku is perfect for structured extraction - run_in_background: true - description: "Lit scout: [RQ theme]" - prompt: See below
Example prompt for Task tool:
You are lit-scout-1, analyzing papers for RQ1. Your data is PRE-READ - you do NOT need to download or extract PDFs. Your assignment: $SESSION_DIR/literature/subsets/RQ1_papers.json This file contains pre-read papers with structured sections: - sections.abstract - sections.introduction - sections.methods - sections.results - sections.discussion - sections.conclusion Use `jq` to query specific sections efficiently: jq '.papers[0].sections.results' RQ1_papers.json jq '.papers[] | .title, .sections.abstract' RQ1_papers.json For each paper, extract 2-5 claims with full provenance: { "claim_text": "Specific finding", "source_doi": "10.xxxx/xxxxx", "quote": "Exact text from paper", "section": "results", "confidence": 0.9 } Output to: $SESSION_DIR/literature/evidence/RQ1_evidence.json Research Question to address: - RQ1: [question text]
Agent Scaling
Spawn 1-3 lit scouts depending on paper volume:
- <30 papers: 1 scout
- 30-100 papers: 2 scouts
-
100 papers: 3 scouts (max, due to concurrency limits)
Launch agents in parallel by making multiple Task tool calls in a single message.
What the Pipeline Does (Under the Hood)
The
craig.cli.literature_pipeline wraps Craig's full infrastructure:
-
Triple Search (per RQ):
- Keyword searches via OpenAlex + PubMed
- Natural language "Google the question" via Semantic Scholar embeddings
- Citation graph expansion (forward + backward citations)
-
Deduplication: By DOI and title similarity
-
Bulk PDF Acquisition (parallel, with fallbacks):
- Unpaywall (open access finder)
- bioRxiv / medRxiv (preprints)
- arXiv
- PMC (PubMed Central)
- Europe PMC
- OA aggregators (CORE, BASE, DOAJ)
-
Pre-Reading (structured extraction):
- PyMuPDF4LLM (optimized for LLM consumption)
- Marker AI (AI-powered, best for scientific papers)
- PyMuPDF / pdfplumber (fallbacks)
- Section detection (abstract, intro, methods, results, discussion)
- Figure/table caption extraction
-
Caching: PDFs cached at
, text cached separately~/.craig/pdf-cache/
Output Tracking
After pipeline completes, verify outputs:
ls -la $SESSION_DIR/literature/ # Expected: # raw_papers.json - All discovered papers # preread_papers.json - Papers with structured sections # prisma_flow.json - PRISMA-style counts # subsets/ - Per-RQ paper subsets for lit scouts jq '.paper_count' $SESSION_DIR/literature/raw_papers.json jq '.successful' $SESSION_DIR/literature/preread_papers.json
World Model Updates
Update world model with PRISMA-style flow:
# Read PRISMA flow from pipeline output cat $SESSION_DIR/literature/prisma_flow.json
Completion Criteria
You have NOT completed literature search until:
- RQs saved to
$SESSION_DIR/rqs.json - Bulk pipeline run:
./scripts/run_literature_pipeline.sh $SESSION_DIR - Pre-read papers available in
$SESSION_DIR/literature/preread_papers.json - Per-RQ subsets in
$SESSION_DIR/literature/subsets/ - Lit-scout subagents SPAWNED via Task tool with
run_in_background: true - PRISMA flow tracked
If you're calling MCP tools one-by-one for 50+ papers, you're doing it WRONG. If you're reading full paper text in main context, you're doing it WRONG.
Use the bulk pipeline. Spawn lit-scouts. Let them query structured JSON.