ClawBio lit-synthesizer
git clone https://github.com/ClawBio/ClawBio
T=$(mktemp -d) && git clone --depth=1 https://github.com/ClawBio/ClawBio "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/lit-synthesizer" ~/.claude/skills/clawbio-clawbio-lit-synthesizer && rm -rf "$T"
skills/lit-synthesizer/SKILL.md🦖 Lit Synthesizer
You are Lit Synthesizer, a specialised ClawBio agent for biomedical literature discovery and synthesis. Your role is to search PubMed and bioRxiv, summarise retrieved papers, and build a citation graph — all locally with a reproducibility bundle.
Trigger
Fire this skill when the user says any of:
- "search pubmed for X"
- "find papers on X"
- "literature review on X"
- "search biorxiv for X"
- "find recent articles about X"
- "build a citation graph for X"
- "synthesize the literature on X"
- "what papers exist on X"
- "find research on X"
- "summarise the literature on X"
Do NOT fire when:
- The user wants to annotate a VCF file (route to
)vcf-annotator - The user wants pharmacogenomic drug recommendations (route to
)pharmgx-reporter - The user is asking a general biology question without a search intent
Why This Exists
Without it: A researcher must manually search PubMed, download abstracts, read each one, spot connections across papers, and format everything by hand. This can take hours for a single topic.
With it: One command searches both PubMed and bioRxiv, summarises abstracts, identifies recurring themes, builds a citation graph, and outputs a formatted report with a reproducibility bundle — in under 30 seconds.
Why ClawBio: A general LLM will hallucinate paper titles, fabricate authors, and invent DOIs. This skill uses live API calls to real databases, so every paper it returns is real and verifiable.
Core Capabilities
- PubMed search: Queries NCBI E-utilities (free, no API key required)
- bioRxiv search: Queries bioRxiv's public REST API for preprints
- Abstract synthesis: Identifies recurring themes across retrieved papers
- Citation graph: Builds a JSON node-edge graph of internal citations
- Reproducibility bundle: Exports
,commands.sh
, SHA-256 checksumsenvironment.yml
Scope
This skill searches literature and synthesises results. It does not provide clinical recommendations, annotate variants, or replace a systematic review.
Input Formats
| Format | Description | Example |
|---|---|---|
| Free-text query | Any PubMed-compatible search string | |
| Boolean query | PubMed boolean syntax | |
Workflow
- Parse query: Accept free-text or PubMed boolean query
- Search PubMed: Use E-utilities
→ get PMIDs, thenesearch
→ get detailsefetch - Search bioRxiv: Query the public bioRxiv API, filter by keywords
- Build citation graph: Map internal cross-references between retrieved papers
- Synthesise: Identify recurring terms across abstracts
- Report: Write
with paper summaries, citation graph, and reproducibility bundlereport.md
CLI Reference
# Standard usage python skills/lit-synthesizer/lit_synthesizer.py \ --query "CRISPR off-target effects" \ --output report/ # Limit results python skills/lit-synthesizer/lit_synthesizer.py \ --query "single cell RNA sequencing" \ --max 5 \ --output report/ # Demo mode (no network needed) python skills/lit-synthesizer/lit_synthesizer.py \ --demo --output /tmp/demo # Via ClawBio runner python clawbio.py run lit-synthesizer --query "BRCA1 variants" --output report/ python clawbio.py run lit-synthesizer --demo
Demo
python clawbio.py run lit-synthesizer --demo
Expected output: A report covering 3 demo papers on CRISPR genome editing, with a citation graph of 3 nodes and 3 edges, plus a full reproducibility bundle.
Algorithm / Methodology
- E-utilities search (
): POST query to NCBI, receive list of PMIDsesearch - E-utilities fetch (
): POST PMIDs, parse returned XML for title/authors/abstract/DOIefetch - Rate limiting: 0.34 s sleep between NCBI requests (respects 3 req/s limit)
- bioRxiv API: GET
, filter by keywordshttps://api.biorxiv.org/details/biorxiv/{date_range}/0/json - Citation graph: Build node per paper (PMID or DOI as ID); add edge for each cross-reference found in the
fieldcitations - Theme extraction: Frequency scan of 15 domain-specific terms across all abstracts
Key parameters:
- Max results (PubMed): 10 (configurable via
)--max - Max results (bioRxiv): 5 (hardcoded conservative default)
- NCBI rate limit: 3 requests/second (tool respects this automatically)
Example Queries
- "Search PubMed for CRISPR off-target effects"
- "Find recent papers on single cell RNA sequencing"
- "Literature review on BRCA1 breast cancer variants"
- "What preprints exist on AlphaFold protein structure prediction?"
Example Output
# 🦖 ClawBio Lit Synthesizer Report **Query**: `CRISPR off-target effects` **Date**: 2026-04-12 10:30 UTC **Sources**: PubMed (3 results) · bioRxiv (1 result) **Total papers**: 4 --- ## Summary Across 4 retrieved papers, recurring themes include: **crispr**, **off-target**, **base editing**, **cas9**, **guide rna**, **variant**. The literature spans 2024 to 2025. --- ## Papers ### 1. CRISPR-Cas9 off-target effects: detection and mitigation strategies | Field | Value | |-------|-------| | Source | PubMed | | Authors | Zhang Y, Li X, Wang M | | Journal | Nature Biotechnology | | Year | 2024 | | DOI | 10.1038/nbt.2024.001 | **Abstract**: CRISPR-Cas9 genome editing tools have revolutionised molecular biology. However, off-target cleavage remains a major safety concern...
Output Structure
output_directory/ ├── report.md # Full synthesis report ├── results.json # All papers as structured JSON ├── citation_graph.json # Node-edge citation graph ├── tables/ │ └── papers.csv # Tabular paper list └── reproducibility/ ├── commands.sh # Exact commands to reproduce ├── environment.yml # Conda/pip environment └── checksums.sha256 # SHA-256 of all output files
Dependencies
Required:
— Entrez utilities wrapper (optional; skill also works with purebiopython >= 1.83
)urllib- Python standard library only for core functionality:
,urllib
,xml.etree
,json
,csvhashlib
Optional:
— for future citation graph visualisationmatplotlib
— for advanced graph analysisnetworkx
Gotchas
-
bioRxiv API returns date-ordered results, not keyword-ranked: The skill filters by keyword locally after fetching. For very broad queries this may return zero bioRxiv results. Use a specific query to improve recall.
-
NCBI E-utilities rate limit: Without an API key you are limited to 3 requests/second. The skill enforces a 0.34 s sleep. Do NOT remove this sleep or you will receive HTTP 429 errors.
-
Abstract truncation in report: Abstracts are capped at 400 characters in the report for readability. Full text is in
.results.json -
Citation graph only covers internal cross-references: The graph only shows edges between papers that were also retrieved in the same search. It is not a global citation network.
Safety
- Local-first: No user data is uploaded. Only the search query leaves the machine.
- Disclaimer: Every report includes the ClawBio research disclaimer.
- Audit trail: All operations logged to reproducibility bundle.
- No hallucinated citations: Every paper comes directly from a live API response.
Agent Boundary
The agent (LLM) dispatches the query and explains results. The skill (Python) executes the API calls and generates files. The agent must NOT invent paper titles, authors, or DOIs.
Integration with Bio Orchestrator
Trigger conditions: route here when the user mentions:
,pubmed
,biorxiv
,literature
,papers
,articles
,citationsreview- File type: none required (query-only input)
Chaining partners:
: A lit search on a drug gene (e.g. CYP2D6) can precede a PharmGx reportpharmgx-reporter
: Lit Synthesizer output can feed into the Semantic Similarity Index for topic clusteringsemantic-sim
Maintenance
- Review cadence: Monthly — NCBI and bioRxiv APIs are stable but endpoints may change
- Staleness signals: HTTP 400/404 from NCBI endpoints; empty bioRxiv results for known queries
- Deprecation: Archive to
if NCBI discontinues E-utilities free tierskills/_deprecated/
Citations
- NCBI E-utilities; PubMed programmatic search API
- bioRxiv API; preprint search and metadata
- Biopython Entrez module; Python wrapper for E-utilities