SciAgent-Skills scientific-literature-search
Systematic strategies for searching, retrieving, and analyzing scientific literature across PubMed, arXiv, Google Scholar, and AI-assisted tools. Covers the PICO framework for clinical question formulation, three-tiered search strategy (database-specific, AI-assisted, content extraction), PubMed field tags and MeSH vocabulary, boolean query construction, and full-text extraction workflows. Consult this guide when planning a literature search, constructing database queries, or deciding which search tier to use for a given research question.
git clone https://github.com/jaechang-hits/SciAgent-Skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/scientific-writing/scientific-literature-search" ~/.claude/skills/jaechang-hits-sciagent-skills-scientific-literature-search && rm -rf "$T"
skills/scientific-writing/scientific-literature-search/SKILL.mdScientific Literature Search
Overview
Scientific literature search is the foundation of evidence-based research. A well-executed search maximizes recall (finding all relevant papers) while maintaining precision (avoiding irrelevant results). This guide provides a systematic approach that combines database-specific query strategies, AI-assisted synthesis, and direct content extraction, organized into a three-tiered framework that scales from targeted lookups to comprehensive landscape reviews.
Key Concepts
The PICO Framework
For clinical and biomedical questions, structure queries using the PICO framework:
- P (Population): Who are you studying? (e.g., "Diabetes Mellitus"[MeSH])
- I (Intervention): What treatment or exposure? (e.g., "Metformin"[MeSH])
- C (Comparison): What is the alternative? (e.g., placebo, standard care)
- O (Outcome): What result are you measuring? (e.g., "Cardiovascular Diseases"[MeSH])
PICO queries can be combined with publication type filters to target specific evidence levels:
"Diabetes Mellitus"[MeSH] AND "Metformin"[MeSH] AND "Cardiovascular Diseases"[MeSH] AND ("clinical trial"[Publication Type] OR "meta-analysis"[Publication Type])
Three-Tiered Search Strategy
Literature search is most effective when approached in tiers of increasing breadth:
Tier 1 -- Database-Specific Searches (Most Reliable)
Query established academic databases (PubMed, arXiv, Google Scholar) for peer-reviewed, indexed content. This is the most reliable tier and should always be the starting point.
- PubMed (
): Primary database for biomedical and life science literature. Supports MeSH controlled vocabulary and advanced field tags.query_pubmed - arXiv (
): Preprint server for physics, mathematics, computer science, and quantitative biology. Results appear faster than peer-reviewed journals.query_arxiv - Google Scholar (
): Broadest coverage across all academic disciplines. Note: has aggressive rate limits on automated queries.query_scholar
Best for: finding specific papers, systematic reviews, clinical evidence, preprints.
Tier 2 -- AI-Assisted Web Search (Comprehensive)
Use AI tools (
advanced_web_search_claude) to synthesize broader context, identify research trends, and surface recent developments not yet indexed in databases. Also use general web search (search_google) for protocols, tutorials, and software documentation.
Best for: understanding the research landscape, complex multi-faceted questions, finding recent developments, identifying key researchers.
Avoid for: specific paper lookups (use Tier 1), citation counts (use Google Scholar), systematic reviews requiring reproducibility, searches where exact query terms must be documented.
Tier 3 -- Direct Content Extraction (Deep Dive)
Extract and analyze full-text content, PDFs, and supplementary materials from identified papers using
extract_url_content, extract_pdf_content, and fetch_supplementary_info_from_doi.
Best for: detailed methodology extraction, data retrieval, protocol identification, supplementary data access.
PubMed Field Tags
PubMed supports field-specific searching to improve precision:
| Tag | Description | Example |
|---|---|---|
| Medical Subject Heading (controlled vocabulary) | |
| Title field only | |
| Title or abstract | |
| Author name | |
| Journal name | |
| Article type filter | |
| Publication date range | |
| MeSH term as major focus of the article | |
Boolean Operators
Boolean operators control how search terms combine:
# AND: All terms must be present -- narrows results results = query_pubmed("CRISPR AND cancer AND therapy") # OR: Any term can be present -- broadens results (use for synonyms) results = query_pubmed("(tumor OR tumour OR neoplasm) AND immunotherapy") # NOT: Exclude terms -- use sparingly to avoid losing relevant papers results = query_pubmed("cancer immunotherapy NOT review")
Use parentheses to group OR terms together before combining with AND.
arXiv Subject Categories
arXiv organizes preprints by subject category. Biology-related categories include:
| Category | Description |
|---|---|
| Biomolecules |
| Cell Behavior |
| Genomics |
| Molecular Networks |
| Neurons and Cognition |
| Quantitative Methods |
| Artificial Intelligence |
| Machine Learning |
Decision Framework
Use this tree to determine which search tier and database to start with:
What type of question are you answering? ├── Clinical / biomedical question │ ├── Specific drug or treatment → Tier 1: PubMed with PICO query │ ├── Disease mechanism → Tier 1: PubMed with MeSH terms │ └── Clinical trial evidence → Tier 1: PubMed filtered by Publication Type ├── Computational / quantitative methods │ ├── ML model or algorithm → Tier 1: arXiv (cs.LG, cs.AI) │ ├── Computational biology method → Tier 1: arXiv (q-bio.*) + PubMed │ └── Software tool or pipeline → Tier 2: AI-assisted web search ├── Broad research landscape │ ├── Current state of a field → Tier 2: AI-assisted web search │ ├── Recent developments (last 6 months) → Tier 2: AI-assisted web search │ └── Cross-disciplinary question → Tier 1: Google Scholar + Tier 2 ├── Specific paper or data │ ├── Known paper details → Tier 1: any database by title/author/DOI │ ├── Methodology or protocol → Tier 3: full-text extraction │ └── Supplementary data → Tier 3: DOI-based supplementary fetch └── Protocols / reagents ├── Lab protocol → Tier 2: web search for protocols.io, etc. └── Validated reagents → Tier 2: AI-assisted web search
| Scenario | Recommended Tier and Database | Rationale |
|---|---|---|
| Systematic review of clinical evidence | Tier 1: PubMed with MeSH + publication type filters | Reproducible, documented search strategy required |
| Finding a preprint on a new ML method | Tier 1: arXiv with category and keyword search | Preprints appear on arXiv before journals |
| Understanding the research landscape | Tier 2: AI-assisted web search | Requires synthesis across many sources |
| Extracting a specific protocol from a paper | Tier 3: PDF content extraction | Need full-text access to methods section |
| Finding papers across disciplines | Tier 1: Google Scholar | Broadest coverage across fields |
| Identifying key researchers in a niche area | Tier 2: AI-assisted web search | Requires contextual synthesis |
| Downloading supplementary data tables | Tier 3: DOI-based supplementary fetch | Direct access to supplementary files |
Best Practices
-
Use controlled vocabulary (MeSH) for PubMed searches: Free-text searches miss papers that use different terminology. MeSH terms map synonyms to a single concept, improving recall without sacrificing precision.
# Free text misses synonyms query_pubmed("heart attack treatment") # MeSH captures all synonyms query_pubmed('"Myocardial Infarction"[MeSH] AND "Drug Therapy"[MeSH]') -
Include synonyms and alternative terms with OR: Scientific concepts often have multiple names (e.g., tumor/tumour/neoplasm). Group synonyms with OR inside parentheses to avoid missing relevant papers.
query_pubmed("(myocardial infarction OR heart attack) AND (treatment OR therapy)") -
Use phrase searching for multi-word concepts: Quoting exact phrases prevents the search engine from splitting terms and matching them independently.
query_pubmed('"single cell RNA sequencing" AND methods') -
Filter by publication type when seeking specific evidence: Clinical trials, systematic reviews, and meta-analyses each answer different questions. Use
to target the evidence level you need.[Publication Type]query_pubmed("COVID-19 vaccine efficacy AND clinical trial[Publication Type]") -
Start broad, then narrow iteratively: Begin with core concepts (2-3 terms) and review initial results. Add specificity based on what you find -- more terms, date ranges, field tags, or publication types.
# Step 1: Broad results = query_pubmed("CRISPR base editing iPSC", max_papers=20) # Step 2: Add MeSH and specificity results = query_pubmed( '"CRISPR-Cas Systems"[MeSH] AND "base editing" AND "induced pluripotent stem cells" AND efficiency', max_papers=20 ) # Step 3: Filter by date results = query_pubmed( '"CRISPR-Cas Systems"[MeSH] AND "base editing" AND "induced pluripotent stem cells" AND efficiency AND ("2022"[Date - Publication]:"2024"[Date - Publication])', max_papers=20 ) -
Cross-reference multiple databases: No single database covers all literature. Use PubMed for biomedical content, arXiv for computational preprints, and Google Scholar for cross-disciplinary coverage.
-
Assess result quality systematically: Evaluate papers for source reliability (peer-reviewed journal), author credentials, recency, study design appropriateness, sample size adequacy, reproducibility, declared conflicts of interest, and citation count.
Common Pitfalls
-
Overly long and specific queries: Packing too many terms into a single query causes missed results because all terms must match simultaneously.
- How to avoid: Limit queries to core concepts (3-5 terms). Run separate searches for sub-topics and combine results manually.
# Too specific -- misses relevant papers query_pubmed("CRISPR Cas9 gene editing HEK293T cells 2024 efficiency optimization delivery") # Better -- core concepts only query_pubmed("CRISPR Cas9 gene editing optimization efficiency") -
Relying on a single database: PubMed has biomedical focus, arXiv covers preprints, Google Scholar spans disciplines. Using only one database guarantees blind spots.
- How to avoid: Always search at least two databases. For computational biology, combine PubMed and arXiv. For cross-disciplinary topics, include Google Scholar.
-
Ignoring publication dates: Scientific knowledge evolves rapidly. Foundational papers remain relevant, but methods and clinical evidence may be superseded.
- How to avoid: Check publication dates in all results. For methods papers, prefer the last 3-5 years. For foundational concepts, older papers are acceptable but verify with recent reviews.
-
Skipping title and abstract review before deep-diving: Not all search results that match keywords are actually relevant. Downloading and reading full texts without screening wastes time.
- How to avoid: Always screen titles and abstracts first. Only extract full text (Tier 3) for papers that pass screening.
-
Using NOT operators too aggressively: The NOT operator can inadvertently exclude relevant papers that mention the excluded term in a different context.
- How to avoid: Use NOT sparingly. Prefer adding positive terms to narrow results rather than excluding terms. When you must use NOT, verify that excluded results are genuinely irrelevant.
-
Ignoring Google Scholar rate limits: Google Scholar aggressively rate-limits automated queries, which can block further searches.
- How to avoid: Use Google Scholar sparingly. Add delays between requests. Prefer PubMed or arXiv for bulk searching and reserve Google Scholar for cross-disciplinary checks.
-
Not documenting the search strategy: For systematic reviews and reproducible research, an undocumented search cannot be verified or reproduced.
- How to avoid: Record your search terms, databases queried, date ranges, and number of results at each stage. This is essential for systematic reviews and good practice for all searches.
Workflow
-
Step 1: Define the research question
- Identify the main concept, population/model, intervention/method, desired outcome, and time frame
- For clinical questions, map to the PICO framework
- Example: "Find recent papers on CRISPR base editing efficiency in human iPSCs" decomposes to: main concept = CRISPR base editing, model = human iPSCs, outcome = efficiency, time frame = last 3 years
-
Step 2: Construct and execute database queries (Tier 1)
- Start with PubMed for biomedical topics, arXiv for computational topics
- Begin with a broad query using 2-3 core terms
- Refine with MeSH terms, field tags, date filters, and publication type filters
from biomni.tool.literature import query_pubmed, query_arxiv, query_scholar # PubMed: biomedical literature results = query_pubmed( '"CRISPR-Cas Systems"[MeSH] AND "Gene Editing"[MeSH]', max_papers=20 ) # arXiv: computational biology preprints results = query_arxiv("protein structure prediction", max_papers=10) # Google Scholar: broad cross-disciplinary coverage result = query_scholar("single cell RNA sequencing analysis methods") -
Step 3: Supplement with AI-assisted search (Tier 2)
- Use AI-assisted web search for landscape overviews and recent developments
- Use general web search for protocols, tutorials, and documentation
from biomni.tool.literature import advanced_web_search_claude results = advanced_web_search_claude( "What are the latest developments in CAR-T cell therapy for solid tumors in 2024?", max_searches=3 ) -
Step 4: Evaluate and filter results
- Screen titles and abstracts for relevance
- Prioritize by recency, journal quality, citation count, and study design
- For clinical evidence, prioritize RCTs, systematic reviews, and meta-analyses
- For methods, prioritize protocol papers and method comparisons
- Decision point: If too many results, add more specific terms or filters. If too few, broaden terms and add synonyms.
-
Step 5: Deep dive into key papers (Tier 3)
- Extract full text from high-priority papers
- Download supplementary materials for data and protocols
- Check reference lists for additional relevant papers
from biomni.tool.literature import extract_url_content, extract_pdf_content, fetch_supplementary_info_from_doi # Extract article content from URL content = extract_url_content("https://www.nature.com/articles/nature12373") # Extract text from PDF content = extract_pdf_content("https://arxiv.org/pdf/1706.03762.pdf") # Download supplementary files using DOI log = fetch_supplementary_info_from_doi( "10.1038/nature12373", output_dir="./supplementary_materials" ) -
Step 6: Document and iterate
- Record all search terms, databases, filters, and result counts
- If gaps remain, revisit Steps 2-3 with refined queries
- For systematic reviews, follow PRISMA guidelines for reporting
Common Search Scenarios
The following scenarios illustrate how to combine the three tiers for typical research questions.
Finding Methods and Protocols
Start with PubMed for published methodology papers, then supplement with web search for step-by-step protocols from resources like protocols.io.
from biomni.tool.literature import query_pubmed, search_google # Search for methodology papers in PubMed results = query_pubmed( '"Western Blotting"[MeSH] AND (protocol OR method OR technique)', max_papers=10 ) # Check web for step-by-step protocols results = search_google("Western blot protocol for membrane proteins", num_results=5)
Understanding Disease Mechanisms
Begin with review articles for a broad overview, then drill into specific mechanistic studies.
# Find review articles first for an overview results = query_pubmed( '"Alzheimer Disease"[MeSH] AND pathophysiology AND review[Publication Type]', max_papers=10 ) # Then find specific mechanistic studies results = query_pubmed( '"Alzheimer Disease"[MeSH] AND ("amyloid beta"[MeSH] OR tau) AND mechanism', max_papers=20 )
Finding Drug and Treatment Information
Use publication type filters to separate clinical trial evidence from systematic reviews.
# Clinical trials for a specific drug-condition pair results = query_pubmed( '"Drug Name"[Substance Name] AND "Condition"[MeSH] AND clinical trial[Publication Type]', max_papers=20 ) # Systematic reviews and meta-analyses results = query_pubmed( '"Drug Name" AND "Condition" AND (systematic review[Publication Type] OR meta-analysis[Publication Type])', max_papers=10 )
Tracking Latest Developments
Combine AI-assisted search for synthesis with database searches for recent indexed publications.
from biomni.tool.literature import advanced_web_search_claude, query_pubmed # AI-assisted synthesis of recent advances results = advanced_web_search_claude( "What are the most significant advances in CAR-T cell therapy in 2024?", max_searches=3 ) # Supplement with recent PubMed results results = query_pubmed( '"Chimeric Antigen Receptor T-Cell Therapy"[MeSH] AND "2024"[Date - Publication]', max_papers=20 )
Finding Specific Reagents and Materials
Use AI-assisted search for validated reagent recommendations, supplemented by general web search.
from biomni.tool.literature import advanced_web_search_claude, search_google # Search for validated reagents results = advanced_web_search_claude( "validated antibodies for Western blot detection of p53 protein", max_searches=2 ) # Search supplier databases results = search_google("p53 antibody Western blot validated", num_results=5)
Comparative Analysis Across Methods
Use AI-assisted search for synthesized comparisons of techniques or tools.
from biomni.tool.literature import advanced_web_search_claude # Compare approaches with AI synthesis results = advanced_web_search_claude( "Compare different CRISPR delivery methods for in vivo gene editing: viral vectors vs lipid nanoparticles", max_searches=5 )
Quality Assessment Checklist
When evaluating search results, apply these criteria:
- Source reliability: Is the paper from a peer-reviewed journal?
- Author credentials: Are the authors established experts in the field?
- Recency: Is the information current enough for your purpose?
- Study design: Is the design appropriate for the question (e.g., RCT for efficacy, cohort for risk)?
- Sample size: Is it adequate for the conclusions drawn?
- Reproducibility: Are methods described clearly enough to replicate?
- Conflicts of interest: Are any conflicts declared?
- Citation count: Has the paper been well-cited by subsequent work?
Further Reading
- PubMed Help -- Official guide to PubMed search syntax, field tags, filters, and advanced features
- arXiv Help Pages -- Documentation on arXiv search, subject categories, and submission process
- MeSH Browser -- NLM tool for browsing and searching the Medical Subject Headings controlled vocabulary
- PRISMA Statement -- Guidelines for transparent reporting of systematic reviews and meta-analyses
- Cochrane Handbook for Systematic Reviews -- Gold-standard methodology for systematic literature reviews
Related Skills
-- Direct PubMed API access for programmatic literature retrievalpubmed-database
-- Structuring literature review sections within manuscriptsscientific-manuscript-writing
-- Frameworks for defining answerable research questionsresearch-question-formulation