SciAgent-Skills scientific-literature-search

Systematic strategies for searching, retrieving, and analyzing scientific literature across PubMed, arXiv, Google Scholar, and AI-assisted tools. Covers the PICO framework for clinical question formulation, three-tiered search strategy (database-specific, AI-assisted, content extraction), PubMed field tags and MeSH vocabulary, boolean query construction, and full-text extraction workflows. Consult this guide when planning a literature search, constructing database queries, or deciding which search tier to use for a given research question.

install
source · Clone the upstream repo
git clone https://github.com/jaechang-hits/SciAgent-Skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/scientific-writing/scientific-literature-search" ~/.claude/skills/jaechang-hits-sciagent-skills-scientific-literature-search && rm -rf "$T"
manifest: skills/scientific-writing/scientific-literature-search/SKILL.md
source content

Scientific Literature Search

Overview

Scientific literature search is the foundation of evidence-based research. A well-executed search maximizes recall (finding all relevant papers) while maintaining precision (avoiding irrelevant results). This guide provides a systematic approach that combines database-specific query strategies, AI-assisted synthesis, and direct content extraction, organized into a three-tiered framework that scales from targeted lookups to comprehensive landscape reviews.

Key Concepts

The PICO Framework

For clinical and biomedical questions, structure queries using the PICO framework:

  • P (Population): Who are you studying? (e.g., "Diabetes Mellitus"[MeSH])
  • I (Intervention): What treatment or exposure? (e.g., "Metformin"[MeSH])
  • C (Comparison): What is the alternative? (e.g., placebo, standard care)
  • O (Outcome): What result are you measuring? (e.g., "Cardiovascular Diseases"[MeSH])

PICO queries can be combined with publication type filters to target specific evidence levels:

"Diabetes Mellitus"[MeSH] AND "Metformin"[MeSH] AND "Cardiovascular Diseases"[MeSH] AND ("clinical trial"[Publication Type] OR "meta-analysis"[Publication Type])

Three-Tiered Search Strategy

Literature search is most effective when approached in tiers of increasing breadth:

Tier 1 -- Database-Specific Searches (Most Reliable)

Query established academic databases (PubMed, arXiv, Google Scholar) for peer-reviewed, indexed content. This is the most reliable tier and should always be the starting point.

  • PubMed (
    query_pubmed
    ): Primary database for biomedical and life science literature. Supports MeSH controlled vocabulary and advanced field tags.
  • arXiv (
    query_arxiv
    ): Preprint server for physics, mathematics, computer science, and quantitative biology. Results appear faster than peer-reviewed journals.
  • Google Scholar (
    query_scholar
    ): Broadest coverage across all academic disciplines. Note: has aggressive rate limits on automated queries.

Best for: finding specific papers, systematic reviews, clinical evidence, preprints.

Tier 2 -- AI-Assisted Web Search (Comprehensive)

Use AI tools (

advanced_web_search_claude
) to synthesize broader context, identify research trends, and surface recent developments not yet indexed in databases. Also use general web search (
search_google
) for protocols, tutorials, and software documentation.

Best for: understanding the research landscape, complex multi-faceted questions, finding recent developments, identifying key researchers.

Avoid for: specific paper lookups (use Tier 1), citation counts (use Google Scholar), systematic reviews requiring reproducibility, searches where exact query terms must be documented.

Tier 3 -- Direct Content Extraction (Deep Dive)

Extract and analyze full-text content, PDFs, and supplementary materials from identified papers using

extract_url_content
,
extract_pdf_content
, and
fetch_supplementary_info_from_doi
.

Best for: detailed methodology extraction, data retrieval, protocol identification, supplementary data access.

PubMed Field Tags

PubMed supports field-specific searching to improve precision:

TagDescriptionExample
[MeSH]
Medical Subject Heading (controlled vocabulary)
"Neoplasms"[MeSH]
[Title]
Title field only
"CRISPR"[Title]
[Title/Abstract]
Title or abstract
"gene therapy"[Title/Abstract]
[Author]
Author name
"Zhang F"[Author]
[Journal]
Journal name
"Nature"[Journal]
[Publication Type]
Article type filter
"Review"[Publication Type]
[Date - Publication]
Publication date range
"2020/01/01"[Date - Publication]:"2024/12/31"[Date - Publication]
[MeSH Major Topic]
MeSH term as major focus of the article
"CRISPR-Cas Systems"[MeSH Major Topic]

Boolean Operators

Boolean operators control how search terms combine:

# AND: All terms must be present -- narrows results
results = query_pubmed("CRISPR AND cancer AND therapy")

# OR: Any term can be present -- broadens results (use for synonyms)
results = query_pubmed("(tumor OR tumour OR neoplasm) AND immunotherapy")

# NOT: Exclude terms -- use sparingly to avoid losing relevant papers
results = query_pubmed("cancer immunotherapy NOT review")

Use parentheses to group OR terms together before combining with AND.

arXiv Subject Categories

arXiv organizes preprints by subject category. Biology-related categories include:

CategoryDescription
q-bio.BM
Biomolecules
q-bio.CB
Cell Behavior
q-bio.GN
Genomics
q-bio.MN
Molecular Networks
q-bio.NC
Neurons and Cognition
q-bio.QM
Quantitative Methods
cs.AI
Artificial Intelligence
cs.LG
Machine Learning

Decision Framework

Use this tree to determine which search tier and database to start with:

What type of question are you answering?
├── Clinical / biomedical question
│   ├── Specific drug or treatment → Tier 1: PubMed with PICO query
│   ├── Disease mechanism → Tier 1: PubMed with MeSH terms
│   └── Clinical trial evidence → Tier 1: PubMed filtered by Publication Type
├── Computational / quantitative methods
│   ├── ML model or algorithm → Tier 1: arXiv (cs.LG, cs.AI)
│   ├── Computational biology method → Tier 1: arXiv (q-bio.*) + PubMed
│   └── Software tool or pipeline → Tier 2: AI-assisted web search
├── Broad research landscape
│   ├── Current state of a field → Tier 2: AI-assisted web search
│   ├── Recent developments (last 6 months) → Tier 2: AI-assisted web search
│   └── Cross-disciplinary question → Tier 1: Google Scholar + Tier 2
├── Specific paper or data
│   ├── Known paper details → Tier 1: any database by title/author/DOI
│   ├── Methodology or protocol → Tier 3: full-text extraction
│   └── Supplementary data → Tier 3: DOI-based supplementary fetch
└── Protocols / reagents
    ├── Lab protocol → Tier 2: web search for protocols.io, etc.
    └── Validated reagents → Tier 2: AI-assisted web search
ScenarioRecommended Tier and DatabaseRationale
Systematic review of clinical evidenceTier 1: PubMed with MeSH + publication type filtersReproducible, documented search strategy required
Finding a preprint on a new ML methodTier 1: arXiv with category and keyword searchPreprints appear on arXiv before journals
Understanding the research landscapeTier 2: AI-assisted web searchRequires synthesis across many sources
Extracting a specific protocol from a paperTier 3: PDF content extractionNeed full-text access to methods section
Finding papers across disciplinesTier 1: Google ScholarBroadest coverage across fields
Identifying key researchers in a niche areaTier 2: AI-assisted web searchRequires contextual synthesis
Downloading supplementary data tablesTier 3: DOI-based supplementary fetchDirect access to supplementary files

Best Practices

  1. Use controlled vocabulary (MeSH) for PubMed searches: Free-text searches miss papers that use different terminology. MeSH terms map synonyms to a single concept, improving recall without sacrificing precision.

    # Free text misses synonyms
    query_pubmed("heart attack treatment")
    # MeSH captures all synonyms
    query_pubmed('"Myocardial Infarction"[MeSH] AND "Drug Therapy"[MeSH]')
    
  2. Include synonyms and alternative terms with OR: Scientific concepts often have multiple names (e.g., tumor/tumour/neoplasm). Group synonyms with OR inside parentheses to avoid missing relevant papers.

    query_pubmed("(myocardial infarction OR heart attack) AND (treatment OR therapy)")
    
  3. Use phrase searching for multi-word concepts: Quoting exact phrases prevents the search engine from splitting terms and matching them independently.

    query_pubmed('"single cell RNA sequencing" AND methods')
    
  4. Filter by publication type when seeking specific evidence: Clinical trials, systematic reviews, and meta-analyses each answer different questions. Use

    [Publication Type]
    to target the evidence level you need.

    query_pubmed("COVID-19 vaccine efficacy AND clinical trial[Publication Type]")
    
  5. Start broad, then narrow iteratively: Begin with core concepts (2-3 terms) and review initial results. Add specificity based on what you find -- more terms, date ranges, field tags, or publication types.

    # Step 1: Broad
    results = query_pubmed("CRISPR base editing iPSC", max_papers=20)
    # Step 2: Add MeSH and specificity
    results = query_pubmed(
        '"CRISPR-Cas Systems"[MeSH] AND "base editing" AND "induced pluripotent stem cells" AND efficiency',
        max_papers=20
    )
    # Step 3: Filter by date
    results = query_pubmed(
        '"CRISPR-Cas Systems"[MeSH] AND "base editing" AND "induced pluripotent stem cells" AND efficiency AND ("2022"[Date - Publication]:"2024"[Date - Publication])',
        max_papers=20
    )
    
  6. Cross-reference multiple databases: No single database covers all literature. Use PubMed for biomedical content, arXiv for computational preprints, and Google Scholar for cross-disciplinary coverage.

  7. Assess result quality systematically: Evaluate papers for source reliability (peer-reviewed journal), author credentials, recency, study design appropriateness, sample size adequacy, reproducibility, declared conflicts of interest, and citation count.

Common Pitfalls

  1. Overly long and specific queries: Packing too many terms into a single query causes missed results because all terms must match simultaneously.

    • How to avoid: Limit queries to core concepts (3-5 terms). Run separate searches for sub-topics and combine results manually.
    # Too specific -- misses relevant papers
    query_pubmed("CRISPR Cas9 gene editing HEK293T cells 2024 efficiency optimization delivery")
    # Better -- core concepts only
    query_pubmed("CRISPR Cas9 gene editing optimization efficiency")
    
  2. Relying on a single database: PubMed has biomedical focus, arXiv covers preprints, Google Scholar spans disciplines. Using only one database guarantees blind spots.

    • How to avoid: Always search at least two databases. For computational biology, combine PubMed and arXiv. For cross-disciplinary topics, include Google Scholar.
  3. Ignoring publication dates: Scientific knowledge evolves rapidly. Foundational papers remain relevant, but methods and clinical evidence may be superseded.

    • How to avoid: Check publication dates in all results. For methods papers, prefer the last 3-5 years. For foundational concepts, older papers are acceptable but verify with recent reviews.
  4. Skipping title and abstract review before deep-diving: Not all search results that match keywords are actually relevant. Downloading and reading full texts without screening wastes time.

    • How to avoid: Always screen titles and abstracts first. Only extract full text (Tier 3) for papers that pass screening.
  5. Using NOT operators too aggressively: The NOT operator can inadvertently exclude relevant papers that mention the excluded term in a different context.

    • How to avoid: Use NOT sparingly. Prefer adding positive terms to narrow results rather than excluding terms. When you must use NOT, verify that excluded results are genuinely irrelevant.
  6. Ignoring Google Scholar rate limits: Google Scholar aggressively rate-limits automated queries, which can block further searches.

    • How to avoid: Use Google Scholar sparingly. Add delays between requests. Prefer PubMed or arXiv for bulk searching and reserve Google Scholar for cross-disciplinary checks.
  7. Not documenting the search strategy: For systematic reviews and reproducible research, an undocumented search cannot be verified or reproduced.

    • How to avoid: Record your search terms, databases queried, date ranges, and number of results at each stage. This is essential for systematic reviews and good practice for all searches.

Workflow

  1. Step 1: Define the research question

    • Identify the main concept, population/model, intervention/method, desired outcome, and time frame
    • For clinical questions, map to the PICO framework
    • Example: "Find recent papers on CRISPR base editing efficiency in human iPSCs" decomposes to: main concept = CRISPR base editing, model = human iPSCs, outcome = efficiency, time frame = last 3 years
  2. Step 2: Construct and execute database queries (Tier 1)

    • Start with PubMed for biomedical topics, arXiv for computational topics
    • Begin with a broad query using 2-3 core terms
    • Refine with MeSH terms, field tags, date filters, and publication type filters
    from biomni.tool.literature import query_pubmed, query_arxiv, query_scholar
    
    # PubMed: biomedical literature
    results = query_pubmed(
        '"CRISPR-Cas Systems"[MeSH] AND "Gene Editing"[MeSH]',
        max_papers=20
    )
    
    # arXiv: computational biology preprints
    results = query_arxiv("protein structure prediction", max_papers=10)
    
    # Google Scholar: broad cross-disciplinary coverage
    result = query_scholar("single cell RNA sequencing analysis methods")
    
  3. Step 3: Supplement with AI-assisted search (Tier 2)

    • Use AI-assisted web search for landscape overviews and recent developments
    • Use general web search for protocols, tutorials, and documentation
    from biomni.tool.literature import advanced_web_search_claude
    
    results = advanced_web_search_claude(
        "What are the latest developments in CAR-T cell therapy for solid tumors in 2024?",
        max_searches=3
    )
    
  4. Step 4: Evaluate and filter results

    • Screen titles and abstracts for relevance
    • Prioritize by recency, journal quality, citation count, and study design
    • For clinical evidence, prioritize RCTs, systematic reviews, and meta-analyses
    • For methods, prioritize protocol papers and method comparisons
    • Decision point: If too many results, add more specific terms or filters. If too few, broaden terms and add synonyms.
  5. Step 5: Deep dive into key papers (Tier 3)

    • Extract full text from high-priority papers
    • Download supplementary materials for data and protocols
    • Check reference lists for additional relevant papers
    from biomni.tool.literature import extract_url_content, extract_pdf_content, fetch_supplementary_info_from_doi
    
    # Extract article content from URL
    content = extract_url_content("https://www.nature.com/articles/nature12373")
    
    # Extract text from PDF
    content = extract_pdf_content("https://arxiv.org/pdf/1706.03762.pdf")
    
    # Download supplementary files using DOI
    log = fetch_supplementary_info_from_doi(
        "10.1038/nature12373",
        output_dir="./supplementary_materials"
    )
    
  6. Step 6: Document and iterate

    • Record all search terms, databases, filters, and result counts
    • If gaps remain, revisit Steps 2-3 with refined queries
    • For systematic reviews, follow PRISMA guidelines for reporting

Common Search Scenarios

The following scenarios illustrate how to combine the three tiers for typical research questions.

Finding Methods and Protocols

Start with PubMed for published methodology papers, then supplement with web search for step-by-step protocols from resources like protocols.io.

from biomni.tool.literature import query_pubmed, search_google

# Search for methodology papers in PubMed
results = query_pubmed(
    '"Western Blotting"[MeSH] AND (protocol OR method OR technique)',
    max_papers=10
)

# Check web for step-by-step protocols
results = search_google("Western blot protocol for membrane proteins", num_results=5)

Understanding Disease Mechanisms

Begin with review articles for a broad overview, then drill into specific mechanistic studies.

# Find review articles first for an overview
results = query_pubmed(
    '"Alzheimer Disease"[MeSH] AND pathophysiology AND review[Publication Type]',
    max_papers=10
)

# Then find specific mechanistic studies
results = query_pubmed(
    '"Alzheimer Disease"[MeSH] AND ("amyloid beta"[MeSH] OR tau) AND mechanism',
    max_papers=20
)

Finding Drug and Treatment Information

Use publication type filters to separate clinical trial evidence from systematic reviews.

# Clinical trials for a specific drug-condition pair
results = query_pubmed(
    '"Drug Name"[Substance Name] AND "Condition"[MeSH] AND clinical trial[Publication Type]',
    max_papers=20
)

# Systematic reviews and meta-analyses
results = query_pubmed(
    '"Drug Name" AND "Condition" AND (systematic review[Publication Type] OR meta-analysis[Publication Type])',
    max_papers=10
)

Tracking Latest Developments

Combine AI-assisted search for synthesis with database searches for recent indexed publications.

from biomni.tool.literature import advanced_web_search_claude, query_pubmed

# AI-assisted synthesis of recent advances
results = advanced_web_search_claude(
    "What are the most significant advances in CAR-T cell therapy in 2024?",
    max_searches=3
)

# Supplement with recent PubMed results
results = query_pubmed(
    '"Chimeric Antigen Receptor T-Cell Therapy"[MeSH] AND "2024"[Date - Publication]',
    max_papers=20
)

Finding Specific Reagents and Materials

Use AI-assisted search for validated reagent recommendations, supplemented by general web search.

from biomni.tool.literature import advanced_web_search_claude, search_google

# Search for validated reagents
results = advanced_web_search_claude(
    "validated antibodies for Western blot detection of p53 protein",
    max_searches=2
)

# Search supplier databases
results = search_google("p53 antibody Western blot validated", num_results=5)

Comparative Analysis Across Methods

Use AI-assisted search for synthesized comparisons of techniques or tools.

from biomni.tool.literature import advanced_web_search_claude

# Compare approaches with AI synthesis
results = advanced_web_search_claude(
    "Compare different CRISPR delivery methods for in vivo gene editing: viral vectors vs lipid nanoparticles",
    max_searches=5
)

Quality Assessment Checklist

When evaluating search results, apply these criteria:

  • Source reliability: Is the paper from a peer-reviewed journal?
  • Author credentials: Are the authors established experts in the field?
  • Recency: Is the information current enough for your purpose?
  • Study design: Is the design appropriate for the question (e.g., RCT for efficacy, cohort for risk)?
  • Sample size: Is it adequate for the conclusions drawn?
  • Reproducibility: Are methods described clearly enough to replicate?
  • Conflicts of interest: Are any conflicts declared?
  • Citation count: Has the paper been well-cited by subsequent work?

Further Reading

  • PubMed Help -- Official guide to PubMed search syntax, field tags, filters, and advanced features
  • arXiv Help Pages -- Documentation on arXiv search, subject categories, and submission process
  • MeSH Browser -- NLM tool for browsing and searching the Medical Subject Headings controlled vocabulary
  • PRISMA Statement -- Guidelines for transparent reporting of systematic reviews and meta-analyses
  • Cochrane Handbook for Systematic Reviews -- Gold-standard methodology for systematic literature reviews

Related Skills

  • pubmed-database
    -- Direct PubMed API access for programmatic literature retrieval
  • scientific-manuscript-writing
    -- Structuring literature review sections within manuscripts
  • research-question-formulation
    -- Frameworks for defining answerable research questions