LLMs-Universal-Life-Science-and-Clinical-Skills- literature

install
source · Clone the upstream repo
git clone https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills-
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills- "$T" && mkdir -p ~/.claude/skills && cp -r "$T/Skills/LLM_Research" ~/.claude/skills/mdbabumiamssm-llms-universal-life-science-and-clinical-skills-literature && rm -rf "$T"
manifest: Skills/LLM_Research/SKILL.md
source content

Literature Parsing Skill

Purpose

Parse scientific literature (PDFs, URLs, DOIs) to extract GEO accessions and metadata, then download datasets for downstream omics analysis.

Methodology

1. Input Processing

Accepts multiple input types:

  • URL: PubMed, bioRxiv, journal article links
  • DOI: Digital Object Identifier (e.g., 10.1038/s41586-021-03569-1)
  • PubMed ID: PMID (e.g., 33234567)
  • PDF: Uploaded scientific paper
  • Text: Raw text containing GEO references

2. Metadata Extraction

Extracts structured information:

  • GEO Accessions: GSE (study-level), GSM (sample-level)
  • Organism: Species (e.g., Homo sapiens, Mus musculus)
  • Tissue: Tissue type or organ
  • Cell Type: Cell type if specified
  • Technology: Sequencing platform (10x, Visium, etc.)

3. Data Download

Downloads datasets from GEO:

  • Resolves GSE to find all associated GSM samples
  • Downloads expression matrices (.h5ad, .mtx, .csv)
  • Organizes files by accession:
    data/GSE123456/
  • Generates metadata.json with extracted information

4. Error Handling

  • Retry with fallbacks: PDF parsing → text extraction → manual patterns
  • Partial results: Returns successfully extracted data even if some downloads fail
  • Logging: Detailed logs for debugging

Output

  • *data/GSE/**: Downloaded datasets organized by accession
  • output/literature-parse_*/report.md: Extraction report
  • output/literature-parse_*/metadata.json: Structured metadata

Usage

# Parse from URL
python skills/literature/literature_parse.py \
  --input "https://pubmed.ncbi.nlm.nih.gov/12345" \
  --output output/literature_results

# Parse from DOI
python skills/literature/literature_parse.py \
  --input "10.1038/s41586-021-03569-1" \
  --input-type doi \
  --output output/literature_results

# Parse PDF
python skills/literature/literature_parse.py \
  --input paper.pdf \
  --input-type file \
  --output output/literature_results

Integration

After extraction, the bot automatically suggests appropriate analysis skills based on:

  • Data type (spatial, single-cell, bulk)
  • Organism and tissue
  • Available files

Dependencies

  • pypdf: PDF text extraction
  • requests: HTTP requests
  • beautifulsoup4: HTML parsing
  • GEOparse: GEO data access (optional, fallback to direct API)