SciAgent-Skills transformers-bio-nlp
Use HuggingFace Transformers with biomedical language models for scientific NLP tasks. Load BioBERT, PubMedBERT, BioGPT, and BioMedLM for named entity recognition (genes, diseases, chemicals), relation extraction, question answering on biomedical literature, text classification, and abstract summarization. Covers model loading, tokenization of biomedical text, inference pipelines, and fine-tuning on domain-specific datasets. Alternatives: spaCy with en_core_sci_lg (rule-based NER), Stanza (Stanford NLP, biomedical models), NLTK (classical NLP).
git clone https://github.com/jaechang-hits/SciAgent-Skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/scientific-computing/transformers-bio-nlp" ~/.claude/skills/jaechang-hits-sciagent-skills-transformers-bio-nlp && rm -rf "$T"
skills/scientific-computing/transformers-bio-nlp/SKILL.mdTransformers for Biomedical NLP
Overview
HuggingFace Transformers provides a unified API to load, run, and fine-tune 500+ biomedical language models. The key biomedical models — BioBERT (trained on PubMed abstracts + PMC full text), PubMedBERT (trained from scratch on PubMed), BioGPT (generative, trained on PubMed), and BioMedLM — significantly outperform general-purpose BERT on biomedical NER, relation extraction, and question answering. The
pipeline() abstraction handles tokenization, inference, and postprocessing in one call. Fine-tuning on task-specific labeled data (e.g., BC5CDR for chemical/disease NER) takes under an hour on a single GPU. The datasets library provides direct access to standard biomedical benchmarks.
When to Use
- Extracting gene names, disease mentions, drug names, or chemical entities from biomedical abstracts (NER)
- Classifying abstracts by topic, sentiment of clinical outcomes, or PICO elements for systematic reviews
- Answering specific questions from biomedical literature using extractive QA (BioASQ format)
- Generating hypotheses or summaries from biomedical text using BioGPT or BioMedLM
- Fine-tuning a pre-trained biomedical model on a custom labeled dataset (e.g., your lab's annotations)
- Embedding biomedical sentences for semantic similarity search across literature
- Use spaCy + en_core_sci_lg for fast rule-augmented NER; use Stanza for dependency parsing
Prerequisites
- Python packages:
,transformers
,torch
,datasets
,acceleratesentencepiece - GPU: Strongly recommended for fine-tuning; inference on CPU is viable for single texts
- Data requirements: plain text biomedical strings; for fine-tuning, annotated data in BIO/IOB format
pip install transformers torch datasets accelerate sentencepiece # For GPU (CUDA 11.8) pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
Quick Start
from transformers import pipeline # Named entity recognition with BioBERT ner = pipeline("ner", model="allenai/scibert_scivocab_cased", aggregation_strategy="simple") text = "BRCA1 mutations are associated with increased risk of breast cancer and ovarian cancer." entities = ner(text) for ent in entities: print(f" {ent['word']:20s} {ent['entity_group']:10s} score={ent['score']:.3f}")
Core API
Module 1: Named Entity Recognition (NER)
Extract biomedical entities using pre-trained NER models.
from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification # BioBERT fine-tuned for NER (genes, diseases, chemicals) # Common choices: # "allenai/scibert_scivocab_cased" — scientific NER # "d4data/biomedical-ner-all" — multi-entity biomedical NER # "pruas/BENT-PubMedBERT-NER-Gene" — gene-specific NER ner_pipe = pipeline( "ner", model="d4data/biomedical-ner-all", aggregation_strategy="simple", # merge subword tokens into words device=-1 # -1=CPU, 0=GPU ) abstracts = [ "Imatinib inhibits the BCR-ABL1 tyrosine kinase and is first-line treatment for CML.", "EGFR mutations in non-small cell lung cancer predict response to erlotinib.", ] for text in abstracts: entities = ner_pipe(text) print(f"\nText: {text[:60]}...") for e in entities: print(f" [{e['entity_group']}] '{e['word']}' (score={e['score']:.2f})")
# Manual tokenization + inference for batch processing from transformers import AutoTokenizer, AutoModelForTokenClassification import torch model_name = "allenai/scibert_scivocab_cased" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForTokenClassification.from_pretrained(model_name) model.eval() text = "Metformin activates AMPK and reduces hepatic glucose production in type 2 diabetes." inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512) with torch.no_grad(): outputs = model(**inputs) logits = outputs.logits # shape: (1, seq_len, n_labels) predictions = logits.argmax(dim=-1)[0] tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]) labels = [model.config.id2label[p.item()] for p in predictions] for token, label in zip(tokens[1:-1], labels[1:-1]): # skip [CLS] and [SEP] if label != "O": print(f" {token:20s} {label}")
Module 2: Text Classification
Classify biomedical abstracts or sentences.
from transformers import pipeline # Zero-shot classification — no fine-tuning needed zs_clf = pipeline("zero-shot-classification", model="facebook/bart-large-mnli", device=-1) abstract = """ This randomized controlled trial evaluated the efficacy of pembrolizumab versus chemotherapy in patients with advanced non-small-cell lung cancer. Overall survival was significantly improved in the pembrolizumab arm (HR=0.60, 95% CI 0.41-0.89). """ candidate_labels = ["clinical trial", "basic research", "meta-analysis", "review"] result = zs_clf(abstract, candidate_labels) print("Zero-shot classification:") for label, score in zip(result["labels"], result["scores"]): print(f" {label:20s}: {score:.3f}")
# Fine-tuned sentiment/outcome classification from transformers import pipeline # Example: classify clinical outcome sentiment clf = pipeline("text-classification", model="pruas/BENT-PubMedBERT-NER-Gene", # use appropriate task-specific model device=-1) sentences = [ "Treatment significantly improved overall survival (p<0.001).", "No statistically significant difference was observed between groups.", ] results = clf(sentences) for sent, result in zip(sentences, results): print(f" [{result['label']} | {result['score']:.2f}] {sent[:50]}...")
Module 3: Biomedical Question Answering
Extract answers from biomedical text passages.
from transformers import pipeline # Extractive QA: find answer span within context qa_pipe = pipeline( "question-answering", model="sultan/BioM-ELECTRA-Large-SQuAD2", # biomedical QA model device=-1 ) context = """ BRCA1 is a tumor suppressor gene located on chromosome 17q21. Pathogenic variants in BRCA1 confer a lifetime breast cancer risk of 50-72% and ovarian cancer risk of 44-46%. BRCA1 protein functions in DNA double-strand break repair via homologous recombination. """ questions = [ "What chromosome is BRCA1 located on?", "What is the lifetime breast cancer risk from BRCA1 variants?", "What DNA repair pathway does BRCA1 participate in?", ] for q in questions: result = qa_pipe(question=q, context=context) print(f"Q: {q}") print(f"A: {result['answer']} (score={result['score']:.3f})\n")
Module 4: Text Generation with BioGPT
Generate biomedical text, hypotheses, and summaries.
from transformers import AutoTokenizer, BioGptForCausalLM import torch model_name = "microsoft/biogpt" tokenizer = AutoTokenizer.from_pretrained(model_name) model = BioGptForCausalLM.from_pretrained(model_name) model.eval() prompt = "The role of VEGF in tumor angiogenesis" inputs = tokenizer(prompt, return_tensors="pt") with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=100, num_beams=5, early_stopping=True, no_repeat_ngram_size=3, pad_token_id=tokenizer.eos_token_id, ) generated = tokenizer.decode(outputs[0], skip_special_tokens=True) print(f"Generated:\n{generated}")
Module 5: Sentence Embeddings for Semantic Search
Embed biomedical text for similarity search and clustering.
from transformers import AutoTokenizer, AutoModel import torch import numpy as np def mean_pooling(model_output, attention_mask): """Mean pooling across token embeddings.""" token_embeddings = model_output.last_hidden_state input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() return (token_embeddings * input_mask_expanded).sum(1) / input_mask_expanded.sum(1) # PubMedBERT for biomedical sentence embeddings model_name = "microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModel.from_pretrained(model_name) model.eval() sentences = [ "BRCA1 is involved in DNA double-strand break repair.", "Homologous recombination requires BRCA1 and BRCA2.", "Metformin inhibits hepatic gluconeogenesis via AMPK.", ] inputs = tokenizer(sentences, padding=True, truncation=True, max_length=512, return_tensors="pt") with torch.no_grad(): outputs = model(**inputs) embeddings = mean_pooling(outputs, inputs["attention_mask"]) embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1).numpy() # Compute cosine similarity from numpy.linalg import norm sim_01 = np.dot(embeddings[0], embeddings[1]) sim_02 = np.dot(embeddings[0], embeddings[2]) print(f"Similarity (BRCA1 repair vs. HR): {sim_01:.3f}") print(f"Similarity (BRCA1 repair vs. Metformin): {sim_02:.3f}")
Module 6: Fine-Tuning on Custom Data
Fine-tune a biomedical model on a labeled NER dataset.
from transformers import (AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer, DataCollatorForTokenClassification) from datasets import Dataset import numpy as np # Example: minimal NER fine-tuning setup model_name = "microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract" label_list = ["O", "B-GENE", "I-GENE", "B-DISEASE", "I-DISEASE"] id2label = {i: l for i, l in enumerate(label_list)} label2id = {l: i for i, l in enumerate(label_list)} tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForTokenClassification.from_pretrained( model_name, num_labels=len(label_list), id2label=id2label, label2id=label2id ) # Training arguments training_args = TrainingArguments( output_dir="./biomed_ner_finetuned", num_train_epochs=3, per_device_train_batch_size=16, per_device_eval_batch_size=32, warmup_steps=100, weight_decay=0.01, logging_dir="./logs", evaluation_strategy="epoch", save_strategy="epoch", load_best_model_at_end=True, ) print(f"Model ready for fine-tuning: {model_name}") print(f"Labels: {label_list}") # trainer = Trainer(model=model, args=training_args, ...) # trainer.train()
Key Concepts
Tokenization of Biomedical Text
Biomedical text contains special tokens (gene symbols, drug names, chemical SMILES, numeric values) that WordPiece and BPE tokenizers split unexpectedly. For example, "BRCA1" →
["BR", "##CA", "##1"]. This subword splitting does not affect classification tasks but does affect NER — use aggregation_strategy="simple" or "first" in pipeline() to merge subword predictions back to word level.
BIO Labeling Scheme
NER uses BIO (Begin-Inside-Outside) tagging:
B-GENE marks the first token of a gene name, I-GENE marks continuation tokens, O marks non-entity tokens. During fine-tuning, align labels to subword tokens by setting non-first subword labels to -100 (ignored by the loss function).
Common Workflows
Workflow 1: Batch Abstract NER and Entity Aggregation
from transformers import pipeline import pandas as pd ner_pipe = pipeline("ner", model="d4data/biomedical-ner-all", aggregation_strategy="simple", device=-1) abstracts = [ "Pembrolizumab combined with chemotherapy significantly improved progression-free survival in HER2-positive breast cancer.", "Inhibition of EGFR by gefitinib is effective in patients with activating EGFR mutations in exons 19 and 21.", "CRISPR-Cas9 editing of the PCSK9 gene in hepatocytes reduces LDL cholesterol in murine models.", ] records = [] for i, text in enumerate(abstracts): entities = ner_pipe(text) for e in entities: records.append({ "abstract_id": i, "entity": e["word"], "type": e["entity_group"], "score": round(e["score"], 3), }) df = pd.DataFrame(records) print(df.groupby("type")["entity"].apply(list).to_string()) df.to_csv("extracted_entities.csv", index=False) print(f"\nExtracted {len(df)} entity mentions across {len(abstracts)} abstracts")
Workflow 2: Semantic Similarity Ranking for Literature Retrieval
from transformers import AutoTokenizer, AutoModel import torch import numpy as np model_name = "microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModel.from_pretrained(model_name) model.eval() def embed(texts): enc = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="pt") with torch.no_grad(): out = model(**enc) vecs = out.last_hidden_state[:, 0, :] # [CLS] token return torch.nn.functional.normalize(vecs, dim=1).numpy() query = "CRISPR base editing for correction of point mutations in genetic disease" corpus = [ "Base editing enables precise single-base changes in genomic DNA without double-strand breaks.", "CAR-T cell therapy targets CD19 in B-cell acute lymphoblastic leukemia.", "Prime editing uses reverse transcriptase to install targeted edits at specific loci.", "RNA interference silences gene expression via RISC-mediated mRNA cleavage.", ] q_emb = embed([query]) c_emb = embed(corpus) scores = (q_emb @ c_emb.T).flatten() ranked = sorted(zip(scores, corpus), reverse=True) print("Top results:") for score, text in ranked: print(f" [{score:.3f}] {text[:70]}...")
Key Parameters
| Parameter | Module/Function | Default | Range / Options | Effect |
|---|---|---|---|---|
| | — | HuggingFace model ID string | Pre-trained model to load; must match task |
| NER | | , , , | Merge subword NER predictions; use for word-level output |
| | -1 | -1 (CPU), 0 (GPU 0), 1 (GPU 1) | Inference device |
| tokenizer | 512 | 128–2048 (model-dependent) | Max token length; truncates longer inputs |
| | 20 | 1–1000 | Tokens to generate for text generation models |
| | 1 | 1–10 | Beam search width; larger = better quality, slower |
| | 3 | 1–10 | Fine-tuning epochs |
| | 8 | 4–32 | Batch size per GPU; reduce if OOM |
| | 0.0 | 0.01–0.1 | L2 regularization for fine-tuning |
Best Practices
-
Use domain-specific models, not general BERT: PubMedBERT trained from scratch on PubMed outperforms BERT-base by 5–15% on biomedical NER. Always start with biomedical pre-training before fine-tuning on task-specific data.
-
Verify model licenses before production use: Some models (BioGPT, BioMedLM) have research-only licenses. Check the HuggingFace model card's license field before deploying in commercial applications.
-
Use
for word-level NER output: The defaultaggregation_strategy="simple"
returns subword tokens, making post-processing difficult."none"
merges subword tokens using the first-token strategy."simple" -
Truncate at sentence boundaries, not mid-sentence: Long biomedical abstracts that exceed 512 tokens should be split at sentence boundaries before encoding. Mid-sentence truncation degrades NER accuracy for entities near the cutoff.
Common Recipes
Recipe: Extract Drug-Disease Pairs from PubMed Abstracts
from transformers import pipeline from itertools import product ner = pipeline("ner", model="d4data/biomedical-ner-all", aggregation_strategy="simple", device=-1) def extract_drug_disease_pairs(text): entities = ner(text) drugs = [e["word"] for e in entities if e["entity_group"] in ("DRUG", "CHEMICAL")] diseases = [e["word"] for e in entities if e["entity_group"] in ("DISEASE", "CONDITION")] return list(product(drugs, diseases)) text = "Imatinib and nilotinib both target BCR-ABL1 in chronic myeloid leukemia and Philadelphia chromosome-positive ALL." pairs = extract_drug_disease_pairs(text) print("Drug-Disease pairs:") for drug, disease in pairs: print(f" {drug} → {disease}")
Recipe: Sentence-Level Abstract Filtering
from transformers import pipeline clf = pipeline("zero-shot-classification", model="facebook/bart-large-mnli", device=-1) abstracts = [ "We present a phase 3 randomized controlled trial of semaglutide in type 2 diabetes.", "Structural analysis of the SARS-CoV-2 spike protein RBD domain by cryo-EM.", "A retrospective cohort study of 1,200 ICU patients during the COVID-19 pandemic.", ] label_options = ["randomized controlled trial", "observational study", "structural biology", "computational study"] for abstract in abstracts: result = clf(abstract, label_options) print(f"Type: {result['labels'][0]} ({result['scores'][0]:.2f})") print(f" {abstract[:70]}...\n")
Troubleshooting
| Problem | Cause | Solution |
|---|---|---|
during inference | Batch too large for GPU VRAM | Reduce batch size; use for CPU; use for FP16 |
NER returns subword tokens () | not set | Set in |
| Model download times out | Large model files (1–10 GB); slow connection | Set and download manually with |
| NER misses entities at end of long abstracts | Input truncated at 512 tokens | Split abstracts into sentences; process each separately |
Fine-tuning loss is | Learning rate too high or gradient explosion | Reduce to 2e-5; enable gradient clipping |
| Wrong entities for specialized domain | Generic biomedical model not suited to subdomain | Fine-tune on domain-labeled data; use more specific model (e.g., gene-only NER) |
| BioGPT generates repetitive text | too small | Set or ; increase |
Related Skills
— retrieve PubMed abstracts that serve as input to biomedical NLP pipelinespubmed-database
— retrieve preprints for NLP analysis before peer reviewbiorxiv-database
— evaluate quality of NLP-extracted evidence before using for research conclusionsscientific-critical-thinking
References
- HuggingFace Transformers docs — pipeline, tokenizer, and training API
- BioBERT paper: Lee et al. (2020), Bioinformatics — pre-training on PubMed and PMC
- PubMedBERT paper: Gu et al. (2021), ACL — from-scratch pre-training on PubMed
- BioGPT paper: Luo et al. (2022), Briefings in Bioinformatics — generative biomedical language model
- BioCreative benchmarks — standard NER and relation extraction datasets