SciAgent-Skills transformers-bio-nlp

Use HuggingFace Transformers with biomedical language models for scientific NLP tasks. Load BioBERT, PubMedBERT, BioGPT, and BioMedLM for named entity recognition (genes, diseases, chemicals), relation extraction, question answering on biomedical literature, text classification, and abstract summarization. Covers model loading, tokenization of biomedical text, inference pipelines, and fine-tuning on domain-specific datasets. Alternatives: spaCy with en_core_sci_lg (rule-based NER), Stanza (Stanford NLP, biomedical models), NLTK (classical NLP).

install

source · Clone the upstream repo

git clone https://github.com/jaechang-hits/SciAgent-Skills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/scientific-computing/transformers-bio-nlp" ~/.claude/skills/jaechang-hits-sciagent-skills-transformers-bio-nlp && rm -rf "$T"

manifest: skills/scientific-computing/transformers-bio-nlp/SKILL.md

source content

Transformers for Biomedical NLP

Overview

HuggingFace Transformers provides a unified API to load, run, and fine-tune 500+ biomedical language models. The key biomedical models — BioBERT (trained on PubMed abstracts + PMC full text), PubMedBERT (trained from scratch on PubMed), BioGPT (generative, trained on PubMed), and BioMedLM — significantly outperform general-purpose BERT on biomedical NER, relation extraction, and question answering. The

pipeline()

abstraction handles tokenization, inference, and postprocessing in one call. Fine-tuning on task-specific labeled data (e.g., BC5CDR for chemical/disease NER) takes under an hour on a single GPU. The

datasets

library provides direct access to standard biomedical benchmarks.

When to Use

Extracting gene names, disease mentions, drug names, or chemical entities from biomedical abstracts (NER)
Classifying abstracts by topic, sentiment of clinical outcomes, or PICO elements for systematic reviews
Answering specific questions from biomedical literature using extractive QA (BioASQ format)
Generating hypotheses or summaries from biomedical text using BioGPT or BioMedLM
Fine-tuning a pre-trained biomedical model on a custom labeled dataset (e.g., your lab's annotations)
Embedding biomedical sentences for semantic similarity search across literature
Use spaCy + en_core_sci_lg for fast rule-augmented NER; use Stanza for dependency parsing

Prerequisites

Python packages:

transformers

torch

datasets

accelerate

sentencepiece

GPU: Strongly recommended for fine-tuning; inference on CPU is viable for single texts
Data requirements: plain text biomedical strings; for fine-tuning, annotated data in BIO/IOB format

pip install transformers torch datasets accelerate sentencepiece
# For GPU (CUDA 11.8)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118

Quick Start

from transformers import pipeline

# Named entity recognition with BioBERT
ner = pipeline("ner", model="allenai/scibert_scivocab_cased",
               aggregation_strategy="simple")

text = "BRCA1 mutations are associated with increased risk of breast cancer and ovarian cancer."
entities = ner(text)
for ent in entities:
    print(f"  {ent['word']:20s} {ent['entity_group']:10s} score={ent['score']:.3f}")

Core API

Module 1: Named Entity Recognition (NER)

Extract biomedical entities using pre-trained NER models.

from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification

# BioBERT fine-tuned for NER (genes, diseases, chemicals)
# Common choices:
#   "allenai/scibert_scivocab_cased"  — scientific NER
#   "d4data/biomedical-ner-all"       — multi-entity biomedical NER
#   "pruas/BENT-PubMedBERT-NER-Gene"  — gene-specific NER
ner_pipe = pipeline(
    "ner",
    model="d4data/biomedical-ner-all",
    aggregation_strategy="simple",  # merge subword tokens into words
    device=-1  # -1=CPU, 0=GPU
)

abstracts = [
    "Imatinib inhibits the BCR-ABL1 tyrosine kinase and is first-line treatment for CML.",
    "EGFR mutations in non-small cell lung cancer predict response to erlotinib.",
]

for text in abstracts:
    entities = ner_pipe(text)
    print(f"\nText: {text[:60]}...")
    for e in entities:
        print(f"  [{e['entity_group']}] '{e['word']}' (score={e['score']:.2f})")

# Manual tokenization + inference for batch processing
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_name = "allenai/scibert_scivocab_cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
model.eval()

text = "Metformin activates AMPK and reduces hepatic glucose production in type 2 diabetes."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

with torch.no_grad():
    outputs = model(**inputs)

logits = outputs.logits  # shape: (1, seq_len, n_labels)
predictions = logits.argmax(dim=-1)[0]
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [model.config.id2label[p.item()] for p in predictions]

for token, label in zip(tokens[1:-1], labels[1:-1]):  # skip [CLS] and [SEP]
    if label != "O":
        print(f"  {token:20s} {label}")

Module 2: Text Classification

Classify biomedical abstracts or sentences.

from transformers import pipeline

# Zero-shot classification — no fine-tuning needed
zs_clf = pipeline("zero-shot-classification",
                  model="facebook/bart-large-mnli",
                  device=-1)

abstract = """
This randomized controlled trial evaluated the efficacy of pembrolizumab versus
chemotherapy in patients with advanced non-small-cell lung cancer. Overall survival
was significantly improved in the pembrolizumab arm (HR=0.60, 95% CI 0.41-0.89).
"""

candidate_labels = ["clinical trial", "basic research", "meta-analysis", "review"]
result = zs_clf(abstract, candidate_labels)
print("Zero-shot classification:")
for label, score in zip(result["labels"], result["scores"]):
    print(f"  {label:20s}: {score:.3f}")

# Fine-tuned sentiment/outcome classification
from transformers import pipeline

# Example: classify clinical outcome sentiment
clf = pipeline("text-classification",
               model="pruas/BENT-PubMedBERT-NER-Gene",  # use appropriate task-specific model
               device=-1)

sentences = [
    "Treatment significantly improved overall survival (p<0.001).",
    "No statistically significant difference was observed between groups.",
]
results = clf(sentences)
for sent, result in zip(sentences, results):
    print(f"  [{result['label']} | {result['score']:.2f}] {sent[:50]}...")

Module 3: Biomedical Question Answering

Extract answers from biomedical text passages.

from transformers import pipeline

# Extractive QA: find answer span within context
qa_pipe = pipeline(
    "question-answering",
    model="sultan/BioM-ELECTRA-Large-SQuAD2",  # biomedical QA model
    device=-1
)

context = """
BRCA1 is a tumor suppressor gene located on chromosome 17q21. Pathogenic variants
in BRCA1 confer a lifetime breast cancer risk of 50-72% and ovarian cancer risk
of 44-46%. BRCA1 protein functions in DNA double-strand break repair via
homologous recombination.
"""

questions = [
    "What chromosome is BRCA1 located on?",
    "What is the lifetime breast cancer risk from BRCA1 variants?",
    "What DNA repair pathway does BRCA1 participate in?",
]

for q in questions:
    result = qa_pipe(question=q, context=context)
    print(f"Q: {q}")
    print(f"A: {result['answer']} (score={result['score']:.3f})\n")

Module 4: Text Generation with BioGPT

Generate biomedical text, hypotheses, and summaries.

from transformers import AutoTokenizer, BioGptForCausalLM
import torch

model_name = "microsoft/biogpt"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = BioGptForCausalLM.from_pretrained(model_name)
model.eval()

prompt = "The role of VEGF in tumor angiogenesis"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=100,
        num_beams=5,
        early_stopping=True,
        no_repeat_ngram_size=3,
        pad_token_id=tokenizer.eos_token_id,
    )

generated = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Generated:\n{generated}")

Module 5: Sentence Embeddings for Semantic Search

Embed biomedical text for similarity search and clustering.

from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np

def mean_pooling(model_output, attention_mask):
    """Mean pooling across token embeddings."""
    token_embeddings = model_output.last_hidden_state
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return (token_embeddings * input_mask_expanded).sum(1) / input_mask_expanded.sum(1)

# PubMedBERT for biomedical sentence embeddings
model_name = "microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
model.eval()

sentences = [
    "BRCA1 is involved in DNA double-strand break repair.",
    "Homologous recombination requires BRCA1 and BRCA2.",
    "Metformin inhibits hepatic gluconeogenesis via AMPK.",
]

inputs = tokenizer(sentences, padding=True, truncation=True,
                   max_length=512, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

embeddings = mean_pooling(outputs, inputs["attention_mask"])
embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1).numpy()

# Compute cosine similarity
from numpy.linalg import norm
sim_01 = np.dot(embeddings[0], embeddings[1])
sim_02 = np.dot(embeddings[0], embeddings[2])
print(f"Similarity (BRCA1 repair vs. HR): {sim_01:.3f}")
print(f"Similarity (BRCA1 repair vs. Metformin): {sim_02:.3f}")

Module 6: Fine-Tuning on Custom Data

Fine-tune a biomedical model on a labeled NER dataset.

from transformers import (AutoTokenizer, AutoModelForTokenClassification,
                           TrainingArguments, Trainer, DataCollatorForTokenClassification)
from datasets import Dataset
import numpy as np

# Example: minimal NER fine-tuning setup
model_name = "microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract"
label_list = ["O", "B-GENE", "I-GENE", "B-DISEASE", "I-DISEASE"]
id2label = {i: l for i, l in enumerate(label_list)}
label2id = {l: i for i, l in enumerate(label_list)}

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(
    model_name, num_labels=len(label_list), id2label=id2label, label2id=label2id
)

# Training arguments
training_args = TrainingArguments(
    output_dir="./biomed_ner_finetuned",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    warmup_steps=100,
    weight_decay=0.01,
    logging_dir="./logs",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

print(f"Model ready for fine-tuning: {model_name}")
print(f"Labels: {label_list}")
# trainer = Trainer(model=model, args=training_args, ...)
# trainer.train()

Key Concepts

Tokenization of Biomedical Text

Biomedical text contains special tokens (gene symbols, drug names, chemical SMILES, numeric values) that WordPiece and BPE tokenizers split unexpectedly. For example, "BRCA1" →

["BR", "##CA", "##1"]

. This subword splitting does not affect classification tasks but does affect NER — use

aggregation_strategy="simple"

"first"

pipeline()

to merge subword predictions back to word level.

BIO Labeling Scheme

NER uses BIO (Begin-Inside-Outside) tagging:

B-GENE

marks the first token of a gene name,

I-GENE

marks continuation tokens,

marks non-entity tokens. During fine-tuning, align labels to subword tokens by setting non-first subword labels to

-100

(ignored by the loss function).

Common Workflows

Workflow 1: Batch Abstract NER and Entity Aggregation

from transformers import pipeline
import pandas as pd

ner_pipe = pipeline("ner", model="d4data/biomedical-ner-all",
                    aggregation_strategy="simple", device=-1)

abstracts = [
    "Pembrolizumab combined with chemotherapy significantly improved progression-free survival in HER2-positive breast cancer.",
    "Inhibition of EGFR by gefitinib is effective in patients with activating EGFR mutations in exons 19 and 21.",
    "CRISPR-Cas9 editing of the PCSK9 gene in hepatocytes reduces LDL cholesterol in murine models.",
]

records = []
for i, text in enumerate(abstracts):
    entities = ner_pipe(text)
    for e in entities:
        records.append({
            "abstract_id": i,
            "entity": e["word"],
            "type": e["entity_group"],
            "score": round(e["score"], 3),
        })

df = pd.DataFrame(records)
print(df.groupby("type")["entity"].apply(list).to_string())
df.to_csv("extracted_entities.csv", index=False)
print(f"\nExtracted {len(df)} entity mentions across {len(abstracts)} abstracts")

Workflow 2: Semantic Similarity Ranking for Literature Retrieval

from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np

model_name = "microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
model.eval()

def embed(texts):
    enc = tokenizer(texts, padding=True, truncation=True,
                    max_length=512, return_tensors="pt")
    with torch.no_grad():
        out = model(**enc)
    vecs = out.last_hidden_state[:, 0, :]  # [CLS] token
    return torch.nn.functional.normalize(vecs, dim=1).numpy()

query = "CRISPR base editing for correction of point mutations in genetic disease"
corpus = [
    "Base editing enables precise single-base changes in genomic DNA without double-strand breaks.",
    "CAR-T cell therapy targets CD19 in B-cell acute lymphoblastic leukemia.",
    "Prime editing uses reverse transcriptase to install targeted edits at specific loci.",
    "RNA interference silences gene expression via RISC-mediated mRNA cleavage.",
]

q_emb = embed([query])
c_emb = embed(corpus)
scores = (q_emb @ c_emb.T).flatten()
ranked = sorted(zip(scores, corpus), reverse=True)

print("Top results:")
for score, text in ranked:
    print(f"  [{score:.3f}] {text[:70]}...")

Key Parameters

Parameter	Module/Function	Default	Range / Options	Effect
`model`	`pipeline()`	—	HuggingFace model ID string	Pre-trained model to load; must match task
`aggregation_strategy`	NER `pipeline`	`"none"`	`"none"` , `"simple"` , `"first"` , `"average"`	Merge subword NER predictions; use `"simple"` for word-level output
`device`	`pipeline()`	-1	-1 (CPU), 0 (GPU 0), 1 (GPU 1)	Inference device
`max_length`	tokenizer	512	128–2048 (model-dependent)	Max token length; truncates longer inputs
`max_new_tokens`	`model.generate()`	20	1–1000	Tokens to generate for text generation models
`num_beams`	`model.generate()`	1	1–10	Beam search width; larger = better quality, slower
`num_train_epochs`	`TrainingArguments`	3	1–10	Fine-tuning epochs
`per_device_train_batch_size`	`TrainingArguments`	8	4–32	Batch size per GPU; reduce if OOM
`weight_decay`	`TrainingArguments`	0.0	0.01–0.1	L2 regularization for fine-tuning

Best Practices

Use domain-specific models, not general BERT: PubMedBERT trained from scratch on PubMed outperforms BERT-base by 5–15% on biomedical NER. Always start with biomedical pre-training before fine-tuning on task-specific data.
Verify model licenses before production use: Some models (BioGPT, BioMedLM) have research-only licenses. Check the HuggingFace model card's license field before deploying in commercial applications.
Use
```
aggregation_strategy="simple"
```
for word-level NER output: The default
```
"none"
```
returns subword tokens, making post-processing difficult.
```
"simple"
```
merges subword tokens using the first-token strategy.
Truncate at sentence boundaries, not mid-sentence: Long biomedical abstracts that exceed 512 tokens should be split at sentence boundaries before encoding. Mid-sentence truncation degrades NER accuracy for entities near the cutoff.

Common Recipes

Recipe: Extract Drug-Disease Pairs from PubMed Abstracts

from transformers import pipeline
from itertools import product

ner = pipeline("ner", model="d4data/biomedical-ner-all",
               aggregation_strategy="simple", device=-1)

def extract_drug_disease_pairs(text):
    entities = ner(text)
    drugs    = [e["word"] for e in entities if e["entity_group"] in ("DRUG", "CHEMICAL")]
    diseases = [e["word"] for e in entities if e["entity_group"] in ("DISEASE", "CONDITION")]
    return list(product(drugs, diseases))

text = "Imatinib and nilotinib both target BCR-ABL1 in chronic myeloid leukemia and Philadelphia chromosome-positive ALL."
pairs = extract_drug_disease_pairs(text)
print("Drug-Disease pairs:")
for drug, disease in pairs:
    print(f"  {drug} → {disease}")

Recipe: Sentence-Level Abstract Filtering

from transformers import pipeline

clf = pipeline("zero-shot-classification",
               model="facebook/bart-large-mnli", device=-1)

abstracts = [
    "We present a phase 3 randomized controlled trial of semaglutide in type 2 diabetes.",
    "Structural analysis of the SARS-CoV-2 spike protein RBD domain by cryo-EM.",
    "A retrospective cohort study of 1,200 ICU patients during the COVID-19 pandemic.",
]

label_options = ["randomized controlled trial", "observational study", "structural biology", "computational study"]

for abstract in abstracts:
    result = clf(abstract, label_options)
    print(f"Type: {result['labels'][0]} ({result['scores'][0]:.2f})")
    print(f"  {abstract[:70]}...\n")

Troubleshooting

Problem	Cause	Solution
`CUDA out of memory` during inference	Batch too large for GPU VRAM	Reduce batch size; use `device=-1` for CPU; use `model.half()` for FP16
NER returns subword tokens ( `##CA` )	`aggregation_strategy` not set	Set `aggregation_strategy="simple"` in `pipeline()`
Model download times out	Large model files (1–10 GB); slow connection	Set `HF_HUB_OFFLINE=1` and download manually with `huggingface-cli download`
NER misses entities at end of long abstracts	Input truncated at 512 tokens	Split abstracts into sentences; process each separately
Fine-tuning loss is `NaN`	Learning rate too high or gradient explosion	Reduce `learning_rate` to 2e-5; enable gradient clipping `max_grad_norm=1.0`
Wrong entities for specialized domain	Generic biomedical model not suited to subdomain	Fine-tune on domain-labeled data; use more specific model (e.g., gene-only NER)
BioGPT generates repetitive text	`no_repeat_ngram_size` too small	Set `no_repeat_ngram_size=3` or `4` ; increase `num_beams`

Related Skills

```
pubmed-database
```
— retrieve PubMed abstracts that serve as input to biomedical NLP pipelines
```
biorxiv-database
```
— retrieve preprints for NLP analysis before peer review
```
scientific-critical-thinking
```
— evaluate quality of NLP-extracted evidence before using for research conclusions

References

HuggingFace Transformers docs — pipeline, tokenizer, and training API
BioBERT paper: Lee et al. (2020), Bioinformatics — pre-training on PubMed and PMC
PubMedBERT paper: Gu et al. (2021), ACL — from-scratch pre-training on PubMed
BioGPT paper: Luo et al. (2022), Briefings in Bioinformatics — generative biomedical language model
BioCreative benchmarks — standard NER and relation extraction datasets