Medical-research-skills pdb-database
Access the RCSB Protein Data Bank (PDB) to search, download, and programmatically retrieve 3D macromolecular structures and metadata; use when you need structure discovery (text/sequence/3D similarity) or automated structural data ingestion for structural biology and drug discovery workflows.
install
source · Clone the upstream repo
git clone https://github.com/aipoch/medical-research-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/aipoch/medical-research-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/scientific-skills/Evidence Insight/pdb-database" ~/.claude/skills/aipoch-medical-research-skills-pdb-database && rm -rf "$T"
manifest:
scientific-skills/Evidence Insight/pdb-database/SKILL.mdsource content
When to Use
Use this skill when you need to:
- Find protein/nucleic acid 3D structures by keywords, organism, experimental method, or resolution.
- Identify related structures via sequence similarity (e.g., homolog search for modeling).
- Identify related structures via 3D structure similarity (e.g., fold-level comparisons).
- Download coordinates (PDB/mmCIF) for downstream analysis, visualization, docking, or modeling.
- Run batch retrieval of metadata/coordinates to feed pipelines in drug discovery, protein engineering, or structural bioinformatics.
Key Features
- Text and attribute-based search over RCSB PDB entries.
- Sequence similarity search with configurable thresholds (e-value, identity).
- Structure similarity search using an existing entry as a query.
- Programmatic metadata retrieval via the RCSB Data API (schema-based or GraphQL).
- Direct coordinate downloads in PDB and mmCIF formats.
- Batch processing patterns for multiple PDB IDs.
Dependencies
(latest recommended; providesrcsb-api
andrcsbapi.search
)rcsbapi.data
(HTTP downloads)requests>=2.0
(optional; parsing/analyzing PDB coordinates)biopython>=1.80
Install (example):
uv pip install rcsb-api requests biopython
Example Usage
The following script is end-to-end runnable: it searches for a target, fetches metadata, downloads coordinates, and parses the structure.
#!/usr/bin/env python3 import pathlib import requests from rcsbapi.search import TextQuery, AttributeQuery from rcsbapi.search.attrs import rcsb_entry_info from rcsbapi.data import fetch, Schema from Bio.PDB import PDBParser def download_text(url: str, out_path: pathlib.Path) -> None: r = requests.get(url, timeout=60) r.raise_for_status() out_path.write_text(r.text, encoding="utf-8") def main(): out_dir = pathlib.Path("pdb_out") out_dir.mkdir(exist_ok=True) # 1) Search: hemoglobin entries with resolution < 2.0 Å q_text = TextQuery("hemoglobin") q_res = AttributeQuery( attribute=rcsb_entry_info.resolution_combined, operator="less", value=2.0, ) query = q_text & q_res pdb_ids = list(query())[:5] if not pdb_ids: raise SystemExit("No results found.") pdb_id = pdb_ids[0] print(f"Selected PDB ID: {pdb_id}") # 2) Fetch entry metadata entry = fetch(pdb_id, schema=Schema.ENTRY) title = entry.get("struct", {}).get("title") method = (entry.get("exptl") or [{}])[0].get("method") resolution = (entry.get("rcsb_entry_info") or {}).get("resolution_combined") deposit_date = (entry.get("rcsb_accession_info") or {}).get("deposit_date") print("Metadata:") print(f" Title: {title}") print(f" Method: {method}") print(f" Resolution: {resolution}") print(f" Deposit date: {deposit_date}") # 3) Download coordinates (PDB and mmCIF) pdb_path = out_dir / f"{pdb_id}.pdb" cif_path = out_dir / f"{pdb_id}.cif" download_text(f"https://files.rcsb.org/download/{pdb_id}.pdb", pdb_path) download_text(f"https://files.rcsb.org/download/{pdb_id}.cif", cif_path) print(f"Downloaded: {pdb_path} and {cif_path}") # 4) Parse PDB coordinates (example: count atoms) parser = PDBParser(QUIET=True) structure = parser.get_structure(pdb_id, str(pdb_path)) atom_count = sum(1 for _ in structure.get_atoms()) chain_ids = sorted({chain.id for chain in structure.get_chains()}) print("Parsed structure:") print(f" Chains: {chain_ids}") print(f" Atom count: {atom_count}") if __name__ == "__main__": main()
Implementation Details
Search Modes and Query Composition
- Text search uses free-text matching over entry annotations (titles, keywords, descriptions).
- Attribute search filters by structured fields (e.g., organism, method, resolution).
- Sequence similarity search typically supports:
: lower is more stringent (fewer, more confident hits).evalue_cutoff
: fraction identity threshold (e.g.,identity_cutoff
for near-identical).0.9
- Structure similarity search uses an existing structure (e.g., an
) as the geometric reference.entry_id - Queries can be combined with boolean logic:
(AND)query1 & query2
(OR)query1 | query2
(NOT), where supported by the client~query
Data Retrieval (Schema vs GraphQL)
- Schema-based fetch (e.g.,
,Schema.ENTRY
) is convenient for common objects and stable access patterns.Schema.POLYMER_ENTITY - GraphQL fetch is best when you need a custom selection of fields in one request (reduce round-trips and payload).
Example GraphQL pattern:
from rcsbapi.data import fetch query = """ { entry(entry_id: "4HHB") { struct { title } exptl { method } rcsb_entry_info { resolution_combined deposited_atom_count } } } """ data = fetch(query_type="graphql", query=query)
Coordinate Downloads and Formats
- PDB: legacy text format; widely supported but less expressive for large/complex structures.
- mmCIF (PDBx): modern standard; preferred for completeness and large structures.
Direct download endpoints:
https://files.rcsb.org/download/{PDB_ID}.pdbhttps://files.rcsb.org/download/{PDB_ID}.cif
Batch Processing Pattern
For batch metadata retrieval, iterate over IDs and call
fetch(pdb_id, schema=Schema.ENTRY); handle exceptions per-ID to keep pipelines robust. For large batches, consider rate limiting and caching to avoid repeated downloads.
Reference Documentation
If present in this repository, consult:
for advanced endpoint usage, query patterns, schema notes, rate limits, and troubleshooting.references/api_reference.md