SciAgent-Skills pdb-database

Query RCSB PDB (200K+ experimental structures) via rcsb-api Python SDK. Text, attribute, sequence, and structure similarity search. Fetch metadata via Schema or GraphQL. Download PDB/mmCIF coordinate files. For AlphaFold predicted structures use alphafold-database-access.

install

source · Clone the upstream repo

git clone https://github.com/jaechang-hits/SciAgent-Skills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/structural-biology-drug-discovery/pdb-database" ~/.claude/skills/jaechang-hits-sciagent-skills-pdb-database && rm -rf "$T"

manifest: skills/structural-biology-drug-discovery/pdb-database/SKILL.md

source content

PDB Database

Overview

RCSB PDB is the worldwide repository for 3D structural data of biological macromolecules with 200,000+ experimentally determined structures. The

rcsb-api

Python SDK provides unified access to Search API (find PDB IDs by text, attributes, sequence, or structure similarity) and Data API (retrieve metadata and coordinates). Use this skill for programmatic structural biology queries, drug target analysis, and protein family comparisons.

When to Use

Searching for protein or nucleic acid crystal/cryo-EM/NMR structures by keyword or property
Finding structures similar to a query sequence (MMseqs2) or 3D geometry (BioZernike)
Retrieving experimental metadata (resolution, method, organism, deposition date) for structure sets
Downloading coordinate files (PDB, mmCIF) for molecular dynamics, docking, or visualization
Building structure-based datasets for machine learning or drug discovery pipelines
Comparing protein-ligand complexes across a target family
For AlphaFold predicted structures, use
```
alphafold-database-access
```
instead
For protein sequence/annotation queries without structures, use
```
uniprot-protein-database
```
instead

Prerequisites

Python packages:
```
rcsb-api
```
(unified Search + Data SDK), optionally
```
requests
```
for file downloads,
```
biopython
```
for coordinate parsing
No API key required: RCSB PDB is freely accessible
Rate limits: No published hard limit; implement exponential backoff on HTTP 429. Recommended: 2-5 requests/second for batch operations

pip install rcsb-api requests
# Optional: pip install biopython  # for PDB/mmCIF coordinate parsing

Quick Start

Typical search-then-fetch pattern:

from rcsbapi.search import TextQuery, AttributeQuery
from rcsbapi.search.attrs import rcsb_entry_info, rcsb_entity_source_organism
from rcsbapi.data import fetch, Schema

# 1. Search: high-resolution human kinases
query = (
    TextQuery("kinase") &
    AttributeQuery(
        attribute=rcsb_entity_source_organism.scientific_name,
        operator="exact_match", value="Homo sapiens"
    ) &
    AttributeQuery(
        attribute=rcsb_entry_info.resolution_combined,
        operator="less", value=2.0
    )
)
pdb_ids = list(query())
print(f"Found {len(pdb_ids)} structures")  # e.g., Found 1523 structures

# 2. Fetch metadata for first hit
data = fetch(pdb_ids[0], schema=Schema.ENTRY)
print(data["struct"]["title"])
print(f"Resolution: {data['rcsb_entry_info']['resolution_combined']} A")

Core API

Module 1: Text and Attribute Search

Text search finds entries by keyword across all indexed fields.

from rcsbapi.search import TextQuery

query = TextQuery("hemoglobin")
results = list(query())
print(f"Found {len(results)} structures")  # Found ~750 structures

Attribute search queries specific properties with typed operators.

from rcsbapi.search import AttributeQuery
from rcsbapi.search.attrs import rcsb_entity_source_organism, exptl, rcsb_entry_info

# Human proteins
q_human = AttributeQuery(
    attribute=rcsb_entity_source_organism.scientific_name,
    operator="exact_match", value="Homo sapiens"
)

# X-ray only
q_xray = AttributeQuery(
    attribute=exptl.method,
    operator="exact_match", value="X-RAY DIFFRACTION"
)

# Resolution range
q_res = AttributeQuery(
    attribute=rcsb_entry_info.resolution_combined,
    operator="range", value=(1.5, 2.5)
)

results = list(q_human & q_xray & q_res)
print(f"Human X-ray 1.5-2.5A: {len(results)} structures")

Module 2: Sequence Similarity Search

Find structures with similar sequences using MMseqs2. Supports protein, DNA, and RNA.

from rcsbapi.search import SequenceQuery

# Protein sequence search (KRAS)
query = SequenceQuery(
    value="MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAG"
          "QEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHHYREQIKRVKDSEDVPMVLVGNKCDL"
          "PSRTVDTKQAQDLARSYGIPFIETSAKTRQGVDDAFYTLVREIRKHKEKMSK",
    evalue_cutoff=0.1,
    identity_cutoff=0.9
)
results = list(query())
print(f"Sequence hits: {len(results)}")

# DNA sequence search
dna_query = SequenceQuery(
    value="ACGTACGTACGTACGTACGT",
    evalue_cutoff=1e-5,
    identity_cutoff=0.8,
    sequence_type="dna"  # "protein" (default), "dna", "rna"
)

Module 3: Structure Similarity Search

Find structures with similar 3D geometry using BioZernike descriptors.

from rcsbapi.search import StructSimilarityQuery

# Search by full entry
query = StructSimilarityQuery(
    structure_search_type="entry",
    entry_id="4HHB"
)
results = list(query())
print(f"Structurally similar to 4HHB: {len(results)}")

# Search by specific chain
chain_query = StructSimilarityQuery(
    structure_search_type="chain",
    entry_id="4HHB",
    chain_id="A"
)

# Search by biological assembly
assembly_query = StructSimilarityQuery(
    structure_search_type="assembly",
    entry_id="4HHB",
    assembly_id="1"
)

Module 4: Data Retrieval (Schema and GraphQL)

Fetch structured metadata for known PDB IDs using Schema objects or custom GraphQL.

from rcsbapi.data import fetch, Schema

# Entry-level data
entry = fetch("4HHB", schema=Schema.ENTRY)
print(entry["struct"]["title"])
print(entry["exptl"][0]["method"])  # X-RAY DIFFRACTION
print(entry["rcsb_entry_info"]["resolution_combined"])  # 1.74

# Polymer entity data (append _N for entity number)
entity = fetch("4HHB_1", schema=Schema.POLYMER_ENTITY)
print(entity["entity_poly"]["pdbx_seq_one_letter_code"][:50])
print(entity["rcsb_entity_source_organism"][0]["scientific_name"])

Custom GraphQL for flexible cross-level queries in a single request:

from rcsbapi.data import fetch

gql = """{
  entry(entry_id: "4HHB") {
    struct { title }
    exptl { method }
    rcsb_entry_info { resolution_combined deposited_atom_count polymer_entity_count }
    rcsb_accession_info { deposit_date initial_release_date }
  }
}"""
data = fetch(query_type="graphql", query=gql)
info = data["data"]["entry"]
print(f"Title: {info['struct']['title']}")
print(f"Atoms: {info['rcsb_entry_info']['deposited_atom_count']}")

Module 5: File Download

Download coordinate files from RCSB file servers.

import requests

def download_structure(pdb_id, fmt="cif", output_dir="."):
    """Download PDB/mmCIF/FASTA. URLs: .pdb, .cif, /fasta/entry/{ID}, .pdb1 (assembly)."""
    url = f"https://files.rcsb.org/download/{pdb_id}.{fmt}"
    resp = requests.get(url)
    if resp.status_code == 200:
        path = f"{output_dir}/{pdb_id}.{fmt}"
        with open(path, "w") as f:
            f.write(resp.text)
        print(f"Downloaded {path} ({len(resp.text)} bytes)")
        return path
    print(f"Error {resp.status_code} for {pdb_id}")
    return None

# Download both formats
download_structure("4HHB", fmt="pdb")
download_structure("4HHB", fmt="cif")

Module 6: Query Composition

Combine queries with Python bitwise operators (

AND,

OR,

NOT).

from rcsbapi.search import TextQuery, AttributeQuery
from rcsbapi.search.attrs import rcsb_entry_info, rcsb_entity_source_organism, refine
import datetime

# AND: high-resolution human structures
q_and = (
    AttributeQuery(attribute=rcsb_entity_source_organism.scientific_name,
                    operator="exact_match", value="Homo sapiens") &
    AttributeQuery(attribute=rcsb_entry_info.resolution_combined,
                    operator="less", value=2.0)
)

# OR: human or mouse
q_or = (
    AttributeQuery(attribute=rcsb_entity_source_organism.scientific_name,
                    operator="exact_match", value="Homo sapiens") |
    AttributeQuery(attribute=rcsb_entity_source_organism.scientific_name,
                    operator="exact_match", value="Mus musculus")
)

# NOT: exclude low resolution
q_not = TextQuery("protein") & ~AttributeQuery(
    attribute=rcsb_entry_info.resolution_combined, operator="greater", value=3.0
)

# Complex: recent high-quality structures
one_month_ago = (datetime.date.today() - datetime.timedelta(days=30)).isoformat()
from rcsbapi.search.attrs import rcsb_accession_info
q_complex = (
    AttributeQuery(attribute=rcsb_entry_info.resolution_combined,
                    operator="less", value=2.0) &
    AttributeQuery(attribute=refine.ls_R_factor_R_free,
                    operator="less", value=0.25) &
    AttributeQuery(attribute=rcsb_accession_info.initial_release_date,
                    operator="range", value=(one_month_ago, datetime.date.today().isoformat()))
)
print(f"Recent high-quality: {len(list(q_complex()))} structures")

Module 7: Batch Operations with Rate Limiting

Fetch data for multiple structures with exponential backoff.

import time
from rcsbapi.search import TextQuery
from rcsbapi.data import fetch, Schema

def batch_fetch(pdb_ids, schema=Schema.ENTRY, delay=0.5, max_retries=3):
    """Fetch metadata for multiple PDB IDs with rate-limit handling."""
    results = {}
    for i, pdb_id in enumerate(pdb_ids):
        wait = delay
        for attempt in range(max_retries):
            try:
                results[pdb_id] = fetch(pdb_id, schema=schema)
                break
            except Exception as e:
                if "429" in str(e):
                    print(f"Rate limited on {pdb_id}, waiting {wait}s...")
                    time.sleep(wait)
                    wait *= 2
                else:
                    print(f"Error {pdb_id}: {e}")
                    break
        if (i + 1) % 10 == 0:
            print(f"Fetched {i + 1}/{len(pdb_ids)}")
        time.sleep(delay)
    return results

# Usage
query = TextQuery("insulin")
pdb_ids = list(query())[:20]  # First 20 hits
data = batch_fetch(pdb_ids, delay=0.3)
for pid, d in list(data.items())[:3]:
    print(f"{pid}: {d['struct']['title']}")

Key Concepts

PDB Data Hierarchy

PDB organizes data in a strict hierarchy. Use the correct Schema for each level:

Level	ID Format	Schema	Content
Entry	`4HHB`	`Schema.ENTRY`	Experiment, resolution, deposition
Polymer Entity	`4HHB_1`	`Schema.POLYMER_ENTITY`	Sequence, organism, molecular weight
Nonpolymer Entity	`4HHB_1`	`Schema.NONPOLYMER_ENTITY`	Ligands, cofactors, ions
Branched Entity	`4HHB_1`	`Schema.BRANCHED_ENTITY`	Oligosaccharides
Assembly	`4HHB/1`	`Schema.ASSEMBLY`	Biological unit, stoichiometry
Chain Instance	`4HHB.A`	`Schema.POLYMER_ENTITY_INSTANCE`	Individual chain coordinates
Chemical Component	`ATP`	`Schema.CHEM_COMP`	Small molecule metadata

Common Data Fields

Entry:

struct.title

exptl[].method

rcsb_entry_info.resolution_combined

rcsb_entry_info.deposited_atom_count

rcsb_accession_info.deposit_date

rcsb_accession_info.initial_release_date

Polymer entity:

entity_poly.pdbx_seq_one_letter_code

rcsb_polymer_entity.formula_weight

rcsb_entity_source_organism.scientific_name

rcsb_entity_source_organism.ncbi_taxonomy_id

Assembly:

rcsb_assembly_info.polymer_entity_count

rcsb_assembly_info.assembly_id

Search Query Types

Query Type	Class	Use Case	Engine
Full text	`TextQuery`	Keyword search across all fields	Lucene
Attribute	`AttributeQuery`	Property filters (organism, resolution)	Exact match
Sequence	`SequenceQuery`	Sequence similarity (protein/DNA/RNA)	MMseqs2
Sequence Motif	`SequenceMotifQuery`	Regex/PROSITE motif patterns	Custom
Structure	`StructSimilarityQuery`	3D geometry similarity	BioZernike
Structure Motif	`StructMotifQuery`	3D residue arrangement patterns	InvertedIndex
Chemical	`ChemSimilarityQuery`	Ligand similarity by SMILES/InChI	Fingerprint

AttributeQuery Operators

Operator	Type	Example Value
`exact_match`	String	`"Homo sapiens"`
`contains_words`	String	`"kinase inhibitor"`
`contains_phrase`	String	`"tyrosine kinase"`
`equals`	Numeric	`2.0`
`greater` / `less`	Numeric	`3.0`
`greater_or_equal` / `less_or_equal`	Numeric	`1.5`
`range`	Numeric/Date	`(1.5, 2.5)` or `("2024-01-01", "2024-12-31")`
`exists`	Any	`None` (field has value)
`in`	List	`["X-RAY DIFFRACTION", "ELECTRON MICROSCOPY"]`

File Format Comparison

Format	Extension	Best For	Notes
PDB	`.pdb`	Legacy tools, small structures	99,999 atom limit, fixed-width
mmCIF	`.cif`	Modern standard, large structures	No atom limit, key-value pairs
BinaryCIF	`.bcif`	Web viewers, bandwidth-limited	Compressed binary, via ModelServer

Return Types

Control what search results return via

ReturnType

from rcsbapi.search import TextQuery, ReturnType

query = TextQuery("hemoglobin")
# Default: PDB IDs
ids = list(query())  # ['4HHB', '1A3N', ...]
# With scores
scored = list(query(return_type=ReturnType.ENTRY, return_scores=True))
# [{'identifier': '4HHB', 'score': 0.95}, ...]
# Polymer entities
entities = list(query(return_type=ReturnType.POLYMER_ENTITY))
# ['4HHB_1', '4HHB_2', ...]

Common Workflows

Workflow 1: Drug Target Structure Search

Goal: Find high-resolution structures of a drug target with bound ligands.

from rcsbapi.search import TextQuery, AttributeQuery
from rcsbapi.search.attrs import rcsb_entry_info, rcsb_entity_source_organism
from rcsbapi.data import fetch, Schema
import time

# Search: human EGFR, high resolution
query = (
    TextQuery("EGFR epidermal growth factor receptor") &
    AttributeQuery(attribute=rcsb_entity_source_organism.scientific_name,
                    operator="exact_match", value="Homo sapiens") &
    AttributeQuery(attribute=rcsb_entry_info.resolution_combined,
                    operator="less", value=2.5)
)
pdb_ids = list(query())
print(f"EGFR structures: {len(pdb_ids)}")

# Filter for structures with bound ligands
targets = []
for pid in pdb_ids[:10]:
    data = fetch(pid, schema=Schema.ENTRY)
    n_lig = data.get("rcsb_entry_info", {}).get("nonpolymer_entity_count", 0)
    if n_lig and n_lig > 0:
        targets.append({"pdb_id": pid, "title": data["struct"]["title"],
                         "resolution": data["rcsb_entry_info"]["resolution_combined"],
                         "ligands": n_lig})
    time.sleep(0.3)

for t in sorted(targets, key=lambda x: x["resolution"]):
    print(f"{t['pdb_id']}: {t['resolution']}A, {t['ligands']} ligands - {t['title'][:60]}")

Workflow 2: Protein Family Structural Comparison

Goal: Compare structures across a protein family by sequence similarity.

from rcsbapi.search import SequenceQuery
from rcsbapi.data import fetch, Schema
import time

# Find structures similar to KRAS
query = SequenceQuery(
    value="MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAG",
    evalue_cutoff=1e-5, identity_cutoff=0.5
)
hits = list(query())
print(f"Family members: {len(hits)}")

# Collect metadata with rate limiting
family = []
for pid in hits[:15]:
    try:
        d = fetch(pid, schema=Schema.ENTRY)
        family.append({"pdb_id": pid, "title": d["struct"]["title"],
                        "resolution": d.get("rcsb_entry_info", {}).get("resolution_combined"),
                        "method": d["exptl"][0]["method"]})
    except Exception as e:
        print(f"Skip {pid}: {e}")
    time.sleep(0.3)

for f in sorted(family, key=lambda x: x["resolution"] or 99):
    res = f"{f['resolution']:.1f}" if f["resolution"] else "N/A"
    print(f"{f['pdb_id']}: {res}A {f['method']:<20} {f['title'][:50]}")

Workflow 3: Structure Download and Coordinate Parsing

Goal: Download a structure and extract chain-level information with BioPython.

import requests
from Bio.PDB import PDBParser, MMCIFParser

# Download mmCIF
pdb_id = "4HHB"
url = f"https://files.rcsb.org/download/{pdb_id}.cif"
resp = requests.get(url)
with open(f"{pdb_id}.cif", "w") as f:
    f.write(resp.text)

# Parse and analyze
parser = MMCIFParser(QUIET=True)
structure = parser.get_structure(pdb_id, f"{pdb_id}.cif")

for model in structure:
    for chain in model:
        residues = [r for r in chain if r.id[0] == " "]  # standard residues
        atoms = sum(len(list(r.get_atoms())) for r in residues)
        print(f"Chain {chain.id}: {len(residues)} residues, {atoms} atoms")
# Chain A: 141 residues, 1070 atoms
# Chain B: 146 residues, 1123 atoms
# ...

Key Parameters

Parameter	Function/Endpoint	Default	Range / Options	Effect
`evalue_cutoff`	`SequenceQuery`	`0.1`	`1e-10` to `10`	E-value threshold for sequence hits
`identity_cutoff`	`SequenceQuery`	`0.9`	`0.0` - `1.0`	Minimum sequence identity fraction
`sequence_type`	`SequenceQuery`	`"protein"`	`"protein"` , `"dna"` , `"rna"`	Query sequence type
`operator`	`AttributeQuery`	(required)	See Operators table	Comparison type for attribute filter
`structure_search_type`	`StructSimilarityQuery`	(required)	`"entry"` , `"chain"` , `"assembly"`	Granularity of structure comparison
`return_type`	`query()`	`ReturnType.ENTRY`	`ENTRY` , `POLYMER_ENTITY` , `ASSEMBLY` , etc.	What identifiers to return
`schema`	`fetch()`	(required)	`Schema.ENTRY` , `POLYMER_ENTITY` , `ASSEMBLY` , etc.	Data hierarchy level to retrieve

Best Practices

Search then fetch: Use Search API to get PDB IDs first, then Data API to retrieve metadata. Avoid fetching blindly.
Use GraphQL for complex queries: When you need fields from multiple hierarchy levels (entry + entity + assembly), a single GraphQL query is more efficient than multiple REST calls.
Cache aggressively: PDB data changes weekly (Wednesday releases). Cache results within a session to avoid redundant requests.
Prefer mmCIF over PDB format: PDB format has a 99,999 atom limit and is being phased out. Always use
```
.cif
```
for new work.

Use typed attribute imports: Import attributes from

rcsbapi.search.attrs

for IDE autocompletion and typo prevention:

from rcsbapi.search.attrs import rcsb_entry_info
# Better than raw string: "rcsb_entry_info.resolution_combined"

Inspect query JSON for debugging: Call
```
query.to_dict()
```
to see the JSON payload sent to the API.

Troubleshooting

Problem	Cause	Solution
`ModuleNotFoundError: rcsbapi`	Package not installed	`pip install rcsb-api` (not `rcsbsearchapi` , which is deprecated)
HTTP 404 on fetch	PDB ID obsolete or invalid	Check ID at rcsb.org; use `rcsb_accession_info.status_code` to find superseding entry
HTTP 429 Too Many Requests	Rate limit exceeded	Add `time.sleep(0.5)` between requests; use exponential backoff (see Module 7)
HTTP 500 Internal Server Error	Temporary server issue	Retry after 5-10s delay; check status.rcsb.org
Empty search results	Query too restrictive or attribute name wrong	Relax filters; verify attribute paths via `rcsbapi.search.attrs`
`KeyError` on fetch result	Field not present for this entry	Use `.get()` with default: `data.get("rcsb_entry_info", {}).get("resolution_combined")`
Sequence search returns 0 hits	E-value or identity too strict	Lower `identity_cutoff` (try 0.3) and raise `evalue_cutoff` (try 1.0)
Downloaded PDB file truncated	Structure too large for PDB format	Download mmCIF ( `.cif` ) instead; PDB has 99,999 atom limit

Bundled Resources

Self-contained consolidation: Consolidates original SKILL.md (308 lines) and references/api_reference.md (618 lines, 926 total). ~150 lines of duplicated code deduplicated; effective original ~776 lines.

Content from api_reference.md consolidated into: Key Concepts (data hierarchy, fields, query types, operators, return types), Core API Modules 1-7 (search patterns, GraphQL, download helper, rate limiting, batch), Workflows (drug-target, quality filtering), Troubleshooting table.

Omitted: ModelServer API, VolumeServer API, Sequence Coordinates API, Alignment API (4 specialized server APIs -- rarely used programmatically; documented at data.rcsb.org). Debugging subprocess/curl tip replaced by

to_dict()

in Best Practices.

Related Skills

alphafold-database-access -- AI-predicted structures; use when no experimental structure exists
uniprot-protein-database -- protein annotations, sequences, ID mapping (UniProt accession needed for AlphaFold)
biopython-molecular-biology -- parse downloaded PDB/mmCIF files, extract coordinates, compute distances
autodock-vina-docking -- downstream molecular docking using PDB structures as receptors
rdkit-cheminformatics -- analyze ligands extracted from PDB complexes

References

RCSB PDB -- main portal and web search
rcsb-api Python SDK documentation -- official SDK docs
rcsb-api GitHub repository -- source code and issues
RCSB PDB Web APIs overview -- REST, GraphQL, and search API docs
Data API reference -- full endpoint documentation