SciAgent-Skills alphafold-database-access

install

source · Clone the upstream repo

git clone https://github.com/jaechang-hits/SciAgent-Skills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/structural-biology-drug-discovery/alphafold-database-access" ~/.claude/skills/jaechang-hits-sciagent-skills-alphafold-database-access && rm -rf "$T"

manifest: skills/structural-biology-drug-discovery/alphafold-database-access/SKILL.md

source content

AlphaFold Database Access

Overview

AlphaFold DB is a public repository of AI-predicted 3D protein structures for over 200 million proteins, maintained by DeepMind and EMBL-EBI. Access predictions via BioPython or REST API, download coordinate files in multiple formats, analyze confidence metrics, and retrieve bulk proteome datasets via Google Cloud.

When to Use

Retrieving AI-predicted protein structures by UniProt accession
Downloading PDB/mmCIF coordinate files for structural analysis or docking
Analyzing prediction confidence (pLDDT per-residue, PAE domain-level)
Bulk-downloading entire proteome predictions via Google Cloud
Comparing predicted structures with experimental PDB structures
Building structural models for proteins lacking experimental data
Identifying high-confidence binding sites for drug discovery
For experimental structures only → use PDB directly
For running AlphaFold predictions → use ColabFold or local AlphaFold

Prerequisites

# Core (BioPython for structure access)
pip install biopython requests numpy matplotlib

# Optional: Google Cloud for bulk access
pip install google-cloud-bigquery google-cloud-storage

Quick Start

from Bio.PDB import alphafold_db, MMCIFParser
import requests, numpy as np

# 1. Get prediction for a protein
uniprot_id = "P00520"  # ABL1 kinase
predictions = list(alphafold_db.get_predictions(uniprot_id))
af_id = predictions[0]['entryId']  # AF-P00520-F1

# 2. Download structure
cif_file = alphafold_db.download_cif_for(predictions[0], directory="./structures")

# 3. Check confidence
conf = requests.get(f"https://alphafold.ebi.ac.uk/files/{af_id}-confidence_v4.json").json()
scores = conf['confidenceScore']
print(f"Mean pLDDT: {np.mean(scores):.1f}, High-conf residues: {sum(1 for s in scores if s > 90)}/{len(scores)}")

Core API

1. Prediction Retrieval

BioPython (recommended for single proteins):

from Bio.PDB import alphafold_db

# Get prediction metadata
predictions = list(alphafold_db.get_predictions("P00520"))
pred = predictions[0]
print(f"AlphaFold ID: {pred['entryId']}")
print(f"Gene: {pred['gene']}, Species: {pred['organismScientificName']}")

# Get Structure objects directly
structures = list(alphafold_db.get_structural_models_for("P00520"))

REST API (for metadata or integration):

import requests

uniprot_id = "P00520"
url = f"https://alphafold.ebi.ac.uk/api/prediction/{uniprot_id}"
response = requests.get(url)
data = response.json()

# Response includes download URLs for all file types
pred = data[0]
print(f"CIF: {pred['cifUrl']}")
print(f"PDB: {pred['pdbUrl']}")
print(f"PAE: {pred['paeDocUrl']}")

3D-Beacons federated API (query multiple structure providers):

url = f"https://www.ebi.ac.uk/pdbe/pdbe-kb/3dbeacons/api/uniprot/summary/{uniprot_id}.json"
data = requests.get(url).json()
af_structures = [s for s in data['structures'] if s['provider'] == 'AlphaFold DB']

2. Structure File Download

import requests

af_id = "AF-P00520-F1"
version = "v4"
base = "https://alphafold.ebi.ac.uk/files"

# mmCIF (recommended — full metadata, supports large structures)
cif = requests.get(f"{base}/{af_id}-model_{version}.cif")
with open(f"{af_id}.cif", "w") as f:
    f.write(cif.text)

# PDB format (legacy — limited to 99,999 atoms)
pdb = requests.get(f"{base}/{af_id}-model_{version}.pdb")
with open(f"{af_id}.pdb", "wb") as f:
    f.write(pdb.content)

# Confidence JSON (per-residue pLDDT scores)
conf = requests.get(f"{base}/{af_id}-confidence_{version}.json").json()

# PAE matrix JSON (inter-residue confidence)
pae = requests.get(f"{base}/{af_id}-predicted_aligned_error_{version}.json").json()

3. Confidence Metrics Analysis

pLDDT (per-residue confidence, 0–100):

import numpy as np

conf_url = f"https://alphafold.ebi.ac.uk/files/{af_id}-confidence_v4.json"
conf = requests.get(conf_url).json()
scores = conf['confidenceScore']

# Classify residues by confidence
very_high = sum(1 for s in scores if s > 90)
high = sum(1 for s in scores if 70 < s <= 90)
low = sum(1 for s in scores if 50 < s <= 70)
very_low = sum(1 for s in scores if s <= 50)
print(f"Very high (>90): {very_high}, High (70-90): {high}, Low (50-70): {low}, Very low (<50): {very_low}")

PAE (Predicted Aligned Error) visualization:

import matplotlib.pyplot as plt

pae_url = f"https://alphafold.ebi.ac.uk/files/{af_id}-predicted_aligned_error_v4.json"
pae = requests.get(pae_url).json()
pae_matrix = np.array(pae['distance'])

plt.figure(figsize=(10, 8))
plt.imshow(pae_matrix, cmap='viridis_r', vmin=0, vmax=30)
plt.colorbar(label='PAE (Å)')
plt.title(f'Predicted Aligned Error: {af_id}')
plt.xlabel('Residue')
plt.ylabel('Residue')
plt.savefig(f'{af_id}_pae.png', dpi=300, bbox_inches='tight')
# Low PAE (<5 Å) = confident relative positioning; >15 Å = uncertain domain arrangement

4. Bulk Data Access (Google Cloud)

# List available data
gsutil ls gs://public-datasets-deepmind-alphafold-v4/

# Download entire proteome by taxonomy ID
gsutil -m cp gs://public-datasets-deepmind-alphafold-v4/proteomes/proteome-tax_id-9606-*_v4.tar .

# Download accession index
gsutil cp gs://public-datasets-deepmind-alphafold-v4/accession_ids.csv .

BigQuery metadata queries:

from google.cloud import bigquery

client = bigquery.Client()
query = """
SELECT entryId, uniprotAccession, gene, organismScientificName,
       globalMetricValue, fractionPlddtVeryHigh
FROM `bigquery-public-data.deepmind_alphafold.metadata`
WHERE organismScientificName = 'Homo sapiens'
  AND fractionPlddtVeryHigh > 0.8
  AND isReviewed = TRUE
LIMIT 100
"""
df = client.query(query).to_dataframe()
print(f"Found {len(df)} high-confidence human proteins")

5. Structure Parsing & Analysis

from Bio.PDB import MMCIFParser
import numpy as np
from scipy.spatial.distance import pdist, squareform

parser = MMCIFParser(QUIET=True)
structure = parser.get_structure("protein", f"{af_id}-model_v4.cif")

# Extract alpha-carbon coordinates
coords, plddt_scores = [], []
for model in structure:
    for chain in model:
        for residue in chain:
            if 'CA' in residue:
                coords.append(residue['CA'].get_coord())
                plddt_scores.append(residue['CA'].get_bfactor())  # pLDDT stored as B-factor

coords = np.array(coords)
print(f"Residues: {len(coords)}, Mean pLDDT: {np.mean(plddt_scores):.1f}")

# Contact map (Cα-Cα < 8 Å)
dist_matrix = squareform(pdist(coords))
contacts = np.where((dist_matrix > 0) & (dist_matrix < 8))
print(f"Contacts: {len(contacts[0]) // 2}")

Key Concepts

Confidence Interpretation

Metric	Range	Interpretation	Suitable For
pLDDT >90	Very high	Backbone + side-chain reliable	Detailed analysis, docking
pLDDT 70–90	High	Backbone generally reliable	Fold analysis, domain ID
pLDDT 50–70	Low	Use with caution	May be flexible/disordered
pLDDT <50	Very low	Likely disordered	Exclude from analysis
PAE <5 Å	Confident	Reliable relative domain positions	Multi-domain assembly
PAE 5–10 Å	Moderate	Uncertain arrangement	Treat domains independently
PAE >15 Å	Uncertain	Domains may be mobile	Do not trust orientation

AlphaFold ID Format

Format:

AF-{UniProt_accession}-F{fragment_number}

(e.g.,

AF-P00520-F1

). Large proteins may be split into fragments (F1, F2, ...). Current database version: v4 — include version suffix in all file URLs.

File Types

File	URL Suffix	Format	Use
Model coordinates	`-model_v4.cif`	mmCIF	Structural analysis (recommended)
Model coordinates	`-model_v4.pdb`	PDB	Legacy tools (<99,999 atoms)
Model coordinates	`-model_v4.bcif`	Binary CIF	Compressed (~70% smaller)
Confidence	`-confidence_v4.json`	JSON	Per-residue pLDDT array
Aligned error	`-predicted_aligned_error_v4.json`	JSON	N×N PAE matrix
PAE image	`-predicted_aligned_error_v4.png`	PNG	Quick visual assessment

Common Workflows

Workflow 1: Single Protein Structure Analysis

from Bio.PDB import alphafold_db, MMCIFParser
import requests, numpy as np

uniprot_id = "P04637"  # p53 tumor suppressor

# Retrieve and download
predictions = list(alphafold_db.get_predictions(uniprot_id))
cif_file = alphafold_db.download_cif_for(predictions[0], directory="./structures")
af_id = predictions[0]['entryId']

# Parse structure
parser = MMCIFParser(QUIET=True)
structure = parser.get_structure("p53", cif_file)

# Extract pLDDT from B-factors
plddt = [r['CA'].get_bfactor() for m in structure for c in m for r in c if 'CA' in r]
print(f"Length: {len(plddt)}, Mean pLDDT: {np.mean(plddt):.1f}")

# Identify high-confidence regions for docking
high_conf_regions = [(i+1, s) for i, s in enumerate(plddt) if s > 90]
print(f"High-confidence residues: {len(high_conf_regions)}/{len(plddt)}")

Workflow 2: Batch Protein Processing

from Bio.PDB import alphafold_db
import requests, numpy as np, pandas as pd, time

uniprot_ids = ["P00520", "P12931", "P04637", "P38398"]
results = []

for uid in uniprot_ids:
    try:
        preds = list(alphafold_db.get_predictions(uid))
        if not preds:
            continue
        af_id = preds[0]['entryId']
        conf = requests.get(f"https://alphafold.ebi.ac.uk/files/{af_id}-confidence_v4.json").json()
        scores = conf['confidenceScore']
        results.append({
            'uniprot': uid, 'alphafold_id': af_id,
            'length': len(scores), 'mean_plddt': np.mean(scores),
            'frac_high_conf': sum(1 for s in scores if s > 90) / len(scores)
        })
        time.sleep(0.2)  # Rate limit: 100-200ms between requests
    except Exception as e:
        print(f"Error {uid}: {e}")

df = pd.DataFrame(results)
print(df.to_string(index=False))

Key Parameters

Parameter	Module	Default	Range	Effect
`uniprot_id`	All	—	UniProt accession	Primary query identifier
`version`	Download	`v4`	v1–v4	Database version (always use latest)
`directory`	BioPython	`"."`	Path	Download destination
`QUIET`	MMCIFParser	`False`	bool	Suppress parser warnings
`vmin/vmax`	PAE plot	0/30	Å	PAE colormap range
`taxonomy_id`	GCS bulk	—	NCBI tax ID	Species for proteome download
`fractionPlddtVeryHigh`	BigQuery	—	0.0–1.0	Filter by high-confidence fraction
Concurrent requests	API	—	≤10	Max parallel API requests
Request delay	API	—	100–200ms	Delay between sequential requests

Best Practices

Use BioPython for single proteins, Google Cloud for bulk — individual API downloads are slow for >100 proteins; GCS parallel download is orders of magnitude faster
Always check pLDDT before downstream analysis — low-confidence regions (pLDDT <50) are likely disordered and should be excluded from docking, contact analysis, or binding site prediction
Anti-pattern — trusting all regions equally: AlphaFold predictions lack ligands, PTMs, cofactors, and multi-chain context. High pLDDT does not guarantee functional accuracy
Cache downloaded files locally — avoid re-downloading the same structures; AlphaFold files are static per version
Use PAE for multi-domain proteins — pLDDT tells you per-residue confidence, but PAE reveals whether domain orientations are reliable. Low inter-domain PAE (<5 Å) = trust the arrangement; high PAE (>15 Å) = treat domains independently
Pin database version in reproducible analyses — include
```
_v4
```
in URLs and document which version was used

Common Recipes

Recipe 1: Proteome Download by Species

import subprocess

def download_proteome(taxonomy_id: int, output_dir: str = "./proteomes"):
    """Download all AlphaFold predictions for a species via GCS."""
    if not isinstance(taxonomy_id, int):
        raise ValueError("taxonomy_id must be an integer")
    pattern = f"gs://public-datasets-deepmind-alphafold-v4/proteomes/proteome-tax_id-{taxonomy_id}-*_v4.tar"
    subprocess.run(["gsutil", "-m", "cp", pattern, f"{output_dir}/"], check=True)

# Human (9606), E. coli (83333), Mouse (10090)
download_proteome(9606)

Recipe 2: High-Confidence Region Extraction

def extract_high_conf_residues(plddt_scores, threshold=90):
    """Extract contiguous high-confidence regions."""
    regions, start = [], None
    for i, score in enumerate(plddt_scores):
        if score > threshold and start is None:
            start = i
        elif score <= threshold and start is not None:
            regions.append((start + 1, i, i - start))  # 1-indexed
            start = None
    if start is not None:
        regions.append((start + 1, len(plddt_scores), len(plddt_scores) - start))
    return regions

# Usage: regions = extract_high_conf_residues(plddt_scores)
# Returns: [(start_res, end_res, length), ...]

Recipe 3: PAE-Based Domain Segmentation

import numpy as np

def segment_domains(pae_matrix, threshold=10.0):
    """Simple domain segmentation from PAE matrix."""
    n = pae_matrix.shape[0]
    # Average PAE for each residue pair -> symmetric
    sym_pae = (pae_matrix + pae_matrix.T) / 2
    # Cluster: residues with low mutual PAE are in the same domain
    domains, current_domain = [0] * n, 0
    for i in range(1, n):
        if sym_pae[i-1, i] > threshold:
            current_domain += 1
        domains[i] = current_domain
    return domains

Troubleshooting

Problem	Cause	Solution
`404 Not Found` from API	No AlphaFold prediction for this UniProt ID	Check if protein is in AlphaFold DB; some organisms not covered
`429 Too Many Requests`	Exceeded rate limit	Add `time.sleep(0.2)` between requests; use GCS for bulk
Empty predictions list	UniProt ID not in database	Verify ID at alphafold.ebi.ac.uk; try canonical isoform
Large protein split into fragments	Protein >2700 residues	Check all fragments (F1, F2, ...); stitch manually if needed
pLDDT values all low (<50)	Intrinsically disordered protein	Expected behavior; structure prediction unreliable for IDPs
PAE matrix asymmetric	PAE[i][j] ≠ PAE[j][i] by design	PAE measures error when aligned on residue i; symmetrize for clustering
`ModuleNotFoundError: Bio.PDB.alphafold_db`	BioPython version too old	Upgrade: `pip install --upgrade biopython>=1.80`
GCS download fails	gsutil not configured	Run `gcloud auth login` or use anonymous access for public data
BigQuery quota exceeded	Free tier limit (1 TB/month)	Optimize queries with `LIMIT` ; use GCS for bulk file access instead

Bundled Resources

references/api_schemas_reference.md

Detailed API data schemas and lookup tables: REST API response fields, mmCIF data categories, confidence JSON schema, PAE JSON schema, BigQuery metadata table fields, HTTP error codes, rate limiting guidelines, and version history (v1–v4). Consult for field-level details when parsing API responses or building custom queries. Scripts functionality (none in original). Original api_reference.md content partially relocated to Core API (endpoints, common code patterns) and Key Concepts (confidence thresholds, file types table); schemas, field catalogs, and error codes retained in this reference.

Related Skills

autodock-vina — molecular docking using AlphaFold structures as receptor
biopython — general protein structure parsing and analysis beyond AlphaFold

References

AlphaFold DB: https://alphafold.ebi.ac.uk/
API Documentation: https://alphafold.ebi.ac.uk/api-docs
Jumper et al. (2021) Nature 596, 583–589: https://doi.org/10.1038/s41586-021-03819-2
Varadi et al. (2024) Nucleic Acids Res. 52, D368–D375: https://doi.org/10.1093/nar/gkad1011
BioPython AlphaFold module: https://biopython.org/docs/dev/api/Bio.PDB.alphafold_db.html
Google Cloud AlphaFold: https://console.cloud.google.com/marketplace/product/bigquery-public-data/deepmind-alphafold