SciAgent-Skills pubchem-compound-search
Query PubChem database (110M+ compounds) via PubChemPy and PUG-REST API. Search compounds by name/CID/SMILES, retrieve molecular properties (MW, LogP, TPSA), perform similarity and substructure searches, access bioactivity data. For local cheminformatics computation use rdkit; for multi-database queries use bioservices.
git clone https://github.com/jaechang-hits/SciAgent-Skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/structural-biology-drug-discovery/pubchem-compound-search" ~/.claude/skills/jaechang-hits-sciagent-skills-pubchem-compound-search && rm -rf "$T"
skills/structural-biology-drug-discovery/pubchem-compound-search/SKILL.mdPubChem Compound Search
Overview
PubChem is the world's largest freely available chemical database with 110M+ compounds. This skill covers searching compounds by name, structure, or identifier, retrieving molecular properties, performing similarity/substructure searches, and accessing bioactivity data through PubChemPy (Python wrapper) and PUG-REST API (direct HTTP).
When to Use
- Looking up a compound by name, CAS number, or SMILES to get its PubChem CID and properties
- Retrieving molecular properties (molecular weight, LogP, TPSA, H-bond counts) for known compounds
- Finding structurally similar compounds via Tanimoto similarity search
- Searching for compounds containing a specific substructure (pharmacophore screening)
- Converting between chemical identifier formats (name ↔ CID ↔ SMILES ↔ InChI)
- Accessing bioactivity screening data (assay results, active/inactive status)
- Batch property comparison across a set of drug candidates
- For local molecular computation (fingerprints, descriptors, 3D conformers), use
insteadrdkit - For querying multiple databases (UniProt, KEGG, ChEMBL) in one workflow, use
insteadbioservices
Prerequisites
- Python packages:
,pubchempy
(for direct API),requests
(for batch processing)pandas - No API key required: PubChem is freely accessible
- Rate limits: Max 5 requests/second, 400 requests/minute
pip install pubchempy requests pandas
Quick Start
import pubchempy as pcp # Search by name → get properties compound = pcp.get_compounds("aspirin", "name")[0] print(f"CID: {compound.cid}") print(f"SMILES: {compound.canonical_smiles}") print(f"MW: {compound.molecular_weight}, LogP: {compound.xlogp}") print(f"HBD: {compound.h_bond_donor_count}, HBA: {compound.h_bond_acceptor_count}")
Workflow
Step 1: Compound Search
Search by name, CID, SMILES, InChI, or molecular formula.
import pubchempy as pcp # By name compounds = pcp.get_compounds("caffeine", "name") print(f"Found {len(compounds)} compounds for 'caffeine'") # By CID (fastest) compound = pcp.Compound.from_cid(2244) # Aspirin print(f"CID 2244 = {compound.iupac_name}") # By SMILES compound = pcp.get_compounds("CC(=O)OC1=CC=CC=C1C(=O)O", "smiles")[0] print(f"SMILES lookup: CID {compound.cid}") # By molecular formula (returns all matches) formula_matches = pcp.get_compounds("C9H8O4", "formula") print(f"Formula C9H8O4 matches: {len(formula_matches)} compounds")
Step 2: Property Retrieval
Get molecular properties for one or more compounds.
import pubchempy as pcp # Full compound object compound = pcp.get_compounds("ibuprofen", "name")[0] print(f"MW: {compound.molecular_weight}") print(f"LogP: {compound.xlogp}") print(f"TPSA: {compound.tpsa}") print(f"Rotatable bonds: {compound.rotatable_bond_count}") # Selective property retrieval (more efficient for specific needs) props = pcp.get_properties( ["MolecularWeight", "XLogP", "TPSA", "HBondDonorCount"], "aspirin", "name" ) print(props) # List of dicts
Step 3: Similarity Search
Find structurally similar compounds using Tanimoto coefficient.
import pubchempy as pcp # Get reference compound SMILES ref = pcp.get_compounds("gefitinib", "name")[0] # Similarity search (may take 15-30s for async processing) similar = pcp.get_compounds( ref.canonical_smiles, "smiles", searchtype="similarity", Threshold=85, # Tanimoto threshold (0-100) MaxRecords=50 ) print(f"Found {len(similar)} compounds with ≥85% similarity to gefitinib") for comp in similar[:5]: print(f" CID {comp.cid}: MW={comp.molecular_weight}")
Step 4: Substructure Search
Find compounds containing a specific structural motif.
import pubchempy as pcp # Search for sulfonamide-containing compounds hits = pcp.get_compounds( "S(=O)(=O)N", "smiles", searchtype="substructure", MaxRecords=100 ) print(f"Found {len(hits)} compounds with sulfonamide group")
Step 5: Bioactivity Data Access
Retrieve biological screening results via PUG-REST API.
import requests cid = 2244 # Aspirin url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{cid}/assaysummary/JSON" response = requests.get(url) if response.status_code == 200: data = response.json() rows = data.get("Table", {}).get("Row", []) print(f"Aspirin has {len(rows)} bioassay records")
Step 6: Batch Property Comparison
Compare properties across multiple compounds.
import pubchempy as pcp import pandas as pd import time compounds = ["aspirin", "ibuprofen", "naproxen", "celecoxib"] results = [] for name in compounds: comp = pcp.get_compounds(name, "name")[0] results.append({ "Name": name, "CID": comp.cid, "MW": comp.molecular_weight, "LogP": comp.xlogp, "TPSA": comp.tpsa, "HBD": comp.h_bond_donor_count, "HBA": comp.h_bond_acceptor_count, }) time.sleep(0.25) # Respect rate limits df = pd.DataFrame(results) print(df.to_string(index=False))
Step 7: Identifier Format Conversion
Convert between chemical identifier formats.
import pubchempy as pcp compound = pcp.get_compounds("caffeine", "name")[0] print(f"CID: {compound.cid}") print(f"IUPAC: {compound.iupac_name}") print(f"SMILES: {compound.canonical_smiles}") print(f"InChI: {compound.inchi}") print(f"InChIKey: {compound.inchikey}") print(f"Formula: {compound.molecular_formula}") # Download structure files pcp.download("SDF", "caffeine", "name", "caffeine.sdf", overwrite=True) print("Downloaded caffeine.sdf")
Key Parameters
| Parameter | Function | Default | Range / Options | Effect |
|---|---|---|---|---|
| | required | , , , , | Identifier type for search |
| | | , | Type of structure search |
| similarity search | | - | Tanimoto similarity cutoff (%) |
| structure search | | - | Maximum results returned |
| | required | See API reference | Which molecular properties to retrieve |
| | | , | Structure dimensionality |
Common Recipes
Recipe: Drug-Likeness Screening (Lipinski's Rule of Five)
When to use: Quick check if a compound is orally bioavailable.
import pubchempy as pcp def check_lipinski(name): comp = pcp.get_compounds(name, "name")[0] rules = { "MW ≤ 500": comp.molecular_weight <= 500, "LogP ≤ 5": (comp.xlogp or 0) <= 5, "HBD ≤ 5": comp.h_bond_donor_count <= 5, "HBA ≤ 10": comp.h_bond_acceptor_count <= 10, } violations = sum(1 for v in rules.values() if not v) return rules, violations rules, v = check_lipinski("metformin") print(f"Violations: {v}/4 — {'PASS' if v <= 1 else 'FAIL'}") for rule, passed in rules.items(): print(f" {'✓' if passed else '✗'} {rule}")
Recipe: Get All Synonyms for a Compound
When to use: Finding alternative names, trade names, or CAS numbers.
import pubchempy as pcp synonyms = pcp.get_synonyms("aspirin", "name") if synonyms: names = synonyms[0]["Synonym"] print(f"Found {len(names)} synonyms for aspirin:") for name in names[:10]: print(f" {name}")
Recipe: Download 2D Structure Image
When to use: Generating structure images for reports or presentations.
import requests cid = 2519 # Caffeine url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{cid}/PNG?image_size=large" response = requests.get(url) with open("caffeine_structure.png", "wb") as f: f.write(response.content) print("Saved caffeine_structure.png")
Expected Outputs
- Compound search:
objects with properties (CID, name, SMILES, MW, etc.)pubchempy.Compound - Property retrieval: List of dictionaries with requested properties
- Similarity search: List of
objects sorted by similarityCompound - Bioactivity query: JSON with assay results (activity outcome, assay ID, target)
- Structure download: SDF, JSON, or PNG files
Troubleshooting
| Problem | Cause | Solution |
|---|---|---|
| No compounds found for query | Check spelling; try alternative names or CID |
| Request timeout (>30s) | Large similarity/substructure search | Reduce ; PubChemPy handles async polling automatically |
Empty property values () | Property not available for this compound | Check if property exists before use: |
| Rate limit exceeded | Add between requests; max 5 req/sec |
| Invalid SMILES or identifier | Validate SMILES syntax; use canonical SMILES from RDKit |
| Formula search returns too many hits | Common formula shared by many isomers | Use SMILES or InChI for more specific searches |
| Bioactivity API returns empty | Compound has no bioassay data | Not all compounds have been tested; check PubChem web interface |
References
- PubChem PUG-REST API — official REST API docs
- PubChemPy documentation — Python wrapper docs
- PubChem PUG-REST tutorial — step-by-step guide
- PubChem database — web interface