SciAgent-Skills pubchem-compound-search

Query PubChem database (110M+ compounds) via PubChemPy and PUG-REST API. Search compounds by name/CID/SMILES, retrieve molecular properties (MW, LogP, TPSA), perform similarity and substructure searches, access bioactivity data. For local cheminformatics computation use rdkit; for multi-database queries use bioservices.

install

source · Clone the upstream repo

git clone https://github.com/jaechang-hits/SciAgent-Skills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/structural-biology-drug-discovery/pubchem-compound-search" ~/.claude/skills/jaechang-hits-sciagent-skills-pubchem-compound-search && rm -rf "$T"

manifest: skills/structural-biology-drug-discovery/pubchem-compound-search/SKILL.md

source content

PubChem Compound Search

Overview

PubChem is the world's largest freely available chemical database with 110M+ compounds. This skill covers searching compounds by name, structure, or identifier, retrieving molecular properties, performing similarity/substructure searches, and accessing bioactivity data through PubChemPy (Python wrapper) and PUG-REST API (direct HTTP).

When to Use

Looking up a compound by name, CAS number, or SMILES to get its PubChem CID and properties
Retrieving molecular properties (molecular weight, LogP, TPSA, H-bond counts) for known compounds
Finding structurally similar compounds via Tanimoto similarity search
Searching for compounds containing a specific substructure (pharmacophore screening)
Converting between chemical identifier formats (name ↔ CID ↔ SMILES ↔ InChI)
Accessing bioactivity screening data (assay results, active/inactive status)
Batch property comparison across a set of drug candidates
For local molecular computation (fingerprints, descriptors, 3D conformers), use
```
rdkit
```
instead
For querying multiple databases (UniProt, KEGG, ChEMBL) in one workflow, use
```
bioservices
```
instead

Prerequisites

Python packages:
```
pubchempy
```
,
```
requests
```
(for direct API),
```
pandas
```
(for batch processing)
No API key required: PubChem is freely accessible
Rate limits: Max 5 requests/second, 400 requests/minute

pip install pubchempy requests pandas

Quick Start

import pubchempy as pcp

# Search by name → get properties
compound = pcp.get_compounds("aspirin", "name")[0]
print(f"CID: {compound.cid}")
print(f"SMILES: {compound.canonical_smiles}")
print(f"MW: {compound.molecular_weight}, LogP: {compound.xlogp}")
print(f"HBD: {compound.h_bond_donor_count}, HBA: {compound.h_bond_acceptor_count}")

Workflow

Step 1: Compound Search

Search by name, CID, SMILES, InChI, or molecular formula.

import pubchempy as pcp

# By name
compounds = pcp.get_compounds("caffeine", "name")
print(f"Found {len(compounds)} compounds for 'caffeine'")

# By CID (fastest)
compound = pcp.Compound.from_cid(2244)  # Aspirin
print(f"CID 2244 = {compound.iupac_name}")

# By SMILES
compound = pcp.get_compounds("CC(=O)OC1=CC=CC=C1C(=O)O", "smiles")[0]
print(f"SMILES lookup: CID {compound.cid}")

# By molecular formula (returns all matches)
formula_matches = pcp.get_compounds("C9H8O4", "formula")
print(f"Formula C9H8O4 matches: {len(formula_matches)} compounds")

Step 2: Property Retrieval

Get molecular properties for one or more compounds.

import pubchempy as pcp

# Full compound object
compound = pcp.get_compounds("ibuprofen", "name")[0]
print(f"MW: {compound.molecular_weight}")
print(f"LogP: {compound.xlogp}")
print(f"TPSA: {compound.tpsa}")
print(f"Rotatable bonds: {compound.rotatable_bond_count}")

# Selective property retrieval (more efficient for specific needs)
props = pcp.get_properties(
    ["MolecularWeight", "XLogP", "TPSA", "HBondDonorCount"],
    "aspirin", "name"
)
print(props)  # List of dicts

Step 3: Similarity Search

Find structurally similar compounds using Tanimoto coefficient.

import pubchempy as pcp

# Get reference compound SMILES
ref = pcp.get_compounds("gefitinib", "name")[0]

# Similarity search (may take 15-30s for async processing)
similar = pcp.get_compounds(
    ref.canonical_smiles, "smiles",
    searchtype="similarity",
    Threshold=85,       # Tanimoto threshold (0-100)
    MaxRecords=50
)
print(f"Found {len(similar)} compounds with ≥85% similarity to gefitinib")
for comp in similar[:5]:
    print(f"  CID {comp.cid}: MW={comp.molecular_weight}")

Step 4: Substructure Search

Find compounds containing a specific structural motif.

import pubchempy as pcp

# Search for sulfonamide-containing compounds
hits = pcp.get_compounds(
    "S(=O)(=O)N", "smiles",
    searchtype="substructure",
    MaxRecords=100
)
print(f"Found {len(hits)} compounds with sulfonamide group")

Step 5: Bioactivity Data Access

Retrieve biological screening results via PUG-REST API.

import requests

cid = 2244  # Aspirin
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{cid}/assaysummary/JSON"
response = requests.get(url)

if response.status_code == 200:
    data = response.json()
    rows = data.get("Table", {}).get("Row", [])
    print(f"Aspirin has {len(rows)} bioassay records")

Step 6: Batch Property Comparison

Compare properties across multiple compounds.

import pubchempy as pcp
import pandas as pd
import time

compounds = ["aspirin", "ibuprofen", "naproxen", "celecoxib"]
results = []

for name in compounds:
    comp = pcp.get_compounds(name, "name")[0]
    results.append({
        "Name": name, "CID": comp.cid,
        "MW": comp.molecular_weight, "LogP": comp.xlogp,
        "TPSA": comp.tpsa, "HBD": comp.h_bond_donor_count,
        "HBA": comp.h_bond_acceptor_count,
    })
    time.sleep(0.25)  # Respect rate limits

df = pd.DataFrame(results)
print(df.to_string(index=False))

Step 7: Identifier Format Conversion

Convert between chemical identifier formats.

import pubchempy as pcp

compound = pcp.get_compounds("caffeine", "name")[0]
print(f"CID:      {compound.cid}")
print(f"IUPAC:    {compound.iupac_name}")
print(f"SMILES:   {compound.canonical_smiles}")
print(f"InChI:    {compound.inchi}")
print(f"InChIKey: {compound.inchikey}")
print(f"Formula:  {compound.molecular_formula}")

# Download structure files
pcp.download("SDF", "caffeine", "name", "caffeine.sdf", overwrite=True)
print("Downloaded caffeine.sdf")

Key Parameters

Parameter	Function	Default	Range / Options	Effect
`namespace`	`get_compounds`	required	`"name"` , `"cid"` , `"smiles"` , `"inchi"` , `"formula"`	Identifier type for search
`searchtype`	`get_compounds`	`None`	`"similarity"` , `"substructure"`	Type of structure search
`Threshold`	similarity search	`90`	`0` - `100`	Tanimoto similarity cutoff (%)
`MaxRecords`	structure search	`None`	`1` - `10000`	Maximum results returned
`properties`	`get_properties`	required	See API reference	Which molecular properties to retrieve
`record_type`	`download`	`"2d"`	`"2d"` , `"3d"`	Structure dimensionality

Common Recipes

Recipe: Drug-Likeness Screening (Lipinski's Rule of Five)

When to use: Quick check if a compound is orally bioavailable.

import pubchempy as pcp

def check_lipinski(name):
    comp = pcp.get_compounds(name, "name")[0]
    rules = {
        "MW ≤ 500": comp.molecular_weight <= 500,
        "LogP ≤ 5": (comp.xlogp or 0) <= 5,
        "HBD ≤ 5": comp.h_bond_donor_count <= 5,
        "HBA ≤ 10": comp.h_bond_acceptor_count <= 10,
    }
    violations = sum(1 for v in rules.values() if not v)
    return rules, violations

rules, v = check_lipinski("metformin")
print(f"Violations: {v}/4 — {'PASS' if v <= 1 else 'FAIL'}")
for rule, passed in rules.items():
    print(f"  {'✓' if passed else '✗'} {rule}")

Recipe: Get All Synonyms for a Compound

When to use: Finding alternative names, trade names, or CAS numbers.

import pubchempy as pcp

synonyms = pcp.get_synonyms("aspirin", "name")
if synonyms:
    names = synonyms[0]["Synonym"]
    print(f"Found {len(names)} synonyms for aspirin:")
    for name in names[:10]:
        print(f"  {name}")

Recipe: Download 2D Structure Image

When to use: Generating structure images for reports or presentations.

import requests

cid = 2519  # Caffeine
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{cid}/PNG?image_size=large"
response = requests.get(url)
with open("caffeine_structure.png", "wb") as f:
    f.write(response.content)
print("Saved caffeine_structure.png")

Expected Outputs

Compound search:
```
pubchempy.Compound
```
objects with properties (CID, name, SMILES, MW, etc.)
Property retrieval: List of dictionaries with requested properties
Similarity search: List of
```
Compound
```
objects sorted by similarity
Bioactivity query: JSON with assay results (activity outcome, assay ID, target)
Structure download: SDF, JSON, or PNG files

Troubleshooting

Problem	Cause	Solution
`IndexError: list index out of range`	No compounds found for query	Check spelling; try alternative names or CID
Request timeout (>30s)	Large similarity/substructure search	Reduce `MaxRecords` ; PubChemPy handles async polling automatically
Empty property values ( `None` )	Property not available for this compound	Check if property exists before use: `if comp.xlogp is not None`
`HTTP 503 Service Unavailable`	Rate limit exceeded	Add `time.sleep(0.25)` between requests; max 5 req/sec
`BadRequestError`	Invalid SMILES or identifier	Validate SMILES syntax; use canonical SMILES from RDKit
Formula search returns too many hits	Common formula shared by many isomers	Use SMILES or InChI for more specific searches
Bioactivity API returns empty	Compound has no bioassay data	Not all compounds have been tested; check PubChem web interface

References

PubChem PUG-REST API — official REST API docs
PubChemPy documentation — Python wrapper docs
PubChem PUG-REST tutorial — step-by-step guide
PubChem database — web interface