SciAgent-Skills pubchem-compound-search

Query PubChem database (110M+ compounds) via PubChemPy and PUG-REST API. Search compounds by name/CID/SMILES, retrieve molecular properties (MW, LogP, TPSA), perform similarity and substructure searches, access bioactivity data. For local cheminformatics computation use rdkit; for multi-database queries use bioservices.

install
source · Clone the upstream repo
git clone https://github.com/jaechang-hits/SciAgent-Skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/structural-biology-drug-discovery/pubchem-compound-search" ~/.claude/skills/jaechang-hits-sciagent-skills-pubchem-compound-search && rm -rf "$T"
manifest: skills/structural-biology-drug-discovery/pubchem-compound-search/SKILL.md
source content

PubChem Compound Search

Overview

PubChem is the world's largest freely available chemical database with 110M+ compounds. This skill covers searching compounds by name, structure, or identifier, retrieving molecular properties, performing similarity/substructure searches, and accessing bioactivity data through PubChemPy (Python wrapper) and PUG-REST API (direct HTTP).

When to Use

  • Looking up a compound by name, CAS number, or SMILES to get its PubChem CID and properties
  • Retrieving molecular properties (molecular weight, LogP, TPSA, H-bond counts) for known compounds
  • Finding structurally similar compounds via Tanimoto similarity search
  • Searching for compounds containing a specific substructure (pharmacophore screening)
  • Converting between chemical identifier formats (name ↔ CID ↔ SMILES ↔ InChI)
  • Accessing bioactivity screening data (assay results, active/inactive status)
  • Batch property comparison across a set of drug candidates
  • For local molecular computation (fingerprints, descriptors, 3D conformers), use
    rdkit
    instead
  • For querying multiple databases (UniProt, KEGG, ChEMBL) in one workflow, use
    bioservices
    instead

Prerequisites

  • Python packages:
    pubchempy
    ,
    requests
    (for direct API),
    pandas
    (for batch processing)
  • No API key required: PubChem is freely accessible
  • Rate limits: Max 5 requests/second, 400 requests/minute
pip install pubchempy requests pandas

Quick Start

import pubchempy as pcp

# Search by name → get properties
compound = pcp.get_compounds("aspirin", "name")[0]
print(f"CID: {compound.cid}")
print(f"SMILES: {compound.canonical_smiles}")
print(f"MW: {compound.molecular_weight}, LogP: {compound.xlogp}")
print(f"HBD: {compound.h_bond_donor_count}, HBA: {compound.h_bond_acceptor_count}")

Workflow

Step 1: Compound Search

Search by name, CID, SMILES, InChI, or molecular formula.

import pubchempy as pcp

# By name
compounds = pcp.get_compounds("caffeine", "name")
print(f"Found {len(compounds)} compounds for 'caffeine'")

# By CID (fastest)
compound = pcp.Compound.from_cid(2244)  # Aspirin
print(f"CID 2244 = {compound.iupac_name}")

# By SMILES
compound = pcp.get_compounds("CC(=O)OC1=CC=CC=C1C(=O)O", "smiles")[0]
print(f"SMILES lookup: CID {compound.cid}")

# By molecular formula (returns all matches)
formula_matches = pcp.get_compounds("C9H8O4", "formula")
print(f"Formula C9H8O4 matches: {len(formula_matches)} compounds")

Step 2: Property Retrieval

Get molecular properties for one or more compounds.

import pubchempy as pcp

# Full compound object
compound = pcp.get_compounds("ibuprofen", "name")[0]
print(f"MW: {compound.molecular_weight}")
print(f"LogP: {compound.xlogp}")
print(f"TPSA: {compound.tpsa}")
print(f"Rotatable bonds: {compound.rotatable_bond_count}")

# Selective property retrieval (more efficient for specific needs)
props = pcp.get_properties(
    ["MolecularWeight", "XLogP", "TPSA", "HBondDonorCount"],
    "aspirin", "name"
)
print(props)  # List of dicts

Step 3: Similarity Search

Find structurally similar compounds using Tanimoto coefficient.

import pubchempy as pcp

# Get reference compound SMILES
ref = pcp.get_compounds("gefitinib", "name")[0]

# Similarity search (may take 15-30s for async processing)
similar = pcp.get_compounds(
    ref.canonical_smiles, "smiles",
    searchtype="similarity",
    Threshold=85,       # Tanimoto threshold (0-100)
    MaxRecords=50
)
print(f"Found {len(similar)} compounds with ≥85% similarity to gefitinib")
for comp in similar[:5]:
    print(f"  CID {comp.cid}: MW={comp.molecular_weight}")

Step 4: Substructure Search

Find compounds containing a specific structural motif.

import pubchempy as pcp

# Search for sulfonamide-containing compounds
hits = pcp.get_compounds(
    "S(=O)(=O)N", "smiles",
    searchtype="substructure",
    MaxRecords=100
)
print(f"Found {len(hits)} compounds with sulfonamide group")

Step 5: Bioactivity Data Access

Retrieve biological screening results via PUG-REST API.

import requests

cid = 2244  # Aspirin
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{cid}/assaysummary/JSON"
response = requests.get(url)

if response.status_code == 200:
    data = response.json()
    rows = data.get("Table", {}).get("Row", [])
    print(f"Aspirin has {len(rows)} bioassay records")

Step 6: Batch Property Comparison

Compare properties across multiple compounds.

import pubchempy as pcp
import pandas as pd
import time

compounds = ["aspirin", "ibuprofen", "naproxen", "celecoxib"]
results = []

for name in compounds:
    comp = pcp.get_compounds(name, "name")[0]
    results.append({
        "Name": name, "CID": comp.cid,
        "MW": comp.molecular_weight, "LogP": comp.xlogp,
        "TPSA": comp.tpsa, "HBD": comp.h_bond_donor_count,
        "HBA": comp.h_bond_acceptor_count,
    })
    time.sleep(0.25)  # Respect rate limits

df = pd.DataFrame(results)
print(df.to_string(index=False))

Step 7: Identifier Format Conversion

Convert between chemical identifier formats.

import pubchempy as pcp

compound = pcp.get_compounds("caffeine", "name")[0]
print(f"CID:      {compound.cid}")
print(f"IUPAC:    {compound.iupac_name}")
print(f"SMILES:   {compound.canonical_smiles}")
print(f"InChI:    {compound.inchi}")
print(f"InChIKey: {compound.inchikey}")
print(f"Formula:  {compound.molecular_formula}")

# Download structure files
pcp.download("SDF", "caffeine", "name", "caffeine.sdf", overwrite=True)
print("Downloaded caffeine.sdf")

Key Parameters

ParameterFunctionDefaultRange / OptionsEffect
namespace
get_compounds
required
"name"
,
"cid"
,
"smiles"
,
"inchi"
,
"formula"
Identifier type for search
searchtype
get_compounds
None
"similarity"
,
"substructure"
Type of structure search
Threshold
similarity search
90
0
-
100
Tanimoto similarity cutoff (%)
MaxRecords
structure search
None
1
-
10000
Maximum results returned
properties
get_properties
requiredSee API referenceWhich molecular properties to retrieve
record_type
download
"2d"
"2d"
,
"3d"
Structure dimensionality

Common Recipes

Recipe: Drug-Likeness Screening (Lipinski's Rule of Five)

When to use: Quick check if a compound is orally bioavailable.

import pubchempy as pcp

def check_lipinski(name):
    comp = pcp.get_compounds(name, "name")[0]
    rules = {
        "MW ≤ 500": comp.molecular_weight <= 500,
        "LogP ≤ 5": (comp.xlogp or 0) <= 5,
        "HBD ≤ 5": comp.h_bond_donor_count <= 5,
        "HBA ≤ 10": comp.h_bond_acceptor_count <= 10,
    }
    violations = sum(1 for v in rules.values() if not v)
    return rules, violations

rules, v = check_lipinski("metformin")
print(f"Violations: {v}/4 — {'PASS' if v <= 1 else 'FAIL'}")
for rule, passed in rules.items():
    print(f"  {'✓' if passed else '✗'} {rule}")

Recipe: Get All Synonyms for a Compound

When to use: Finding alternative names, trade names, or CAS numbers.

import pubchempy as pcp

synonyms = pcp.get_synonyms("aspirin", "name")
if synonyms:
    names = synonyms[0]["Synonym"]
    print(f"Found {len(names)} synonyms for aspirin:")
    for name in names[:10]:
        print(f"  {name}")

Recipe: Download 2D Structure Image

When to use: Generating structure images for reports or presentations.

import requests

cid = 2519  # Caffeine
url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{cid}/PNG?image_size=large"
response = requests.get(url)
with open("caffeine_structure.png", "wb") as f:
    f.write(response.content)
print("Saved caffeine_structure.png")

Expected Outputs

  • Compound search:
    pubchempy.Compound
    objects with properties (CID, name, SMILES, MW, etc.)
  • Property retrieval: List of dictionaries with requested properties
  • Similarity search: List of
    Compound
    objects sorted by similarity
  • Bioactivity query: JSON with assay results (activity outcome, assay ID, target)
  • Structure download: SDF, JSON, or PNG files

Troubleshooting

ProblemCauseSolution
IndexError: list index out of range
No compounds found for queryCheck spelling; try alternative names or CID
Request timeout (>30s)Large similarity/substructure searchReduce
MaxRecords
; PubChemPy handles async polling automatically
Empty property values (
None
)
Property not available for this compoundCheck if property exists before use:
if comp.xlogp is not None
HTTP 503 Service Unavailable
Rate limit exceededAdd
time.sleep(0.25)
between requests; max 5 req/sec
BadRequestError
Invalid SMILES or identifierValidate SMILES syntax; use canonical SMILES from RDKit
Formula search returns too many hitsCommon formula shared by many isomersUse SMILES or InChI for more specific searches
Bioactivity API returns emptyCompound has no bioassay dataNot all compounds have been tested; check PubChem web interface

References