SciAgent-Skills chembl-database-bioactivity

Query ChEMBL bioactive molecules and drug discovery data using the Python SDK. Search compounds by structure/properties, retrieve bioactivity data (IC50, Ki, EC50), find inhibitors for targets, perform SAR studies, and access drug mechanism/indication data for medicinal chemistry research.

install
source · Clone the upstream repo
git clone https://github.com/jaechang-hits/SciAgent-Skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/structural-biology-drug-discovery/chembl-database-bioactivity" ~/.claude/skills/jaechang-hits-sciagent-skills-chembl-database-bioactivity && rm -rf "$T"
manifest: skills/structural-biology-drug-discovery/chembl-database-bioactivity/SKILL.md
source content

ChEMBL Database — Bioactivity Queries

Overview

Query the ChEMBL bioactive molecule database (2M+ compounds, 19M+ bioactivity measurements, 13K+ targets) using the

chembl_webresource_client
Python SDK. Covers compound search, target lookup, bioactivity retrieval, structure-based search, and drug information access.

When to Use

  • Finding compounds by name, ChEMBL ID, or physicochemical properties
  • Querying bioactivity data (IC50, Ki, EC50) for specific targets
  • Performing similarity or substructure searches using SMILES
  • Retrieving drug mechanisms of action and clinical indications
  • Identifying inhibitors, agonists, or bioactive molecules for a target
  • Analyzing structure-activity relationships (SAR) across compound series
  • Filtering molecules by Lipinski rule-of-5 or other drug-likeness criteria
  • For general cheminformatics (SMILES manipulation, fingerprints, descriptors) use rdkit-cheminformatics instead

Prerequisites

uv pip install chembl_webresource_client
# Optional: pandas for tabular analysis
uv pip install pandas

Rate limiting: The SDK handles rate limiting internally via automatic retries and caching. No

time.sleep()
needed between queries. For large-scale data retrieval (100K+ records), consider ChEMBL bulk downloads instead of API queries.

Quick Start

from chembl_webresource_client.new_client import new_client

# Each entity type has its own client endpoint
molecule = new_client.molecule
target = new_client.target
activity = new_client.activity

# Retrieve a molecule by ChEMBL ID
aspirin = molecule.get('CHEMBL25')
print(f"{aspirin['pref_name']}: MW={aspirin['molecule_properties']['mw_freebase']}")

# Search targets by name
egfr_targets = target.filter(pref_name__icontains='EGFR', target_type='SINGLE PROTEIN')
print(f"Found {len(list(egfr_targets))} EGFR-related targets")

# Query bioactivities with filters
potent = activity.filter(
    target_chembl_id='CHEMBL203',  # EGFR
    standard_type='IC50',
    standard_value__lte=100,       # <= 100 nM
    standard_units='nM'
)

Key Concepts

Filter Operators

ChEMBL uses Django-style query filters on all endpoints:

OperatorMeaningExample
__exact
Exact match (default)
target_type__exact='SINGLE PROTEIN'
__iexact
Case-insensitive exact
pref_name__iexact='aspirin'
__contains
Substring match
pref_name__contains='kinase'
__icontains
Case-insensitive substring
pref_name__icontains='egfr'
__startswith
Prefix match
pref_name__startswith='Epi'
__endswith
Suffix match
pref_name__endswith='nib'
__gt
/
__gte
Greater than (or equal)
standard_value__gte=10
__lt
/
__lte
Less than (or equal)
standard_value__lte=100
__range
Value in range
mw_freebase__range=[300, 500]
__in
Value in list
target_chembl_id__in=['CHEMBL203', 'CHEMBL240']
__isnull
Null check
pchembl_value__isnull=False
__regex
Regular expression
pref_name__regex='^EGF.*kinase$'
__search
Full-text search
description__search='apoptosis'

Core Endpoints

EndpointAccessDescription
molecule
new_client.molecule
Compound structures, properties, synonyms
target
new_client.target
Protein and non-protein biological targets
activity
new_client.activity
Bioassay measurement results
assay
new_client.assay
Experimental assay details
drug
new_client.drug
Approved pharmaceutical information
mechanism
new_client.mechanism
Drug mechanism of action data
drug_indication
new_client.drug_indication
Drug therapeutic indications
similarity
new_client.similarity
Tanimoto similarity search
substructure
new_client.substructure
Substructure search
image
new_client.image
SVG molecular structure images
molecule_form
new_client.molecule_form
Parent/salt forms
protein_class
new_client.protein_class
Protein classification hierarchy
target_component
new_client.target_component
Target component details
cell_line
new_client.cell_line
Cell line information
tissue
new_client.tissue
Tissue type information
compound_structural_alert
new_client.compound_structural_alert
Structural alerts for toxicity
document
new_client.document
Literature source references

Molecular Properties

Properties accessible via

molecule['molecule_properties']
:

FieldDescription
mw_freebase
Molecular weight (free base)
full_mwt
Full molecular weight (including salts)
alogp
Calculated LogP
hba
Hydrogen bond acceptors
hbd
Hydrogen bond donors
psa
Polar surface area
rtb
Rotatable bonds
num_ro5_violations
Lipinski rule-of-5 violations
ro3_pass
Rule of 3 compliance
cx_most_apka
Most acidic pKa
cx_most_bpka
Most basic pKa

Target Information Fields

Key fields in target records:

FieldDescription
target_chembl_id
ChEMBL target identifier
pref_name
Preferred target name
target_type
Type: SINGLE PROTEIN, PROTEIN COMPLEX, ORGANISM
organism
Target organism (e.g., Homo sapiens)
tax_id
NCBI taxonomy ID
target_components
Component details (UniProt accession, etc.)

Bioactivity Data Fields

Key fields in activity records:

FieldDescription
standard_type
Activity type: IC50, Ki, Kd, EC50, etc.
standard_value
Numerical activity value
standard_units
Units: nM, uM, etc.
pchembl_value
Normalized activity (-log10 scale, comparable across types)
activity_comment
Activity annotations
data_validity_comment
Data quality flags (check before analysis)
potential_duplicate
Duplicate flag

Core API

1. Molecule Queries

molecule = new_client.molecule

# By ChEMBL ID
aspirin = molecule.get('CHEMBL25')

# By name (case-insensitive)
results = molecule.filter(pref_name__icontains='imatinib')

# By properties (Lipinski rule-of-5 compliant)
drug_like = molecule.filter(
    molecule_properties__mw_freebase__lte=500,
    molecule_properties__alogp__lte=5,
    molecule_properties__hba__lte=10,
    molecule_properties__hbd__lte=5
)

# By property range
mid_weight = molecule.filter(
    molecule_properties__mw_freebase__range=[300, 500]
)

2. Target Queries

target = new_client.target

# By ChEMBL ID
egfr = target.get('CHEMBL203')
print(f"{egfr['pref_name']} ({egfr['organism']})")

# Search by name and type
kinases = target.filter(
    target_type='SINGLE PROTEIN',
    pref_name__icontains='kinase'
)

# By organism
human_targets = target.filter(organism='Homo sapiens')

3. Bioactivity Data

activity = new_client.activity

# Potent inhibitors for a target
potent = activity.filter(
    target_chembl_id='CHEMBL203',
    standard_type='IC50',
    standard_value__lte=100,
    standard_units='nM'
)

# All activities for a compound (with pChEMBL values)
compound_acts = activity.filter(
    molecule_chembl_id='CHEMBL25',
    pchembl_value__isnull=False
)

# Multiple activity types
ki_data = activity.filter(
    target_chembl_id='CHEMBL240',
    standard_type__in=['IC50', 'Ki', 'Kd']
)

4. Structure-Based Search

# Similarity search (Tanimoto)
similarity = new_client.similarity
similar = similarity.filter(
    smiles='CC(=O)Oc1ccccc1C(=O)O',  # aspirin
    similarity=85  # >=85% similarity
)

# Substructure search
substructure = new_client.substructure
benzimidazoles = substructure.filter(smiles='c1ccc2[nH]cnc2c1')

5. Drug and Mechanism Data

drug = new_client.drug
mechanism = new_client.mechanism
drug_indication = new_client.drug_indication

# Drug details
drug_info = drug.get('CHEMBL25')

# Mechanisms of action
mechs = mechanism.filter(molecule_chembl_id='CHEMBL941')
for m in mechs:
    print(f"{m['mechanism_of_action']} → {m.get('target_chembl_id')}")

# Therapeutic indications
indications = drug_indication.filter(molecule_chembl_id='CHEMBL941')
for ind in indications:
    print(f"{ind.get('mesh_heading')} (Phase {ind.get('max_phase_for_ind')})")

# SVG molecular image
image = new_client.image
svg_data = image.get('CHEMBL25')
with open('aspirin.svg', 'w') as f:
    f.write(svg_data)

Common Workflows

Workflow 1: Find Inhibitors for a Target

from chembl_webresource_client.new_client import new_client
import pandas as pd

# Step 1: Identify the target
targets = new_client.target.filter(pref_name__icontains='BRAF', target_type='SINGLE PROTEIN')
target_id = list(targets)[0]['target_chembl_id']

# Step 2: Query potent activities
activities = new_client.activity.filter(
    target_chembl_id=target_id,
    standard_type='IC50',
    standard_value__lte=100,
    standard_units='nM',
    pchembl_value__isnull=False
)

# Step 3: Convert to DataFrame for analysis
df = pd.DataFrame(list(activities))
df['standard_value'] = pd.to_numeric(df['standard_value'])
print(f"Found {len(df)} potent compounds")
print(df[['molecule_chembl_id', 'standard_value', 'pchembl_value']].head(10))

Workflow 2: Analyze a Known Drug

from chembl_webresource_client.new_client import new_client

chembl_id = 'CHEMBL941'  # Imatinib

# Drug information
drug_info = new_client.molecule.get(chembl_id)
print(f"Name: {drug_info['pref_name']}")
print(f"MW: {drug_info['molecule_properties']['mw_freebase']}")

# Mechanisms of action
mechs = list(new_client.mechanism.filter(molecule_chembl_id=chembl_id))
for m in mechs:
    print(f"Mechanism: {m['mechanism_of_action']}")

# Indications
indications = list(new_client.drug_indication.filter(molecule_chembl_id=chembl_id))
for ind in indications:
    print(f"Indication: {ind.get('mesh_heading')} (Phase {ind.get('max_phase_for_ind')})")

# All bioactivity data
activities = list(new_client.activity.filter(
    molecule_chembl_id=chembl_id, pchembl_value__isnull=False
))
print(f"Total bioactivity records: {len(activities)}")

Workflow 3: SAR Study

from chembl_webresource_client.new_client import new_client
import pandas as pd

# Step 1: Find similar compounds to lead
similar = new_client.similarity.filter(
    smiles='c1ccc2c(c1)cc(nc2N)c3ccc(cc3)NC(=O)c4ccccc4',  # lead compound
    similarity=80
)
analogs = list(similar)

# Step 2: Collect activities for each analog
records = []
for compound in analogs[:20]:  # limit for demo
    cid = compound['molecule_chembl_id']
    acts = list(new_client.activity.filter(
        molecule_chembl_id=cid,
        standard_type='IC50',
        pchembl_value__isnull=False
    ))
    for act in acts:
        records.append({
            'chembl_id': cid,
            'target': act.get('target_pref_name'),
            'IC50_nM': act.get('standard_value'),
            'pchembl': act.get('pchembl_value'),
            'mw': compound.get('molecule_properties', {}).get('mw_freebase'),
            'alogp': compound.get('molecule_properties', {}).get('alogp')
        })

# Step 3: Analyze property-activity relationships
df = pd.DataFrame(records)
if not df.empty:
    df['IC50_nM'] = pd.to_numeric(df['IC50_nM'])
    print(df.groupby('target')['IC50_nM'].describe())

Common Recipes

Recipe: Virtual Screening Filter (Lipinski + Activity)

from chembl_webresource_client.new_client import new_client

candidates = new_client.molecule.filter(
    molecule_properties__mw_freebase__range=[300, 500],
    molecule_properties__alogp__lte=5,
    molecule_properties__hba__lte=10,
    molecule_properties__hbd__lte=5,
    molecule_properties__num_ro5_violations=0
)
print(f"Drug-like candidates: {len(list(candidates))}")

Recipe: Client Configuration

from chembl_webresource_client.settings import Settings

Settings.Instance().CACHING = True           # enable/disable cache
Settings.Instance().CACHE_EXPIRE = 86400     # cache duration (seconds)
Settings.Instance().TIMEOUT = 30             # request timeout (seconds)
Settings.Instance().TOTAL_RETRIES = 3        # retry count on failure

Recipe: Export Activities to CSV

import pandas as pd
from chembl_webresource_client.new_client import new_client

activities = new_client.activity.filter(
    target_chembl_id='CHEMBL203',
    standard_type='IC50',
    pchembl_value__isnull=False
)
df = pd.DataFrame(list(activities))
df.to_csv('egfr_activities.csv', index=False)
print(f"Exported {len(df)} records")

Key Parameters

ParameterEndpointDefaultDescription
similarity
similarity.filter()
Tanimoto threshold (0-100), typically 70-90
standard_type
activity.filter()
Activity type: IC50, Ki, Kd, EC50
standard_value__lte
activity.filter()
Max activity value (nM)
pchembl_value
activity.filter()
Normalized -log10 activity (>6 = potent)
target_type
target.filter()
SINGLE PROTEIN, PROTEIN COMPLEX, ORGANISM
CACHING
Settings
True
Enable HTTP response caching
CACHE_EXPIRE
Settings
86400
Cache TTL in seconds
TIMEOUT
Settings
30
HTTP request timeout in seconds
TOTAL_RETRIES
Settings
3
Auto-retry count on failure

Troubleshooting

ProblemCauseSolution
Empty results from filterNo matches or too strict filtersRelax filters; verify IDs exist with
.get()
first
KeyError
on molecule properties
Not all molecules have full property dataUse
.get('molecule_properties', {}).get('field')
Query returns unexpectedly few resultsLazy evaluation not consumedConvert to
list()
before checking length
Slow queriesLarge result sets paginated automaticallyAdd more filters to narrow results; use
__range
404
on
.get()
Invalid ChEMBL IDVerify ID format (e.g., CHEMBL25, not 25)
Stale dataAggressive cachingSet
Settings.Instance().CACHING = False
or clear cache
Timeout errorsServer overload or large queryIncrease
TIMEOUT
; split into smaller queries
Mixed units in activity dataDifferent assays use different unitsFilter by
standard_units='nM'
or use
pchembl_value
Duplicate activity recordsSame measurement from different sourcesCheck
potential_duplicate
and
data_validity_comment

Best Practices

  • Use
    pchembl_value
    for cross-study comparisons — it normalizes IC50/Ki/EC50 to a comparable -log10 scale
  • Always check
    data_validity_comment
    before using activity values — flags data quality issues
  • Filter by
    standard_units
    to ensure consistent units across results
  • Pagination is automatic: the SDK handles pagination transparently — iterate directly over query results without manual page handling. Convert to
    list()
    only when you need all results in memory
  • Use lazy evaluation: queries execute only when iterated — convert to
    list()
    only when needed
  • Cache results: the SDK caches for 24h by default — leverage this for repeated queries
  • For bulk data (>100K records): use ChEMBL FTP downloads rather than API queries

Related Skills

  • rdkit-cheminformatics
    — SMILES manipulation, molecular descriptors, fingerprints
  • datamol-cheminformatics
    — Molecular preprocessing and featurization
  • pubchem-compound-search
    — Alternative compound database (NIH)

References

Bundled Resources

Self-contained entry (no

references/
directory). Original total: 662 lines (SKILL.md 389 + api_reference.md 273). Scripts: 279 lines (example_queries.py).

Original file disposition:

  • SKILL.md
    (389 lines) → Core API, Workflows, Quick Start. "Common Use Cases" consolidated (rule 7b): Find Kinase Inhibitors → Workflow 1 pattern, Virtual Screening → Recipe, Drug Repurposing → omitted (trivial loop over drug endpoint, not a distinct analytical workflow). "Important Notes" section routed to Best Practices and Troubleshooting (rule 9)
  • references/api_reference.md
    (273 lines) → Consolidated inline. Filter Operators → Key Concepts table. Core Endpoints listing → Key Concepts table (all endpoints). Molecular Properties → Key Concepts table. Bioactivity Data Fields → Key Concepts table. Target Information Fields → Key Concepts table. Configuration/Settings → Common Recipes. Error handling/rate limiting → Troubleshooting + Best Practices. Response formats (JSON/XML/YAML) → omitted (JSON is default and only format used via Python SDK). Advanced query examples already covered in Core API
  • scripts/example_queries.py
    (279 lines) → Thin-wrapper functions absorbed into Core API modules:
    get_molecule_info
    /
    search_molecules_by_name
    /
    find_molecules_by_properties
    → Module 1 (Molecule Queries);
    get_target_info
    /
    search_targets_by_name
    → Module 2 (Target Queries);
    get_bioactivity_data
    /
    get_compound_bioactivities
    → Module 3 (Bioactivity Data);
    find_similar_compounds
    /
    substructure_search
    → Module 4 (Structure-Based Search);
    get_drug_info
    → Module 5 (Drug and Mechanism Data);
    find_kinase_inhibitors
    → Workflow 1;
    export_to_dataframe
    → Workflow 1 + Recipe (Export)

Retention: ~460 lines / 662 original = ~69%. Vendor metadata stripped (rule 13). Agent-behavior section stripped (rule 4).