SciAgent-Skills chembl-database-bioactivity
Query ChEMBL bioactive molecules and drug discovery data using the Python SDK. Search compounds by structure/properties, retrieve bioactivity data (IC50, Ki, EC50), find inhibitors for targets, perform SAR studies, and access drug mechanism/indication data for medicinal chemistry research.
git clone https://github.com/jaechang-hits/SciAgent-Skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/structural-biology-drug-discovery/chembl-database-bioactivity" ~/.claude/skills/jaechang-hits-sciagent-skills-chembl-database-bioactivity && rm -rf "$T"
skills/structural-biology-drug-discovery/chembl-database-bioactivity/SKILL.mdChEMBL Database — Bioactivity Queries
Overview
Query the ChEMBL bioactive molecule database (2M+ compounds, 19M+ bioactivity measurements, 13K+ targets) using the
chembl_webresource_client Python SDK. Covers compound search, target lookup, bioactivity retrieval, structure-based search, and drug information access.
When to Use
- Finding compounds by name, ChEMBL ID, or physicochemical properties
- Querying bioactivity data (IC50, Ki, EC50) for specific targets
- Performing similarity or substructure searches using SMILES
- Retrieving drug mechanisms of action and clinical indications
- Identifying inhibitors, agonists, or bioactive molecules for a target
- Analyzing structure-activity relationships (SAR) across compound series
- Filtering molecules by Lipinski rule-of-5 or other drug-likeness criteria
- For general cheminformatics (SMILES manipulation, fingerprints, descriptors) use rdkit-cheminformatics instead
Prerequisites
uv pip install chembl_webresource_client # Optional: pandas for tabular analysis uv pip install pandas
Rate limiting: The SDK handles rate limiting internally via automatic retries and caching. No
time.sleep() needed between queries. For large-scale data retrieval (100K+ records), consider ChEMBL bulk downloads instead of API queries.
Quick Start
from chembl_webresource_client.new_client import new_client # Each entity type has its own client endpoint molecule = new_client.molecule target = new_client.target activity = new_client.activity # Retrieve a molecule by ChEMBL ID aspirin = molecule.get('CHEMBL25') print(f"{aspirin['pref_name']}: MW={aspirin['molecule_properties']['mw_freebase']}") # Search targets by name egfr_targets = target.filter(pref_name__icontains='EGFR', target_type='SINGLE PROTEIN') print(f"Found {len(list(egfr_targets))} EGFR-related targets") # Query bioactivities with filters potent = activity.filter( target_chembl_id='CHEMBL203', # EGFR standard_type='IC50', standard_value__lte=100, # <= 100 nM standard_units='nM' )
Key Concepts
Filter Operators
ChEMBL uses Django-style query filters on all endpoints:
| Operator | Meaning | Example |
|---|---|---|
| Exact match (default) | |
| Case-insensitive exact | |
| Substring match | |
| Case-insensitive substring | |
| Prefix match | |
| Suffix match | |
/ | Greater than (or equal) | |
/ | Less than (or equal) | |
| Value in range | |
| Value in list | |
| Null check | |
| Regular expression | |
| Full-text search | |
Core Endpoints
| Endpoint | Access | Description |
|---|---|---|
| | Compound structures, properties, synonyms |
| | Protein and non-protein biological targets |
| | Bioassay measurement results |
| | Experimental assay details |
| | Approved pharmaceutical information |
| | Drug mechanism of action data |
| | Drug therapeutic indications |
| | Tanimoto similarity search |
| | Substructure search |
| | SVG molecular structure images |
| | Parent/salt forms |
| | Protein classification hierarchy |
| | Target component details |
| | Cell line information |
| | Tissue type information |
| | Structural alerts for toxicity |
| | Literature source references |
Molecular Properties
Properties accessible via
molecule['molecule_properties']:
| Field | Description |
|---|---|
| Molecular weight (free base) |
| Full molecular weight (including salts) |
| Calculated LogP |
| Hydrogen bond acceptors |
| Hydrogen bond donors |
| Polar surface area |
| Rotatable bonds |
| Lipinski rule-of-5 violations |
| Rule of 3 compliance |
| Most acidic pKa |
| Most basic pKa |
Target Information Fields
Key fields in target records:
| Field | Description |
|---|---|
| ChEMBL target identifier |
| Preferred target name |
| Type: SINGLE PROTEIN, PROTEIN COMPLEX, ORGANISM |
| Target organism (e.g., Homo sapiens) |
| NCBI taxonomy ID |
| Component details (UniProt accession, etc.) |
Bioactivity Data Fields
Key fields in activity records:
| Field | Description |
|---|---|
| Activity type: IC50, Ki, Kd, EC50, etc. |
| Numerical activity value |
| Units: nM, uM, etc. |
| Normalized activity (-log10 scale, comparable across types) |
| Activity annotations |
| Data quality flags (check before analysis) |
| Duplicate flag |
Core API
1. Molecule Queries
molecule = new_client.molecule # By ChEMBL ID aspirin = molecule.get('CHEMBL25') # By name (case-insensitive) results = molecule.filter(pref_name__icontains='imatinib') # By properties (Lipinski rule-of-5 compliant) drug_like = molecule.filter( molecule_properties__mw_freebase__lte=500, molecule_properties__alogp__lte=5, molecule_properties__hba__lte=10, molecule_properties__hbd__lte=5 ) # By property range mid_weight = molecule.filter( molecule_properties__mw_freebase__range=[300, 500] )
2. Target Queries
target = new_client.target # By ChEMBL ID egfr = target.get('CHEMBL203') print(f"{egfr['pref_name']} ({egfr['organism']})") # Search by name and type kinases = target.filter( target_type='SINGLE PROTEIN', pref_name__icontains='kinase' ) # By organism human_targets = target.filter(organism='Homo sapiens')
3. Bioactivity Data
activity = new_client.activity # Potent inhibitors for a target potent = activity.filter( target_chembl_id='CHEMBL203', standard_type='IC50', standard_value__lte=100, standard_units='nM' ) # All activities for a compound (with pChEMBL values) compound_acts = activity.filter( molecule_chembl_id='CHEMBL25', pchembl_value__isnull=False ) # Multiple activity types ki_data = activity.filter( target_chembl_id='CHEMBL240', standard_type__in=['IC50', 'Ki', 'Kd'] )
4. Structure-Based Search
# Similarity search (Tanimoto) similarity = new_client.similarity similar = similarity.filter( smiles='CC(=O)Oc1ccccc1C(=O)O', # aspirin similarity=85 # >=85% similarity ) # Substructure search substructure = new_client.substructure benzimidazoles = substructure.filter(smiles='c1ccc2[nH]cnc2c1')
5. Drug and Mechanism Data
drug = new_client.drug mechanism = new_client.mechanism drug_indication = new_client.drug_indication # Drug details drug_info = drug.get('CHEMBL25') # Mechanisms of action mechs = mechanism.filter(molecule_chembl_id='CHEMBL941') for m in mechs: print(f"{m['mechanism_of_action']} → {m.get('target_chembl_id')}") # Therapeutic indications indications = drug_indication.filter(molecule_chembl_id='CHEMBL941') for ind in indications: print(f"{ind.get('mesh_heading')} (Phase {ind.get('max_phase_for_ind')})") # SVG molecular image image = new_client.image svg_data = image.get('CHEMBL25') with open('aspirin.svg', 'w') as f: f.write(svg_data)
Common Workflows
Workflow 1: Find Inhibitors for a Target
from chembl_webresource_client.new_client import new_client import pandas as pd # Step 1: Identify the target targets = new_client.target.filter(pref_name__icontains='BRAF', target_type='SINGLE PROTEIN') target_id = list(targets)[0]['target_chembl_id'] # Step 2: Query potent activities activities = new_client.activity.filter( target_chembl_id=target_id, standard_type='IC50', standard_value__lte=100, standard_units='nM', pchembl_value__isnull=False ) # Step 3: Convert to DataFrame for analysis df = pd.DataFrame(list(activities)) df['standard_value'] = pd.to_numeric(df['standard_value']) print(f"Found {len(df)} potent compounds") print(df[['molecule_chembl_id', 'standard_value', 'pchembl_value']].head(10))
Workflow 2: Analyze a Known Drug
from chembl_webresource_client.new_client import new_client chembl_id = 'CHEMBL941' # Imatinib # Drug information drug_info = new_client.molecule.get(chembl_id) print(f"Name: {drug_info['pref_name']}") print(f"MW: {drug_info['molecule_properties']['mw_freebase']}") # Mechanisms of action mechs = list(new_client.mechanism.filter(molecule_chembl_id=chembl_id)) for m in mechs: print(f"Mechanism: {m['mechanism_of_action']}") # Indications indications = list(new_client.drug_indication.filter(molecule_chembl_id=chembl_id)) for ind in indications: print(f"Indication: {ind.get('mesh_heading')} (Phase {ind.get('max_phase_for_ind')})") # All bioactivity data activities = list(new_client.activity.filter( molecule_chembl_id=chembl_id, pchembl_value__isnull=False )) print(f"Total bioactivity records: {len(activities)}")
Workflow 3: SAR Study
from chembl_webresource_client.new_client import new_client import pandas as pd # Step 1: Find similar compounds to lead similar = new_client.similarity.filter( smiles='c1ccc2c(c1)cc(nc2N)c3ccc(cc3)NC(=O)c4ccccc4', # lead compound similarity=80 ) analogs = list(similar) # Step 2: Collect activities for each analog records = [] for compound in analogs[:20]: # limit for demo cid = compound['molecule_chembl_id'] acts = list(new_client.activity.filter( molecule_chembl_id=cid, standard_type='IC50', pchembl_value__isnull=False )) for act in acts: records.append({ 'chembl_id': cid, 'target': act.get('target_pref_name'), 'IC50_nM': act.get('standard_value'), 'pchembl': act.get('pchembl_value'), 'mw': compound.get('molecule_properties', {}).get('mw_freebase'), 'alogp': compound.get('molecule_properties', {}).get('alogp') }) # Step 3: Analyze property-activity relationships df = pd.DataFrame(records) if not df.empty: df['IC50_nM'] = pd.to_numeric(df['IC50_nM']) print(df.groupby('target')['IC50_nM'].describe())
Common Recipes
Recipe: Virtual Screening Filter (Lipinski + Activity)
from chembl_webresource_client.new_client import new_client candidates = new_client.molecule.filter( molecule_properties__mw_freebase__range=[300, 500], molecule_properties__alogp__lte=5, molecule_properties__hba__lte=10, molecule_properties__hbd__lte=5, molecule_properties__num_ro5_violations=0 ) print(f"Drug-like candidates: {len(list(candidates))}")
Recipe: Client Configuration
from chembl_webresource_client.settings import Settings Settings.Instance().CACHING = True # enable/disable cache Settings.Instance().CACHE_EXPIRE = 86400 # cache duration (seconds) Settings.Instance().TIMEOUT = 30 # request timeout (seconds) Settings.Instance().TOTAL_RETRIES = 3 # retry count on failure
Recipe: Export Activities to CSV
import pandas as pd from chembl_webresource_client.new_client import new_client activities = new_client.activity.filter( target_chembl_id='CHEMBL203', standard_type='IC50', pchembl_value__isnull=False ) df = pd.DataFrame(list(activities)) df.to_csv('egfr_activities.csv', index=False) print(f"Exported {len(df)} records")
Key Parameters
| Parameter | Endpoint | Default | Description |
|---|---|---|---|
| | — | Tanimoto threshold (0-100), typically 70-90 |
| | — | Activity type: IC50, Ki, Kd, EC50 |
| | — | Max activity value (nM) |
| | — | Normalized -log10 activity (>6 = potent) |
| | — | SINGLE PROTEIN, PROTEIN COMPLEX, ORGANISM |
| | | Enable HTTP response caching |
| | | Cache TTL in seconds |
| | | HTTP request timeout in seconds |
| | | Auto-retry count on failure |
Troubleshooting
| Problem | Cause | Solution |
|---|---|---|
| Empty results from filter | No matches or too strict filters | Relax filters; verify IDs exist with first |
on molecule properties | Not all molecules have full property data | Use |
| Query returns unexpectedly few results | Lazy evaluation not consumed | Convert to before checking length |
| Slow queries | Large result sets paginated automatically | Add more filters to narrow results; use |
on | Invalid ChEMBL ID | Verify ID format (e.g., CHEMBL25, not 25) |
| Stale data | Aggressive caching | Set or clear cache |
| Timeout errors | Server overload or large query | Increase ; split into smaller queries |
| Mixed units in activity data | Different assays use different units | Filter by or use |
| Duplicate activity records | Same measurement from different sources | Check and |
Best Practices
- Use
for cross-study comparisons — it normalizes IC50/Ki/EC50 to a comparable -log10 scalepchembl_value - Always check
before using activity values — flags data quality issuesdata_validity_comment - Filter by
to ensure consistent units across resultsstandard_units - Pagination is automatic: the SDK handles pagination transparently — iterate directly over query results without manual page handling. Convert to
only when you need all results in memorylist() - Use lazy evaluation: queries execute only when iterated — convert to
only when neededlist() - Cache results: the SDK caches for 24h by default — leverage this for repeated queries
- For bulk data (>100K records): use ChEMBL FTP downloads rather than API queries
Related Skills
— SMILES manipulation, molecular descriptors, fingerprintsrdkit-cheminformatics
— Molecular preprocessing and featurizationdatamol-cheminformatics
— Alternative compound database (NIH)pubchem-compound-search
References
- ChEMBL website: https://www.ebi.ac.uk/chembl/
- API documentation: https://www.ebi.ac.uk/chembl/api/data/docs
- Python client: https://github.com/chembl/chembl_webresource_client
- Interface docs: https://chembl.gitbook.io/chembl-interface-documentation/
- Example notebooks: https://github.com/chembl/notebooks
Bundled Resources
Self-contained entry (no
references/ directory). Original total: 662 lines (SKILL.md 389 + api_reference.md 273). Scripts: 279 lines (example_queries.py).
Original file disposition:
(389 lines) → Core API, Workflows, Quick Start. "Common Use Cases" consolidated (rule 7b): Find Kinase Inhibitors → Workflow 1 pattern, Virtual Screening → Recipe, Drug Repurposing → omitted (trivial loop over drug endpoint, not a distinct analytical workflow). "Important Notes" section routed to Best Practices and Troubleshooting (rule 9)SKILL.md
(273 lines) → Consolidated inline. Filter Operators → Key Concepts table. Core Endpoints listing → Key Concepts table (all endpoints). Molecular Properties → Key Concepts table. Bioactivity Data Fields → Key Concepts table. Target Information Fields → Key Concepts table. Configuration/Settings → Common Recipes. Error handling/rate limiting → Troubleshooting + Best Practices. Response formats (JSON/XML/YAML) → omitted (JSON is default and only format used via Python SDK). Advanced query examples already covered in Core APIreferences/api_reference.md
(279 lines) → Thin-wrapper functions absorbed into Core API modules:scripts/example_queries.py
/get_molecule_info
/search_molecules_by_name
→ Module 1 (Molecule Queries);find_molecules_by_properties
/get_target_info
→ Module 2 (Target Queries);search_targets_by_name
/get_bioactivity_data
→ Module 3 (Bioactivity Data);get_compound_bioactivities
/find_similar_compounds
→ Module 4 (Structure-Based Search);substructure_search
→ Module 5 (Drug and Mechanism Data);get_drug_info
→ Workflow 1;find_kinase_inhibitors
→ Workflow 1 + Recipe (Export)export_to_dataframe
Retention: ~460 lines / 662 original = ~69%. Vendor metadata stripped (rule 13). Agent-behavior section stripped (rule 4).