SciAgent-Skills pytdc-therapeutics-data-commons

install
source · Clone the upstream repo
git clone https://github.com/jaechang-hits/SciAgent-Skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/structural-biology-drug-discovery/pytdc-therapeutics-data-commons" ~/.claude/skills/jaechang-hits-sciagent-skills-pytdc-therapeutics-data-commons && rm -rf "$T"
manifest: skills/structural-biology-drug-discovery/pytdc-therapeutics-data-commons/SKILL.md
source content

PyTDC (Therapeutics Data Commons)

Overview

PyTDC is an open-science platform providing AI-ready datasets and benchmarks for drug discovery. It organizes therapeutics data into three categories: single-instance prediction (molecular/protein properties), multi-instance prediction (drug-target interactions), and generation (molecule design, retrosynthesis). All datasets come with standardized splits, evaluation metrics, and molecular oracles.

When to Use

  • Loading curated ADME, toxicity, or bioactivity datasets for ML model training
  • Benchmarking drug discovery models with standardized 5-seed evaluation protocols
  • Predicting drug-target or drug-drug interactions with proper cold-split evaluation
  • Generating novel molecules and scoring them with molecular oracles (QED, SA, DRD2, GSK3B)
  • Accessing scaffold-based or temporal train/test splits for pharmaceutical ML
  • Converting molecular representations (SMILES to PyG graphs, ECFP fingerprints, SELFIES)
  • For chemical database queries (compound search, bioactivity), use
    chembl-database-bioactivity
    instead
  • For molecular featurization beyond format conversion, use
    molfeat
    instead

Prerequisites

uv pip install PyTDC
# Core deps: numpy, pandas, scikit-learn, tqdm, fuzzywuzzy
# Optional: rdkit (scaffold splits), torch-geometric (PyG conversion)

API Note: TDC downloads datasets on first access (~10-500 MB per dataset). Specify

path='data/'
to control download location. No API key required.

Quick Start

from tdc.single_pred import ADME
from tdc import Evaluator

# Load dataset with scaffold split
data = ADME(name='Caco2_Wang')
split = data.get_split(method='scaffold', seed=42, frac=[0.7, 0.1, 0.2])
train, valid, test = split['train'], split['valid'], split['test']
print(f"Train: {len(train)}, Valid: {len(valid)}, Test: {len(test)}")
# Train: ~640, Valid: ~91, Test: ~182

# Evaluate predictions
evaluator = Evaluator(name='MAE')
# score = evaluator(test['Y'].values, predictions)

Core API

Module 1: Single-Instance Prediction — Dataset Access

Load datasets for predicting properties of individual molecules or proteins.

from tdc.single_pred import ADME, Tox, HTS, QM

# ADME — pharmacokinetic properties
data = ADME(name='Caco2_Wang')       # Intestinal permeability (regression)
data = ADME(name='BBB_Martins')       # Blood-brain barrier (binary)
data = ADME(name='Lipophilicity_AstraZeneca')  # LogD (regression)
data = ADME(name='Solubility_AqSolDB')         # Aqueous solubility

# Toxicity — adverse effects
data = Tox(name='hERG')              # Cardiotoxicity (binary)
data = Tox(name='AMES')              # Mutagenicity (binary)
data = Tox(name='DILI')              # Drug-induced liver injury
data = Tox(name='ClinTox')           # Clinical trial toxicity

# Access data as DataFrame
df = data.get_data(format='df')
print(df.columns.tolist())
# ['Drug_ID', 'Drug', 'Y'] — Drug is SMILES, Y is target label
print(f"Dataset size: {len(df)}, Label range: [{df['Y'].min():.2f}, {df['Y'].max():.2f}]")

Other single-prediction tasks:

HTS
(screening),
QM
(quantum mechanics),
Yields
,
Epitope
,
Develop
,
CRISPROutcome
.

Module 2: Multi-Instance Prediction — Interaction Datasets

Load datasets for predicting interactions between pairs of biomedical entities.

from tdc.multi_pred import DTI, DDI, PPI

# Drug-Target Interaction — binding affinity
data = DTI(name='BindingDB_Kd')      # 52,284 pairs, Kd values
data = DTI(name='DAVIS')             # 30,056 pairs, kinase binding
data = DTI(name='KIBA')              # 118,254 pairs, kinase bioactivity

# Drug-Drug Interaction — interaction type prediction
data = DDI(name='DrugBank')           # 191,808 pairs, 86 interaction types

# Protein-Protein Interaction
data = PPI(name='HuRI')

# Multi-instance data format
df = data.get_data(format='df')
print(df.columns.tolist())
# ['Drug_ID', 'Drug', 'Target_ID', 'Target', 'Y']
# Drug=SMILES, Target=protein sequence, Y=binding affinity or class

Other multi-instance tasks:

GDA
,
DrugRes
,
DrugSyn
,
PeptideMHC
,
AntibodyAff
,
MTI
,
Catalyst
,
TrialOutcome
.

Module 3: Generation Tasks — Molecular Design

Load training sets and oracles for molecule generation and retrosynthesis.

from tdc.generation import MolGen, RetroSyn, PairMolGen
from tdc import Oracle

# Molecule generation — training data
data = MolGen(name='ChEMBL_V29')     # 1.6M drug-like SMILES
split = data.get_split()
train_smiles = split['train']['Drug'].tolist()

# Oracle scoring — evaluate generated molecules
oracle = Oracle(name='GSK3B')         # GSK3B inhibition predictor (0-1)
score = oracle('CC(C)Cc1ccc(cc1)C(C)C(O)=O')
print(f"GSK3B score: {score:.4f}")

# Batch evaluation
scores = oracle(['CCO', 'c1ccccc1', 'CC(=O)O'])
print(f"Batch scores: {scores}")

# Retrosynthesis — reaction prediction
data = RetroSyn(name='USPTO')         # 1.9M reactions
split = data.get_split()

# Paired generation — prodrug design
data = PairMolGen(name='Prodrug')

Module 4: Data Splits and Evaluation

Apply meaningful data splits and standardized evaluation metrics.

from tdc.single_pred import ADME
from tdc.multi_pred import DTI
from tdc import Evaluator

# Scaffold split — ensures chemical diversity between sets
data = ADME(name='Caco2_Wang')
split = data.get_split(method='scaffold', seed=42, frac=[0.7, 0.1, 0.2])

# Cold splits — for DTI (unseen drugs/targets in test set)
data = DTI(name='BindingDB_Kd')
cold_drug = data.get_split(method='cold_drug', seed=1)
cold_target = data.get_split(method='cold_target', seed=1)

# Verify no overlap in cold split
train_drugs = set(cold_drug['train']['Drug_ID'])
test_drugs = set(cold_drug['test']['Drug_ID'])
print(f"Drug overlap: {len(train_drugs & test_drugs)}")  # 0

# Evaluation metrics
eval_mae = Evaluator(name='MAE')
eval_auc = Evaluator(name='ROC-AUC')
eval_spearman = Evaluator(name='Spearman')
# score = eval_mae(y_true, y_pred)

Available split methods:

random
,
scaffold
(Bemis-Murcko),
cold_drug
,
cold_target
,
cold_drug_target
,
temporal
.

Available metrics: Classification —

ROC-AUC
,
PR-AUC
,
F1
,
Accuracy
,
Kappa
. Regression —
RMSE
,
MAE
,
R2
,
MSE
. Ranking —
Spearman
,
Pearson
. Multi-label —
Micro-F1
,
Macro-F1
.

Module 5: Benchmark Groups

Run standardized multi-seed evaluation protocols for model comparison.

from tdc.benchmark_group import admet_group

# Load ADMET benchmark (22 datasets)
group = admet_group(path='data/')

# Standard 5-seed evaluation protocol
benchmark = group.get('Caco2_Wang')
predictions = {}

for seed in [1, 2, 3, 4, 5]:
    train_df = benchmark['train']
    valid_df = benchmark['valid']
    test_df = benchmark['test']
    # Train your model on train_df, tune on valid_df
    # predictions[seed] = model.predict(test_df['Drug'])
    predictions[seed] = test_df['Y'].values  # placeholder

# Get benchmark results
results = group.evaluate(predictions)
print(f"Mean MAE: {results['Caco2_Wang'][0]:.4f} ± {results['Caco2_Wang'][1]:.4f}")

Key Concepts

Dataset Organization

CategoryImport PathTask ExamplesData Format
Single-Instance
tdc.single_pred
ADME, Tox, HTS, QMDrug (SMILES) + Y (label)
Multi-Instance
tdc.multi_pred
DTI, DDI, PPI, DrugSynDrug + Target + Y
Generation
tdc.generation
MolGen, RetroSynSMILES collections
Benchmark
tdc.benchmark_group
admet_groupCurated splits

Oracle Categories

CategoryExamplesSpeedOutput Range
BiochemicalDRD2, GSK3B, JNK3, 5HT2AMedium (ML)0-1 probability
PhysicochemicalQED, SA, LogP, MWFast (rule-based)Varies by metric
CompositeIsomer_Meta, Median1/2, RediscoveryMedium0-1 combined
SpecializedASKCOS, Docking, VinaSlow (external)Varies

Data Processing Utilities

UtilityFunctionExample
Format conversion
MolConvert(src, dst)
SMILES → PyG, ECFP, SELFIES, DGL
Molecule filters
MolFilter(filters)
PAINS, BMS, Glaxo, drug-likeness
Label binarization
label_transform()
Continuous → binary at threshold
Unit conversion
label_transform(from_unit, to_unit)
nM → pIC50
ID resolution
cid2smiles()
,
uniprot2seq()
PubChem CID → SMILES
Dataset listing
retrieve_dataset_names(task)
List all ADME datasets

Common Workflows

Workflow 1: Multi-Seed ADME Model Evaluation

from tdc.single_pred import ADME
from tdc import Evaluator
import numpy as np

data = ADME(name='Caco2_Wang')
evaluator = Evaluator(name='MAE')

results = []
for seed in [1, 2, 3, 4, 5]:
    split = data.get_split(method='scaffold', seed=seed)
    train, valid, test = split['train'], split['valid'], split['test']
    # model.fit(train['Drug'], train['Y'])
    # preds = model.predict(test['Drug'])
    preds = test['Y'].values + np.random.normal(0, 0.1, len(test))  # placeholder
    score = evaluator(test['Y'].values, preds)
    results.append(score)
    print(f"Seed {seed}: MAE = {score:.4f}")

print(f"Mean MAE: {np.mean(results):.4f} ± {np.std(results):.4f}")

Workflow 2: Multi-Objective Molecular Scoring

from tdc import Oracle
import numpy as np

# Define multi-objective scoring
oracles = {
    'QED': (Oracle(name='QED'), 0.3),      # drug-likeness
    'SA': (Oracle(name='SA'), 0.3),         # synthetic accessibility
    'GSK3B': (Oracle(name='GSK3B'), 0.4),   # target activity
}

test_smiles = ['CC(C)Cc1ccc(cc1)C(C)C(O)=O', 'c1ccc2c(c1)cc1ccc3cccc4ccc2c1c34']

for smi in test_smiles:
    scores = {}
    weighted_sum = 0
    for name, (oracle, weight) in oracles.items():
        score = oracle(smi)
        scores[name] = score
        weighted_sum += score * weight
    print(f"SMILES: {smi[:30]}...")
    print(f"  Scores: {scores}")
    print(f"  Weighted: {weighted_sum:.4f}")

Workflow 3: Cold-Split DTI Evaluation (text-only)

  1. Load DTI dataset —
    DTI(name='BindingDB_Kd')
    (Core API Module 2)
  2. Apply cold_drug split —
    data.get_split(method='cold_drug', seed=seed)
    (Core API Module 4)
  3. Verify zero drug overlap between train and test sets (Module 4 overlap check)
  4. Train model on train set, predict on test set
  5. Evaluate with Spearman correlation —
    Evaluator(name='Spearman')
    (Module 4)
  6. Repeat for 5 seeds and report mean ± std (same pattern as Workflow 1)

Key Parameters

ParameterFunction/ModuleDefaultRange/OptionsEffect
method
get_split()
'scaffold'
random
,
scaffold
,
cold_drug
,
cold_target
,
temporal
Split strategy for train/test
seed
get_split()
42
1-5 for benchmarksReproducibility; use 5 seeds for benchmarks
frac
get_split()
[0.7, 0.1, 0.2]
Sum must equal 1.0Train/valid/test proportions
name
Evaluator()
MAE
,
RMSE
,
ROC-AUC
,
Spearman
, etc.
Evaluation metric
name
Oracle()
QED
,
SA
,
GSK3B
,
DRD2
, etc.
Scoring function for molecules
src
/
dst
MolConvert()
SMILES
,
SELFIES
,
PyG
,
DGL
,
ECFP4
Molecular representation formats
path
admet_group()
'data/'
Any directoryDataset download/cache location
format
get_data()
'df'
'df'
,
'dict'
Output data format

Best Practices

  1. Always use scaffold splits for molecular property prediction — random splits leak structural information and inflate performance metrics
  2. Report 5-seed evaluations with mean ± std — single-seed results are unreliable for method comparison
  3. Use cold splits for interaction prediction
    cold_drug
    tests generalization to unseen drugs,
    cold_target
    to unseen targets
  4. Filter molecules early with
    MolFilter
    (PAINS, drug-likeness) before training to remove problematic compounds
  5. Normalize oracles appropriately — QED returns 0-1, SA returns 1-10 (lower is better), binding scores vary. Check oracle documentation before combining
  6. Cache datasets by specifying a persistent
    path
    — avoids re-downloading large datasets across sessions

Common Recipes

Recipe 1: Dataset Exploration and Statistics

from tdc.single_pred import ADME
from tdc.utils import retrieve_dataset_names

# List all ADME datasets
datasets = retrieve_dataset_names('ADME')
print(f"Available ADME datasets: {datasets}")

# Load and inspect
data = ADME(name='Caco2_Wang')
df = data.get_data(format='df')
print(f"Size: {len(df)}")
print(f"Label stats: mean={df['Y'].mean():.2f}, std={df['Y'].std():.2f}")
print(f"SMILES example: {df['Drug'].iloc[0]}")

Recipe 2: Molecule Format Conversion Pipeline

from tdc.chem_utils import MolConvert

# SMILES to multiple representations
smiles = 'CC(C)Cc1ccc(cc1)C(C)C(O)=O'

converter_ecfp = MolConvert(src='SMILES', dst='ECFP4')
converter_selfies = MolConvert(src='SMILES', dst='SELFIES')

ecfp = converter_ecfp(smiles)
selfies = converter_selfies(smiles)
print(f"ECFP4 shape: {ecfp.shape}")    # (1024,) binary fingerprint
print(f"SELFIES: {selfies}")

Recipe 3: Custom Oracle with Constraint Satisfaction

from tdc import Oracle

# Define property constraints
constraints = {
    'QED': (Oracle(name='QED'), 0.5, 1.0),       # min, max
    'SA': (Oracle(name='SA'), 1.0, 4.0),
    'LogP': (Oracle(name='LogP'), -0.5, 5.0),
}

def check_constraints(smiles):
    """Check if molecule satisfies all property constraints."""
    results = {}
    all_pass = True
    for name, (oracle, lo, hi) in constraints.items():
        score = oracle(smiles)
        passed = lo <= score <= hi
        results[name] = {'score': score, 'passed': passed}
        all_pass = all_pass and passed
    return all_pass, results

passed, details = check_constraints('CC(C)Cc1ccc(cc1)C(C)C(O)=O')
for name, info in details.items():
    status = "PASS" if info['passed'] else "FAIL"
    print(f"{name}: {info['score']:.3f} [{status}]")

Troubleshooting

ProblemCauseSolution
ModuleNotFoundError: tdc
Package not installed
uv pip install PyTDC
Scaffold split failsMissing RDKit dependency
uv pip install rdkit
for scaffold decomposition
Dataset download timeoutLarge dataset or slow connectionSet
path='data/'
for persistent cache; retry
KeyError
on dataset name
Wrong name or task categoryUse
retrieve_dataset_names('ADME')
to list valid names
Oracle returns NaNInvalid SMILES or RDKit parse failureValidate SMILES with RDKit
MolFromSmiles()
first
Cold split empty test setToo few unique entitiesUse
frac=[0.7, 0.1, 0.2]
with larger datasets
Benchmark evaluation errorWrong prediction formatPass
dict
with seeds as keys:
{1: preds1, 2: preds2, ...}
Memory error on large datasetFull dataset loaded to memoryProcess in chunks or use smaller split fractions
PyG conversion failstorch-geometric not installed
uv pip install torch-geometric
for graph conversion

Bundled Resources

references/datasets_catalog.md

Covers: complete catalog of all TDC datasets organized by task category (single-instance, multi-instance, generation) with dataset names, sizes, label types, and data sources. Relocated inline: top ADME/Tox/DTI datasets with code examples consolidated into Core API Modules 1-2. Omitted: None — all dataset entries preserved in catalog format.

references/oracles_utilities.md

Covers: detailed oracle documentation (all 17+ oracles with parameters, speed tiers, output ranges, custom oracle template) and data processing utilities (format conversion targets, molecule filter types, label transformation, entity resolution). Consolidated from original

oracles.md
+
utilities.md
. Relocated inline: Quick Start oracle usage pattern, top evaluation metrics, split methods, MolConvert pattern → Core API Modules 3-4. Omitted: distribution learning KS-test example — niche statistical comparison; leaderboard submission guide — platform-specific.

Script Disposition

  • load_and_split_data.py
    (215 lines): scaffold/cold split patterns → Core API Module 4; custom split fractions → Key Parameters; evaluation examples → Workflow 1. Thin wrappers around
    get_split()
    and
    Evaluator()
    .
  • benchmark_evaluation.py
    (328 lines): 5-seed protocol → Core API Module 5 + Workflow 1; multi-dataset evaluation → Workflow 1 pattern; leaderboard guide → omitted (platform-specific).
  • molecular_generation.py
    (405 lines): single/batch oracle usage → Core API Module 3; multi-objective scoring → Workflow 2; constraint satisfaction → Recipe 3; distribution learning → omitted (niche).

Related Skills

  • chembl-database-bioactivity — for querying ChEMBL compound/target/activity data directly
  • molfeat — for advanced molecular featurization beyond TDC's built-in MolConvert
  • rdkit-cheminformatics — for molecular manipulation, substructure search, descriptor calculation

References