Medical-research-skills pytdc
Therapeutics Data Commons (PyTDC) for AI-ready therapeutic ML datasets and benchmarks; use it when you need standardized dataset loading, meaningful splits (e.g., scaffold/cold-start), and consistent evaluation for ADME/Toxicity/DTI/DDI or molecular optimization.
git clone https://github.com/aipoch/medical-research-skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/aipoch/medical-research-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/scientific-skills/Evidence Insight/pytdc" ~/.claude/skills/aipoch-medical-research-skills-pytdc && rm -rf "$T"
scientific-skills/Evidence Insight/pytdc/SKILL.mdWhen to Use
- You need curated, AI-ready datasets for drug discovery tasks (e.g., ADME, toxicity, bioactivity, DTI/DDI).
- You want standardized benchmarks with consistent evaluation protocols (including multi-seed benchmark groups).
- You require meaningful data splits such as scaffold splits (chemical diversity) or cold-start splits (unseen drugs/targets).
- You are building models for single-instance prediction (molecular/protein properties) or multi-instance prediction (interactions).
- You are doing goal-directed molecular generation and need property Oracles for scoring/optimization.
Key Features
- Dataset access by task type
- Single-instance prediction: ADME, Toxicity (Tox), HTS, QM, and more.
- Multi-instance prediction: DTI, DDI, PPI, and more.
- Generation: MolGen, RetroSyn, PairMolGen.
- Standardized splitting utilities
,random
, and cold-start variants such asscaffold
,cold_drug
,cold_target
, pluscold_drug_target
where applicable.temporal
- Unified evaluation
- Built-in
with common metrics (ROC-AUC, PR-AUC, RMSE, MAE, Spearman, etc.).Evaluator
- Built-in
- Benchmark groups
- Curated collections (e.g., ADMET group) with recommended evaluation protocols (commonly 5 seeds).
- Chem/data utilities
- Molecular format conversion (e.g., SMILES → PyG), filtering, balancing, negative sampling, entity retrieval (CID→SMILES, UniProt→sequence).
- Molecular Oracles
- Property scoring functions usable for goal-directed generation workflows (see
).references/oracles.md
- Property scoring functions usable for goal-directed generation workflows (see
Dependencies
Install (recommended):
uv pip install PyTDC
Upgrade:
uv pip install PyTDC --upgrade
Core runtime dependencies (installed automatically; versions depend on the PyTDC release you install):
(latest from PyPI)PyTDCnumpypandasscikit-learntqdmseabornfuzzywuzzy
Optional dependencies may be pulled in automatically depending on which submodules you use (e.g., graph backends or chemistry toolchains).
Example Usage
A complete runnable example that:
- loads an ADME dataset,
- performs a scaffold split,
- trains a simple baseline model,
- evaluates with a standard metric,
- queries an Oracle score for a SMILES.
# pip install PyTDC scikit-learn from tdc.single_pred import ADME from tdc import Evaluator, Oracle from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import CountVectorizer from sklearn.linear_model import Ridge def main(): # 1) Load a single-instance prediction dataset (ADME) data = ADME(name="Caco2_Wang") # 2) Create a scaffold split (train/valid/test) split = data.get_split(method="scaffold", seed=42, frac=[0.7, 0.1, 0.2]) train, valid, test = split["train"], split["valid"], split["test"] # 3) Train a simple baseline model on SMILES strings # (character n-gram + ridge regression; replace with your own model) model = Pipeline( steps=[ ("featurizer", CountVectorizer(analyzer="char", ngram_range=(2, 5))), ("regressor", Ridge(alpha=1.0)), ] ) model.fit(train["Drug"], train["Y"]) # 4) Evaluate on the test set using a TDC Evaluator y_pred = model.predict(test["Drug"]) evaluator = Evaluator(name="MAE") mae = evaluator(test["Y"], y_pred) print(f"Test MAE: {mae:.4f}") # 5) Oracle scoring example (property scoring for a SMILES) oracle = Oracle(name="DRD2") score = oracle("CC(C)Cc1ccc(cc1)C(C)C(O)=O") print(f"DRD2 Oracle score: {score}") if __name__ == "__main__": main()
Related references and templates (if present in this skill package):
- Oracle catalog and usage:
references/oracles.md - Utility functions (splits, processing, retrieval):
references/utilities.md - Dataset catalog:
references/datasets.md - Workflow templates:
,scripts/load_and_split_data.py
,scripts/benchmark_evaluation.pyscripts/molecular_generation.py
Implementation Details
1) Dataset Access Pattern
PyTDC datasets follow a consistent interface:
from tdc.<problem> import <Task> data = <Task>(name="<DatasetName>") df = data.get_data(format="df") split = data.get_split(method="scaffold", seed=1, frac=[0.7, 0.1, 0.2])
is typically one of:<problem>
(single-entity property prediction)single_pred
(pairwise/multi-entity interaction prediction)multi_pred
(molecule/reaction generation tasks)generation
2) Splitting Strategies (Key Parameters)
Use
get_split(...) to obtain {"train": ..., "valid": ..., "test": ...}.
Common parameters:
: split strategymethod
: random seed for reproducibilityseed
:frac
fractions (when supported)[train, valid, test]
Typical methods:
: random shuffling splitrandom
: Bemis–Murcko scaffold-based split to reduce scaffold leakage and improve chemical generalizationscaffold- Cold-start (commonly for interaction tasks like DTI/DDI):
: test contains unseen drugscold_drug
: test contains unseen targetscold_target
: test contains unseen drugs and targetscold_drug_target
: time-based split for datasets with timestamps (when available)temporal
Example:
split = data.get_split(method="cold_target", seed=1)
3) Standardized Evaluation
TDC provides a unified evaluator:
from tdc import Evaluator evaluator = Evaluator(name="ROC-AUC") # classification score = evaluator(y_true, y_pred)
Choose metrics appropriate to the task type:
- Classification:
,ROC-AUC
,PR-AUC
,F1
, etc.Accuracy - Regression:
,RMSE
,MAE
,R2
,Spearman
, etc.Pearson
4) Data Schemas (What Columns to Expect)
While schemas vary by task, common conventions include:
-
Single-instance prediction (e.g., ADME/Tox):
(often SMILES) and labelDrugY- sometimes
/Drug_IDCompound_ID
-
Multi-instance prediction (e.g., DTI):
(SMILES),Drug
(protein sequence), labelTargetY- plus identifiers such as
,Drug_IDTarget_ID
5) Oracles for Molecular Optimization
Oracles provide a callable scoring interface:
from tdc import Oracle oracle = Oracle(name="GSK3B") score = oracle("CCO...") scores = oracle(["SMILES1", "SMILES2"])
Use Oracles to:
- score candidate molecules during generation,
- define optimization objectives,
- compare molecules under consistent property predictors.
For the full list of Oracles and their expected inputs/outputs, see
references/oracles.md.