SciAgent-Skills deepchem
Deep learning framework for drug discovery and materials science. 60+ models (GCN, GAT, AttentiveFP, MPNN, DMPNN, ChemBERTa, GROVER), 50+ molecular featurizers, MoleculeNet benchmarks, hyperparameter optimization, transfer learning. Unified load-featurize-split-train-evaluate API. For fingerprint-only cheminformatics use rdkit-cheminformatics; for featurization hub without training use molfeat-molecular-featurization.
git clone https://github.com/jaechang-hits/SciAgent-Skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/structural-biology-drug-discovery/deepchem" ~/.claude/skills/jaechang-hits-sciagent-skills-deepchem && rm -rf "$T"
skills/structural-biology-drug-discovery/deepchem/SKILL.mdDeepChem — Deep Learning for Drug Discovery
Overview
DeepChem is an open-source Python framework providing a unified API for molecular machine learning across drug discovery, materials science, and quantum chemistry. It wraps 60+ model architectures (graph neural networks, transformers, classical ML) with 50+ molecular featurizers and standardized datasets (MoleculeNet), enabling end-to-end workflows from SMILES strings to trained predictive models.
When to Use
- Predicting molecular properties (solubility, toxicity, binding affinity) from SMILES
- Benchmarking models on MoleculeNet standardized datasets (BBBP, Tox21, ESOL, FreeSolv, etc.)
- Training graph neural networks on molecular graphs (GCN, GAT, AttentiveFP, MPNN, DMPNN)
- Fine-tuning pretrained chemical language models (ChemBERTa, GROVER, MolFormer)
- Running hyperparameter optimization for molecular ML models
- Virtual screening and hit prioritization with trained models
- Materials property prediction from crystal structures (CGCNN, MEGNet)
- Protein-ligand interaction modeling and binding affinity prediction
- For fingerprint-based cheminformatics without deep learning, use
insteadrdkit-cheminformatics - For featurization only (no model training), use
insteadmolfeat-molecular-featurization
Prerequisites
- Python packages:
(core),deepchem
ortorch
(backend-dependent models)tensorflow - GPU: Recommended for graph neural networks and transformer models; CPU sufficient for classical ML and fingerprint models
- Data: SMILES strings with property labels (CSV), or MoleculeNet datasets (auto-downloaded)
# Core installation (includes RDKit, scikit-learn, XGBoost) pip install deepchem # With PyTorch backend (GNN models) pip install deepchem[torch] # With TensorFlow backend (legacy models) pip install deepchem[tensorflow] # Full installation (all backends + extras) pip install deepchem[all]
Quick Start
import deepchem as dc # Load MoleculeNet dataset with featurization + scaffold split tasks, datasets, transformers = dc.molnet.load_delaney(featurizer="ECFP") train, valid, test = datasets # Train and evaluate a multitask regressor model = dc.models.MultitaskRegressor(n_tasks=1, n_features=1024, dropouts=0.2) model.fit(train, nb_epoch=50) metric = dc.metrics.Metric(dc.metrics.pearson_r2_score) print(f"Test R2: {model.evaluate(test, [metric])}") # {'pearson_r2_score': ~0.7}
Core API
Module 1: Data Loading and Processing
Load molecular data from CSV files or MoleculeNet benchmark datasets.
import deepchem as dc # Load from CSV (SMILES + property columns) loader = dc.data.CSVLoader( tasks=["measured_log_solubility"], feature_field="smiles", featurizer=dc.feat.CircularFingerprint(size=2048, radius=3) ) dataset = loader.create_dataset("solubility_data.csv") print(f"Samples: {dataset.X.shape[0]}, Features: {dataset.X.shape[1]}") # Samples: 1128, Features: 2048 # Load from SDF (3D structures) sdf_loader = dc.data.SDFLoader( tasks=["activity"], featurizer=dc.feat.CoulombMatrix(max_atoms=50) ) dataset_3d = sdf_loader.create_dataset("molecules.sdf")
# Load MoleculeNet benchmark datasets (auto-download + featurize + split) # Available: load_delaney, load_bbbp, load_tox21, load_hiv, load_qm7, load_qm9, etc. tasks, datasets, transformers = dc.molnet.load_tox21(featurizer="ECFP", splitter="scaffold") train, valid, test = datasets print(f"Tasks: {len(tasks)}, Train: {len(train)}, Test: {len(test)}") # Tasks: 12, Train: ~6264, Test: ~631 # Inverse-transform predictions back to original scale y_pred = model.predict(test) y_original = transformers[0].untransform(y_pred)
Module 2: Molecular Featurization
Convert molecules to numerical representations for ML. DeepChem provides 50+ featurizers spanning fingerprints, descriptors, graph features, and Coulomb matrices.
import deepchem as dc smiles = ["CCO", "CC(=O)O", "c1ccccc1", "CC(C)O"] # Fingerprints (most common for classical ML) ecfp = dc.feat.CircularFingerprint(size=2048, radius=3) fp_features = ecfp.featurize(smiles) print(f"ECFP shape: {fp_features.shape}") # (4, 2048) # RDKit descriptors (interpretable physicochemical properties) rdkit_desc = dc.feat.RDKitDescriptors() desc_features = rdkit_desc.featurize(smiles) print(f"Descriptor shape: {desc_features.shape}") # (4, 208) # Graph features (for GNN models — returns ConvMol objects) graph_feat = dc.feat.ConvMolFeaturizer() graphs = graph_feat.featurize(smiles) print(f"Atoms in first mol: {graphs[0].get_num_atoms()}") # 3 # Mol2Vec embeddings (pretrained word2vec on molecular substructures) mol2vec = dc.feat.Mol2VecFingerprint() embeddings = mol2vec.featurize(smiles) print(f"Mol2Vec shape: {embeddings.shape}") # (4, 300)
Module 3: Model Training and Evaluation
DeepChem provides MultitaskRegressor and MultitaskClassifier as general-purpose models, plus specialized architectures for graph and sequence data.
import deepchem as dc # Load dataset tasks, datasets, transformers = dc.molnet.load_delaney(featurizer="ECFP") train, valid, test = datasets # Regression model (fingerprint input) model = dc.models.MultitaskRegressor( n_tasks=1, n_features=1024, layer_sizes=[1000, 500], dropouts=0.25, learning_rate=0.001, batch_size=64, ) model.fit(train, nb_epoch=100) # Evaluate with multiple metrics metrics = [ dc.metrics.Metric(dc.metrics.pearson_r2_score), dc.metrics.Metric(dc.metrics.mean_absolute_error), dc.metrics.Metric(dc.metrics.rms_score), ] results = model.evaluate(test, metrics) print(f"R2: {results['pearson_r2_score']:.3f}, MAE: {results['mean_absolute_error']:.3f}")
# Classification model (e.g., Tox21 toxicity prediction) tasks, datasets, transformers = dc.molnet.load_tox21(featurizer="ECFP") train, valid, test = datasets clf = dc.models.MultitaskClassifier( n_tasks=len(tasks), n_features=1024, layer_sizes=[1000, 500], dropouts=0.5, learning_rate=0.001, ) clf.fit(train, nb_epoch=50) roc_metric = dc.metrics.Metric(dc.metrics.roc_auc_score, np.mean) print(f"Mean ROC-AUC: {clf.evaluate(test, [roc_metric])}")
Module 4: Graph Neural Networks
GNNs operate directly on molecular graphs (atoms as nodes, bonds as edges), avoiding information loss from fixed fingerprints.
import deepchem as dc # Load with graph featurizer tasks, datasets, transformers = dc.molnet.load_delaney(featurizer="GraphConv") train, valid, test = datasets # Graph Convolutional Network (Duvenaud et al.) gcn_model = dc.models.GraphConvModel( n_tasks=1, mode="regression", graph_conv_layers=[64, 64], dense_layer_size=256, dropout=0.2, learning_rate=0.001, batch_size=64, ) gcn_model.fit(train, nb_epoch=100) metric = dc.metrics.Metric(dc.metrics.pearson_r2_score) print(f"GCN R2: {gcn_model.evaluate(test, [metric])}")
# AttentiveFP (Xiong et al.) — attention-based GNN, strong on molecular properties tasks, datasets, transformers = dc.molnet.load_delaney( featurizer=dc.feat.MolGraphConvFeaturizer(use_edges=True) ) train, valid, test = datasets attfp_model = dc.models.AttentiveFPModel( n_tasks=1, mode="regression", num_layers=2, graph_feat_size=200, num_timesteps=2, dropout=0.2, learning_rate=0.001, batch_size=64, ) attfp_model.fit(train, nb_epoch=100) print(f"AttentiveFP R2: {attfp_model.evaluate(test, [metric])}")
Module 5: Transfer Learning
Fine-tune pretrained chemical language models for downstream tasks with limited data.
import deepchem as dc from deepchem.models.torch_models import ChemBERTaModel # ChemBERTa — SMILES-based transformer (pretrained on 77M molecules) tasks, datasets, transformers = dc.molnet.load_bbbp(featurizer=dc.feat.SmilesTokenizer()) train, valid, test = datasets chemberta = ChemBERTaModel( task="classification", n_tasks=1, model_dir="chemberta_finetuned/", ) # Fine-tune on downstream task (BBB permeability) chemberta.fit(train, nb_epoch=10) metric = dc.metrics.Metric(dc.metrics.roc_auc_score) print(f"ChemBERTa ROC-AUC: {chemberta.evaluate(test, [metric])}")
Module 6: Predictions on New Molecules
Run inference on new molecules with a trained model.
import deepchem as dc import numpy as np # Assume trained model from Module 3 # Featurize new molecules using same featurizer featurizer = dc.feat.CircularFingerprint(size=1024, radius=2) new_smiles = ["c1cc(O)ccc1", "CC(=O)Nc1ccc(O)cc1", "OC(=O)c1ccccc1"] new_features = featurizer.featurize(new_smiles) new_dataset = dc.data.NumpyDataset(X=new_features) predictions = model.predict(new_dataset) for smi, pred in zip(new_smiles, predictions): print(f"{smi}: {pred[0]:.2f}") # Ensemble predictions from multiple models for robustness models = [model1, model2, model3] # trained models all_preds = np.array([m.predict(new_dataset) for m in models]) ensemble_mean = all_preds.mean(axis=0) ensemble_std = all_preds.std(axis=0) print(f"Ensemble prediction: {ensemble_mean[0][0]:.2f} +/- {ensemble_std[0][0]:.2f}")
Key Concepts
Unified API Pattern
All DeepChem workflows follow a consistent 5-step pattern:
Load Data → Featurize → Split → Train → Evaluate
- Load:
,CSVLoader
, orSDFLoader
(auto-loads MoleculeNet datasets)dc.molnet.load_*() - Featurize: Pass featurizer to loader, or call
directlyfeaturizer.featurize(smiles) - Split:
(recommended for drug discovery),ScaffoldSplitter
,RandomSplitterButinaSplitter - Train:
model.fit(train_dataset, nb_epoch=N) - Evaluate:
model.evaluate(test_dataset, metrics_list)
Model Selection Guide
| Data Type | Model | Key Feature | Use When |
|---|---|---|---|
| SMILES + fingerprints | | Fast, baseline | First attempt, small datasets |
| SMILES + fingerprints | | Multi-label | Multi-task classification (Tox21) |
| Molecular graphs | | Learned fingerprints | Medium datasets, general properties |
| Molecular graphs | | Attention mechanism | When atom importance matters |
| Molecular graphs | | Graph + timestep attention | State-of-art molecular properties |
| Molecular graphs | | Message passing | Complex molecular interactions |
| Molecular graphs | | Directed MPNN | Bond-level predictions |
| SMILES strings | | Pretrained transformer | Low-data regime, transfer learning |
| SMILES strings | | Graph + transformer | Rich molecular representations |
| Crystal structures | | Crystal graph CNN | Materials property prediction |
| Crystal structures | | Graph networks | Materials and molecules |
| Protein sequences | | Complex modeling | Binding affinity prediction |
| Tabular features | , | Classical ML | Interpretability, baselines |
Featurizer Selection Guide
| Featurizer | Class | Output | Best For |
|---|---|---|---|
| ECFP/Morgan | | Binary vector (1024-2048) | General QSAR, fast baselines |
| MACCS Keys | | 167-bit vector | Substructure filtering |
| RDKit 2D | | 200+ descriptors | Interpretable models |
| Mol2Vec | | 300-dim embedding | Similarity, clustering |
| ConvMol | | Graph features | input |
| MolGraph | | Node + edge features | , |
| Weave | | Pair features | input |
| Coulomb Matrix | | Atom-pair distances | QM property prediction |
| SMILES tokens | | Token IDs | ChemBERTa, transformer models |
Data Splitting Strategies
| Splitter | Use Case | Why |
|---|---|---|
| Drug discovery (default) | Tests generalization to new chemotypes |
| Quick experiments | Baseline, but overestimates performance |
| Diversity-based | Clusters by Tanimoto similarity |
| Chemical similarity | Groups structurally similar molecules |
| Maximum diversity test | Extreme generalization test |
Common Workflows
Workflow 1: QSAR from CSV Data
Goal: Build a property prediction model from a CSV file with SMILES and activity columns.
import deepchem as dc import pandas as pd # Step 1: Load and featurize CSV data loader = dc.data.CSVLoader( tasks=["pIC50"], feature_field="smiles", featurizer=dc.feat.CircularFingerprint(size=2048, radius=3), ) dataset = loader.create_dataset("bioactivity_data.csv") # Step 2: Normalize targets transformer = dc.trans.NormalizationTransformer( transform_y=True, dataset=dataset ) dataset = transformer.transform(dataset) # Step 3: Scaffold split (realistic for drug discovery) splitter = dc.splits.ScaffoldSplitter() train, valid, test = splitter.train_valid_test_split(dataset) print(f"Train: {len(train)}, Valid: {len(valid)}, Test: {len(test)}") # Step 4: Train model model = dc.models.MultitaskRegressor( n_tasks=1, n_features=2048, layer_sizes=[1000, 500], dropouts=0.25, learning_rate=0.001, batch_size=64, ) model.fit(train, nb_epoch=100) # Step 5: Evaluate metrics = [ dc.metrics.Metric(dc.metrics.pearson_r2_score), dc.metrics.Metric(dc.metrics.mean_absolute_error), ] results = model.evaluate(test, metrics) print(f"R2: {results['pearson_r2_score']:.3f}, MAE: {results['mean_absolute_error']:.3f}")
Workflow 2: MoleculeNet Benchmark Comparison
Goal: Compare multiple models on a MoleculeNet benchmark dataset.
import deepchem as dc # Load dataset with graph featurizer (supports both fingerprint and GNN models) tasks, datasets, transformers = dc.molnet.load_bbbp( featurizer="GraphConv", splitter="scaffold" ) train, valid, test = datasets metric = dc.metrics.Metric(dc.metrics.roc_auc_score) # Model 1: Graph Convolutional Network gcn = dc.models.GraphConvModel(n_tasks=1, mode="classification", dropout=0.2) gcn.fit(train, nb_epoch=50) gcn_score = gcn.evaluate(test, [metric]) # Model 2: Random Forest baseline (needs fingerprints) tasks_fp, datasets_fp, _ = dc.molnet.load_bbbp(featurizer="ECFP", splitter="scaffold") train_fp, _, test_fp = datasets_fp rf = dc.models.SklearnModel( model=dc.models.sklearn_models.RandomForestClassifier(n_estimators=500), model_dir="rf_model/" ) rf.fit(train_fp) rf_score = rf.evaluate(test_fp, [metric]) print(f"GCN ROC-AUC: {gcn_score['roc_auc_score']:.3f}") print(f"RF ROC-AUC: {rf_score['roc_auc_score']:.3f}")
Workflow 3: Transfer Learning Pipeline
Goal: Fine-tune a pretrained model on a small dataset.
- Load pretrained ChemBERTa model (see Module 5 for code)
- Prepare downstream dataset with
featurizerSmilesTokenizer - Fine-tune with reduced learning rate (
to1e-5
) for 5-15 epochs5e-5 - Evaluate on held-out scaffold split — expect gains over fingerprint baselines when training data < 1000 samples
- Save fine-tuned model:
model.save_checkpoint() - See
Workflow 1 for complete hyperparameter optimization codereferences/workflows_model_catalog.md
Key Parameters
| Parameter | Module | Default | Range / Options | Effect |
|---|---|---|---|---|
| MultitaskRegressor/Classifier | Required | Matches featurizer output | Input feature dimension |
| MultitaskRegressor/Classifier | | to | Hidden layer dimensions |
| All neural models | | - | Regularization strength |
| All neural models | | - | Training step size |
| All neural models | | - | Samples per gradient update |
| | | - | Training iterations |
| | | - | Fingerprint bit length |
| | | - | Substructure neighborhood radius |
| | | to | Graph convolution widths |
| | | - | GNN message passing depth |
| | | - | Graph feature dimension |
| | | , , | Data splitting strategy |
Best Practices
-
Always use scaffold splitting for drug discovery: Random splits leak structural information and overestimate performance. Scaffold splits test generalization to novel chemotypes.
-
Normalize regression targets: Apply
before training. Remember toNormalizationTransformer(transform_y=True)
predictions for interpretable values.untransform() -
Start with fingerprint baselines: Train
+ ECFP first. Only move to GNNs if fingerprint baseline is insufficient — GNNs need more data and compute.MultitaskRegressor# Baseline first baseline = dc.models.MultitaskRegressor(n_tasks=1, n_features=2048) -
Match featurizer to model: GNN models require graph featurizers (
,ConvMolFeaturizer
). Fingerprint models needMolGraphConvFeaturizer
. Mixing causes silent errors.CircularFingerprint -
Anti-pattern -- Do not use random split for drug discovery benchmarks: Results with
are not publishable for molecular property prediction. Reviewers expect scaffold or temporal splits.RandomSplitter -
Handle missing labels in multi-task datasets: Tox21 and many bioactivity datasets have missing values. DeepChem handles NaN labels automatically during training (masked loss), but verify with
.np.isnan(dataset.y).sum() -
Use early stopping via validation set: Monitor validation loss to prevent overfitting, especially with GNN models.
Common Recipes
Recipe: Hyperparameter Search
When to use: Optimize model performance before final evaluation.
import deepchem as dc tasks, datasets, transformers = dc.molnet.load_delaney(featurizer="ECFP") train, valid, test = datasets # Define parameter grid params = { "n_features": [1024], "layer_sizes": [[500], [1000, 500], [1000, 500, 250]], "dropouts": [0.1, 0.25, 0.5], "learning_rate": [0.001, 0.0005], } optimizer = dc.hyper.GridHyperparamOpt(lambda **p: dc.models.MultitaskRegressor(**p)) metric = dc.metrics.Metric(dc.metrics.pearson_r2_score) best_model, best_params, all_results = optimizer.hyperparam_search( params, train, valid, metric, logdir="hyperparam_logs/" ) print(f"Best params: {best_params}") print(f"Best R2: {best_model.evaluate(test, [metric])}")
Recipe: Save and Reload Models
When to use: Deploy trained models or resume training.
# Save model checkpoint model.save_checkpoint(model_dir="saved_model/") # Reload model loaded_model = dc.models.MultitaskRegressor(n_tasks=1, n_features=2048) loaded_model.restore(model_dir="saved_model/") predictions = loaded_model.predict(test)
Recipe: Custom Metric
When to use: Evaluate models with domain-specific metrics.
import deepchem as dc import numpy as np def enrichment_factor(y_true, y_pred, top_fraction=0.01): """Enrichment factor at top X% of ranked predictions.""" n = len(y_true) n_top = max(int(n * top_fraction), 1) top_indices = np.argsort(y_pred.flatten())[-n_top:] hits_in_top = y_true.flatten()[top_indices].sum() expected = y_true.sum() * top_fraction return hits_in_top / expected if expected > 0 else 0.0 ef_metric = dc.metrics.Metric(enrichment_factor, mode="regression") print(f"EF@1%: {model.evaluate(test, [ef_metric])}")
Troubleshooting
| Problem | Cause | Solution |
|---|---|---|
| PyTorch not installed | for GNN models |
| Featurizer output size does not match model | Check and set accordingly |
| NaN loss during training | Learning rate too high or unnormalized targets | Apply , reduce learning rate to |
| Low scaffold-split performance | Model memorizes scaffolds, not properties | Use more data, try GNN models, or add regularization (dropout 0.3-0.5) |
| Batch size too large for GPU | Reduce (32 or 16), or use CPU for small datasets |
on some SMILES | Invalid or complex SMILES strings | Pre-filter with RDKit: |
| Model predicts constant values | Targets not normalized or too few epochs | Apply , increase |
| Slow featurization | Large dataset with expensive featurizer | Use (fast) or parallelize with parameter |
Bundled Resources
- references/workflows_model_catalog.md -- Extended workflows (hyperparameter optimization with full code, MolGAN generative models, materials property prediction with CGCNN/MEGNet, protein-ligand modeling, custom model architecture) plus complete model catalog (60+ models organized by category) and complete featurizer catalog (50+ featurizers). Covers: workflows 4-8 from original, extended model and featurizer inventories, MoleculeNet dataset catalog. Relocated inline: top 3 workflows (QSAR, MoleculeNet benchmark, transfer learning) are in Common Workflows; core model/featurizer tables are in Key Concepts. Omitted: detailed installation troubleshooting for TensorFlow 1.x (deprecated) and Docker-specific setup (covered by official docs).
Related Skills
- rdkit-cheminformatics -- molecular manipulation, fingerprints, substructure search (upstream featurization)
- molfeat-molecular-featurization -- 100+ featurizers with scikit-learn API (featurization-only alternative)
- datamol-cheminformatics -- Pythonic molecular processing (upstream data prep)
- pytdc-therapeutics-data-commons -- curated ADMET/DTI datasets with standardized splits (complementary data source)
- torch-geometric-graph-neural-networks -- lower-level PyG for custom GNN architectures (alternative for advanced users)
- scikit-learn-machine-learning -- classical ML baselines that DeepChem wraps via
SklearnModel
References
- DeepChem documentation -- official API docs and tutorials
- DeepChem GitHub -- source code, examples, issues
- MoleculeNet benchmark paper -- Wu et al. 2018, benchmark dataset descriptions
- DeepChem tutorials -- Jupyter notebook tutorials