SciAgent-Skills datamol-cheminformatics

install
source · Clone the upstream repo
git clone https://github.com/jaechang-hits/SciAgent-Skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/structural-biology-drug-discovery/datamol-cheminformatics" ~/.claude/skills/jaechang-hits-sciagent-skills-datamol-cheminformatics && rm -rf "$T"
manifest: skills/structural-biology-drug-discovery/datamol-cheminformatics/SKILL.md
source content

Datamol Cheminformatics Toolkit

Overview

Datamol provides a lightweight, Pythonic abstraction layer over RDKit for molecular cheminformatics. It simplifies common drug discovery operations — SMILES parsing, standardization, descriptors, fingerprints, clustering, scaffolds, conformers, and visualization — with sensible defaults, built-in parallelization, and cloud storage support via fsspec. All molecular objects are native

rdkit.Chem.Mol
instances, ensuring full RDKit compatibility.

When to Use

  • Parsing, validating, and standardizing molecular structures from SMILES, SDF, or other formats
  • Computing molecular descriptors and fingerprints for ML featurization
  • Similarity searching and diversity selection from compound libraries
  • Clustering compounds by structural similarity (Butina clustering)
  • Scaffold analysis and scaffold-based train/test splitting for ML
  • BRICS/RECAP molecular fragmentation for fragment-based design
  • 3D conformer generation and analysis
  • Visualizing molecules as grids with alignment and highlighting
  • Batch processing molecular datasets with parallelization
  • For quick gene lookups use gget instead; for advanced substructure queries or custom fingerprints, use RDKit directly

Prerequisites

uv pip install datamol
import datamol as dm
import numpy as np
import pandas as pd

Quick Start

import datamol as dm

# Parse and standardize
mol = dm.to_mol("CC(=O)Oc1ccccc1C(=O)O")  # Aspirin
mol = dm.standardize_mol(mol)
print(dm.to_smiles(mol))  # Canonical SMILES

# Compute descriptors
desc = dm.descriptors.compute_many_descriptors(mol)
print(f"MW: {desc['mw']:.1f}, LogP: {desc['logp']:.2f}, TPSA: {desc['tpsa']:.1f}")

# Generate fingerprint
fp = dm.to_fp(mol, fp_type='ecfp', radius=2, n_bits=2048)
print(f"Fingerprint shape: {fp.shape}")  # (2048,)

Core API

1. Molecular I/O & Standardization

Parsing molecules:

import datamol as dm

# From SMILES (returns None on failure)
mol = dm.to_mol("CCO")
if mol is None:
    print("Invalid SMILES")

# Format conversions
smiles = dm.to_smiles(mol, isomeric=True)  # Canonical SMILES
inchi = dm.to_inchi(mol)
inchikey = dm.to_inchikey(mol)
selfies = dm.to_selfies(mol)

Standardization (always recommended for external data):

mol = dm.standardize_mol(
    mol,
    disconnect_metals=True,
    normalize=True,
    reionize=True
)
clean_smiles = dm.standardize_smiles("C(C)O")  # From SMILES directly

File I/O:

# Reading (supports local, S3, GCS, HTTP via fsspec)
df = dm.read_sdf("compounds.sdf", mol_column='mol')
df = dm.read_csv("data.csv", smiles_column="SMILES", mol_column="mol")
df = dm.read_excel("compounds.xlsx", sheet_name=0, mol_column="mol")
df = dm.open_df("file.sdf")  # Auto-detect format

# Writing
dm.to_sdf(df, "output.sdf", mol_column="mol")
dm.to_smi(mols, "output.smi")
dm.to_xlsx(df, "output.xlsx", mol_columns=["mol"])  # Renders molecule images

# Remote files
df = dm.read_sdf("s3://bucket/compounds.sdf")
dm.to_sdf(mols, "s3://bucket/output.sdf")

2. Descriptors & Properties

import datamol as dm

mol = dm.to_mol("c1ccc(cc1)CCN")

# Standard descriptor set (single molecule)
desc = dm.descriptors.compute_many_descriptors(mol)
# Returns dict: {'mw': 121.18, 'logp': 1.41, 'hbd': 1, 'hba': 1,
#                'tpsa': 26.02, 'n_aromatic_atoms': 6, ...}

# Batch computation (parallel)
mols = [dm.to_mol(s) for s in ["CCO", "c1ccccc1", "CC(=O)O"]]
desc_df = dm.descriptors.batch_compute_many_descriptors(
    mols, n_jobs=-1, progress=True
)
print(desc_df.head())

# Specific descriptors
n_stereo = dm.descriptors.n_stereo_centers(mol)
n_aromatic = dm.descriptors.n_aromatic_atoms(mol)
aromatic_ratio = dm.descriptors.n_aromatic_atoms_proportion(mol)
n_rigid = dm.descriptors.n_rigid_bonds(mol)

Drug-likeness filtering (Lipinski Rule of Five):

def is_druglike(mol):
    desc = dm.descriptors.compute_many_descriptors(mol)
    return (desc['mw'] <= 500 and desc['logp'] <= 5
            and desc['hbd'] <= 5 and desc['hba'] <= 10)

druglike = [m for m in mols if is_druglike(m)]
print(f"Drug-like: {len(druglike)}/{len(mols)}")

3. Fingerprints & Similarity

import datamol as dm

mol = dm.to_mol("c1ccc(cc1)CCN")

# Fingerprint types
fp_ecfp = dm.to_fp(mol, fp_type='ecfp', radius=2, n_bits=2048)  # Morgan/ECFP
fp_maccs = dm.to_fp(mol, fp_type='maccs')    # MACCS keys (167 bits)
fp_topo = dm.to_fp(mol, fp_type='topological')  # Topological
fp_ap = dm.to_fp(mol, fp_type='atompair')    # Atom pairs

# Pairwise distances (Tanimoto distance = 1 - similarity)
mols = [dm.to_mol(s) for s in ["CCO", "CCCO", "c1ccccc1"]]
dist_matrix = dm.pdist(mols, n_jobs=-1)
print(f"Distance vector shape: {dist_matrix.shape}")

# Distances between two sets
query = [dm.to_mol("CCO")]
library = [dm.to_mol(s) for s in ["CCCO", "c1ccccc1", "CC(=O)O"]]
distances = dm.cdist(query, library, n_jobs=-1)
print(f"Query-library distances: {distances.shape}")

4. Clustering & Diversity Selection

import datamol as dm

mols = [dm.to_mol(s) for s in smiles_list]  # Assume smiles_list defined

# Butina clustering (suitable for ~1000 molecules, builds full distance matrix)
clusters = dm.cluster_mols(mols, cutoff=0.2, n_jobs=-1)
for i, cluster in enumerate(clusters[:5]):
    print(f"Cluster {i}: {len(cluster)} molecules")

# Diversity selection (works for larger libraries)
diverse_mols = dm.pick_diverse(mols, npick=100)
print(f"Selected {len(diverse_mols)} diverse molecules")

# Cluster centroids
centroids = dm.pick_centroids(mols, npick=50)
print(f"Selected {len(centroids)} centroids")

5. Scaffolds & Fragments

Murcko scaffold extraction:

import datamol as dm
from collections import Counter

mol = dm.to_mol("c1ccc(cc1)CCN")
scaffold = dm.to_scaffold_murcko(mol)
print(f"Scaffold: {dm.to_smiles(scaffold)}")

# Scaffold frequency analysis
scaffolds = [dm.to_scaffold_murcko(m) for m in mols]
scaffold_smiles = [dm.to_smiles(s) for s in scaffolds]
counts = Counter(scaffold_smiles)
print(f"Top scaffolds: {counts.most_common(5)}")

# Scaffold-based train/test split (for ML)
scaffold_to_mols = {}
for mol, scaf in zip(mols, scaffold_smiles):
    scaffold_to_mols.setdefault(scaf, []).append(mol)
scaffolds_list = list(scaffold_to_mols.keys())
split_idx = int(0.8 * len(scaffolds_list))
train_mols = [m for s in scaffolds_list[:split_idx] for m in scaffold_to_mols[s]]
test_mols = [m for s in scaffolds_list[split_idx:] for m in scaffold_to_mols[s]]

Fragmentation:

mol = dm.to_mol("CC(=O)Oc1ccccc1C(=O)O")  # Aspirin

# BRICS (16 bond types, retrosynthetic)
brics_frags = dm.fragment.brics(mol)
print(f"BRICS fragments: {brics_frags}")  # Set of fragment SMILES with [1*] attachment points

# RECAP (11 bond types, combinatorial)
recap_frags = dm.fragment.recap(mol)

# MMPA (matched molecular pair analysis)
mmpa_frags = dm.fragment.mmpa_frag(mol)

6. 3D Conformers

import datamol as dm

mol = dm.to_mol("c1ccc(cc1)CCN")

# Generate conformers
mol_3d = dm.conformers.generate(
    mol,
    n_confs=50,           # Number to generate
    rms_cutoff=0.5,       # Filter similar (Angstroms)
    minimize_energy=True,  # UFF minimization
    method='ETKDGv3'      # Embedding method
)
print(f"Generated {mol_3d.GetNumConformers()} conformers")

# Access coordinates
conf = mol_3d.GetConformer(0)
positions = conf.GetPositions()  # Nx3 array
print(f"Atom positions shape: {positions.shape}")

# Cluster conformers by RMSD
clusters = dm.conformers.cluster(mol_3d, rms_cutoff=1.0)
centroids = dm.conformers.return_centroids(mol_3d, clusters)

# Solvent accessible surface area
sasa = dm.conformers.sasa(mol_3d, n_jobs=-1)
print(f"SASA values: {sasa[:3]}")

Key Concepts

Datamol vs RDKit Decision Guide

Use Datamol when...Use RDKit directly when...
Standard SMILES ↔ Mol conversionsCustom fingerprint definitions
Batch processing with parallelizationLow-level atom/bond manipulation
Quick descriptor computationSubstructure query optimization
File I/O (SDF, CSV, Excel, cloud)Reaction enumeration (large-scale)
Clustering & diversity selectionCustom force field parameters
Scaffold analysisAdvanced stereochemistry handling

Key Data Types

  • All molecules are native
    rdkit.Chem.Mol
    objects — fully compatible with RDKit functions
  • Fingerprints are numpy arrays (dense bit vectors)
  • DataFrames use pandas with a
    mol
    column containing Mol objects
  • Distance matrices use Tanimoto distance (0 = identical, 1 = completely different)

Parallelization

Functions supporting

n_jobs
parameter:
dm.read_sdf
,
dm.descriptors.batch_compute_many_descriptors
,
dm.cluster_mols
,
dm.pdist
,
dm.cdist
,
dm.conformers.sasa
. Use
n_jobs=-1
for all cores,
progress=True
for progress bars.

Common Workflows

1. Drug Discovery Pipeline: Load → Filter → Cluster → Visualize

import datamol as dm

# 1. Load and standardize
df = dm.read_sdf("compounds.sdf")
df['mol'] = df['mol'].apply(lambda m: dm.standardize_mol(m) if m else None)
df = df[df['mol'].notna()]
print(f"Loaded {len(df)} valid molecules")

# 2. Compute descriptors and filter by drug-likeness
desc_df = dm.descriptors.batch_compute_many_descriptors(
    df['mol'].tolist(), n_jobs=-1, progress=True
)
druglike = (desc_df['mw'] <= 500) & (desc_df['logp'] <= 5) & (desc_df['hbd'] <= 5) & (desc_df['hba'] <= 10)
filtered_df = df[druglike.values].reset_index(drop=True)
print(f"Drug-like compounds: {len(filtered_df)}")

# 3. Select diverse subset
diverse = dm.pick_diverse(filtered_df['mol'].tolist(), npick=100)

# 4. Visualize
dm.viz.to_image(diverse[:20], legends=[dm.to_smiles(m) for m in diverse[:20]],
                n_cols=5, mol_size=(300, 300), outfile="diverse_hits.png")

2. Virtual Screening: Query → Similarity → Rank

import datamol as dm
import numpy as np

# Query actives and screening library
actives = [dm.to_mol(s) for s in active_smiles]  # Known actives
library = [dm.to_mol(s) for s in library_smiles]  # Screening library

# Calculate distances (Tanimoto)
distances = dm.cdist(actives, library, n_jobs=-1)
min_distances = distances.min(axis=0)  # Best match to any active
similarities = 1 - min_distances

# Rank and select top hits
top_idx = np.argsort(similarities)[::-1][:100]
top_hits = [library[i] for i in top_idx]
top_scores = [similarities[i] for i in top_idx]
print(f"Top hit similarity: {top_scores[0]:.3f}")

# Visualize top hits
dm.viz.to_image(top_hits[:20],
    legends=[f"Sim: {s:.3f}" for s in top_scores[:20]],
    outfile="screening_hits.png")

3. SAR Analysis: Group by Scaffold → Compare Activities

import datamol as dm

# Group compounds by scaffold
scaffolds = [dm.to_scaffold_murcko(m) for m in mols]
scaffold_smiles = [dm.to_smiles(s) for s in scaffolds]

sar_df = pd.DataFrame({
    'mol': mols, 'scaffold': scaffold_smiles, 'activity': activities
})

# Analyze each scaffold series
for scaffold, group in sar_df.groupby('scaffold'):
    if len(group) >= 3:
        print(f"Scaffold: {scaffold} | N={len(group)} | "
              f"Activity: {group['activity'].min():.2f}–{group['activity'].max():.2f}")
        dm.viz.to_image(group['mol'].tolist(), align=True,
            legends=[f"Act: {a:.2f}" for a in group['activity']])

Key Parameters

FunctionParameterDefaultDescription
dm.to_fp
fp_type
'ecfp'
Fingerprint type: ecfp, maccs, topological, atompair
dm.to_fp
radius
2
Morgan radius (ecfp only); radius=2 ≈ ECFP4
dm.to_fp
n_bits
2048
Fingerprint length (ecfp, topological)
dm.cluster_mols
cutoff
0.2
Tanimoto distance threshold (0=identical, 1=different)
dm.pick_diverse
npick
requiredNumber of diverse molecules to select
dm.conformers.generate
n_confs
None
Number of conformers (None = auto)
dm.conformers.generate
rms_cutoff
None
RMSD filter threshold (Angstroms)
dm.conformers.generate
method
'ETKDGv3'
Embedding: ETKDGv3, ETKDGv2, ETKDG
dm.standardize_mol
disconnect_metals
False
Remove metal-ligand bonds
dm.read_sdf
sanitize
True
Apply molecule sanitization
dm.read_sdf
remove_hs
True
Remove explicit hydrogens
dm.viz.to_image
align
False
Align molecules by MCS
dm.viz.to_image
use_svg
False
Output SVG (True) or PNG (False)

Best Practices

  1. Always standardize molecules from external sources — call
    dm.standardize_mol()
    with
    disconnect_metals=True, normalize=True, reionize=True
    before any analysis. Different SMILES representations of the same molecule will produce different fingerprints
  2. Check for None after parsing
    dm.to_mol()
    returns
    None
    for invalid SMILES. Filter these before batch operations to avoid crashes
  3. Use parallel processing for datasets — pass
    n_jobs=-1, progress=True
    to batch operations. Sequential processing of 10,000+ molecules is unnecessarily slow
  4. Choose fingerprints by use case — ECFP (Morgan): general structural similarity; MACCS: fast, smaller space; Atom pairs: distance-sensitive. ECFP with radius=2, n_bits=2048 is the most common default
  5. Mind clustering scale limits — Butina clustering (
    dm.cluster_mols
    ) builds a full distance matrix. Use for ≤~1,000 molecules. For larger sets, use
    dm.pick_diverse()
    or hierarchical methods
  6. Use scaffold splitting for ML — random splits leak similar structures into train/test. Always use scaffold-based splitting for molecular property prediction models
  7. Leverage fsspec for cloud data — all I/O functions accept S3, GCS, and HTTP paths directly. Install
    s3fs
    or
    gcsfs
    for cloud support

Common Recipes

Recipe: Batch SMILES Validation and Standardization

When to use: Clean a list of SMILES strings before any downstream analysis.

import datamol as dm

smiles_list = ["CC(=O)Oc1ccccc1C(=O)O", "c1ccccc1", "invalid_smiles", "CC(N)C(=O)O"]
mols = [dm.to_mol(s) for s in smiles_list]
valid = [(s, m) for s, m in zip(smiles_list, mols) if m is not None]
standardized = [(s, dm.standardize_mol(m)) for s, m in valid]
print(f"Valid: {len(valid)}/{len(smiles_list)}")
for orig, mol in standardized:
    print(f"  {orig} → {dm.to_smiles(mol)}")

Recipe: Pairwise Similarity Matrix

When to use: Compare a small compound set against each other or a reference library.

import datamol as dm
import numpy as np

smiles = ["CC(=O)Oc1ccccc1C(=O)O", "c1ccc(cc1)C(=O)O", "CC(N)C(=O)O", "c1ccccc1"]
mols = [dm.to_mol(s) for s in smiles]
fps = [dm.to_fp(m) for m in mols]

# Pairwise Tanimoto similarity
n = len(fps)
sim_matrix = np.zeros((n, n))
for i in range(n):
    for j in range(n):
        sim_matrix[i, j] = dm.similarity.tanimoto(fps[i], fps[j])
print(f"Similarity matrix shape: {sim_matrix.shape}")
print(f"Most similar pair: {np.unravel_index(np.argsort(sim_matrix.ravel())[-3], (n, n))}")

Troubleshooting

ProblemCauseSolution
dm.to_mol()
returns None
Invalid or non-canonical SMILESTry
dm.standardize_smiles()
first; check for kekulization issues
MemoryError during clusteringFull distance matrix for large setUse
dm.pick_diverse()
instead of
dm.cluster_mols
for >1000 molecules
Slow conformer generationToo many conformers or large moleculeReduce
n_confs
, increase
rms_cutoff
, or limit molecule size
Remote file access failsMissing fsspec backendInstall
s3fs
(AWS),
gcsfs
(GCP), or
adlfs
(Azure)
Descriptor computation failsMolecule has no conformerStandardize first; some 3D descriptors need
dm.conformers.generate()
dm.to_xlsx
missing images
openpyxl not installed
uv pip install openpyxl
Inconsistent fingerprintsDifferent SMILES for same moleculeStandardize all molecules before fingerprint computation
Scaffold extraction returns full moleculeNo ring system in moleculeMurcko scaffolds require at least one ring; acyclic molecules return themselves
Reaction product is NoneReactant doesn't match SMARTS patternVerify reactant matches reaction template; check atom mapping
Import error for
dm.viz
Missing visualization dependencies
uv pip install Pillow cairosvg

Related Skills

  • rdkit-cheminformatics — full RDKit API for advanced operations not covered by datamol's simplified interface
  • pubchem-compound-search — retrieve compound data by name, CID, or structure from PubChem
  • scikit-learn-machine-learning — ML model training using datamol-generated features
  • matplotlib-scientific-plotting — custom publication-quality molecular property plots

References