Computational-chemistry-agent-skills rdkit-repr

install
source · Clone the upstream repo
git clone https://github.com/jinzhezenggroup/computational-chemistry-agent-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jinzhezenggroup/computational-chemistry-agent-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/molecular-representation/rdkit-repr" ~/.claude/skills/jinzhezenggroup-computational-chemistry-agent-skills-rdkit-repr && rm -rf "$T"
manifest: molecular-representation/rdkit-repr/SKILL.md
source content

RDKit Molecular Featurization

This skill provides practical command patterns for RDKit descriptor and fingerprint extraction using the standardized CLI wrapper:

<skill_path>/scripts/rdkit_helper.py
.

Key behaviors (important for Agents):

  • The script prints environment detection (Python/RDKit/NumPy/Pandas) by default.
  • Bad/illegal SMILES are skipped and logged to
    *.skipped.csv
    (no crash).
  • Each run ends by printing absolute output paths like:
    • [RESULT] desc_csv=/abs/path.csv
    • [RESULT] fp_npy=/abs/path.npy
    • [RESULT] fp_csv=/abs/path.csv

Quick Start

Check CLI help:

uv run <skill_path>/scripts/rdkit_helper.py --help

Check subcommand help:

uv run <skill_path>/scripts/rdkit_helper.py desc --help
uv run <skill_path>/scripts/rdkit_helper.py fp --help
uv run <skill_path>/scripts/rdkit_helper.py list-desc --help

Disable environment printing (optional):

uv run <skill_path>/scripts/rdkit_helper.py --no-env desc --smiles "CCO" --output out.csv

Core Tasks

1) Compute physicochemical descriptors → .csv

Single SMILES (default preset:

physchem
, 25 descriptors):

uv run <skill_path>/scripts/rdkit_helper.py desc \
    --smiles "CCO" \
    --output /tmp/CCO.desc.csv

From CSV (default SMILES column is

smiles
):

uv run <skill_path>/scripts/rdkit_helper.py desc \
    --file data.csv \
    --smiles-col smiles \
    --output data.desc.csv

From SMI:

uv run <skill_path>/scripts/rdkit_helper.py desc \
    --file molecules.smi \
    --output molecules.desc.csv

Choose a descriptor preset:

# Lipinski drug-likeness (6 descriptors: MolWt, MolLogP, NumHDonors, ...)
uv run <skill_path>/scripts/rdkit_helper.py desc \
    --file data.csv --preset lipinski --output data.lipinski.csv

# Extended physicochemical (25 descriptors, default)
uv run <skill_path>/scripts/rdkit_helper.py desc \
    --file data.csv --preset physchem --output data.physchem.csv

# Topological / graph indices (56 descriptors: BalabanJ, BertzCT, Chi*, PEOE_VSA*, ...)
uv run <skill_path>/scripts/rdkit_helper.py desc \
    --file data.csv --preset topological --output data.topo.csv

# All RDKit descriptors (~200 descriptors)
uv run <skill_path>/scripts/rdkit_helper.py desc \
    --file data.csv --preset all --output data.all_desc.csv

Select specific descriptors (overrides

--preset
):

uv run <skill_path>/scripts/rdkit_helper.py desc \
    --file data.csv \
    --descriptors "MolWt,MolLogP,TPSA,NumHDonors,NumHAcceptors" \
    --output data.custom.csv

Suppress merging back original CSV columns (output only smiles + descriptors):

uv run <skill_path>/scripts/rdkit_helper.py desc \
    --file data.csv --preset physchem --no-merge --output data.desc_only.csv

2) Compute molecular fingerprints → .npy or .csv

Available fingerprint types:

TypeDescriptionDefault bits
morgan2
Morgan circular FP radius 2 (ECFP4-like), bit vector2048
morgan3
Morgan circular FP radius 3 (ECFP6-like), bit vector2048
morgan2_count
Morgan radius-2 count vector2048
rdkit
RDKit path-based FP, bit vector2048
maccs
MACCS 167 structural keys (bit vector,
--nbits
ignored)
167
topological
Topological torsion FP (count vector, hashed to
--nbits
)
2048
atompair
Atom-pair FP (count vector, hashed to
--nbits
)
2048
layered
Layered substructure FP, bit vector2048
pattern
SMARTS pattern FP, bit vector2048

Single SMILES, output as NumPy array (.npy):

uv run <skill_path>/scripts/rdkit_helper.py fp \
    --smiles "CCO" \
    --type morgan2 \
    --output /tmp/CCO.morgan2.npy

From CSV, Morgan ECFP4 (2048 bits):

uv run <skill_path>/scripts/rdkit_helper.py fp \
    --file data.csv \
    --smiles-col smiles \
    --type morgan2 \
    --nbits 2048 \
    --output data.morgan2.npy

From SMI, MACCS keys (always 167 bits):

uv run <skill_path>/scripts/rdkit_helper.py fp \
    --file molecules.smi \
    --type maccs \
    --output molecules.maccs.npy

Output as CSV (smiles + bit_0 … bit_N-1 columns):

uv run <skill_path>/scripts/rdkit_helper.py fp \
    --file data.csv \
    --type rdkit \
    --nbits 1024 \
    --format csv \
    --output data.rdkfp.csv

Atom-pair fingerprint, 4096 bits:

uv run <skill_path>/scripts/rdkit_helper.py fp \
    --file data.csv \
    --type atompair \
    --nbits 4096 \
    --output data.atompair.npy

3) List available descriptors

List all descriptors and built-in presets:

uv run <skill_path>/scripts/rdkit_helper.py list-desc

List descriptors in a specific preset group:

uv run <skill_path>/scripts/rdkit_helper.py list-desc --group lipinski
uv run <skill_path>/scripts/rdkit_helper.py list-desc --group physchem
uv run <skill_path>/scripts/rdkit_helper.py list-desc --group topological
uv run <skill_path>/scripts/rdkit_helper.py list-desc --group all

Descriptor Presets Reference

PresetCountTypical Use
lipinski
6Quick drug-likeness screening (Ro5 filter)
physchem
25General ML features: MW, logP, TPSA, ring counts, charge stats, …
topological
56Graph/topology indices: Balaban J, Kappa, Chi, PEOE_VSA, EState_VSA, …
all
~200Full RDKit descriptor set (includes fragment counts, MQN, etc.)

Output Format Notes

desc
output (CSV):

  • Columns:
    smiles
    , then one column per descriptor.
  • When
    --file
    is a
    .csv
    and
    --no-merge
    is not set, original CSV columns are appended.
  • Rows only contain valid SMILES (invalid ones are logged to
    *.skipped.csv
    ).

fp
output:

  • .npy
    (default): NumPy array of shape
    (N_valid, nbits)
    , dtype
    uint8
    (bit) or
    int32
    (count).
  • .csv
    :
    smiles
    column followed by
    bit_0
    bit_{nbits-1}
    columns.
  • MACCS keys always produce 167 bits regardless of
    --nbits
    .

Agent Checklist

When using this skill for users:

  1. Confirm input format:
    • .csv
      requires a SMILES column (default
      smiles
      )
    • .smi
      uses the first token of each line as SMILES
  2. Quote SMILES containing special shell characters (brackets/parentheses):
    • Example:
      --smiles "[C@@H](O)(F)Cl"
  3. For CSV workflows, verify column names:
    • desc
      :
      --smiles-col
    • fp
      :
      --smiles-col
  4. Choose the right preset or fingerprint type for the downstream task:
    • Drug screening / Ro5:
      --preset lipinski
    • General ML featurization:
      --preset physchem
      or
      --type morgan2
    • Structural similarity search:
      --type morgan2
      or
      --type rdkit
    • Substructure matching:
      --type maccs
      or
      --type pattern
  5. Watch for skipped SMILES:
    • Check
      *.skipped.csv
      and decide whether to fix or permanently drop them
  6. Always capture absolute output paths:
    • Look for
      [RESULT] ...=/abs/path
      in stdout
  7. If debugging is needed, enable full traceback:
    • RDKIT_HELPER_TRACE=1 uv run <skill_path>/scripts/rdkit_helper.py ...

References