Computational-chemistry-agent-skills rdkit-repr
git clone https://github.com/jinzhezenggroup/computational-chemistry-agent-skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/jinzhezenggroup/computational-chemistry-agent-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/molecular-representation/rdkit-repr" ~/.claude/skills/jinzhezenggroup-computational-chemistry-agent-skills-rdkit-repr && rm -rf "$T"
molecular-representation/rdkit-repr/SKILL.mdRDKit Molecular Featurization
This skill provides practical command patterns for RDKit descriptor and fingerprint extraction using the standardized CLI wrapper:
<skill_path>/scripts/rdkit_helper.py.
Key behaviors (important for Agents):
- The script prints environment detection (Python/RDKit/NumPy/Pandas) by default.
- Bad/illegal SMILES are skipped and logged to
(no crash).*.skipped.csv - Each run ends by printing absolute output paths like:
[RESULT] desc_csv=/abs/path.csv[RESULT] fp_npy=/abs/path.npy[RESULT] fp_csv=/abs/path.csv
Quick Start
Check CLI help:
uv run <skill_path>/scripts/rdkit_helper.py --help
Check subcommand help:
uv run <skill_path>/scripts/rdkit_helper.py desc --help uv run <skill_path>/scripts/rdkit_helper.py fp --help uv run <skill_path>/scripts/rdkit_helper.py list-desc --help
Disable environment printing (optional):
uv run <skill_path>/scripts/rdkit_helper.py --no-env desc --smiles "CCO" --output out.csv
Core Tasks
1) Compute physicochemical descriptors → .csv
Single SMILES (default preset:
physchem, 25 descriptors):
uv run <skill_path>/scripts/rdkit_helper.py desc \ --smiles "CCO" \ --output /tmp/CCO.desc.csv
From CSV (default SMILES column is
smiles):
uv run <skill_path>/scripts/rdkit_helper.py desc \ --file data.csv \ --smiles-col smiles \ --output data.desc.csv
From SMI:
uv run <skill_path>/scripts/rdkit_helper.py desc \ --file molecules.smi \ --output molecules.desc.csv
Choose a descriptor preset:
# Lipinski drug-likeness (6 descriptors: MolWt, MolLogP, NumHDonors, ...) uv run <skill_path>/scripts/rdkit_helper.py desc \ --file data.csv --preset lipinski --output data.lipinski.csv # Extended physicochemical (25 descriptors, default) uv run <skill_path>/scripts/rdkit_helper.py desc \ --file data.csv --preset physchem --output data.physchem.csv # Topological / graph indices (56 descriptors: BalabanJ, BertzCT, Chi*, PEOE_VSA*, ...) uv run <skill_path>/scripts/rdkit_helper.py desc \ --file data.csv --preset topological --output data.topo.csv # All RDKit descriptors (~200 descriptors) uv run <skill_path>/scripts/rdkit_helper.py desc \ --file data.csv --preset all --output data.all_desc.csv
Select specific descriptors (overrides
--preset):
uv run <skill_path>/scripts/rdkit_helper.py desc \ --file data.csv \ --descriptors "MolWt,MolLogP,TPSA,NumHDonors,NumHAcceptors" \ --output data.custom.csv
Suppress merging back original CSV columns (output only smiles + descriptors):
uv run <skill_path>/scripts/rdkit_helper.py desc \ --file data.csv --preset physchem --no-merge --output data.desc_only.csv
2) Compute molecular fingerprints → .npy or .csv
Available fingerprint types:
| Type | Description | Default bits |
|---|---|---|
| Morgan circular FP radius 2 (ECFP4-like), bit vector | 2048 |
| Morgan circular FP radius 3 (ECFP6-like), bit vector | 2048 |
| Morgan radius-2 count vector | 2048 |
| RDKit path-based FP, bit vector | 2048 |
| MACCS 167 structural keys (bit vector, ignored) | 167 |
| Topological torsion FP (count vector, hashed to ) | 2048 |
| Atom-pair FP (count vector, hashed to ) | 2048 |
| Layered substructure FP, bit vector | 2048 |
| SMARTS pattern FP, bit vector | 2048 |
Single SMILES, output as NumPy array (.npy):
uv run <skill_path>/scripts/rdkit_helper.py fp \ --smiles "CCO" \ --type morgan2 \ --output /tmp/CCO.morgan2.npy
From CSV, Morgan ECFP4 (2048 bits):
uv run <skill_path>/scripts/rdkit_helper.py fp \ --file data.csv \ --smiles-col smiles \ --type morgan2 \ --nbits 2048 \ --output data.morgan2.npy
From SMI, MACCS keys (always 167 bits):
uv run <skill_path>/scripts/rdkit_helper.py fp \ --file molecules.smi \ --type maccs \ --output molecules.maccs.npy
Output as CSV (smiles + bit_0 … bit_N-1 columns):
uv run <skill_path>/scripts/rdkit_helper.py fp \ --file data.csv \ --type rdkit \ --nbits 1024 \ --format csv \ --output data.rdkfp.csv
Atom-pair fingerprint, 4096 bits:
uv run <skill_path>/scripts/rdkit_helper.py fp \ --file data.csv \ --type atompair \ --nbits 4096 \ --output data.atompair.npy
3) List available descriptors
List all descriptors and built-in presets:
uv run <skill_path>/scripts/rdkit_helper.py list-desc
List descriptors in a specific preset group:
uv run <skill_path>/scripts/rdkit_helper.py list-desc --group lipinski uv run <skill_path>/scripts/rdkit_helper.py list-desc --group physchem uv run <skill_path>/scripts/rdkit_helper.py list-desc --group topological uv run <skill_path>/scripts/rdkit_helper.py list-desc --group all
Descriptor Presets Reference
| Preset | Count | Typical Use |
|---|---|---|
| 6 | Quick drug-likeness screening (Ro5 filter) |
| 25 | General ML features: MW, logP, TPSA, ring counts, charge stats, … |
| 56 | Graph/topology indices: Balaban J, Kappa, Chi, PEOE_VSA, EState_VSA, … |
| ~200 | Full RDKit descriptor set (includes fragment counts, MQN, etc.) |
Output Format Notes
output (CSV):desc
- Columns:
, then one column per descriptor.smiles - When
is a--file
and.csv
is not set, original CSV columns are appended.--no-merge - Rows only contain valid SMILES (invalid ones are logged to
).*.skipped.csv
output:fp
(default): NumPy array of shape.npy
, dtype(N_valid, nbits)
(bit) oruint8
(count).int32
:.csv
column followed bysmiles
…bit_0
columns.bit_{nbits-1}- MACCS keys always produce 167 bits regardless of
.--nbits
Agent Checklist
When using this skill for users:
- Confirm input format:
requires a SMILES column (default.csv
)smiles
uses the first token of each line as SMILES.smi
- Quote SMILES containing special shell characters (brackets/parentheses):
- Example:
--smiles "[C@@H](O)(F)Cl"
- Example:
- For CSV workflows, verify column names:
:desc--smiles-col
:fp--smiles-col
- Choose the right preset or fingerprint type for the downstream task:
- Drug screening / Ro5:
--preset lipinski - General ML featurization:
or--preset physchem--type morgan2 - Structural similarity search:
or--type morgan2--type rdkit - Substructure matching:
or--type maccs--type pattern
- Drug screening / Ro5:
- Watch for skipped SMILES:
- Check
and decide whether to fix or permanently drop them*.skipped.csv
- Check
- Always capture absolute output paths:
- Look for
in stdout[RESULT] ...=/abs/path
- Look for
- If debugging is needed, enable full traceback:
RDKIT_HELPER_TRACE=1 uv run <skill_path>/scripts/rdkit_helper.py ...
References
- RDKit documentation: https://www.rdkit.org/docs/
- RDKit descriptor list: https://www.rdkit.org/docs/GettingStartedInPython.html#list-of-available-descriptors
- RDKit fingerprint guide: https://www.rdkit.org/docs/GettingStartedInPython.html#fingerprinting-and-molecular-similarity