Computational-chemistry-agent-skills rdkit-repr

install

source · Clone the upstream repo

git clone https://github.com/jinzhezenggroup/computational-chemistry-agent-skills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/jinzhezenggroup/computational-chemistry-agent-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/molecular-representation/rdkit-repr" ~/.claude/skills/jinzhezenggroup-computational-chemistry-agent-skills-rdkit-repr && rm -rf "$T"

manifest: molecular-representation/rdkit-repr/SKILL.md

source content

RDKit Molecular Featurization

This skill provides practical command patterns for RDKit descriptor and fingerprint extraction using the standardized CLI wrapper:

<skill_path>/scripts/rdkit_helper.py

Key behaviors (important for Agents):

The script prints environment detection (Python/RDKit/NumPy/Pandas) by default.
Bad/illegal SMILES are skipped and logged to
```
*.skipped.csv
```
(no crash).

Each run ends by printing absolute output paths like:

```
[RESULT] desc_csv=/abs/path.csv
```
```
[RESULT] fp_npy=/abs/path.npy
```
```
[RESULT] fp_csv=/abs/path.csv
```

Quick Start

Check CLI help:

uv run <skill_path>/scripts/rdkit_helper.py --help

Check subcommand help:

uv run <skill_path>/scripts/rdkit_helper.py desc --help
uv run <skill_path>/scripts/rdkit_helper.py fp --help
uv run <skill_path>/scripts/rdkit_helper.py list-desc --help

Disable environment printing (optional):

uv run <skill_path>/scripts/rdkit_helper.py --no-env desc --smiles "CCO" --output out.csv

Core Tasks

1) Compute physicochemical descriptors → .csv

Single SMILES (default preset:

physchem

, 25 descriptors):

uv run <skill_path>/scripts/rdkit_helper.py desc \
    --smiles "CCO" \
    --output /tmp/CCO.desc.csv

From CSV (default SMILES column is

smiles

uv run <skill_path>/scripts/rdkit_helper.py desc \
    --file data.csv \
    --smiles-col smiles \
    --output data.desc.csv

From SMI:

uv run <skill_path>/scripts/rdkit_helper.py desc \
    --file molecules.smi \
    --output molecules.desc.csv

Choose a descriptor preset:

# Lipinski drug-likeness (6 descriptors: MolWt, MolLogP, NumHDonors, ...)
uv run <skill_path>/scripts/rdkit_helper.py desc \
    --file data.csv --preset lipinski --output data.lipinski.csv

# Extended physicochemical (25 descriptors, default)
uv run <skill_path>/scripts/rdkit_helper.py desc \
    --file data.csv --preset physchem --output data.physchem.csv

# Topological / graph indices (56 descriptors: BalabanJ, BertzCT, Chi*, PEOE_VSA*, ...)
uv run <skill_path>/scripts/rdkit_helper.py desc \
    --file data.csv --preset topological --output data.topo.csv

# All RDKit descriptors (~200 descriptors)
uv run <skill_path>/scripts/rdkit_helper.py desc \
    --file data.csv --preset all --output data.all_desc.csv

Select specific descriptors (overrides

--preset

uv run <skill_path>/scripts/rdkit_helper.py desc \
    --file data.csv \
    --descriptors "MolWt,MolLogP,TPSA,NumHDonors,NumHAcceptors" \
    --output data.custom.csv

Suppress merging back original CSV columns (output only smiles + descriptors):

uv run <skill_path>/scripts/rdkit_helper.py desc \
    --file data.csv --preset physchem --no-merge --output data.desc_only.csv

2) Compute molecular fingerprints → .npy or .csv

Available fingerprint types:

Type	Description	Default bits
`morgan2`	Morgan circular FP radius 2 (ECFP4-like), bit vector	2048
`morgan3`	Morgan circular FP radius 3 (ECFP6-like), bit vector	2048
`morgan2_count`	Morgan radius-2 count vector	2048
`rdkit`	RDKit path-based FP, bit vector	2048
`maccs`	MACCS 167 structural keys (bit vector, `--nbits` ignored)	167
`topological`	Topological torsion FP (count vector, hashed to `--nbits` )	2048
`atompair`	Atom-pair FP (count vector, hashed to `--nbits` )	2048
`layered`	Layered substructure FP, bit vector	2048
`pattern`	SMARTS pattern FP, bit vector	2048

Single SMILES, output as NumPy array (.npy):

uv run <skill_path>/scripts/rdkit_helper.py fp \
    --smiles "CCO" \
    --type morgan2 \
    --output /tmp/CCO.morgan2.npy

From CSV, Morgan ECFP4 (2048 bits):

uv run <skill_path>/scripts/rdkit_helper.py fp \
    --file data.csv \
    --smiles-col smiles \
    --type morgan2 \
    --nbits 2048 \
    --output data.morgan2.npy

From SMI, MACCS keys (always 167 bits):

uv run <skill_path>/scripts/rdkit_helper.py fp \
    --file molecules.smi \
    --type maccs \
    --output molecules.maccs.npy

Output as CSV (smiles + bit_0 … bit_N-1 columns):

uv run <skill_path>/scripts/rdkit_helper.py fp \
    --file data.csv \
    --type rdkit \
    --nbits 1024 \
    --format csv \
    --output data.rdkfp.csv

Atom-pair fingerprint, 4096 bits:

uv run <skill_path>/scripts/rdkit_helper.py fp \
    --file data.csv \
    --type atompair \
    --nbits 4096 \
    --output data.atompair.npy

3) List available descriptors

List all descriptors and built-in presets:

uv run <skill_path>/scripts/rdkit_helper.py list-desc

List descriptors in a specific preset group:

uv run <skill_path>/scripts/rdkit_helper.py list-desc --group lipinski
uv run <skill_path>/scripts/rdkit_helper.py list-desc --group physchem
uv run <skill_path>/scripts/rdkit_helper.py list-desc --group topological
uv run <skill_path>/scripts/rdkit_helper.py list-desc --group all

Descriptor Presets Reference

Preset	Count	Typical Use
`lipinski`	6	Quick drug-likeness screening (Ro5 filter)
`physchem`	25	General ML features: MW, logP, TPSA, ring counts, charge stats, …
`topological`	56	Graph/topology indices: Balaban J, Kappa, Chi, PEOE_VSA, EState_VSA, …
`all`	~200	Full RDKit descriptor set (includes fragment counts, MQN, etc.)

Output Format Notes

desc

output (CSV):

Columns:
```
smiles
```
, then one column per descriptor.
When
```
--file
```
is a
```
.csv
```
and
```
--no-merge
```
is not set, original CSV columns are appended.
Rows only contain valid SMILES (invalid ones are logged to
```
*.skipped.csv
```
).

fp

output:

```
.npy
```
(default): NumPy array of shape
```
(N_valid, nbits)
```
, dtype
```
uint8
```
(bit) or
```
int32
```
(count).
```
.csv
```
:
```
smiles
```
column followed by
```
bit_0
```
…
```
bit_{nbits-1}
```
columns.
MACCS keys always produce 167 bits regardless of
```
--nbits
```
.

Agent Checklist

When using this skill for users:

Confirm input format:
- ```
.csv
```
  requires a SMILES column (default
```
smiles
```
  )
- ```
.smi
```
  uses the first token of each line as SMILES
Quote SMILES containing special shell characters (brackets/parentheses):
- Example:
```
--smiles "[C@@H](O)(F)Cl"
```
For CSV workflows, verify column names:
- ```
desc
```
  :
```
--smiles-col
```
- ```
fp
```
  :
```
--smiles-col
```
Choose the right preset or fingerprint type for the downstream task:
- Drug screening / Ro5:
```
--preset lipinski
```
- General ML featurization:
```
--preset physchem
```
  or
```
--type morgan2
```
- Structural similarity search:
```
--type morgan2
```
  or
```
--type rdkit
```
- Substructure matching:
```
--type maccs
```
  or
```
--type pattern
```
Watch for skipped SMILES:
- Check
```
*.skipped.csv
```
  and decide whether to fix or permanently drop them
Always capture absolute output paths:
- Look for
```
[RESULT] ...=/abs/path
```
  in stdout

If debugging is needed, enable full traceback:

RDKIT_HELPER_TRACE=1 uv run <skill_path>/scripts/rdkit_helper.py ...

References

RDKit documentation: https://www.rdkit.org/docs/
RDKit descriptor list: https://www.rdkit.org/docs/GettingStartedInPython.html#list-of-available-descriptors
RDKit fingerprint guide: https://www.rdkit.org/docs/GettingStartedInPython.html#fingerprinting-and-molecular-similarity