Medical-research-skills biopython-advanced

Advanced Biopython modules for motifs, population genetics, sequence utilities, restriction analysis, clustering, and GenomeDiagram visualization; use when you need extended bioinformatics analysis beyond basic sequence I/O and alignment.

install

source · Clone the upstream repo

git clone https://github.com/aipoch/medical-research-skills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/aipoch/medical-research-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/scientific-skills/Data Analysis/biopython-advanced" ~/.claude/skills/aipoch-medical-research-skills-biopython-advanced && rm -rf "$T"

manifest: scientific-skills/Data Analysis/biopython-advanced/SKILL.md

source content

Source: https://github.com/aipoch/medical-research-skills

biopython-advanced

When to Use

You need motif discovery/statistics (e.g., PWM/consensus, motif counts across multiple sequences).
You want restriction enzyme site analysis (e.g., find cut sites for specific enzymes in a DNA sequence).
You need codon usage / sequence utility calculations (e.g., codon frequency from CDS, GC content, basic sequence stats).
You are working with population genetics (PopGen) utilities for advanced analyses.
You need advanced visualization such as GenomeDiagram-style plots for genomic features.

Key Features

Motif analysis using Biopython’s
```
Bio.motifs
```
(counts, consensus, simple statistics).
Restriction analysis using
```
Bio.Restriction
```
(enzyme lookup, cut site detection).
Sequence utilities via
```
Bio.SeqUtils
```
(codon usage and related helpers).
Access to additional advanced tools such as CodonTable, SeqFeature, and IUPACData when needed.
Standardized workflow conventions:
- Write configuration to
```
config/task_config.json
```
  as an intermediate artifact.
- Run tasks uniformly via
```
python scripts/<task_name>.py
```
  .
- Avoid stacking many CLI flags; keep parameters in config files.
- Always use
```
encoding="utf-8"
```
  for file I/O; JSON output uses
```
ensure_ascii=False
```
  .

Dependencies

Required:

biopython (>=1.80)
numpy (>=1.21)

Optional (for reporting/plotting):

reportlab (>=3.6)
matplotlib (>=3.5)

Example Usage

The following examples are complete runnable scripts that follow the conventions:

configuration stored in
```
config/task_config.json
```
invoked as
```
python scripts/<task_name>.py
```
explicit UTF-8 encoding and
```
ensure_ascii=False
```
for JSON output

1) Motif Statistics

config/task_config.json

{
  "task": "motif_stats",
  "sequences": ["ATGCATGCATGC", "ATGCGTGCATGC", "ATGCATGTATGC"]
}

scripts/motif_stats.py

import json
from Bio import motifs
from Bio.Seq import Seq

def main():
    with open("config/task_config.json", "r", encoding="utf-8") as f:
        cfg = json.load(f)

    seqs = [Seq(s) for s in cfg["sequences"]]
    m = motifs.create(seqs)

    result = {
        "alphabet": str(m.alphabet),
        "length": m.length,
        "counts": {k: dict(v) for k, v in m.counts.items()},
        "consensus": str(m.consensus),
        "degenerate_consensus": str(m.degenerate_consensus),
    }

    with open("outputs/motif_stats.json", "w", encoding="utf-8") as f:
        json.dump(result, f, ensure_ascii=False, indent=2)

if __name__ == "__main__":
    main()

Run:

python scripts/motif_stats.py

2) Restriction Enzyme Cleavage Sites

config/task_config.json

{
  "task": "restriction_sites",
  "sequence": "GAATTCGCGGAATTC",
  "enzymes": ["EcoRI", "BamHI"]
}

scripts/restriction_sites.py

import json
from Bio.Seq import Seq
from Bio.Restriction import RestrictionBatch

def main():
    with open("config/task_config.json", "r", encoding="utf-8") as f:
        cfg = json.load(f)

    seq = Seq(cfg["sequence"])
    batch = RestrictionBatch(cfg["enzymes"])
    analysis = batch.search(seq)

    # Convert enzyme keys to strings for JSON serialization
    result = {str(enzyme): positions for enzyme, positions in analysis.items()}

    with open("outputs/restriction_sites.json", "w", encoding="utf-8") as f:
        json.dump(result, f, ensure_ascii=False, indent=2)

if __name__ == "__main__":
    main()

Run:

python scripts/restriction_sites.py

3) Codon Usage Frequency (CDS)

config/task_config.json

{
  "task": "codon_usage",
  "cds": "ATGGCTGCTGCTGCTTAA"
}

scripts/codon_usage.py

import json
from collections import Counter

def main():
    with open("config/task_config.json", "r", encoding="utf-8") as f:
        cfg = json.load(f)

    cds = cfg["cds"].upper().replace(" ", "").replace("\n", "")
    codons = [cds[i:i+3] for i in range(0, len(cds) - (len(cds) % 3), 3)]
    counts = Counter(codons)
    total = sum(counts.values()) or 1

    result = {
        "total_codons": total,
        "codon_counts": dict(sorted(counts.items())),
        "codon_frequencies": {k: v / total for k, v in sorted(counts.items())},
        "note": "This example computes raw codon frequencies from the provided CDS. Validate CDS frame and stop codons for your use case."
    }

    with open("outputs/codon_usage.json", "w", encoding="utf-8") as f:
        json.dump(result, f, ensure_ascii=False, indent=2)

if __name__ == "__main__":
    main()

Run:

python scripts/codon_usage.py

Implementation Details

Configuration-first execution
- All task parameters are stored in
```
config/task_config.json
```
  to keep CLI invocation stable and reproducible.
- Scripts read the config as the single source of truth and write results to
```
outputs/*.json
```
  .
Motif statistics (
```
Bio.motifs
```
)
- A motif is created from aligned sequences of equal length.
- Outputs typically include:
  - ```
  counts
```
  : per-position nucleotide counts
- ```
consensus
```
    and
```
degenerate_consensus
```
    : derived consensus sequences
- If sequences differ in length, you must align/trim/pad them before motif creation.
Restriction analysis (
```
Bio.Restriction
```
)
- ```
RestrictionBatch(enzymes).search(seq)
```
  returns cut positions per enzyme.
- Enzyme objects are converted to strings for JSON serialization.
Codon usage
- The example computes codon frequencies by splitting the CDS into triplets in-frame.
- Practical considerations:
  - Ensure the CDS length is a multiple of 3 (or decide how to handle remainder bases).
  - Confirm the correct reading frame and whether to include terminal stop codons.
  - For organism-specific codon usage tables, integrate
```
Bio.Data.CodonTable
```
    as needed.
I/O requirements
- Always open files with
```
encoding="utf-8"
```
  .
- Use
```
json.dump(..., ensure_ascii=False)
```
  to preserve non-ASCII characters in outputs.
Further reference
- See
```
references/advanced.md
```
  for additional notes and module coverage (motifs/PopGen/SeqUtils/Restriction/Cluster, GenomeDiagram, CodonTable/SeqFeature/IUPACData).