BioSkills bio-clip-seq-clip-preprocessing

Preprocess CLIP-seq data including adapter trimming, UMI extraction, and PCR duplicate removal. Use when preparing raw CLIP, iCLIP, or eCLIP reads for peak calling.

install
source · Clone the upstream repo
git clone https://github.com/GPTomics/bioSkills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/clip-seq/clip-preprocessing" ~/.claude/skills/gptomics-bioskills-bio-clip-seq-clip-preprocessing && rm -rf "$T"
manifest: clip-seq/clip-preprocessing/SKILL.md
source content

Version Compatibility

Reference examples tested with: cutadapt 4.4+, pysam 0.22+

Before using code patterns, verify installed versions match. If versions differ:

  • Python:
    pip show <package>
    then
    help(module.function)
    to check signatures
  • CLI:
    <tool> --version
    then
    <tool> --help
    to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

CLIP-seq Preprocessing

"Preprocess my CLIP-seq reads" → Extract UMIs, trim adapters, and remove PCR duplicates from raw CLIP/iCLIP/eCLIP reads to prepare for alignment and peak calling.

  • CLI:
    umi_tools extract
    for UMI handling,
    cutadapt
    for adapter trimming

UMI Extraction (eCLIP/iCLIP)

Goal: Extract UMI barcodes from CLIP-seq reads and append them to read names for downstream deduplication.

Approach: Run umi_tools extract with a barcode pattern matching the UMI length and position in the read.

# Extract UMI from read 1
umi_tools extract \
    --stdin=reads_R1.fastq.gz \
    --read2-in=reads_R2.fastq.gz \
    --bc-pattern=NNNNNNNNNN \
    --stdout=R1_umi.fastq.gz \
    --read2-out=R2_umi.fastq.gz

# bc-pattern: UMI barcode pattern
# N = UMI base
# For eCLIP: typically 10-nt UMI in read 1

Adapter Trimming

# Trim adapters after UMI extraction
cutadapt \
    -a AGATCGGAAGAGCACACGTCT \
    -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
    -m 18 \
    -o trimmed_R1.fastq.gz \
    -p trimmed_R2.fastq.gz \
    R1_umi.fastq.gz R2_umi.fastq.gz

Two-Pass Trimming (eCLIP)

Goal: Remove inline adapters from eCLIP reads that appear at both 3' and 5' ends.

Approach: Run two sequential cutadapt passes: first trim the 3' adapter, then trim any remaining 5' adapter from read-through events.

# eCLIP protocol has inline adapters
# First pass: trim 3' adapter
cutadapt -a AGATCGGAAGAGC -m 18 -o pass1.fq.gz input.fq.gz

# Second pass: trim 5' adapter (read-through)
cutadapt -g AGATCGGAAGAGC -m 18 -o pass2.fq.gz pass1.fq.gz

PCR Duplicate Removal

# After alignment, deduplicate using UMIs
umi_tools dedup \
    --stdin=aligned.bam \
    --stdout=deduped.bam \
    --paired \
    --method=unique

# Methods:
# unique: Exact UMI match
# cluster: Allow UMI mismatches (default)
# adjacency: Network-based clustering

Python Preprocessing

from umi_tools import UMIClusterer
import pysam

def count_umis_per_position(bam_path):
    '''Count unique UMIs at each genomic position'''
    from collections import defaultdict

    position_umis = defaultdict(set)

    with pysam.AlignmentFile(bam_path, 'rb') as bam:
        for read in bam:
            if read.is_unmapped:
                continue

            # Extract UMI from read name (added by umi_tools extract)
            umi = read.query_name.split('_')[-1]
            pos = (read.reference_name, read.reference_start)
            position_umis[pos].add(umi)

    return {pos: len(umis) for pos, umis in position_umis.items()}

Quality Control

def clip_qc(bam_path):
    '''CLIP-seq specific QC metrics'''
    import pysam

    total = 0
    unique_positions = set()
    read_lengths = []

    with pysam.AlignmentFile(bam_path, 'rb') as bam:
        for read in bam:
            if read.is_unmapped:
                continue
            total += 1
            unique_positions.add((read.reference_name, read.reference_start))
            read_lengths.append(read.query_length)

    return {
        'total_reads': total,
        'unique_positions': len(unique_positions),
        'mean_read_length': sum(read_lengths) / len(read_lengths),
        'complexity': len(unique_positions) / total
    }

Related Skills

  • clip-alignment - Align preprocessed reads
  • read-qc/umi-processing - General UMI handling
  • clip-peak-calling - Call peaks from aligned reads