BioSkills bio-clip-seq-clip-preprocessing
Preprocess CLIP-seq data including adapter trimming, UMI extraction, and PCR duplicate removal. Use when preparing raw CLIP, iCLIP, or eCLIP reads for peak calling.
git clone https://github.com/GPTomics/bioSkills
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/clip-seq/clip-preprocessing" ~/.claude/skills/gptomics-bioskills-bio-clip-seq-clip-preprocessing && rm -rf "$T"
clip-seq/clip-preprocessing/SKILL.mdVersion Compatibility
Reference examples tested with: cutadapt 4.4+, pysam 0.22+
Before using code patterns, verify installed versions match. If versions differ:
- Python:
thenpip show <package>
to check signatureshelp(module.function) - CLI:
then<tool> --version
to confirm flags<tool> --help
If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
CLIP-seq Preprocessing
"Preprocess my CLIP-seq reads" → Extract UMIs, trim adapters, and remove PCR duplicates from raw CLIP/iCLIP/eCLIP reads to prepare for alignment and peak calling.
- CLI:
for UMI handling,umi_tools extract
for adapter trimmingcutadapt
UMI Extraction (eCLIP/iCLIP)
Goal: Extract UMI barcodes from CLIP-seq reads and append them to read names for downstream deduplication.
Approach: Run umi_tools extract with a barcode pattern matching the UMI length and position in the read.
# Extract UMI from read 1 umi_tools extract \ --stdin=reads_R1.fastq.gz \ --read2-in=reads_R2.fastq.gz \ --bc-pattern=NNNNNNNNNN \ --stdout=R1_umi.fastq.gz \ --read2-out=R2_umi.fastq.gz # bc-pattern: UMI barcode pattern # N = UMI base # For eCLIP: typically 10-nt UMI in read 1
Adapter Trimming
# Trim adapters after UMI extraction cutadapt \ -a AGATCGGAAGAGCACACGTCT \ -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \ -m 18 \ -o trimmed_R1.fastq.gz \ -p trimmed_R2.fastq.gz \ R1_umi.fastq.gz R2_umi.fastq.gz
Two-Pass Trimming (eCLIP)
Goal: Remove inline adapters from eCLIP reads that appear at both 3' and 5' ends.
Approach: Run two sequential cutadapt passes: first trim the 3' adapter, then trim any remaining 5' adapter from read-through events.
# eCLIP protocol has inline adapters # First pass: trim 3' adapter cutadapt -a AGATCGGAAGAGC -m 18 -o pass1.fq.gz input.fq.gz # Second pass: trim 5' adapter (read-through) cutadapt -g AGATCGGAAGAGC -m 18 -o pass2.fq.gz pass1.fq.gz
PCR Duplicate Removal
# After alignment, deduplicate using UMIs umi_tools dedup \ --stdin=aligned.bam \ --stdout=deduped.bam \ --paired \ --method=unique # Methods: # unique: Exact UMI match # cluster: Allow UMI mismatches (default) # adjacency: Network-based clustering
Python Preprocessing
from umi_tools import UMIClusterer import pysam def count_umis_per_position(bam_path): '''Count unique UMIs at each genomic position''' from collections import defaultdict position_umis = defaultdict(set) with pysam.AlignmentFile(bam_path, 'rb') as bam: for read in bam: if read.is_unmapped: continue # Extract UMI from read name (added by umi_tools extract) umi = read.query_name.split('_')[-1] pos = (read.reference_name, read.reference_start) position_umis[pos].add(umi) return {pos: len(umis) for pos, umis in position_umis.items()}
Quality Control
def clip_qc(bam_path): '''CLIP-seq specific QC metrics''' import pysam total = 0 unique_positions = set() read_lengths = [] with pysam.AlignmentFile(bam_path, 'rb') as bam: for read in bam: if read.is_unmapped: continue total += 1 unique_positions.add((read.reference_name, read.reference_start)) read_lengths.append(read.query_length) return { 'total_reads': total, 'unique_positions': len(unique_positions), 'mean_read_length': sum(read_lengths) / len(read_lengths), 'complexity': len(unique_positions) / total }
Related Skills
- clip-alignment - Align preprocessed reads
- read-qc/umi-processing - General UMI handling
- clip-peak-calling - Call peaks from aligned reads