Claude-skill-registry bio-duplicate-handling
Mark and remove PCR/optical duplicates using samtools fixmate and markdup. Use when preparing alignments for variant calling or when duplicate reads would bias analysis.
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/duplicate-handling" ~/.claude/skills/majiayu000-claude-skill-registry-bio-duplicate-handling && rm -rf "$T"
manifest:
skills/data/duplicate-handling/SKILL.mdsource content
Duplicate Handling
Mark and remove PCR/optical duplicates using samtools.
Why Remove Duplicates?
PCR duplicates are identical copies of the same original molecule, created during library preparation. They:
- Inflate coverage artificially
- Bias allele frequencies
- Can create false positive variant calls
Optical duplicates are clusters read multiple times due to their proximity on the flowcell.
Duplicate Marking Workflow
The standard samtools workflow requires multiple steps:
# 1. Sort by name (required for fixmate) samtools sort -n -o namesort.bam input.bam # 2. Add mate information with fixmate samtools fixmate -m namesort.bam fixmate.bam # 3. Sort by coordinate (required for markdup) samtools sort -o coordsort.bam fixmate.bam # 4. Mark duplicates samtools markdup coordsort.bam marked.bam # 5. Index result samtools index marked.bam
Pipeline Version
samtools sort -n input.bam | \ samtools fixmate -m - - | \ samtools sort - | \ samtools markdup - marked.bam samtools index marked.bam
samtools fixmate
Adds mate information required by markdup. Must be run on name-sorted BAM.
Basic Usage
samtools fixmate namesorted.bam fixmate.bam
Add Mate Score Tag (-m)
# Required for markdup to work correctly samtools fixmate -m namesorted.bam fixmate.bam
Multi-threaded
samtools fixmate -m -@ 4 namesorted.bam fixmate.bam
Remove Secondary/Unmapped
samtools fixmate -r -m namesorted.bam fixmate.bam
samtools markdup
Marks or removes duplicate alignments. Requires coordinate-sorted BAM with mate tags from fixmate.
Mark Duplicates (Keep in File)
samtools markdup input.bam marked.bam
Remove Duplicates
samtools markdup -r input.bam deduped.bam
Output Statistics
samtools markdup -s input.bam marked.bam 2> markdup_stats.txt
Optical Duplicate Distance
# Set pixel distance for optical duplicate detection (default: 100) samtools markdup -d 2500 input.bam marked.bam
Multi-threaded
samtools markdup -@ 4 input.bam marked.bam
Write Stats to File
samtools markdup -f stats.txt input.bam marked.bam
Duplicate Statistics
Check Duplicate Rate
samtools flagstat marked.bam # Look for "duplicates" line
Count Duplicates
# Count reads with duplicate flag samtools view -c -f 1024 marked.bam
Percentage Duplicates
total=$(samtools view -c marked.bam) dups=$(samtools view -c -f 1024 marked.bam) echo "scale=2; $dups * 100 / $total" | bc
pysam Python Alternative
Full Pipeline
import pysam # Sort by name pysam.sort('-n', '-o', 'namesort.bam', 'input.bam') # Fixmate pysam.fixmate('-m', 'namesort.bam', 'fixmate.bam') # Sort by coordinate pysam.sort('-o', 'coordsort.bam', 'fixmate.bam') # Mark duplicates pysam.markdup('coordsort.bam', 'marked.bam') # Index pysam.index('marked.bam')
Check Duplicate Flag
import pysam with pysam.AlignmentFile('marked.bam', 'rb') as bam: total = 0 duplicates = 0 for read in bam: total += 1 if read.is_duplicate: duplicates += 1 print(f'Total: {total}') print(f'Duplicates: {duplicates}') print(f'Rate: {duplicates/total*100:.2f}%')
Filter Out Duplicates
import pysam with pysam.AlignmentFile('marked.bam', 'rb') as infile: with pysam.AlignmentFile('nodup.bam', 'wb', header=infile.header) as outfile: for read in infile: if not read.is_duplicate: outfile.write(read)
Mark Duplicates Manually (Simple Case)
import pysam from collections import defaultdict def simple_markdup(input_bam, output_bam): seen = defaultdict(set) with pysam.AlignmentFile(input_bam, 'rb') as infile: with pysam.AlignmentFile(output_bam, 'wb', header=infile.header) as outfile: for read in infile: if read.is_unmapped: outfile.write(read) continue key = (read.reference_id, read.reference_start, read.is_reverse, read.next_reference_id, read.next_reference_start) if key in seen: read.is_duplicate = True else: seen[key].add(read.query_name) outfile.write(read) simple_markdup('sorted.bam', 'marked.bam')
Alternative: From Aligner
Some aligners can mark duplicates directly:
BWA-MEM2 with samblaster
bwa-mem2 mem ref.fa R1.fq R2.fq | \ samblaster | \ samtools sort -o marked.bam
Using Picard (Alternative Tool)
java -jar picard.jar MarkDuplicates \ I=input.bam \ O=marked.bam \ M=metrics.txt
Quick Reference
| Task | Command |
|---|---|
| Full workflow | |
| Mark duplicates | |
| Remove duplicates | |
| Count duplicates | |
| View non-duplicates | |
| Get stats | |
Duplicate FLAG
| Flag | Value | Meaning |
|---|---|---|
| 0x400 | 1024 | PCR or optical duplicate |
Filter Commands
# View only duplicates samtools view -f 1024 marked.bam # View non-duplicates only samtools view -F 1024 marked.bam # Count non-duplicates samtools view -c -F 1024 marked.bam
Common Errors
| Error | Cause | Solution |
|---|---|---|
| Input not name-sorted | Run first |
| fixmate not run with -m | Re-run fixmate with flag |
| Input to markdup not sorted | Run after fixmate |
Related Skills
- alignment-sorting - Sort by name/coordinate for workflow
- alignment-filtering - Filter duplicates from output
- bam-statistics - Check duplicate rates with flagstat
- variant-calling - Duplicate marking before calling