BioSkills bio-duplicate-handling
Mark and remove PCR/optical duplicates using samtools fixmate and markdup. Use when preparing alignments for variant calling or when duplicate reads would bias analysis.
git clone https://github.com/GPTomics/bioSkills
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/alignment-files/duplicate-handling" ~/.claude/skills/gptomics-bioskills-bio-duplicate-handling && rm -rf "$T"
alignment-files/duplicate-handling/SKILL.mdVersion Compatibility
Reference examples tested with: picard 3.1+, pysam 0.22+, samtools 1.19+
Before using code patterns, verify installed versions match. If versions differ:
- Python:
thenpip show <package>
to check signatureshelp(module.function) - CLI:
then<tool> --version
to confirm flags<tool> --help
If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Duplicate Handling
"Remove PCR duplicates from my BAM file" → Mark or remove duplicate reads using the fixmate-sort-markdup pipeline to prevent duplicate bias in variant calling.
- CLI:
,samtools fixmate
(samtools)samtools markdup - Python:
,pysam.fixmate()
(pysam)pysam.markdup()
Mark and remove PCR/optical duplicates using samtools.
Why Remove Duplicates?
PCR duplicates are identical copies of the same original molecule, created during library preparation. They:
- Inflate coverage artificially
- Bias allele frequencies
- Can create false positive variant calls
Optical duplicates are clusters read multiple times due to their proximity on the flowcell.
Duplicate Marking Workflow
Goal: Mark PCR/optical duplicates so they can be excluded from downstream variant calling and coverage analysis.
Approach: Name-sort, add mate tags with fixmate, coordinate-sort, then run markdup. The pipeline version avoids intermediate files.
Reference (samtools 1.19+):
# 1. Sort by name (required for fixmate) samtools sort -n -o namesort.bam input.bam # 2. Add mate information with fixmate samtools fixmate -m namesort.bam fixmate.bam # 3. Sort by coordinate (required for markdup) samtools sort -o coordsort.bam fixmate.bam # 4. Mark duplicates samtools markdup coordsort.bam marked.bam # 5. Index result samtools index marked.bam
Pipeline Version
samtools sort -n input.bam | \ samtools fixmate -m - - | \ samtools sort - | \ samtools markdup - marked.bam samtools index marked.bam
samtools fixmate
Adds mate information required by markdup. Must be run on name-sorted BAM.
Basic Usage
samtools fixmate namesorted.bam fixmate.bam
Add Mate Score Tag (-m)
# Required for markdup to work correctly samtools fixmate -m namesorted.bam fixmate.bam
Multi-threaded
samtools fixmate -m -@ 4 namesorted.bam fixmate.bam
Remove Secondary/Unmapped
samtools fixmate -r -m namesorted.bam fixmate.bam
samtools markdup
Marks or removes duplicate alignments. Requires coordinate-sorted BAM with mate tags from fixmate.
Mark Duplicates (Keep in File)
samtools markdup input.bam marked.bam
Remove Duplicates
samtools markdup -r input.bam deduped.bam
Output Statistics
samtools markdup -s input.bam marked.bam 2> markdup_stats.txt
Optical Duplicate Distance
# Set pixel distance for optical duplicate detection (default: 100) samtools markdup -d 2500 input.bam marked.bam
Multi-threaded
samtools markdup -@ 4 input.bam marked.bam
Write Stats to File
samtools markdup -f stats.txt input.bam marked.bam
Duplicate Statistics
Check Duplicate Rate
samtools flagstat marked.bam # Look for "duplicates" line
Count Duplicates
# Count reads with duplicate flag samtools view -c -f 1024 marked.bam
Percentage Duplicates
total=$(samtools view -c marked.bam) dups=$(samtools view -c -f 1024 marked.bam) echo "scale=2; $dups * 100 / $total" | bc
pysam Python Alternative
Full Pipeline
import pysam # Sort by name pysam.sort('-n', '-o', 'namesort.bam', 'input.bam') # Fixmate pysam.fixmate('-m', 'namesort.bam', 'fixmate.bam') # Sort by coordinate pysam.sort('-o', 'coordsort.bam', 'fixmate.bam') # Mark duplicates pysam.markdup('coordsort.bam', 'marked.bam') # Index pysam.index('marked.bam')
Check Duplicate Flag
import pysam with pysam.AlignmentFile('marked.bam', 'rb') as bam: total = 0 duplicates = 0 for read in bam: total += 1 if read.is_duplicate: duplicates += 1 print(f'Total: {total}') print(f'Duplicates: {duplicates}') print(f'Rate: {duplicates/total*100:.2f}%')
Filter Out Duplicates
import pysam with pysam.AlignmentFile('marked.bam', 'rb') as infile: with pysam.AlignmentFile('nodup.bam', 'wb', header=infile.header) as outfile: for read in infile: if not read.is_duplicate: outfile.write(read)
Mark Duplicates Manually (Simple Case)
import pysam from collections import defaultdict def simple_markdup(input_bam, output_bam): seen = defaultdict(set) with pysam.AlignmentFile(input_bam, 'rb') as infile: with pysam.AlignmentFile(output_bam, 'wb', header=infile.header) as outfile: for read in infile: if read.is_unmapped: outfile.write(read) continue key = (read.reference_id, read.reference_start, read.is_reverse, read.next_reference_id, read.next_reference_start) if key in seen: read.is_duplicate = True else: seen[key].add(read.query_name) outfile.write(read) simple_markdup('sorted.bam', 'marked.bam')
Alternative: From Aligner
Some aligners can mark duplicates directly:
BWA-MEM2 with samblaster
bwa-mem2 mem ref.fa R1.fq R2.fq | \ samblaster | \ samtools sort -o marked.bam
Using Picard (Alternative Tool)
java -jar picard.jar MarkDuplicates \ I=input.bam \ O=marked.bam \ M=metrics.txt
Quick Reference
| Task | Command |
|---|---|
| Full workflow | |
| Mark duplicates | |
| Remove duplicates | |
| Count duplicates | |
| View non-duplicates | |
| Get stats | |
Duplicate FLAG
| Flag | Value | Meaning |
|---|---|---|
| 0x400 | 1024 | PCR or optical duplicate |
Filter Commands
# View only duplicates samtools view -f 1024 marked.bam # View non-duplicates only samtools view -F 1024 marked.bam # Count non-duplicates samtools view -c -F 1024 marked.bam
Common Errors
| Error | Cause | Solution |
|---|---|---|
| Input not name-sorted | Run first |
| fixmate not run with -m | Re-run fixmate with flag |
| Input to markdup not sorted | Run after fixmate |
Related Skills
- alignment-sorting - Sort by name/coordinate for workflow
- alignment-filtering - Filter duplicates from output
- bam-statistics - Check duplicate rates with flagstat
- variant-calling - Duplicate marking before calling