BioSkills bio-duplicate-handling

Mark and remove PCR/optical duplicates using samtools fixmate and markdup. Use when preparing alignments for variant calling or when duplicate reads would bias analysis.

install
source · Clone the upstream repo
git clone https://github.com/GPTomics/bioSkills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/alignment-files/duplicate-handling" ~/.claude/skills/gptomics-bioskills-bio-duplicate-handling && rm -rf "$T"
manifest: alignment-files/duplicate-handling/SKILL.md
source content

Version Compatibility

Reference examples tested with: picard 3.1+, pysam 0.22+, samtools 1.19+

Before using code patterns, verify installed versions match. If versions differ:

  • Python:
    pip show <package>
    then
    help(module.function)
    to check signatures
  • CLI:
    <tool> --version
    then
    <tool> --help
    to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Duplicate Handling

"Remove PCR duplicates from my BAM file" → Mark or remove duplicate reads using the fixmate-sort-markdup pipeline to prevent duplicate bias in variant calling.

  • CLI:
    samtools fixmate
    ,
    samtools markdup
    (samtools)
  • Python:
    pysam.fixmate()
    ,
    pysam.markdup()
    (pysam)

Mark and remove PCR/optical duplicates using samtools.

Why Remove Duplicates?

PCR duplicates are identical copies of the same original molecule, created during library preparation. They:

  • Inflate coverage artificially
  • Bias allele frequencies
  • Can create false positive variant calls

Optical duplicates are clusters read multiple times due to their proximity on the flowcell.

Duplicate Marking Workflow

Goal: Mark PCR/optical duplicates so they can be excluded from downstream variant calling and coverage analysis.

Approach: Name-sort, add mate tags with fixmate, coordinate-sort, then run markdup. The pipeline version avoids intermediate files.

Reference (samtools 1.19+):

# 1. Sort by name (required for fixmate)
samtools sort -n -o namesort.bam input.bam

# 2. Add mate information with fixmate
samtools fixmate -m namesort.bam fixmate.bam

# 3. Sort by coordinate (required for markdup)
samtools sort -o coordsort.bam fixmate.bam

# 4. Mark duplicates
samtools markdup coordsort.bam marked.bam

# 5. Index result
samtools index marked.bam

Pipeline Version

samtools sort -n input.bam | \
    samtools fixmate -m - - | \
    samtools sort - | \
    samtools markdup - marked.bam

samtools index marked.bam

samtools fixmate

Adds mate information required by markdup. Must be run on name-sorted BAM.

Basic Usage

samtools fixmate namesorted.bam fixmate.bam

Add Mate Score Tag (-m)

# Required for markdup to work correctly
samtools fixmate -m namesorted.bam fixmate.bam

Multi-threaded

samtools fixmate -m -@ 4 namesorted.bam fixmate.bam

Remove Secondary/Unmapped

samtools fixmate -r -m namesorted.bam fixmate.bam

samtools markdup

Marks or removes duplicate alignments. Requires coordinate-sorted BAM with mate tags from fixmate.

Mark Duplicates (Keep in File)

samtools markdup input.bam marked.bam

Remove Duplicates

samtools markdup -r input.bam deduped.bam

Output Statistics

samtools markdup -s input.bam marked.bam 2> markdup_stats.txt

Optical Duplicate Distance

# Set pixel distance for optical duplicate detection (default: 100)
samtools markdup -d 2500 input.bam marked.bam

Multi-threaded

samtools markdup -@ 4 input.bam marked.bam

Write Stats to File

samtools markdup -f stats.txt input.bam marked.bam

Duplicate Statistics

Check Duplicate Rate

samtools flagstat marked.bam
# Look for "duplicates" line

Count Duplicates

# Count reads with duplicate flag
samtools view -c -f 1024 marked.bam

Percentage Duplicates

total=$(samtools view -c marked.bam)
dups=$(samtools view -c -f 1024 marked.bam)
echo "scale=2; $dups * 100 / $total" | bc

pysam Python Alternative

Full Pipeline

import pysam

# Sort by name
pysam.sort('-n', '-o', 'namesort.bam', 'input.bam')

# Fixmate
pysam.fixmate('-m', 'namesort.bam', 'fixmate.bam')

# Sort by coordinate
pysam.sort('-o', 'coordsort.bam', 'fixmate.bam')

# Mark duplicates
pysam.markdup('coordsort.bam', 'marked.bam')

# Index
pysam.index('marked.bam')

Check Duplicate Flag

import pysam

with pysam.AlignmentFile('marked.bam', 'rb') as bam:
    total = 0
    duplicates = 0
    for read in bam:
        total += 1
        if read.is_duplicate:
            duplicates += 1

    print(f'Total: {total}')
    print(f'Duplicates: {duplicates}')
    print(f'Rate: {duplicates/total*100:.2f}%')

Filter Out Duplicates

import pysam

with pysam.AlignmentFile('marked.bam', 'rb') as infile:
    with pysam.AlignmentFile('nodup.bam', 'wb', header=infile.header) as outfile:
        for read in infile:
            if not read.is_duplicate:
                outfile.write(read)

Mark Duplicates Manually (Simple Case)

import pysam
from collections import defaultdict

def simple_markdup(input_bam, output_bam):
    seen = defaultdict(set)

    with pysam.AlignmentFile(input_bam, 'rb') as infile:
        with pysam.AlignmentFile(output_bam, 'wb', header=infile.header) as outfile:
            for read in infile:
                if read.is_unmapped:
                    outfile.write(read)
                    continue

                key = (read.reference_id, read.reference_start, read.is_reverse,
                       read.next_reference_id, read.next_reference_start)

                if key in seen:
                    read.is_duplicate = True
                else:
                    seen[key].add(read.query_name)

                outfile.write(read)

simple_markdup('sorted.bam', 'marked.bam')

Alternative: From Aligner

Some aligners can mark duplicates directly:

BWA-MEM2 with samblaster

bwa-mem2 mem ref.fa R1.fq R2.fq | \
    samblaster | \
    samtools sort -o marked.bam

Using Picard (Alternative Tool)

java -jar picard.jar MarkDuplicates \
    I=input.bam \
    O=marked.bam \
    M=metrics.txt

Quick Reference

TaskCommand
Full workflow
sort -n | fixmate -m | sort | markdup
Mark duplicates
samtools markdup in.bam out.bam
Remove duplicates
samtools markdup -r in.bam out.bam
Count duplicates
samtools view -c -f 1024 marked.bam
View non-duplicates
samtools view -F 1024 marked.bam
Get stats
samtools markdup -s in.bam out.bam

Duplicate FLAG

FlagValueMeaning
0x4001024PCR or optical duplicate

Filter Commands

# View only duplicates
samtools view -f 1024 marked.bam

# View non-duplicates only
samtools view -F 1024 marked.bam

# Count non-duplicates
samtools view -c -F 1024 marked.bam

Common Errors

ErrorCauseSolution
mate not found
Input not name-sortedRun
samtools sort -n
first
no MC tag
fixmate not run with -mRe-run fixmate with
-m
flag
not coordinate sorted
Input to markdup not sortedRun
samtools sort
after fixmate

Related Skills

  • alignment-sorting - Sort by name/coordinate for workflow
  • alignment-filtering - Filter duplicates from output
  • bam-statistics - Check duplicate rates with flagstat
  • variant-calling - Duplicate marking before calling