BioSkills bio-variant-calling-structural-variant-calling

Call structural variants (SVs) from sequencing data using Manta, Delly, GRIDSS, and LUMPY. Detects deletions, insertions, inversions, duplications, and translocations too large for standard SNV callers. Use when detecting structural variants from short-read or long-read data and building consensus callsets.

install
source · Clone the upstream repo
git clone https://github.com/GPTomics/bioSkills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/variant-calling/structural-variant-calling" ~/.claude/skills/gptomics-bioskills-bio-variant-calling-structural-variant-calling && rm -rf "$T"
manifest: variant-calling/structural-variant-calling/SKILL.md
source content

Version Compatibility

Reference examples tested with: Manta 1.6+, Delly 1.2+, GRIDSS 2.13+, bcftools 1.19+, samtools 1.19+, SURVIVOR 1.0.7+, Sniffles2 2.2+

Before using code patterns, verify installed versions match. If versions differ:

  • CLI:
    <tool> --version
    then
    <tool> --help
    to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Structural Variant Calling

"Call structural variants from my WGS data" -> Detect large genomic rearrangements (deletions, insertions, inversions, duplications, translocations) using split-read, discordant-pair, and assembly-based evidence.

  • CLI:
    configManta.py
    (Manta),
    delly call
    ,
    gridss
    (GRIDSS),
    lumpyexpress
    /
    smoove call

SV Detection Limitations by Platform

Not all SV types are equally detectable across sequencing platforms. This table reflects practical detection performance, not theoretical capability:

SV TypeShort-read DetectionLong-read DetectionKey Limitation
DeletionGood (read-pair + split-read)ExcellentShort reads miss deletions in repetitive regions
DuplicationModerate (read-pair + depth)GoodTandem vs dispersed distinction unreliable with short reads
InversionModerate (read-pair)GoodBreakpoints in repeats cause false negatives
InsertionPoor (limited by read length)ExcellentShort reads cannot resolve insertions >read length
TranslocationModerate (discordant pairs)GoodHigh false positive rate near centromeres/telomeres
Complex/nestedPoorGood (with assembly)Multiple overlapping SVs confound short-read signals

Caller Comparison

FeatureMantaDellyGRIDSSSmoove/LUMPY
MethodRead-pair + split-read + local assemblyRead-pair + split-readPositional de Bruijn graph assemblyRead-pair + split-read
SpeedFastestModerateSlowest (2-5x Manta)Moderate
DEL detectionGoodGoodBest precisionGood
INS detectionGoodLimited (small INS only)GoodCannot detect
Somatic modeYesYesYes (GRIDSS2/GRIPSS)Limited
RNA-seqYesNoNoNo
Single breakendsNoNoYesNo
Complex SVsLimitedNoYes (via LINX)No

GRIDSS produces the highest precision for deletions and uniquely detects single breakend events (one side of a breakpoint where the partner cannot be mapped). Manta provides the best speed-to-accuracy ratio for most applications. Delly excels at joint calling across cohorts. LUMPY/Smoove lacks insertion detection entirely.

Consensus Calling Strategy

Current best practice: run Delly + GRIDSS + Manta + SvABA, require 2/4 caller agreement. This consensus approach yields best sensitivity with minimized false positives. Each caller has distinct algorithmic biases, so union sets are noisy while strict intersection is too conservative.

Manta

configManta.py \
    --bam sample.bam \
    --referenceFasta reference.fa \
    --runDir manta_run

manta_run/runWorkflow.py -j 8

# Output: manta_run/results/variants/
# - diploidSV.vcf.gz (germline SVs)
# - candidateSV.vcf.gz (all candidates before scoring)
# - candidateSmallIndels.vcf.gz (50-1000bp indels for Strelka input)

Manta Tumor-Normal Mode

configManta.py \
    --tumorBam tumor.bam \
    --normalBam normal.bam \
    --referenceFasta reference.fa \
    --runDir manta_somatic

manta_somatic/runWorkflow.py -j 8

# Output includes:
# - somaticSV.vcf.gz (somatic SVs, scored by tumor/normal evidence ratio)
# - diploidSV.vcf.gz (germline SVs)

Manta Options

# WES mode (adjusts depth filters for uneven exome coverage)
configManta.py \
    --bam sample.bam \
    --referenceFasta reference.fa \
    --exome \
    --callRegions regions.bed.gz \
    --runDir manta_exome

# RNA-seq mode (handles split alignments across splice junctions)
configManta.py \
    --bam rnaseq.bam \
    --referenceFasta reference.fa \
    --rna \
    --runDir manta_rna

Delly

delly call -g reference.fa -o sv_calls.bcf sample.bam
bcftools view sv_calls.bcf > sv_calls.vcf

# Joint calling across cohort (recommended for population studies)
delly call -g reference.fa -o joint_svs.bcf sample1.bam sample2.bam sample3.bam

Delly Somatic Mode

delly call -g reference.fa -o svs.bcf tumor.bam normal.bam

echo -e "tumor\ttumor\nnormal\tcontrol" > samples.tsv

delly filter -f somatic -o somatic_svs.bcf -s samples.tsv svs.bcf

Delly SV Types

delly call -t DEL -g ref.fa -o deletions.bcf sample.bam
delly call -t DUP -g ref.fa -o duplications.bcf sample.bam
delly call -t INV -g ref.fa -o inversions.bcf sample.bam
delly call -t BND -g ref.fa -o translocations.bcf sample.bam
delly call -t INS -g ref.fa -o insertions.bcf sample.bam

GRIDSS

GRIDSS uses positional de Bruijn graph assembly to reconstruct breakpoints, producing the highest precision among short-read callers. It detects single breakend events where only one side of a rearrangement maps to the reference--critical for viral integrations, centromeric breakpoints, and highly rearranged cancer genomes.

gridss \
    --reference reference.fa \
    --output gridss_svs.vcf \
    --assembly gridss_assembly.bam \
    --threads 8 \
    sample.bam

GRIDSS Somatic Mode (GRIDSS2 + GRIPSS)

# GRIDSS2 with paired tumor-normal
gridss \
    --reference reference.fa \
    --output gridss_raw.vcf \
    --assembly gridss_assembly.bam \
    --labels normal,tumor \
    --threads 8 \
    normal.bam tumor.bam

# GRIPSS post-filtering (somatic/germline classification)
gripss \
    -ref_genome reference.fa \
    -ref_genome_version 38 \
    -sample tumor \
    -reference normal \
    -vcf gridss_raw.vcf \
    -output_dir gripss_output/

Complex rearrangement reconstruction is available via LINX, which interprets GRIDSS breakpoints into higher-order SV events (chromothripsis, breakage-fusion-bridge cycles).

LUMPY

samtools view -b -F 1294 sample.bam > discordant.bam
samtools view -h sample.bam | \
    /path/to/lumpy-sv/scripts/extractSplitReads_BwaMem -i stdin | \
    samtools view -Sb - > splitters.bam

lumpyexpress \
    -B sample.bam \
    -S splitters.bam \
    -D discordant.bam \
    -o lumpy_svs.vcf

Smoove (LUMPY Wrapper)

smoove call \
    --name sample \
    --fasta reference.fa \
    --outdir smoove_output \
    -p 8 \
    sample.bam

# Output: smoove_output/sample-smoove.genotyped.vcf.gz

Merge Multiple Callers with SURVIVOR

Goal: Increase confidence in SV calls by requiring support from multiple callers with distinct algorithmic approaches.

Approach: Run 2-4 callers independently, then merge callsets with SURVIVOR requiring agreement on breakpoint proximity and SV type. Using max_dist=1000bp allows for the breakpoint imprecision inherent in short-read callers while min_callers=2 filters false positives unique to any single algorithm.

ls manta_svs.vcf delly_svs.vcf gridss_svs.vcf smoove_svs.vcf > vcf_list.txt

# max_dist=1000  min_callers=2  type_agree=1  strand_agree=1  estimate_dist=0  min_size=50
SURVIVOR merge vcf_list.txt 1000 2 1 1 0 50 merged_svs.vcf

The 1000bp max_dist accounts for breakpoint position uncertainty across callers (Manta and GRIDSS resolve breakpoints more precisely than Delly/LUMPY). Requiring type_agree=1 prevents merging a deletion call with a duplication call at the same locus.

Filter SV Calls

bcftools view -i 'QUAL >= 20' svs.vcf > svs.filtered.vcf
bcftools view -i 'ABS(SVLEN) >= 50' svs.vcf > svs.min50.vcf

# Filter by SV type
bcftools view -i 'SVTYPE="DEL"' svs.vcf > deletions.vcf
bcftools view -i 'SVTYPE="INS"' svs.vcf > insertions.vcf
bcftools view -i 'SVTYPE="INV"' svs.vcf > inversions.vcf
bcftools view -i 'SVTYPE="DUP"' svs.vcf > duplications.vcf
bcftools view -i 'SVTYPE="BND"' svs.vcf > translocations.vcf

bcftools view -f PASS svs.vcf > svs.pass.vcf

Annotate SVs

AnnotSV \
    -SVinputFile svs.vcf \
    -genomeBuild GRCh38 \
    -outputFile annotated_svs

# Output includes: gene overlap, DGV frequency, gnomAD-SV population AF, ClinVar pathogenicity

SV Types

TypeCodeDescriptionTypical Size Range
DeletionDELSequence removed50bp - 100Mb
InsertionINSNovel sequence inserted50bp - 10kb (short-read); unlimited (long-read)
InversionINVSequence orientation reversed1kb - 10Mb
DuplicationDUPSequence copied (tandem or dispersed)1kb - 10Mb
TranslocationBNDBreakend connecting different chromosomesN/A (inter-chromosomal)

Coverage Guidelines

CoverageDetection AbilityPractical Guidance
10xLarge SVs only (>1kb)Limited breakpoint accuracy; high false negative rate for SVs <1kb; suitable only for large deletion screening
30xMost SVs detectedStandard for WGS; good sensitivity for DEL/DUP/INV >300bp; moderate INS detection
50x+Small SVs, precise breakpointsBetter sensitivity near repetitive regions; resolves complex SVs; recommended for clinical applications

Below 30x, split-read evidence becomes sparse and callers rely more heavily on read-pair signals, which have lower breakpoint resolution (~300-500bp uncertainty vs ~10bp for split-reads).

Short-read vs Long-read Decision Framework

Short reads are sufficient for: deletions >300bp, balanced translocations, large tandem duplications, and population-scale screening where cost per sample matters.

Long reads are necessary for: insertions exceeding read length, complex/nested SVs, SVs in repetitive regions (segmental duplications, LINE/SINE elements), complete breakpoint resolution, and phased SV haplotyping.

Cost consideration: short reads for population-scale SV surveys (hundreds of samples), long reads for clinical-grade SV characterization where completeness matters more than throughput.

Long-Read SV Callers

CallerBest ForKey Strengths
Sniffles2ONT/HiFi general11.8x faster than v1; population merging with
sniffles --merge
; mosaic SV detection; best overall accuracy
CuteSV2ONT dataHighest recall for ONT; signature-based clustering handles noisy reads
pbsvPacBio HiFiOfficial PacBio tool; best paired with PBMM2 aligner; tandem repeat aware
SeverusSomatic SVsPhased breakpoint graph approach; resolves complex somatic rearrangements (Nature Biotechnology 2025)

Recommended Aligner-Caller Pairings

  • Minimap2 + CuteSV2: ONT general purpose; fastest end-to-end
  • Winnowmap + Sniffles2: high accuracy in repetitive regions (Winnowmap downweights repetitive k-mers)
  • PBMM2 + pbsv: PacBio HiFi data; PBMM2 produces the CIGAR strings pbsv expects

See long-read-sequencing/structural-variants for long-read SV calling workflows with full pipeline examples.

Related Skills

  • long-read-sequencing/structural-variants - Long-read SV calling with Sniffles2, CuteSV, pbsv
  • copy-number/cnvkit-analysis - Copy number variant detection (complements SV calling for dosage changes)
  • variant-calling/filtering-best-practices - VCF filtering strategies applicable to SV callsets
  • variant-calling/variant-annotation - Functional annotation of variants including SVs
  • alignment-files/alignment-filtering - BAM preparation and quality filtering before SV calling