BioSkills bio-variant-calling-structural-variant-calling
Call structural variants (SVs) from sequencing data using Manta, Delly, GRIDSS, and LUMPY. Detects deletions, insertions, inversions, duplications, and translocations too large for standard SNV callers. Use when detecting structural variants from short-read or long-read data and building consensus callsets.
git clone https://github.com/GPTomics/bioSkills
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/variant-calling/structural-variant-calling" ~/.claude/skills/gptomics-bioskills-bio-variant-calling-structural-variant-calling && rm -rf "$T"
variant-calling/structural-variant-calling/SKILL.mdVersion Compatibility
Reference examples tested with: Manta 1.6+, Delly 1.2+, GRIDSS 2.13+, bcftools 1.19+, samtools 1.19+, SURVIVOR 1.0.7+, Sniffles2 2.2+
Before using code patterns, verify installed versions match. If versions differ:
- CLI:
then<tool> --version
to confirm flags<tool> --help
If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Structural Variant Calling
"Call structural variants from my WGS data" -> Detect large genomic rearrangements (deletions, insertions, inversions, duplications, translocations) using split-read, discordant-pair, and assembly-based evidence.
- CLI:
(Manta),configManta.py
,delly call
(GRIDSS),gridss
/lumpyexpresssmoove call
SV Detection Limitations by Platform
Not all SV types are equally detectable across sequencing platforms. This table reflects practical detection performance, not theoretical capability:
| SV Type | Short-read Detection | Long-read Detection | Key Limitation |
|---|---|---|---|
| Deletion | Good (read-pair + split-read) | Excellent | Short reads miss deletions in repetitive regions |
| Duplication | Moderate (read-pair + depth) | Good | Tandem vs dispersed distinction unreliable with short reads |
| Inversion | Moderate (read-pair) | Good | Breakpoints in repeats cause false negatives |
| Insertion | Poor (limited by read length) | Excellent | Short reads cannot resolve insertions >read length |
| Translocation | Moderate (discordant pairs) | Good | High false positive rate near centromeres/telomeres |
| Complex/nested | Poor | Good (with assembly) | Multiple overlapping SVs confound short-read signals |
Caller Comparison
| Feature | Manta | Delly | GRIDSS | Smoove/LUMPY |
|---|---|---|---|---|
| Method | Read-pair + split-read + local assembly | Read-pair + split-read | Positional de Bruijn graph assembly | Read-pair + split-read |
| Speed | Fastest | Moderate | Slowest (2-5x Manta) | Moderate |
| DEL detection | Good | Good | Best precision | Good |
| INS detection | Good | Limited (small INS only) | Good | Cannot detect |
| Somatic mode | Yes | Yes | Yes (GRIDSS2/GRIPSS) | Limited |
| RNA-seq | Yes | No | No | No |
| Single breakends | No | No | Yes | No |
| Complex SVs | Limited | No | Yes (via LINX) | No |
GRIDSS produces the highest precision for deletions and uniquely detects single breakend events (one side of a breakpoint where the partner cannot be mapped). Manta provides the best speed-to-accuracy ratio for most applications. Delly excels at joint calling across cohorts. LUMPY/Smoove lacks insertion detection entirely.
Consensus Calling Strategy
Current best practice: run Delly + GRIDSS + Manta + SvABA, require 2/4 caller agreement. This consensus approach yields best sensitivity with minimized false positives. Each caller has distinct algorithmic biases, so union sets are noisy while strict intersection is too conservative.
Manta
configManta.py \ --bam sample.bam \ --referenceFasta reference.fa \ --runDir manta_run manta_run/runWorkflow.py -j 8 # Output: manta_run/results/variants/ # - diploidSV.vcf.gz (germline SVs) # - candidateSV.vcf.gz (all candidates before scoring) # - candidateSmallIndels.vcf.gz (50-1000bp indels for Strelka input)
Manta Tumor-Normal Mode
configManta.py \ --tumorBam tumor.bam \ --normalBam normal.bam \ --referenceFasta reference.fa \ --runDir manta_somatic manta_somatic/runWorkflow.py -j 8 # Output includes: # - somaticSV.vcf.gz (somatic SVs, scored by tumor/normal evidence ratio) # - diploidSV.vcf.gz (germline SVs)
Manta Options
# WES mode (adjusts depth filters for uneven exome coverage) configManta.py \ --bam sample.bam \ --referenceFasta reference.fa \ --exome \ --callRegions regions.bed.gz \ --runDir manta_exome # RNA-seq mode (handles split alignments across splice junctions) configManta.py \ --bam rnaseq.bam \ --referenceFasta reference.fa \ --rna \ --runDir manta_rna
Delly
delly call -g reference.fa -o sv_calls.bcf sample.bam bcftools view sv_calls.bcf > sv_calls.vcf # Joint calling across cohort (recommended for population studies) delly call -g reference.fa -o joint_svs.bcf sample1.bam sample2.bam sample3.bam
Delly Somatic Mode
delly call -g reference.fa -o svs.bcf tumor.bam normal.bam echo -e "tumor\ttumor\nnormal\tcontrol" > samples.tsv delly filter -f somatic -o somatic_svs.bcf -s samples.tsv svs.bcf
Delly SV Types
delly call -t DEL -g ref.fa -o deletions.bcf sample.bam delly call -t DUP -g ref.fa -o duplications.bcf sample.bam delly call -t INV -g ref.fa -o inversions.bcf sample.bam delly call -t BND -g ref.fa -o translocations.bcf sample.bam delly call -t INS -g ref.fa -o insertions.bcf sample.bam
GRIDSS
GRIDSS uses positional de Bruijn graph assembly to reconstruct breakpoints, producing the highest precision among short-read callers. It detects single breakend events where only one side of a rearrangement maps to the reference--critical for viral integrations, centromeric breakpoints, and highly rearranged cancer genomes.
gridss \ --reference reference.fa \ --output gridss_svs.vcf \ --assembly gridss_assembly.bam \ --threads 8 \ sample.bam
GRIDSS Somatic Mode (GRIDSS2 + GRIPSS)
# GRIDSS2 with paired tumor-normal gridss \ --reference reference.fa \ --output gridss_raw.vcf \ --assembly gridss_assembly.bam \ --labels normal,tumor \ --threads 8 \ normal.bam tumor.bam # GRIPSS post-filtering (somatic/germline classification) gripss \ -ref_genome reference.fa \ -ref_genome_version 38 \ -sample tumor \ -reference normal \ -vcf gridss_raw.vcf \ -output_dir gripss_output/
Complex rearrangement reconstruction is available via LINX, which interprets GRIDSS breakpoints into higher-order SV events (chromothripsis, breakage-fusion-bridge cycles).
LUMPY
samtools view -b -F 1294 sample.bam > discordant.bam samtools view -h sample.bam | \ /path/to/lumpy-sv/scripts/extractSplitReads_BwaMem -i stdin | \ samtools view -Sb - > splitters.bam lumpyexpress \ -B sample.bam \ -S splitters.bam \ -D discordant.bam \ -o lumpy_svs.vcf
Smoove (LUMPY Wrapper)
smoove call \ --name sample \ --fasta reference.fa \ --outdir smoove_output \ -p 8 \ sample.bam # Output: smoove_output/sample-smoove.genotyped.vcf.gz
Merge Multiple Callers with SURVIVOR
Goal: Increase confidence in SV calls by requiring support from multiple callers with distinct algorithmic approaches.
Approach: Run 2-4 callers independently, then merge callsets with SURVIVOR requiring agreement on breakpoint proximity and SV type. Using max_dist=1000bp allows for the breakpoint imprecision inherent in short-read callers while min_callers=2 filters false positives unique to any single algorithm.
ls manta_svs.vcf delly_svs.vcf gridss_svs.vcf smoove_svs.vcf > vcf_list.txt # max_dist=1000 min_callers=2 type_agree=1 strand_agree=1 estimate_dist=0 min_size=50 SURVIVOR merge vcf_list.txt 1000 2 1 1 0 50 merged_svs.vcf
The 1000bp max_dist accounts for breakpoint position uncertainty across callers (Manta and GRIDSS resolve breakpoints more precisely than Delly/LUMPY). Requiring type_agree=1 prevents merging a deletion call with a duplication call at the same locus.
Filter SV Calls
bcftools view -i 'QUAL >= 20' svs.vcf > svs.filtered.vcf bcftools view -i 'ABS(SVLEN) >= 50' svs.vcf > svs.min50.vcf # Filter by SV type bcftools view -i 'SVTYPE="DEL"' svs.vcf > deletions.vcf bcftools view -i 'SVTYPE="INS"' svs.vcf > insertions.vcf bcftools view -i 'SVTYPE="INV"' svs.vcf > inversions.vcf bcftools view -i 'SVTYPE="DUP"' svs.vcf > duplications.vcf bcftools view -i 'SVTYPE="BND"' svs.vcf > translocations.vcf bcftools view -f PASS svs.vcf > svs.pass.vcf
Annotate SVs
AnnotSV \ -SVinputFile svs.vcf \ -genomeBuild GRCh38 \ -outputFile annotated_svs # Output includes: gene overlap, DGV frequency, gnomAD-SV population AF, ClinVar pathogenicity
SV Types
| Type | Code | Description | Typical Size Range |
|---|---|---|---|
| Deletion | DEL | Sequence removed | 50bp - 100Mb |
| Insertion | INS | Novel sequence inserted | 50bp - 10kb (short-read); unlimited (long-read) |
| Inversion | INV | Sequence orientation reversed | 1kb - 10Mb |
| Duplication | DUP | Sequence copied (tandem or dispersed) | 1kb - 10Mb |
| Translocation | BND | Breakend connecting different chromosomes | N/A (inter-chromosomal) |
Coverage Guidelines
| Coverage | Detection Ability | Practical Guidance |
|---|---|---|
| 10x | Large SVs only (>1kb) | Limited breakpoint accuracy; high false negative rate for SVs <1kb; suitable only for large deletion screening |
| 30x | Most SVs detected | Standard for WGS; good sensitivity for DEL/DUP/INV >300bp; moderate INS detection |
| 50x+ | Small SVs, precise breakpoints | Better sensitivity near repetitive regions; resolves complex SVs; recommended for clinical applications |
Below 30x, split-read evidence becomes sparse and callers rely more heavily on read-pair signals, which have lower breakpoint resolution (~300-500bp uncertainty vs ~10bp for split-reads).
Short-read vs Long-read Decision Framework
Short reads are sufficient for: deletions >300bp, balanced translocations, large tandem duplications, and population-scale screening where cost per sample matters.
Long reads are necessary for: insertions exceeding read length, complex/nested SVs, SVs in repetitive regions (segmental duplications, LINE/SINE elements), complete breakpoint resolution, and phased SV haplotyping.
Cost consideration: short reads for population-scale SV surveys (hundreds of samples), long reads for clinical-grade SV characterization where completeness matters more than throughput.
Long-Read SV Callers
| Caller | Best For | Key Strengths |
|---|---|---|
| Sniffles2 | ONT/HiFi general | 11.8x faster than v1; population merging with ; mosaic SV detection; best overall accuracy |
| CuteSV2 | ONT data | Highest recall for ONT; signature-based clustering handles noisy reads |
| pbsv | PacBio HiFi | Official PacBio tool; best paired with PBMM2 aligner; tandem repeat aware |
| Severus | Somatic SVs | Phased breakpoint graph approach; resolves complex somatic rearrangements (Nature Biotechnology 2025) |
Recommended Aligner-Caller Pairings
- Minimap2 + CuteSV2: ONT general purpose; fastest end-to-end
- Winnowmap + Sniffles2: high accuracy in repetitive regions (Winnowmap downweights repetitive k-mers)
- PBMM2 + pbsv: PacBio HiFi data; PBMM2 produces the CIGAR strings pbsv expects
See long-read-sequencing/structural-variants for long-read SV calling workflows with full pipeline examples.
Related Skills
- long-read-sequencing/structural-variants - Long-read SV calling with Sniffles2, CuteSV, pbsv
- copy-number/cnvkit-analysis - Copy number variant detection (complements SV calling for dosage changes)
- variant-calling/filtering-best-practices - VCF filtering strategies applicable to SV callsets
- variant-calling/variant-annotation - Functional annotation of variants including SVs
- alignment-files/alignment-filtering - BAM preparation and quality filtering before SV calling