SciAgent-Skills cnvkit-copy-number

Detect somatic copy number variants (CNVs) from WES, WGS, or targeted sequencing BAM files with CNVkit v0.9.x. Pipeline: calculate bin-level coverage in target/antitarget regions, normalize against a reference, segment copy ratios with CBS or HMM, call amplifications and deletions, generate scatter/diagram plots, estimate tumor purity and ploidy, and export to VCF/SEG. Both CLI and Python API (cnvlib) shown. Use GATK CNV instead for deep WGS with population-scale controls; use CNVkit for targeted or exome sequencing where antitarget bins are critical.

install
source · Clone the upstream repo
git clone https://github.com/jaechang-hits/SciAgent-Skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/genomics-bioinformatics/cnvkit-copy-number" ~/.claude/skills/jaechang-hits-sciagent-skills-cnvkit-copy-number && rm -rf "$T"
manifest: skills/genomics-bioinformatics/cnvkit-copy-number/SKILL.md
source content

CNVkit Copy Number Analysis

Overview

CNVkit detects somatic copy number variants (CNVs) from whole-exome sequencing (WES), whole-genome sequencing (WGS), or targeted panel BAM files. It calculates read depth in both on-target (capture) bins and off-target (antitarget) bins, corrects for GC bias and library depth, segments the log2 copy ratio profile with circular binary segmentation (CBS) or a hidden Markov model (HMM), and calls amplifications and deletions. CNVkit provides both a CLI (

cnvkit.py
) and a Python API (
cnvlib
) for integration into analysis pipelines, and produces scatter plots, chromosome diagrams, heatmaps, and export files in VCF, BED, and SEG formats.

When to Use

  • Calling somatic copy number variants from tumor-normal paired exome (WES) or targeted panel sequencing
  • Detecting copy number alterations in tumor-only samples using a pooled normal reference
  • Running CNV analysis on whole-genome sequencing (WGS) data with the
    --method wgs
    mode
  • Estimating tumor purity and ploidy for samples where purity is unknown, to interpret copy ratio calls
  • Generating SEG format copy number files for GISTIC2, cBioPortal, or IGV visualization
  • Identifying focal amplifications (e.g., ERBB2, MYC) or homozygous deletions (e.g., CDKN2A, RB1)
  • Use GATK CNV (
    gatk DenoiseReadCounts
    /
    gatk ModelSegments
    ) instead for deep WGS cohorts with large matched panel-of-normals (PoN); CNVkit is better suited for targeted/exome data
  • Use Control-FREEC instead when you need allele-frequency-based B-allele fraction modeling alongside CNV calling

Prerequisites

  • Software: CNVkit v0.9.x, Python 3.8+, R (for CBS segmentation), samtools
  • Python packages:
    cnvlib
    (installed as part of CNVkit),
    matplotlib
    ,
    pandas
  • Input files: sorted, indexed BAM files (tumor ± matched normal); BED file of capture targets; reference genome FASTA; access to R with DNAcopy package for CBS
  • Data requirements: minimum ~50× mean target coverage for WES; WGS works at 20-30×

Check before installing: The tool may already be available in the current environment (e.g., inside a

pixi
/
conda
env). Run
command -v cnvkit.py
first and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool via
pixi run cnvkit.py
rather than bare
cnvkit.py
.

# Install CNVkit via conda (recommended — handles R/DNAcopy dependency)
conda install -c bioconda cnvkit

# Or via pip (requires R + DNAcopy already installed)
pip install cnvkit

# Verify
cnvkit.py version
# cnvkit 0.9.10

# Install R DNAcopy (for CBS segmentation)
Rscript -e 'if (!requireNamespace("BiocManager")) install.packages("BiocManager"); BiocManager::install("DNAcopy")'

# Index BAM files if not already indexed
samtools index tumor.bam
samtools index normal.bam

Quick Start

# One-command paired tumor/normal CNV analysis (WES)
cnvkit.py batch tumor.bam \
    --normal normal.bam \
    --targets targets.bed \
    --fasta GRCh38.fa \
    --output-dir cnvkit_results/ \
    --diagram --scatter \
    --method hybrid

# Output files in cnvkit_results/:
#   tumor.targetcoverage.cnn   — target bin coverage
#   tumor.antitargetcoverage.cnn — antitarget coverage
#   tumor.cnr                  — copy number ratios
#   tumor.cns                  — segmented copy numbers
#   tumor-scatter.png          — genome-wide scatter plot
#   tumor-diagram.pdf          — chromosome diagram
echo "CNV analysis complete"

Workflow

Step 1: Create Copy Number Reference

Build a reference from one or more normal BAM files. This corrects for systematic biases (GC content, mappability) and sets the neutral baseline.

# Option A: Paired normal reference (single matched normal)
cnvkit.py reference normal.targetcoverage.cnn normal.antitargetcoverage.cnn \
    --fasta GRCh38.fa \
    -o reference_normal.cnn

# Option B: Flat reference (no normal; uses GC/mappability correction only)
# Use when no matched normal is available
cnvkit.py reference \
    --targets targets.bed \
    --fasta GRCh38.fa \
    --output flat_reference.cnn

# Option C: Pooled normal reference from multiple normals (most robust)
cnvkit.py batch \
    normal1.bam normal2.bam normal3.bam \
    --normal \
    --targets targets.bed \
    --fasta GRCh38.fa \
    --output-reference pooled_reference.cnn \
    --output-dir normals_cov/

echo "Reference created: pooled_reference.cnn"

Step 2: Calculate Coverage in Target and Antitarget Bins

Bin the target BED file and compute per-bin read depth for tumor and normal samples.

# First, create accessible bins from the target BED
cnvkit.py target targets.bed \
    --annotate refFlat.txt \
    --split \
    -o targets.split.bed

cnvkit.py antitarget targets.bed \
    --access data/access-5k-mappable.hg38.bed \
    -o antitargets.bed

# Calculate coverage for tumor sample
cnvkit.py coverage tumor.bam targets.split.bed \
    -o tumor.targetcoverage.cnn

cnvkit.py coverage tumor.bam antitargets.bed \
    -o tumor.antitargetcoverage.cnn

echo "Coverage files:"
echo "  tumor.targetcoverage.cnn"
echo "  tumor.antitargetcoverage.cnn"
# Python API equivalent: compute coverage with cnvlib
import cnvlib

# Load and inspect coverage files
target_cov = cnvlib.read("tumor.targetcoverage.cnn")
antitarget_cov = cnvlib.read("tumor.antitargetcoverage.cnn")

print(f"Target bins: {len(target_cov)}")
print(f"Antitarget bins: {len(antitarget_cov)}")
print(f"Mean target depth: {target_cov['depth'].mean():.1f}×")
print(f"Median target depth: {target_cov['depth'].median():.1f}×")

# Check coverage distribution
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(8, 4))
target_cov["depth"].clip(upper=500).hist(bins=50, ax=ax, color="#2c6fad", alpha=0.7)
ax.set_xlabel("Read depth")
ax.set_ylabel("Bin count")
ax.set_title("Target bin coverage distribution")
plt.tight_layout()
plt.savefig("coverage_distribution.png", dpi=150)
print("Saved: coverage_distribution.png")

Step 3: Normalize and Correct Copy Ratios

Normalize tumor coverage against the reference (correcting for GC bias, library depth, and target efficiency).

# Fix sample-level biases relative to the reference
cnvkit.py fix \
    tumor.targetcoverage.cnn \
    tumor.antitargetcoverage.cnn \
    pooled_reference.cnn \
    -o tumor.cnr

echo "Copy number ratios: tumor.cnr"
# tumor.cnr columns: chromosome, start, end, gene, log2, depth, weight
# Inspect copy ratio file with cnvlib Python API
import cnvlib
import pandas as pd

cnr = cnvlib.read("tumor.cnr")
print(f"Total bins: {len(cnr)}")
print(f"Chromosomes: {sorted(cnr.chromosome.unique())}")

# Convert to DataFrame and inspect
df = cnr.data
print(f"\nLog2 copy ratio summary:")
print(df["log2"].describe().round(3))

# Flag high-amplitude events
high_amp = df[df["log2"] >= 2.0]
hom_del  = df[df["log2"] <= -3.0]
print(f"\nHigh amplitude bins (log2 >= 2.0): {len(high_amp)}")
print(f"Homozygous deletion bins (log2 <= -3.0): {len(hom_del)}")
if not high_amp.empty:
    print(high_amp[["chromosome", "start", "end", "gene", "log2"]].head())

Step 4: Segment Copy Number Ratios

Identify contiguous regions of similar copy ratio using CBS (Circular Binary Segmentation) or HMM segmentation.

# CBS segmentation (default; requires R DNAcopy)
cnvkit.py segment tumor.cnr \
    -o tumor.cns \
    --method cbs

# HMM segmentation (no R required; faster)
cnvkit.py segment tumor.cnr \
    -o tumor.hmm.cns \
    --method hmm

echo "Segments: tumor.cns"
# tumor.cns columns: chromosome, start, end, gene, log2, cn, depth, p_ttest, weight, probes
# Python API: segment and inspect
import cnvlib
import subprocess

# Run segmentation via Python subprocess (mirrors CLI)
subprocess.run(
    ["cnvkit.py", "segment", "tumor.cnr", "-o", "tumor.cns", "--method", "cbs"],
    check=True
)

# Load and analyze segments
cns = cnvlib.read("tumor.cns")
df_seg = cns.data
print(f"Total segments: {len(df_seg)}")
print(f"\nSegment log2 summary:")
print(df_seg["log2"].describe().round(3))

# Large segments (>5 Mb) with copy gain or loss
large_events = df_seg[
    ((df_seg["end"] - df_seg["start"]) > 5_000_000) &
    (df_seg["log2"].abs() > 0.3)
].copy()
large_events["size_mb"] = (large_events["end"] - large_events["start"]) / 1e6
print(f"\nLarge CNV segments (>5 Mb, |log2|>0.3): {len(large_events)}")
print(large_events[["chromosome", "start", "end", "log2", "size_mb"]].head(8).to_string(index=False))

Step 5: Call CNV States

Assign integer copy number states and classify amplifications and deletions.

# Call with default thresholds (diploid normal)
cnvkit.py call tumor.cns \
    -o tumor.call.cns \
    --ploidy 2

# Call with tumor purity estimate (if known)
cnvkit.py call tumor.cns \
    --purity 0.7 \
    --ploidy 2 \
    -o tumor.call.purity.cns

echo "Called CNVs: tumor.call.cns"
# Parse called CNV file and classify events
import cnvlib
import pandas as pd

cns_called = cnvlib.read("tumor.call.cns")
df = cns_called.data

# Classify by log2 thresholds (diploid assumed)
# log2 >= 1.0 = high-level amplification (CN >= 4)
# log2 0.2–1.0 = copy gain (CN = 3)
# log2 -1.0–-0.2 = heterozygous deletion (CN = 1)
# log2 <= -3.5 = homozygous deletion (CN = 0)

def classify_cnv(log2):
    if log2 >= 1.0:    return "AMP"
    if log2 >= 0.2:    return "GAIN"
    if log2 <= -3.5:   return "HOMDEL"
    if log2 <= -1.0:   return "LOSS"
    return "NEUTRAL"

df["cnv_class"] = df["log2"].apply(classify_cnv)
print("CNV class counts:")
print(df["cnv_class"].value_counts().to_string())

# Focal amplifications in known oncogenes
oncogenes = ["ERBB2", "MYC", "EGFR", "CCND1", "CDK6", "MDM2", "KRAS"]
focal_amps = df[(df["cnv_class"] == "AMP") &
                (df["gene"].str.split(",").apply(
                    lambda genes: any(g in oncogenes for g in genes)))]
if not focal_amps.empty:
    print(f"\nFocal amplifications in oncogenes:")
    print(focal_amps[["chromosome", "start", "end", "gene", "log2", "cn"]].to_string(index=False))

Step 6: Visualize CNV Profile

Generate scatter plots and chromosome diagrams to review the copy number landscape.

# Genome-wide scatter plot (CNR bins + segments)
cnvkit.py scatter tumor.cnr \
    -s tumor.cns \
    -o tumor-scatter.png

# Chromosome diagram (color-coded by CN state)
cnvkit.py diagram tumor.cnr \
    -s tumor.cns \
    -o tumor-diagram.pdf

# Heatmap across multiple samples
cnvkit.py heatmap sample1.cns sample2.cns sample3.cns \
    -o cohort_heatmap.pdf

echo "Plots saved: tumor-scatter.png, tumor-diagram.pdf, cohort_heatmap.pdf"
# Custom scatter plot with matplotlib highlighting specific genes
import cnvlib
import matplotlib.pyplot as plt
import numpy as np

cnr = cnvlib.read("tumor.cnr")
cns = cnvlib.read("tumor.cns")

# Plot chr7 (EGFR locus) in detail
fig, ax = plt.subplots(figsize=(12, 4))
chrom = "chr7"
cnr_chr = cnr.data[cnr.data["chromosome"] == chrom]
cns_chr = cns.data[cns.data["chromosome"] == chrom]

# Bin dots
ax.scatter(cnr_chr["start"], cnr_chr["log2"],
           s=3, alpha=0.3, color="#aaa", label="Bins")
# Segment lines
for _, seg in cns_chr.iterrows():
    ax.hlines(seg["log2"], seg["start"], seg["end"],
              colors=("#d32f2f" if seg["log2"] > 0.3 else
                      "#1565c0" if seg["log2"] < -0.3 else "#666"),
              lw=3, label="_nolegend_")

# Mark EGFR
egfr_start, egfr_end = 55_019_017, 55_211_628
ax.axvspan(egfr_start, egfr_end, alpha=0.15, color="orange")
ax.text((egfr_start + egfr_end) / 2, ax.get_ylim()[1] * 0.85,
        "EGFR", ha="center", fontsize=9, color="darkorange")

ax.axhline(0, color="black", lw=0.8, ls="--")
ax.axhline(0.585, color="#d32f2f", lw=0.5, ls=":")   # gain threshold
ax.axhline(-1.0, color="#1565c0", lw=0.5, ls=":")    # loss threshold
ax.set_xlabel("Chromosome 7 position (bp)")
ax.set_ylabel("Log2 copy ratio")
ax.set_title(f"CNV Profile — {chrom}")
plt.tight_layout()
plt.savefig("chr7_cnv_scatter.png", dpi=150, bbox_inches="tight")
print("Saved: chr7_cnv_scatter.png")

Step 7: Estimate Tumor Purity and Ploidy

Use CNVkit's purity/ploidy estimation to interpret absolute copy numbers.

# Estimate purity and ploidy from the segmented CNV profile
cnvkit.py call tumor.cns \
    --purity auto \
    --ploidy 2 \
    --method clonal \
    -o tumor.call.auto.cns \
    --center median

# Print purity/ploidy estimate embedded in header
head -5 tumor.call.auto.cns
# Python API: purity/ploidy estimation with cnvlib
import cnvlib
from cnvlib import segmetrics

cns = cnvlib.read("tumor.cns")
cnr = cnvlib.read("tumor.cnr")

# Compute segment-level statistics
cns_with_stats = segmetrics.do_segmetrics(cnr, cns,
    location_stats=["mean", "median"],
    spread_stats=["stdev"])

# Export stats to CSV for review
df = cns_with_stats.data
df.to_csv("tumor_segment_stats.csv", index=False)
print(f"Segment stats saved: tumor_segment_stats.csv")
print(f"Segments: {len(df)}")
print(df[["chromosome", "start", "end", "log2", "cn", "mean", "stdev"]].head(8).to_string(index=False))

Step 8: Export to VCF, BED, and SEG Formats

Export CNV calls for downstream tools (GISTIC2, cBioPortal, IGV, clinical reporting).

# Export to VCF (for clinical variant databases)
cnvkit.py export vcf tumor.call.cns \
    -o tumor.cnv.vcf

# Export to SEG format (for GISTIC2 and cBioPortal)
cnvkit.py export seg tumor.call.cns \
    -o tumor.seg

# Export to BED format (for bedtools/IGV)
cnvkit.py export bed tumor.call.cns \
    -o tumor.cnv.bed

echo "Exported:"
echo "  tumor.cnv.vcf   — VCF format"
echo "  tumor.seg       — SEG format (GISTIC2 / cBioPortal)"
echo "  tumor.cnv.bed   — BED format (IGV / bedtools)"
# Parse and summarize SEG file
import pandas as pd

seg = pd.read_csv("tumor.seg", sep="\t",
                  names=["sample", "chrom", "start", "end", "n_probes", "log2"],
                  comment="#")
print(f"SEG file: {len(seg)} segments")
print(seg.head())

# Count amplifications and deletions by chromosome arm
amp_count = (seg["log2"] >= 0.585).sum()
del_count  = (seg["log2"] <= -1.0).sum()
print(f"\nAmplifications (log2 >= 0.585): {amp_count}")
print(f"Deletions (log2 <= -1.0):       {del_count}")
seg.to_csv("tumor_seg_annotated.csv", index=False)

Key Parameters

ParameterDefaultRange / OptionsEffect
--method
(batch/coverage)
"hybrid"
"hybrid"
,
"wgs"
,
"amplicon"
Sequencing method; selects target binning strategy
--segment-method
(segment)
"cbs"
"cbs"
,
"hmm"
,
"haar"
,
"none"
Segmentation algorithm; CBS requires R DNAcopy
--ploidy
(call)
2
1
6
Assumed baseline ploidy for absolute CN calling
--purity
(call)
1.0
0.1
1.0
or
"auto"
Tumor cell fraction; corrects log2 ratios for admixed normal
--target-avg-size
(target)
200
(WES)
50
500
bp
Desired mean target bin size after splitting
--antitarget-avg-size
150000
10000
500000
bp
Antitarget bin size (larger = fewer bins, less noise)
--drop-low-coverage
(segment)
offflagDrop bins with depth < 5× before segmentation
-p
/
--processes
(coverage)
1
1
–CPU count
Parallel processes for coverage calculation
--scatter
(batch)
offflagAutomatically generate scatter plot
--diagram
(batch)
offflagAutomatically generate chromosome diagram

Common Recipes

Recipe: Tumor-Only Analysis with Flat Reference

When to use: No matched normal is available; use a flat GC/mappability-corrected reference.

# Step 1: Create flat reference
cnvkit.py reference \
    --targets targets.bed \
    --fasta GRCh38.fa \
    --output flat_reference.cnn

# Step 2: Run batch on tumor-only
cnvkit.py batch tumor.bam \
    --reference flat_reference.cnn \
    --output-dir tumor_only_results/

echo "Tumor-only results: tumor_only_results/"

Recipe: WGS Mode for Whole-Genome Data

When to use: Input is whole-genome sequencing (no capture BED needed).

# WGS mode uses a uniform genome-wide bin grid
cnvkit.py batch tumor_wgs.bam \
    --normal normal_wgs.bam \
    --method wgs \
    --fasta GRCh38.fa \
    --output-dir wgs_results/ \
    --scatter

echo "WGS CNV analysis complete: wgs_results/"

Recipe: Snakemake Integration for Multi-Sample Cohort

When to use: Automate CNVkit for a cohort of paired tumor/normal samples.

# Snakefile — CNVkit batch pipeline
configfile: "config.yaml"
SAMPLES = config["tumor_samples"]   # list of sample names
GENOME  = config["genome_fasta"]
TARGETS = config["targets_bed"]
POOLED  = config["pooled_reference"]  # pre-built pooled normal reference

rule all:
    input:
        expand("cnvkit/{sample}.call.cns", sample=SAMPLES),
        expand("cnvkit/{sample}-scatter.png", sample=SAMPLES),

rule cnvkit_batch:
    input:
        tumor  = "bam/{sample}.tumor.bam",
        ref    = POOLED,
    output:
        cnr    = "cnvkit/{sample}.cnr",
        cns    = "cnvkit/{sample}.cns",
        call   = "cnvkit/{sample}.call.cns",
        scatter = "cnvkit/{sample}-scatter.png",
    threads: 4
    shell:
        """
        cnvkit.py batch {input.tumor} \
            --reference {input.ref} \
            --targets {TARGETS} \
            --fasta {GENOME} \
            --output-dir cnvkit/ \
            --scatter --processes {threads}
        cnvkit.py call cnvkit/{wildcards.sample}.cns \
            --ploidy 2 -o {output.call}
        """

Recipe: Export Gene-Level CNV Table

When to use: Summarize copy number status for a panel of cancer genes from a called CNS file.

import cnvlib
import pandas as pd

cns = cnvlib.read("tumor.call.cns")
df = cns.data

# Cancer genes of interest
cancer_genes = ["ERBB2", "MYC", "EGFR", "CDKN2A", "RB1", "TP53",
                "KRAS", "PTEN", "BRCA1", "BRCA2", "MDM2", "CDK4"]

rows = []
for gene in cancer_genes:
    hits = df[df["gene"].str.contains(gene, na=False)]
    if hits.empty:
        rows.append({"gene": gene, "log2_mean": float("nan"), "cn": "n/a", "status": "not_detected"})
    else:
        best = hits.loc[hits["log2"].abs().idxmax()]
        status = ("AMP" if best["log2"] >= 1.0 else
                  "GAIN" if best["log2"] >= 0.2 else
                  "HOMDEL" if best["log2"] <= -3.5 else
                  "LOSS" if best["log2"] <= -1.0 else "NEUTRAL")
        rows.append({"gene": gene, "log2_mean": round(best["log2"], 3),
                     "cn": best.get("cn", "?"), "status": status})

result_df = pd.DataFrame(rows)
print(result_df.to_string(index=False))
result_df.to_csv("cancer_gene_cnv_summary.csv", index=False)
print("\nSaved: cancer_gene_cnv_summary.csv")

Expected Outputs

Output FileFormatDescription
tumor.targetcoverage.cnn
CNNPer-bin target region read depth and log2 coverage
tumor.antitargetcoverage.cnn
CNNPer-bin antitarget region coverage (off-target reads)
tumor.cnr
CNRNormalized log2 copy ratio per bin (GC/depth corrected)
tumor.cns
CNSSegmented copy ratios (CBS/HMM output); one row per segment
tumor.call.cns
CNSCalled copy number states with integer CN and purity adjustment
tumor-scatter.png
PNGGenome-wide scatter plot of bins + segment overlays
tumor-diagram.pdf
PDFChromosome arm diagram color-coded by CN state
tumor.seg
SEGGISTIC2/cBioPortal-format segment file
tumor.cnv.vcf
VCFVCF-format CNV calls for clinical databases

Troubleshooting

ProblemCauseSolution
High noise in antitarget binsLow off-target coverage (<0.1× mean)Increase
--antitarget-avg-size
to 500kb; use
--drop-low-coverage
to exclude low bins before segmentation
CBS segmentation fails with
Error in DNAcopy
R DNAcopy not installed or incompatibleInstall via
BiocManager::install("DNAcopy")
; alternatively use
--method hmm
(no R required)
All segments near 0 (no CNVs detected)Low tumor purity (<20%) or shallow coverageVerify purity with
cnvkit.py call --purity auto
; check target coverage depth with
cnvkit.py coverage
Wavy GC bias across chromosomesGC normalization failed or flat reference used with biased tumorRebuild reference with matched normal BAMs; check
cnvkit.py reference
includes
--fasta
for GC correction
Many very short segments (over-segmentation)CBS threshold too sensitiveAdd
--threshold 0.2
or
--smooth-cbs
to reduce false segment boundaries; increase minimum segment size
ImportError: No module named cnvlib
CNVkit not installed in active environmentActivate the correct conda env:
conda activate cnvkit_env
; verify
cnvkit.py version
Chromosome naming mismatchBAM uses
1
but reference uses
chr1
or vice versa
Ensure BAM, BED, and FASTA all use the same chromosome naming convention

References