SciAgent-Skills deeptools-ngs-analysis
NGS analysis CLI toolkit for ChIP-seq, RNA-seq, ATAC-seq. BAM→bigWig conversion with normalization (RPGC, CPM, RPKM), sample correlation/PCA, heatmaps and profile plots around genomic features, enrichment fingerprints. For alignment use STAR/BWA; for peak calling use MACS2.
git clone https://github.com/jaechang-hits/SciAgent-Skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/genomics-bioinformatics/deeptools-ngs-analysis" ~/.claude/skills/jaechang-hits-sciagent-skills-deeptools-ngs-analysis && rm -rf "$T"
skills/genomics-bioinformatics/deeptools-ngs-analysis/SKILL.mddeepTools — NGS Data Analysis Toolkit
Overview
deepTools is a command-line toolkit for processing and visualizing high-throughput sequencing data. It converts BAM alignments to normalized coverage tracks (bigWig), performs quality control (correlation, PCA, fingerprint), and generates publication-quality heatmaps and profile plots around genomic features. Supports ChIP-seq, RNA-seq, ATAC-seq, and MNase-seq.
When to Use
- Converting BAM files to normalized bigWig coverage tracks
- Comparing ChIP-seq treatment vs input control (log2 ratio tracks)
- Assessing sample quality: replicate correlation, PCA, coverage depth
- Evaluating ChIP enrichment strength (fingerprint plots)
- Creating heatmaps and profile plots around TSS, peaks, or other genomic regions
- Analyzing ATAC-seq data with Tn5 offset correction
- Generating strand-specific RNA-seq coverage tracks
- For read alignment, use STAR, BWA, or bowtie2 instead
- For peak calling, use MACS2 or HOMER instead
- For BAM/VCF file manipulation, use pysam instead
Prerequisites
pip install deeptools # Verify installation bamCoverage --version
Input requirements: BAM files must be sorted and indexed (
.bai file present). Generate index with samtools index input.bam. BED files for genomic regions (genes, peaks) in standard 3+ column format.
Quick Start
# Convert BAM to normalized bigWig bamCoverage --bam sample.bam --outFileName sample.bw \ --normalizeUsing RPGC --effectiveGenomeSize 2913022398 \ --binSize 10 --numberOfProcessors 8 # Create heatmap around TSS computeMatrix reference-point -S sample.bw -R genes.bed \ -b 3000 -a 3000 --referencePoint TSS -o matrix.gz plotHeatmap -m matrix.gz -o heatmap.png --colorMap RdBu
Core API
1. BAM to Coverage Conversion
Convert BAM alignments to normalized coverage tracks (bigWig or bedGraph).
# Basic conversion with RPGC normalization bamCoverage --bam input.bam --outFileName output.bw \ --normalizeUsing RPGC --effectiveGenomeSize 2913022398 \ --binSize 10 --numberOfProcessors 8 \ --extendReads 200 --ignoreDuplicates # CPM normalization (simpler, no genome size needed) bamCoverage --bam input.bam --outFileName output.bw \ --normalizeUsing CPM --binSize 10 -p 8 # RNA-seq: strand-specific coverage bamCoverage --bam rnaseq.bam --outFileName forward.bw \ --filterRNAstrand forward --normalizeUsing CPM -p 8 # IMPORTANT: Never use --extendReads for RNA-seq (spans splice junctions)
2. Sample Comparison
Compare treatment vs control or generate ratio tracks.
# Log2 ratio: treatment / control bamCompare -b1 treatment.bam -b2 control.bam -o log2ratio.bw \ --operation log2 --scaleFactorsMethod readCount \ --extendReads 200 -p 8 # Subtract control from treatment bamCompare -b1 treatment.bam -b2 control.bam -o subtract.bw \ --operation subtract --scaleFactorsMethod readCount
3. Quality Control
Assess sample quality, replicate concordance, and enrichment strength.
# Sample correlation heatmap multiBamSummary bins --bamfiles rep1.bam rep2.bam rep3.bam \ -o counts.npz --binSize 10000 -p 8 plotCorrelation -in counts.npz --corMethod pearson \ --whatToShow heatmap -o correlation.png # Good: replicates cluster with r > 0.9 # PCA of samples plotPCA -in counts.npz -o pca.png --plotTitle "Sample PCA" # ChIP enrichment fingerprint plotFingerprint -b input.bam chip.bam -o fingerprint.png \ --extendReads 200 --ignoreDuplicates # Good ChIP: steep rise curve; flat diagonal = poor enrichment # Coverage depth assessment plotCoverage -b sample.bam -o coverage.png --ignoreDuplicates -p 8 # Fragment size distribution (paired-end) bamPEFragmentSize -b sample.bam -o fragsize.png
4. Heatmaps and Profile Plots
Visualize signal around genomic features (TSS, peaks, gene bodies).
# Reference-point mode: signal around TSS computeMatrix reference-point -S chip.bw -R genes.bed \ -b 3000 -a 3000 --referencePoint TSS -o matrix.gz -p 8 # Scale-regions mode: signal across gene bodies computeMatrix scale-regions -S chip.bw -R genes.bed \ -b 1000 -a 1000 --regionBodyLength 5000 -o matrix.gz -p 8 # Generate heatmap plotHeatmap -m matrix.gz -o heatmap.png \ --colorMap RdBu --kmeans 3 --sortUsing mean # Generate profile plot plotProfile -m matrix.gz -o profile.png \ --plotType lines --colors blue red # Multiple signal files: compare marks computeMatrix reference-point -S h3k4me3.bw h3k27me3.bw -R genes.bed \ -b 3000 -a 3000 --referencePoint TSS -o multi_matrix.gz plotHeatmap -m multi_matrix.gz -o multi_heatmap.png
5. Read Filtering and Processing
Filter reads before analysis or correct for assay-specific biases.
# Filter by mapping quality and fragment size alignmentSieve --bam input.bam --outFile filtered.bam \ --minMappingQuality 10 --minFragmentLength 150 \ --maxFragmentLength 700 # ATAC-seq: apply Tn5 offset correction (+4/-5 bp shift) alignmentSieve --bam atac.bam --outFile shifted.bam --ATACshift # Then index: samtools index shifted.bam # GC bias correction (only if significant bias detected) computeGCBias -b input.bam --effectiveGenomeSize 2913022398 \ -g genome.2bit --GCbiasFrequenciesFile gc_freq.txt -p 8 correctGCBias -b input.bam --effectiveGenomeSize 2913022398 \ --GCbiasFrequenciesFile gc_freq.txt -o corrected.bam
6. Enrichment Analysis
Quantify signal enrichment at specific regions.
# Signal enrichment at peak regions plotEnrichment -b chip.bam input.bam --BED peaks.bed \ -o enrichment.png --ignoreDuplicates -p 8
Key Concepts
Normalization Methods
| Method | Formula | When to Use | Requires |
|---|---|---|---|
| RPGC | 1× genome coverage | ChIP-seq, ATAC-seq | |
| CPM | Counts per million | Any assay, quick comparison | Nothing |
| RPKM | Per kb per million | RNA-seq gene-level | Nothing |
| BPM | Bins per million | Similar to CPM | Nothing |
| None | Raw counts | Not recommended for comparison | Nothing |
Rule: Use RPGC for ChIP-seq/ATAC-seq (accounts for genome size). Use CPM for quick comparisons. Use RPKM for RNA-seq gene-level analysis.
Effective Genome Sizes
| Organism | Assembly | Effective Size |
|---|---|---|
| Human | GRCh38/hg38 | 2,913,022,398 |
| Mouse | GRCm38/mm10 | 2,652,783,500 |
| Zebrafish | GRCz11 | 1,368,780,147 |
| Drosophila | dm6 | 142,573,017 |
| C. elegans | ce10/ce11 | 100,286,401 |
computeMatrix Modes
| Mode | Use When | Key Params |
|---|---|---|
| Signal around a fixed point (TSS, peak summit) | , , |
| Signal across variable-length features (gene bodies) | , , |
Common Workflows
Workflow: ChIP-seq QC and Visualization
#!/bin/bash # Complete ChIP-seq QC + visualization pipeline CHIP="chip.bam" INPUT="input.bam" GENES="genes.bed" PEAKS="peaks.bed" GSIZE=2913022398 THREADS=8 # 1. QC: sample correlation multiBamSummary bins --bamfiles $INPUT $CHIP -o summary.npz -p $THREADS plotCorrelation -in summary.npz --corMethod pearson --whatToShow heatmap -o correlation.png # 2. QC: enrichment fingerprint plotFingerprint -b $INPUT $CHIP -o fingerprint.png --extendReads 200 --ignoreDuplicates # 3. Convert to normalized bigWig bamCoverage --bam $CHIP --outFileName chip.bw --normalizeUsing RPGC \ --effectiveGenomeSize $GSIZE --extendReads 200 --ignoreDuplicates -p $THREADS # 4. Log2 ratio track bamCompare -b1 $CHIP -b2 $INPUT -o log2ratio.bw --operation log2 \ --scaleFactorsMethod readCount --extendReads 200 -p $THREADS # 5. Heatmap at TSS computeMatrix reference-point -S chip.bw log2ratio.bw -R $GENES \ -b 3000 -a 3000 --referencePoint TSS -o tss_matrix.gz -p $THREADS plotHeatmap -m tss_matrix.gz -o tss_heatmap.png --colorMap RdBu --kmeans 3 # 6. Profile at peaks computeMatrix reference-point -S chip.bw -R $PEAKS \ -b 2000 -a 2000 -o peak_matrix.gz -p $THREADS plotProfile -m peak_matrix.gz -o peak_profile.png
Workflow: ATAC-seq Analysis
#!/bin/bash ATAC="atac.bam" PEAKS="atac_peaks.bed" GSIZE=2913022398 THREADS=8 # 1. Apply Tn5 offset correction (+4/-5 bp) alignmentSieve --bam $ATAC --outFile shifted.bam --ATACshift -p $THREADS samtools index shifted.bam # 2. Generate RPGC-normalized coverage bamCoverage --bam shifted.bam --outFileName atac.bw \ --normalizeUsing RPGC --effectiveGenomeSize $GSIZE \ --binSize 5 --extendReads -p $THREADS # 3. Check nucleosome periodicity (expect 200bp/400bp peaks) bamPEFragmentSize -b shifted.bam -o fragsize.png \ --maxFragmentLength 1000 --binSize 1 # 4. Heatmap at ATAC peaks computeMatrix reference-point -S atac.bw -R $PEAKS \ -b 2000 -a 2000 -o atac_matrix.gz -p $THREADS plotHeatmap -m atac_matrix.gz -o atac_heatmap.png --colorMap Blues --kmeans 2
Key Parameters
| Parameter | Tool(s) | Default | Range | Effect |
|---|---|---|---|---|
| bamCoverage, bamCompare | None | RPGC, CPM, RPKM, BPM, None | Coverage normalization method |
| bamCoverage, bamCompare | — | See table above | Required for RPGC normalization |
| bamCoverage, multiBamSummary | 50 | 1–10000 | Resolution in bp; smaller = larger files |
| bamCoverage, bamCompare | False | integer (bp) | Extend to fragment length (ChIP: YES, RNA: NO) |
| Most tools | False | True/False | Remove PCR duplicates |
| Most tools | 1 | 1–N cores | Parallel processing |
| bamCompare | log2 | log2, ratio, subtract, add, mean, reciprocal_ratio | Sample comparison operation |
| computeMatrix | TSS | TSS, TES, center | Anchor point for reference-point mode |
/ | computeMatrix | 500 | 100–10000 bp | Upstream/downstream distance from reference |
| plotHeatmap | None | 1–20 | Number of clusters for heatmap rows |
| Most tools | None | 0–60 | Minimum alignment quality filter |
Best Practices
-
Always extend reads for ChIP-seq: Use
(or actual fragment length) — ChIP fragments are longer than reads.--extendReads 200 -
Never extend reads for RNA-seq:
would span splice junctions, creating artifacts.--extendReads -
Anti-pattern — comparing with different normalizations: Always use the same normalization method across all samples in a comparison.
-
Use
for parameter testing: Test on a single chromosome (--region
) before running on the full genome — saves hours.--region chr1:1-10000000 -
Always use
: Most tools parallelize well — use all available cores.--numberOfProcessors -
Anti-pattern — using RPGC without
: Will silently produce wrong results. Always specify the correct genome size.--effectiveGenomeSize -
Run QC before analysis: Check fingerprint and correlation before investing time in heatmaps/profiles. Poor enrichment means downstream visualizations will be noise.
Common Recipes
Recipe: Multi-Sample Correlation Matrix
# Compare 6 samples across the genome multiBamSummary bins --bamfiles sample{1..6}.bam \ -o all_samples.npz --binSize 10000 -p 8 \ --labels S1 S2 S3 S4 S5 S6 # Pearson correlation heatmap plotCorrelation -in all_samples.npz --corMethod pearson \ --whatToShow heatmap -o pearson_corr.png --plotNumbers # Spearman correlation + PCA plotCorrelation -in all_samples.npz --corMethod spearman \ --whatToShow heatmap -o spearman_corr.png plotPCA -in all_samples.npz -o pca.png
Recipe: Gene Body Coverage Profile
# Scale-regions mode for gene body analysis computeMatrix scale-regions -S sample.bw -R genes.bed \ -b 1000 -a 1000 --regionBodyLength 5000 -o gene_body.gz -p 8 plotProfile -m gene_body.gz -o gene_body_profile.png \ --plotType lines --perGroup
Troubleshooting
| Problem | Cause | Solution |
|---|---|---|
| Missing file | Run |
| Out of memory | Large genome, small bin size | Increase ; process with |
| Very slow processing | Single-threaded execution | Add (or available cores) |
| bigWig files very large | Bin size too small | Increase or larger |
| Flat ChIP fingerprint | Poor ChIP enrichment | Biological issue — consider repeating ChIP experiment |
| RNA-seq artifacts at exon boundaries | used with RNA-seq | Remove for RNA-seq data |
| ATAC-seq signal offset | Missing Tn5 correction | Apply before analysis |
| Mismatched genome assemblies | BAM and BED use different assemblies | Verify both use same genome build (hg38 vs hg19) |
Related Skills
- pysam-genomic-files — programmatic BAM/VCF manipulation for custom filtering before deepTools
- matplotlib-scientific-plotting — customize deepTools output figures beyond built-in options
References
- deepTools documentation — official user guide and tool reference
- deepTools Galaxy — web-based interface
- Ramirez et al. (2016) "deepTools2: a next generation web server for deep-sequencing data analysis" — Nucleic Acids Research