LLMs-Universal-Life-Science-and-Clinical-Skills- long-read-qc

<!--

install

source · Clone the upstream repo

git clone https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills-

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills- "$T" && mkdir -p ~/.claude/skills && cp -r "$T/Skills/Genomics/long-read-sequencing/long-read-qc" ~/.claude/skills/mdbabumiamssm-llms-universal-life-science-and-clinical-skills-long-read-qc && rm -rf "$T"

manifest: Skills/Genomics/long-read-sequencing/long-read-qc/SKILL.md

source content

name: bio-longread-qc description: Quality control for long-read sequencing data using NanoPlot, NanoStat, and chopper. Generate QC reports, filter reads by length and quality, and visualize read characteristics. Use when assessing ONT or PacBio run quality or filtering reads before assembly or alignment. tool_type: cli primary_tool: nanoplot measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes. allowed-tools:

read_file
run_shell_command

Long-Read Quality Control

NanoPlot - Visualization

# From FASTQ
NanoPlot --fastq reads.fastq.gz -o nanoplot_output -t 4

# From BAM
NanoPlot --bam aligned.bam -o nanoplot_output -t 4

# From sequencing summary (fastest)
NanoPlot --summary sequencing_summary.txt -o nanoplot_output

NanoPlot - Common Options

NanoPlot --fastq reads.fastq.gz \
    -o nanoplot_output \
    -t 8 \
    --N50 \                        # Show N50 in plots
    --title "Sample QC" \
    --plots hex dot \              # Plot types
    --format png pdf \             # Output formats
    --color darkblue \
    --maxlength 50000 \            # Max length for plots
    --minlength 500                # Min length for plots

NanoStat - Statistics Only

# Quick statistics (no plots)
NanoStat --fastq reads.fastq.gz --threads 4

# From BAM
NanoStat --bam aligned.bam --threads 4

# Output to file
NanoStat --fastq reads.fastq.gz --threads 4 > qc_stats.txt

chopper - Filter Reads

# Filter by length and quality
gunzip -c reads.fastq.gz | chopper -q 10 -l 1000 | gzip > filtered.fastq.gz

# Quality >= 10, length >= 1000bp

chopper - Common Options

gunzip -c reads.fastq.gz | chopper \
    --quality 10 \                 # Min quality
    --minlength 1000 \             # Min length
    --maxlength 50000 \            # Max length
    --headcrop 50 \                # Remove from start
    --tailcrop 50 \                # Remove from end
    --threads 4 \
    | gzip > filtered.fastq.gz

NanoFilt - Alternative Filter

# Filter with NanoFilt
gunzip -c reads.fastq.gz | NanoFilt -q 10 -l 1000 | gzip > filtered.fastq.gz

# With more options
gunzip -c reads.fastq.gz | NanoFilt \
    --quality 10 \
    --length 1000 \
    --maxlength 50000 \
    --headcrop 50 \
    | gzip > filtered.fastq.gz

Porechop - Adapter Trimming

# Trim adapters
porechop -i reads.fastq.gz -o trimmed.fastq.gz --threads 8

# With barcode splitting
porechop -i reads.fastq.gz -b output_dir/ --threads 8

Generate Summary Statistics

# Quick summary with seqkit
seqkit stats reads.fastq.gz

# Detailed stats
seqkit stats -a reads.fastq.gz

# Watch stats during basecalling
seqkit watch --fields ReadLen,MeanQual reads.fastq.gz

PycoQC - From Basecalling

# Generate QC report from sequencing_summary.txt
pycoQC -f sequencing_summary.txt -o pycoqc_report.html

# With BAM for alignment stats
pycoQC -f sequencing_summary.txt -a aligned.bam -o pycoqc_report.html

Calculate N50

# With seqkit
seqkit stats -a reads.fastq.gz | grep N50

# Manual calculation
seqkit fx2tab -l reads.fastq.gz | cut -f 2 | sort -rn | \
    awk '{sum+=$1; len[NR]=$1} END {
        target=sum/2; cumsum=0;
        for(i=1; i<=NR; i++) {
            cumsum+=len[i];
            if(cumsum>=target) {print "N50:", len[i]; break}
        }
    }'

Parse FASTQ Quality in Python

import numpy as np
from Bio import SeqIO

lengths = []
qualities = []

for record in SeqIO.parse('reads.fastq', 'fastq'):
    lengths.append(len(record))
    qualities.append(np.mean(record.letter_annotations['phred_quality']))

print(f'Total reads: {len(lengths)}')
print(f'Total bases: {sum(lengths):,}')
print(f'Mean length: {np.mean(lengths):.0f}')
print(f'Median length: {np.median(lengths):.0f}')
print(f'Mean quality: {np.mean(qualities):.1f}')

NanoPlot Output Files

File	Description
NanoStats.txt	Summary statistics
NanoPlot-report.html	Interactive report
LengthvsQualityScatterPlot	Length vs Q plot
WeightedHistogramReadlength	Read length distribution
Yield_By_Length	Cumulative yield

Key Parameters - NanoPlot

Parameter	Description
--fastq	Input FASTQ
--bam	Input BAM
--summary	Sequencing summary
-o	Output directory
-t	Threads
--N50	Show N50 line
--plots	Plot types
--format	Output formats

Key Parameters - chopper

Parameter	Default	Description
-q	0	Min quality
-l	0	Min length
--maxlength	inf	Max length
--headcrop	0	Trim from start
--tailcrop	0	Trim from end
-t	4	Threads

Quality Thresholds

Q Score	Accuracy	Typical Use
Q7	~80%	Very low quality
Q10	~90%	Basic filtering
Q15	~97%	Moderate filtering
Q20	~99%	High quality (SUP)
Q30	~99.9%	Very high (HiFi)

Related Skills

long-read-alignment - Align filtered reads
sequence-io/fastq-quality - FASTQ quality analysis
medaka-polishing - Polish with filtered reads