LLMs-Universal-Life-Science-and-Clinical-Skills- basecalling

<!--

install
source · Clone the upstream repo
git clone https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills-
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills- "$T" && mkdir -p ~/.claude/skills && cp -r "$T/Skills/Genomics/long-read-sequencing/basecalling" ~/.claude/skills/mdbabumiamssm-llms-universal-life-science-and-clinical-skills-basecalling && rm -rf "$T"
manifest: Skills/Genomics/long-read-sequencing/basecalling/SKILL.md
source content
<!-- # COPYRIGHT NOTICE # This file is part of the "Universal Biomedical Skills" project. # Copyright (c) 2026 MD BABU MIA, PhD <md.babu.mia@mssm.edu> # All Rights Reserved. # # This code is proprietary and confidential. # Unauthorized copying of this file, via any medium is strictly prohibited. # # Provenance: Authenticated by MD BABU MIA -->

name: bio-basecalling description: "Convert raw Nanopore signal data (FAST5/POD5) to nucleotide sequences using Dorado basecaller. Covers model selection, GPU acceleration, modified base detection, and quality filtering. Use when processing raw Nanopore data before alignment. Note: Guppy is deprecated; use Dorado for all new analyses." tool_type: cli primary_tool: dorado measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes. allowed-tools:

  • read_file
  • run_shell_command

Nanopore Basecalling

Convert raw electrical signal from Nanopore sequencing into nucleotide sequences.

Dorado (Recommended)

Dorado is ONT's current production basecaller, replacing Guppy. It offers better accuracy and speed.

Basic Basecalling

dorado basecaller sup pod5_dir/ > calls.bam

Choose Model

dorado basecaller fast pod5_dir/ > calls.bam
dorado basecaller hac pod5_dir/ > calls.bam
dorado basecaller sup pod5_dir/ > calls.bam

Model Speed vs Accuracy

ModelSpeedAccuracyUse Case
fastFastestLowerQuick preview
hacMediumHighGeneral use
supSlowestHighestPublication quality

Specific Model Version

dorado download --model dna_r10.4.1_e8.2_400bps_sup@v5.1.0
dorado basecaller dna_r10.4.1_e8.2_400bps_sup@v5.1.0 pod5_dir/ > calls.bam

List Available Models

dorado download --list

Output FASTQ Instead of BAM

dorado basecaller sup pod5_dir/ --emit-fastq > calls.fastq

Modified Base Detection

dorado basecaller sup,5mCG_5hmCG pod5_dir/ > calls_mods.bam
dorado basecaller sup,5mCG pod5_dir/ > calls_5mc.bam
dorado basecaller sup,6mA pod5_dir/ > calls_6ma.bam

GPU Selection

dorado basecaller sup pod5_dir/ --device cuda:0 > calls.bam
dorado basecaller sup pod5_dir/ --device cuda:0,1 > calls.bam
dorado basecaller sup pod5_dir/ --device cpu > calls.bam

Batch Size for Memory

dorado basecaller sup pod5_dir/ --batchsize 64 > calls.bam

Duplex Calling

dorado duplex sup pod5_dir/ > duplex.bam

Demultiplexing During Basecalling

dorado basecaller sup pod5_dir/ --kit-name SQK-NBD114-24 > calls.bam
dorado demux calls.bam --output-dir demuxed/ --kit-name SQK-NBD114-24

Trim Adapters

dorado basecaller sup pod5_dir/ --trim adapters > calls.bam
dorado basecaller sup pod5_dir/ --no-trim > calls_untrimmed.bam

Resume Interrupted Run

dorado basecaller sup pod5_dir/ --resume-from calls.bam > calls_complete.bam

Guppy (Deprecated - Legacy Only)

Guppy is deprecated and no longer receiving updates. Use Dorado for all new analyses. Guppy examples below are only for maintaining legacy pipelines.

Basic Basecalling

guppy_basecaller \
    -i fast5_dir/ \
    -s output_dir/ \
    -c dna_r10.4.1_e8.2_400bps_sup.cfg \
    --device cuda:0

CPU Mode

guppy_basecaller \
    -i fast5_dir/ \
    -s output_dir/ \
    -c dna_r10.4.1_e8.2_400bps_fast.cfg \
    --num_callers 8 \
    --cpu_threads_per_caller 4

High Accuracy Model

guppy_basecaller \
    -i fast5_dir/ \
    -s output_dir/ \
    -c dna_r10.4.1_e8.2_400bps_hac.cfg \
    --device cuda:0

Super Accuracy Model

guppy_basecaller \
    -i fast5_dir/ \
    -s output_dir/ \
    -c dna_r10.4.1_e8.2_400bps_sup.cfg \
    --device cuda:0

List Available Configs

guppy_basecaller --print_workflows
ls /opt/ont/guppy/data/*.cfg

Modified Base Calling

guppy_basecaller \
    -i fast5_dir/ \
    -s output_dir/ \
    -c dna_r10.4.1_e8.2_400bps_modbases_5mc_cg_sup.cfg \
    --device cuda:0

Barcoding During Basecalling

guppy_basecaller \
    -i fast5_dir/ \
    -s output_dir/ \
    -c dna_r10.4.1_e8.2_400bps_sup.cfg \
    --device cuda:0 \
    --barcode_kits SQK-NBD114-24

Output BAM

guppy_basecaller \
    -i fast5_dir/ \
    -s output_dir/ \
    -c dna_r10.4.1_e8.2_400bps_sup.cfg \
    --device cuda:0 \
    --bam_out \
    --index

POD5 File Handling

POD5 is the new format replacing FAST5.

Convert FAST5 to POD5

pod5 convert fast5 fast5_dir/*.fast5 --output pod5_dir/

Merge POD5 Files

pod5 merge pod5_dir/*.pod5 --output merged.pod5

Inspect POD5

pod5 inspect reads input.pod5
pod5 inspect summary input.pod5

Subset POD5

pod5 subset input.pod5 --output subset.pod5 --read-id-file read_ids.txt

Quality Filtering

Filter with Chopper (After Basecalling)

gunzip -c calls.fastq.gz | chopper -q 10 -l 500 | gzip > filtered.fastq.gz

Filter by Quality Score

gunzip -c calls.fastq.gz | \
    awk 'BEGIN{OFS="\n"} {h=$0; getline seq; getline plus; getline qual;
         split(h, a, " "); split(a[4], q, "=");
         if(q[2] >= 10) print h, seq, plus, qual}' | \
    gzip > q10_filtered.fastq.gz

NanoFilt (Alternative)

gunzip -c calls.fastq.gz | NanoFilt -q 10 -l 500 | gzip > filtered.fastq.gz

Basecalling QC

NanoPlot

NanoPlot --fastq calls.fastq.gz -o qc_report/ --plots hex dot
NanoPlot --bam calls.bam -o qc_report/

pycoQC (From Sequencing Summary)

pycoQC -f sequencing_summary.txt -o pycoqc_report.html

Basic Stats

seqkit stats calls.fastq.gz

awk 'NR%4==2 {sum+=length($0); count++} END {print "Reads:", count, "Mean length:", sum/count}' calls.fastq

Model Selection Guide

R10.4.1 Chemistry (Current)

ModelUse
dna_r10.4.1_e8.2_400bps_fastQuick analysis
dna_r10.4.1_e8.2_400bps_hacRoutine work
dna_r10.4.1_e8.2_400bps_supHigh accuracy

R9.4.1 Chemistry (Legacy)

ModelUse
dna_r9.4.1_450bps_fastQuick analysis
dna_r9.4.1_450bps_hacRoutine work
dna_r9.4.1_450bps_supHigh accuracy

Complete Pipeline

#!/bin/bash
INPUT=$1
OUTPUT=$2
MODEL=${3:-sup}

mkdir -p $OUTPUT

if [ -d "$INPUT/fast5" ]; then
    echo "Converting FAST5 to POD5..."
    pod5 convert fast5 $INPUT/fast5/*.fast5 --output $OUTPUT/pod5/
    INPUT_DIR="$OUTPUT/pod5"
else
    INPUT_DIR="$INPUT"
fi

echo "Basecalling with $MODEL model..."
dorado basecaller $MODEL $INPUT_DIR > $OUTPUT/calls.bam

echo "Converting to FASTQ..."
samtools fastq $OUTPUT/calls.bam | gzip > $OUTPUT/calls.fastq.gz

echo "Filtering..."
gunzip -c $OUTPUT/calls.fastq.gz | chopper -q 10 -l 500 | gzip > $OUTPUT/filtered.fastq.gz

echo "QC report..."
NanoPlot --fastq $OUTPUT/filtered.fastq.gz -o $OUTPUT/qc/

echo "Done!"

GPU Requirements

ModelVRAM RequiredSpeed (R10.4.1)
fast4 GB~450 bases/s
hac8 GB~200 bases/s
sup12 GB~50 bases/s

Troubleshooting

Out of Memory

dorado basecaller sup pod5_dir/ --batchsize 32 > calls.bam

Slow CPU Basecalling

dorado basecaller fast pod5_dir/ --device cpu > calls.bam

Check GPU Usage

nvidia-smi -l 1
watch -n 1 nvidia-smi

Related Skills

  • long-read-alignment - Align basecalled reads
  • long-read-qc - QC after basecalling
  • medaka-polishing - Polish using basecalled reads
  • structural-variants - SV detection from long reads
<!-- AUTHOR_SIGNATURE: 9a7f3c2e-MD-BABU-MIA-2026-MSSM-SECURE -->