LLMs-Universal-Life-Science-and-Clinical-Skills- medaka-polishing

<!--

install
source · Clone the upstream repo
git clone https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills-
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills- "$T" && mkdir -p ~/.claude/skills && cp -r "$T/Skills/Genomics/long-read-sequencing/medaka-polishing" ~/.claude/skills/mdbabumiamssm-llms-universal-life-science-and-clinical-skills-medaka-polishing && rm -rf "$T"
manifest: Skills/Genomics/long-read-sequencing/medaka-polishing/SKILL.md
source content
<!-- # COPYRIGHT NOTICE # This file is part of the "Universal Biomedical Skills" project. # Copyright (c) 2026 MD BABU MIA, PhD <md.babu.mia@mssm.edu> # All Rights Reserved. # # This code is proprietary and confidential. # Unauthorized copying of this file, via any medium is strictly prohibited. # # Provenance: Authenticated by MD BABU MIA -->

name: bio-longread-medaka description: Polish assemblies and call variants from Oxford Nanopore data using medaka. Uses neural networks trained on specific basecaller versions. Use when improving ONT-only assemblies or calling variants from Nanopore data without short-read polishing. tool_type: cli primary_tool: medaka measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes. allowed-tools:

  • read_file
  • run_shell_command

Medaka Polishing and Variant Calling

Basic Consensus Polishing

# Polish assembly with medaka
medaka_consensus -i reads.fastq.gz \
    -d draft_assembly.fa \
    -o medaka_output \
    -t 4 \
    -m r1041_e82_400bps_sup_v5.0.0

Variant Calling (Haploid)

# Call variants against reference (medaka v2.0+)
medaka_variant \
    -i reads.fastq.gz \
    -r reference.fa \
    -o output_dir \
    -m r1041_e82_400bps_sup_v5.0.0

Note: Diploid variant calling has been deprecated in medaka v2.0. For diploid samples, use Clair3 instead.

Step-by-Step Workflow (medaka v2.0+)

# 1. Align reads to reference/draft
minimap2 -ax map-ont reference.fa reads.fastq.gz | \
    samtools sort -o aligned.bam
samtools index aligned.bam

# 2. Run neural network inference
medaka inference aligned.bam consensus.hdf \
    --model r1041_e82_400bps_sup_v5.0.0 \
    --threads 2                          # >2 threads has poor scaling

# 3. Create consensus sequence from probabilities
medaka sequence consensus.hdf reference.fa polished.fa

# 4. Call variants from probabilities
medaka vcf reference.fa consensus.hdf variants.vcf

List Available Models

# See all available models
medaka tools list_models

# Models are named:
# r{pore}_{chemistry}_{speed}bps_{accuracy}_{version}
# e.g., r1041_e82_400bps_sup_v5.0.0

Common Models

ModelDescription
r1041_e82_400bps_sup_v5.0.0R10.4.1, E8.2, SUP basecalling
r1041_e82_400bps_hac_v5.0.0R10.4.1, E8.2, HAC basecalling
r941_min_sup_g507R9.4.1, MinION, SUP
r941_min_hac_g507R9.4.1, MinION, HAC

Choose Model Based on Basecaller

# Check which basecaller was used in your data
# Then select matching model

# For Guppy/Dorado SUP basecalling on R10.4.1
medaka_consensus -m r1041_e82_400bps_sup_v5.0.0 ...

# For HAC basecalling
medaka_consensus -m r1041_e82_400bps_hac_v5.0.0 ...

Polish Region Only

# Polish specific region (medaka v2.0+)
medaka inference aligned.bam consensus.hdf \
    --model r1041_e82_400bps_sup_v5.0.0 \
    --region chr1:1000000-2000000

Multiple Rounds of Polishing

# First round
medaka_consensus -i reads.fastq.gz -d draft.fa -o round1 -m model

# Second round (diminishing returns, usually not needed)
medaka_consensus -i reads.fastq.gz -d round1/consensus.fasta -o round2 -m model

Call Variants from Existing BAM

# If you already have aligned BAM (medaka v2.0+)
medaka inference aligned.bam consensus.hdf --model r1041_e82_400bps_sup_v5.0.0
medaka vcf reference.fa consensus.hdf variants.vcf

Filter VCF Output

# Filter by quality
bcftools filter -i 'QUAL>20' variants.vcf > variants.filtered.vcf

# Get high-confidence calls
bcftools view -i 'FILTER="PASS"' variants.vcf > variants.pass.vcf

Output Files

FileDescription
consensus.fastaPolished sequence
consensus.hdfNeural network outputs
variants.vcfVariant calls
calls_to_draft.bamAlignments used

Key Parameters

ParameterDescription
-iInput reads (FASTQ)
-dDraft assembly/reference
-oOutput directory
-mModel name
-tThreads
-bBatch size (GPU memory)
--regionSpecific region to process

GPU Acceleration

# Enable GPU (if available)
medaka_consensus -i reads.fastq.gz -d draft.fa -o output \
    -m r1041_e82_400bps_sup_v5.0.0 \
    -b 100 \                       # Increase batch size for GPU
    -t 4

Related Skills

  • long-read-alignment - Generate input alignments
  • structural-variants - Find SVs from polished assembly
  • variant-calling/variant-calling - Short-read variant calling comparison
<!-- AUTHOR_SIGNATURE: 9a7f3c2e-MD-BABU-MIA-2026-MSSM-SECURE -->