BioSkills bio-genome-assembly-long-read-assembly
De novo genome assembly from Oxford Nanopore or PacBio long reads using Flye and Canu. Produces highly contiguous assemblies suitable for complete bacterial genomes and resolving complex regions. Use when assembling genomes from ONT or PacBio reads.
git clone https://github.com/GPTomics/bioSkills
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/genome-assembly/long-read-assembly" ~/.claude/skills/gptomics-bioskills-bio-genome-assembly-long-read-assembly && rm -rf "$T"
genome-assembly/long-read-assembly/SKILL.mdVersion Compatibility
Reference examples tested with: Canu 2.2+, Flye 2.9+, hifiasm 0.19+, wtdbg2 2.5+
Before using code patterns, verify installed versions match. If versions differ:
- CLI:
then<tool> --version
to confirm flags<tool> --help
If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Long-Read Assembly
"Assemble a genome from long reads" → Build a contiguous de novo assembly from ONT or PacBio reads, producing complete or near-complete chromosomes.
- CLI:
(ONT),flye --nano-raw reads.fq -o output
(ONT/PacBio)canu -p asm -d output -nanopore reads.fq
Tool Comparison
| Tool | Speed | Memory | Best For |
|---|---|---|---|
| Flye | Fast | Moderate | General purpose, bacteria, ONT |
| Canu | Slow | High | High accuracy, complex genomes |
| Wtdbg2 | Very fast | Low | Draft assemblies |
Note: For PacBio HiFi data, see the dedicated hifi-assembly skill which covers hifiasm.
Flye
Installation
conda install -c bioconda flye
Basic Usage
# Oxford Nanopore flye --nano-raw reads.fastq.gz --out-dir flye_output --threads 16 # PacBio CLR flye --pacbio-raw reads.fastq.gz --out-dir flye_output --threads 16 # PacBio HiFi flye --pacbio-hifi reads.fastq.gz --out-dir flye_output --threads 16
Read Type Options
| Option | Read Type |
|---|---|
| ONT regular reads |
| ONT corrected reads |
| ONT Q20+ reads (Guppy 5+) |
| PacBio CLR |
| PacBio corrected |
| PacBio HiFi/CCS |
Key Options
| Option | Description |
|---|---|
| Output directory |
| Number of threads |
| Estimated genome size (e.g., 5m, 100m) |
| Polishing iterations (default: 1) |
| Metagenome mode |
| Recover plasmids |
| Don't collapse haplotypes |
| Enable scaffolding |
Genome Size Estimation
# Estimate if unknown flye --nano-raw reads.fq.gz --out-dir output --genome-size 5m # Size formats: 1000, 1k, 1m, 1g
Output Files
flye_output/ ├── assembly.fasta # Final assembly ├── assembly_graph.gfa # Assembly graph ├── assembly_info.txt # Contig statistics └── flye.log # Log file
Bacterial Assembly
flye \ --nano-raw bacteria.fastq.gz \ --out-dir bacteria_assembly \ --genome-size 5m \ --threads 16
Metagenome Assembly
flye \ --nano-raw metagenome.fastq.gz \ --out-dir meta_assembly \ --meta \ --threads 32
With Plasmid Recovery
flye \ --nano-raw isolate.fastq.gz \ --out-dir assembly \ --plasmids \ --threads 16
Canu
Installation
conda install -c bioconda canu
Basic Usage
# ONT reads canu -p assembly -d canu_output genomeSize=5m -nanopore reads.fastq.gz # PacBio HiFi canu -p assembly -d canu_output genomeSize=5m -pacbio-hifi reads.fastq.gz
Key Options
| Option | Description |
|---|---|
| Assembly prefix |
| Output directory |
| Estimated size (required) |
| Max threads |
| Max memory (e.g., 64g) |
| Disable grid execution |
| Expected error rate |
Read Type Options
| Option | Read Type |
|---|---|
| ONT reads |
| ONT raw (deprecated) |
| PacBio CLR |
| PacBio HiFi/CCS |
Fast Mode
canu -p asm -d output genomeSize=5m \ -nanopore reads.fq.gz \ useGrid=false \ maxThreads=16 \ maxMemory=32g
High-Quality Mode (PacBio HiFi)
canu -p asm -d output genomeSize=5m \ -pacbio-hifi reads.fq.gz \ correctedErrorRate=0.01
Output Files
canu_output/ ├── assembly.contigs.fasta # Contigs ├── assembly.unassembled.fasta ├── assembly.report └── assembly.seqStore/
Wtdbg2 (Fast Draft)
Installation
conda install -c bioconda wtdbg
Basic Usage
# Assemble wtdbg2 -x ont -g 5m -t 16 -i reads.fq.gz -o draft # Consensus wtpoa-cns -t 16 -i draft.ctg.lay.gz -o draft.ctg.fa
Platform Presets
| Preset | Platform |
|---|---|
| ONT R9 |
| PacBio HiFi |
| PacBio CLR |
| ONT R10 |
Complete Workflows
Goal: Run end-to-end long-read assembly pipelines from raw reads to contigs.
Approach: Use Flye for initial assembly, optionally followed by short-read polishing.
ONT Bacterial Assembly
#!/bin/bash set -euo pipefail READS=$1 OUTDIR=$2 SIZE=${3:-5m} echo "=== ONT Bacterial Assembly ===" # Flye assembly flye \ --nano-raw $READS \ --out-dir ${OUTDIR}/flye \ --genome-size $SIZE \ --threads 16 # Stats echo "Assembly statistics:" cat ${OUTDIR}/flye/assembly_info.txt echo "Assembly: ${OUTDIR}/flye/assembly.fasta"
Hybrid Assembly (Long + Short)
#!/bin/bash set -euo pipefail LONG=$1 SHORT_R1=$2 SHORT_R2=$3 OUTDIR=$4 # 1. Long-read assembly with Flye flye --nano-raw $LONG --out-dir ${OUTDIR}/flye --genome-size 5m --threads 16 # 2. Polish with short reads (Pilon) # See assembly-polishing skill
Quality Expectations
| Metric | Bacterial | Eukaryotic |
|---|---|---|
| Contigs | 1-10 | 100-1000+ |
| N50 | >1 Mb | Variable |
| Complete chromosomes | Often | Rare |
Troubleshooting
Low Contiguity
- Check coverage (need >30x)
- Try increasing iterations in Flye
- Consider supplementing with short reads
Memory Issues
- Use Flye (more memory efficient)
- Reduce threads
- Filter reads by length/quality
Misassemblies
- Polish with Pilon/medaka
- Validate with short reads
- Check for contamination
Related Skills
- hifi-assembly - PacBio HiFi assembly with hifiasm
- assembly-polishing - Polish long-read assemblies
- assembly-qc - QUAST and BUSCO assessment
- short-read-assembly - Hybrid with Illumina
- long-read-sequencing - Read QC and alignment