BioSkills bio-genome-assembly-metagenome-assembly
Metagenome assembly from long reads using metaFlye and metaSPAdes with binning strategies. Use when reconstructing genomes from microbial communities, recovering metagenome-assembled genomes (MAGs), or resolving strain-level variation in complex samples.
git clone https://github.com/GPTomics/bioSkills
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/genome-assembly/metagenome-assembly" ~/.claude/skills/gptomics-bioskills-bio-genome-assembly-metagenome-assembly && rm -rf "$T"
genome-assembly/metagenome-assembly/SKILL.mdVersion Compatibility
Reference examples tested with: QUAST 5.2+, SPAdes 3.15+, minimap2 2.26+, pandas 2.2+, samtools 1.19+
Before using code patterns, verify installed versions match. If versions differ:
- Python:
thenpip show <package>
to check signatureshelp(module.function) - CLI:
then<tool> --version
to confirm flags<tool> --help
If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Metagenome Assembly
"Assemble genomes from my metagenome data" → Reconstruct individual microbial genomes (MAGs) from mixed community sequencing reads using metagenome-aware assemblers and binning.
- CLI:
(long-read),flye --meta --nano-raw reads.fq
(short-read)metaspades.py -1 R1.fq -2 R2.fq
Overview
Metagenome assembly reconstructs genomes from mixed microbial communities. Long reads enable recovery of complete circular genomes and resolution of strain-level differences.
metaFlye (Long Reads)
Goal: Assemble metagenome contigs from long reads handling uneven coverage across species.
Approach: Run Flye in --meta mode which accounts for varying coverage depths in mixed communities.
# ONT metagenome assembly flye --nano-raw reads.fastq.gz \ --meta \ --out-dir flye_meta \ --threads 32 # PacBio HiFi metagenome flye --pacbio-hifi reads.hifi.fastq.gz \ --meta \ --out-dir flye_meta_hifi \ --threads 32 # Key output files: # assembly.fasta - assembled contigs # assembly_graph.gfa - assembly graph # assembly_info.txt - contig statistics
metaSPAdes (Short Reads)
Goal: Assemble metagenome contigs from Illumina paired-end reads.
Approach: Run metaSPAdes which uses multi-kmer de Bruijn graph assembly optimized for metagenomes.
# Illumina paired-end metagenome metaspades.py -1 R1.fastq.gz -2 R2.fastq.gz \ -o spades_meta \ -t 32 \ -m 500 # With multiple libraries metaspades.py \ --pe1-1 lib1_R1.fq.gz --pe1-2 lib1_R2.fq.gz \ --pe2-1 lib2_R1.fq.gz --pe2-2 lib2_R2.fq.gz \ -o spades_meta -t 32
Hybrid Assembly
Goal: Combine long-read contiguity with short-read accuracy in metagenome assembly.
Approach: Assemble with metaFlye from long reads, then polish the assembly with Pilon using short reads.
# Combine short and long reads flye --nano-raw ont_reads.fastq.gz \ --meta \ --out-dir flye_hybrid \ --threads 32 # Polish with short reads pilon --genome flye_hybrid/assembly.fasta \ --frags short_reads.bam \ --output polished \ --threads 16
Key Parameters
metaFlye
| Parameter | Description |
|---|---|
| --meta | Metagenome mode (handles uneven coverage) |
| --min-overlap | Minimum overlap for assembly (default: auto) |
| --genome-size | Estimated total size (optional for meta) |
| --iterations | Polishing iterations (default: 1) |
| --keep-haplotypes | Preserve strain variants |
metaSPAdes
| Parameter | Description |
|---|---|
| -m | Memory limit in GB |
| --only-assembler | Skip error correction |
| -k | K-mer sizes (auto-selected by default) |
| --phred-offset | Quality encoding (33 or 64) |
Binning Workflow
Goal: Recover individual genomes (MAGs) from a metagenome assembly.
Approach: Map reads back to the assembly for coverage, compute per-contig depth, bin with MetaBAT2, and assess quality with CheckM2.
"Bin the contigs from my metagenome assembly into individual genomes" --> Map reads for coverage, cluster contigs by composition and coverage, then evaluate bins.
# Step 1: Map reads back to assembly minimap2 -ax map-ont -t 32 assembly.fasta reads.fastq.gz | \ samtools sort -o mapped.bam - # Step 2: Generate depth file jgi_summarize_bam_contig_depths --outputDepth depth.txt mapped.bam # Step 3: Bin with MetaBAT2 metabat2 -i assembly.fasta -a depth.txt -o bins/bin -t 32 # Step 4: Assess bin quality with CheckM2 checkm2 predict --input bins/ --output-directory checkm2_out -x fa --threads 32
SemiBin2 (Deep Learning Binning)
Goal: Improve MAG recovery using deep learning-based contig binning.
Approach: Run SemiBin2 which trains a neural network on contig composition and coverage for more accurate bin assignments.
# Single-sample binning SemiBin2 single_easy_bin \ -i assembly.fasta \ -b mapped.bam \ -o semibin_out \ --environment global # Multi-sample binning (better for time-series) SemiBin2 multi_easy_bin \ -i assembly.fasta \ -b sample1.bam sample2.bam sample3.bam \ -o semibin_multi
Quality Assessment
Goal: Evaluate assembly contiguity, bin completeness, and taxonomic composition.
Approach: Run seqkit for basic stats, CheckM2 for bin quality, GTDB-Tk for taxonomy, and MetaQUAST for assembly metrics.
# Assembly stats seqkit stats assembly.fasta # CheckM2 for bin completeness checkm2 predict -i bins/ -o checkm2_out -x fa -t 32 # GTDB-Tk for taxonomic classification gtdbtk classify_wf --genome_dir bins/ --out_dir gtdbtk_out --cpus 32 # QUAST for assembly metrics metaquast.py -o metaquast_out assembly.fasta -t 32
Circular Genome Detection
Goal: Identify complete circular genomes (e.g., bacterial chromosomes, plasmids) in the assembly.
Approach: Parse Flye's assembly_info.txt for circularity flags and extract matching contigs.
# Flye marks circular contigs in assembly_info.txt grep "Y" flye_meta/assembly_info.txt | cut -f1 > circular_contigs.txt # Extract circular contigs seqkit grep -f circular_contigs.txt assembly.fasta > circular_genomes.fasta
Python Pipeline
Goal: Provide a reusable Python workflow from metagenome assembly through binning to quality assessment.
Approach: Chain metaFlye assembly, MetaBAT2 binning, and CheckM2 quality filtering, returning high-quality MAGs.
import subprocess from pathlib import Path import pandas as pd def run_metaflye(reads, output_dir, read_type='nano-raw', threads=32): cmd = ['flye', f'--{read_type}', reads, '--meta', '--out-dir', output_dir, '--threads', str(threads)] subprocess.run(cmd, check=True) return Path(output_dir) / 'assembly.fasta' def run_binning(assembly, bam, output_dir, threads=32): depth_file = Path(output_dir) / 'depth.txt' subprocess.run(['jgi_summarize_bam_contig_depths', '--outputDepth', str(depth_file), bam], check=True) bins_dir = Path(output_dir) / 'bins' bins_dir.mkdir(exist_ok=True) subprocess.run(['metabat2', '-i', assembly, '-a', str(depth_file), '-o', str(bins_dir / 'bin'), '-t', str(threads)], check=True) return bins_dir def assess_bins(bins_dir, output_dir, threads=32): subprocess.run(['checkm2', 'predict', '--input', str(bins_dir), '--output-directory', output_dir, '-x', 'fa', '--threads', str(threads)], check=True) results = pd.read_csv(Path(output_dir) / 'quality_report.tsv', sep='\t') high_quality = results[(results['Completeness'] > 90) & (results['Contamination'] < 5)] return high_quality # Example workflow assembly = run_metaflye('ont_reads.fq.gz', 'flye_out') bins = run_binning(str(assembly), 'mapped.bam', 'binning_out') hq_bins = assess_bins(bins, 'checkm2_out') print(f'High-quality MAGs: {len(hq_bins)}')
Expected Outputs
| Metric | Good Assembly |
|---|---|
| N50 | >50 kb |
| Largest contig | >1 Mb |
| HQ MAGs (>90% complete, <5% contam) | Varies by sample |
| Circular genomes | Sample dependent |
Troubleshooting
| Issue | Solution |
|---|---|
| Few long contigs | Increase read depth or length |
| High chimeric rate | Use --keep-haplotypes in Flye |
| Poor binning | Add more samples for differential coverage |
| Missing taxa | Check read QC; consider targeted enrichment |
Related Skills
- genome-assembly/contamination-detection - CheckM2/GUNC
- metagenomics/taxonomic-profiling - Kraken2/Bracken
- metagenomics/functional-profiling - HUMAnN
- long-read-sequencing/read-qc - Input quality control