BioSkills bio-genome-assembly-assembly-qc

Assess genome assembly quality using QUAST for contiguity metrics and BUSCO for completeness. Essential for evaluating assembly success and comparing assemblers. Use when evaluating assembly completeness and quality.

install
source · Clone the upstream repo
git clone https://github.com/GPTomics/bioSkills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/genome-assembly/assembly-qc" ~/.claude/skills/gptomics-bioskills-bio-genome-assembly-assembly-qc && rm -rf "$T"
manifest: genome-assembly/assembly-qc/SKILL.md
source content

Version Compatibility

Reference examples tested with: BUSCO 5.5+, QUAST 5.2+, SPAdes 3.15+, pandas 2.2+

Before using code patterns, verify installed versions match. If versions differ:

  • Python:
    pip show <package>
    then
    help(module.function)
    to check signatures
  • CLI:
    <tool> --version
    then
    <tool> --help
    to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Assembly QC

"Assess my genome assembly quality" → Evaluate assembly contiguity (N50, total length, misassemblies) and gene completeness using conserved single-copy orthologs.

  • CLI:
    quast assembly.fa -r reference.fa
    (contiguity),
    busco -i assembly.fa -l lineage
    (completeness)

Key Metrics

MetricGood Assembly
N50High (relative to genome)
L50Low
ContigsFew
Misassemblies0 (with reference)
BUSCO Complete>95%
BUSCO Duplicated<5% (unless polyploid)

QUAST

Installation

conda install -c bioconda quast

Basic Usage

quast.py assembly.fasta -o quast_output

With Reference Genome

quast.py assembly.fasta -r reference.fasta -o quast_output

Compare Multiple Assemblies

quast.py assembly1.fa assembly2.fa assembly3.fa -o comparison

Key Options

OptionDescription
-o
Output directory
-r
Reference genome
-g
Gene annotations (GFF)
-t
Threads
-m
Min contig length (default: 500)
--large
For large genomes (>100Mb)
--fragmented
For highly fragmented assemblies
--scaffolds
Input is scaffolds (includes N-gaps)

With Gene Annotations

quast.py assembly.fasta -r reference.fasta -g genes.gff -o quast_output

For Large Genomes

quast.py --large assembly.fasta -o quast_output -t 16

Output Files

quast_output/
├── report.txt        # Summary statistics
├── report.html       # Interactive report
├── report.tsv        # Tab-separated stats
├── icarus.html       # Contig viewer
└── aligned_stats/    # If reference provided

Key Output Metrics

MetricDescription
Total lengthSum of contig lengths
# contigsNumber of contigs (>= min length)
Largest contigLength of largest contig
N5050% of assembly in contigs >= this length
N9090% of assembly in contigs >= this length
L50Number of contigs comprising N50
GC %GC content
# misassembliesWith reference: structural errors
Genome fractionWith reference: % of reference covered

BUSCO

Installation

conda install -c bioconda busco

Basic Usage

busco -i assembly.fasta -m genome -l bacteria_odb10 -o busco_output

Key Options

OptionDescription
-i
Input assembly
-m
Mode: genome, proteins, transcriptome
-l
Lineage dataset
-o
Output name
-c
CPU threads
--auto-lineage
Auto-detect lineage
--offline
Use downloaded datasets only
--list-datasets
List available lineages

List Available Lineages

busco --list-datasets

Common Lineages

LineageUse For
bacteria_odb10Bacteria
archaea_odb10Archaea
eukaryota_odb10General eukaryote
fungi_odb10Fungi
metazoa_odb10Animals
vertebrata_odb10Vertebrates
mammalia_odb10Mammals
viridiplantae_odb10Plants
saccharomycetes_odb10Yeasts

Auto-Lineage Detection

busco -i assembly.fasta -m genome --auto-lineage -o busco_output

Output Files

busco_output/
├── short_summary.txt           # Quick summary
├── full_table.tsv              # All BUSCO results
├── missing_busco_list.tsv      # Missing genes
└── busco_sequences/            # BUSCO gene sequences

Interpret Results

C:98.5%[S:97.0%,D:1.5%],F:0.5%,M:1.0%,n:4085

C - Complete (total)
S - Single-copy
D - Duplicated
F - Fragmented
M - Missing
n - Total BUSCO groups

Quality Thresholds

QualityCompleteMissing
Excellent>95%<2%
Good>90%<5%
Acceptable>80%<10%
Poor<80%>10%

Complete QC Workflow

Goal: Run a comprehensive assembly quality assessment combining contiguity and completeness metrics.

Approach: Execute QUAST for contiguity statistics and BUSCO for gene completeness, optionally with a reference genome.

#!/bin/bash
set -euo pipefail

ASSEMBLY=$1
REFERENCE=${2:-}
LINEAGE=${3:-bacteria_odb10}
OUTDIR=${4:-assembly_qc}

mkdir -p $OUTDIR

echo "=== Assembly QC ==="

# QUAST
echo "Running QUAST..."
if [ -n "$REFERENCE" ]; then
    quast.py $ASSEMBLY -r $REFERENCE -o ${OUTDIR}/quast -t 8
else
    quast.py $ASSEMBLY -o ${OUTDIR}/quast -t 8
fi

# BUSCO
echo "Running BUSCO..."
busco -i $ASSEMBLY -m genome -l $LINEAGE -o busco_run -c 8
mv busco_run ${OUTDIR}/busco

# Summary
echo ""
echo "=== QUAST Summary ==="
cat ${OUTDIR}/quast/report.txt

echo ""
echo "=== BUSCO Summary ==="
cat ${OUTDIR}/busco/short_summary*.txt

echo ""
echo "Reports saved to $OUTDIR"

Compare Assemblies

Goal: Evaluate multiple assemblies side-by-side to select the best one.

Approach: Run QUAST with multiple input assemblies and labeled names, then generate BUSCO comparison plots.

QUAST Comparison

quast.py \
    spades_assembly.fa \
    flye_assembly.fa \
    canu_assembly.fa \
    -r reference.fa \
    -l "SPAdes,Flye,Canu" \
    -o assembly_comparison

BUSCO Comparison

# Run BUSCO on each assembly
for asm in spades.fa flye.fa canu.fa; do
    name=$(basename $asm .fa)
    busco -i $asm -m genome -l bacteria_odb10 -o busco_${name}
done

# Generate comparison plot
generate_plot.py -wd . busco_spades busco_flye busco_canu

Python: Parse QUAST Output

Goal: Programmatically extract assembly metrics from QUAST reports.

Approach: Read the tab-separated report.tsv file and transpose it for easy metric access.

import pandas as pd

def parse_quast(report_tsv):
    '''Parse QUAST report.tsv file.'''
    df = pd.read_csv(report_tsv, sep='\t', index_col=0)
    return df.T

stats = parse_quast('quast_output/report.tsv')
print(f"N50: {stats['N50'].values[0]}")
print(f"Total length: {stats['Total length'].values[0]}")
print(f"# contigs: {stats['# contigs'].values[0]}")

Python: Parse BUSCO Output

Goal: Programmatically extract BUSCO completeness metrics from summary files.

Approach: Parse the short_summary.txt file using regex to capture completeness, duplication, fragmentation, and missing percentages.

import re

def parse_busco_summary(summary_file):
    '''Parse BUSCO short summary.'''
    with open(summary_file) as f:
        text = f.read()

    pattern = r'C:(\d+\.\d+)%\[S:(\d+\.\d+)%,D:(\d+\.\d+)%\],F:(\d+\.\d+)%,M:(\d+\.\d+)%,n:(\d+)'
    match = re.search(pattern, text)

    if match:
        return {
            'complete': float(match.group(1)),
            'single': float(match.group(2)),
            'duplicated': float(match.group(3)),
            'fragmented': float(match.group(4)),
            'missing': float(match.group(5)),
            'total': int(match.group(6))
        }
    return None

result = parse_busco_summary('busco_output/short_summary.txt')
print(f"Complete: {result['complete']}%")

MetaQUAST (Metagenomes)

Goal: Assess metagenome assembly quality accounting for multiple reference genomes.

Approach: Run MetaQUAST which automatically identifies reference genomes and reports per-genome metrics.

metaquast.py metagenome_assembly.fa -o metaquast_output -t 16

Troubleshooting

Low N50

  • Check coverage depth
  • Consider longer reads
  • Try different assembler

Low BUSCO Completeness

  • Check input read quality
  • Verify correct lineage dataset
  • May indicate real gene loss (compare to relatives)

High Duplication in BUSCO

  • Normal for polyploids
  • May indicate contamination
  • Check for collapsed haplotypes

Related Skills

  • short-read-assembly - SPAdes assembly
  • long-read-assembly - Flye/Canu assembly
  • assembly-polishing - Improve accuracy
  • metagenomics - Metagenome analysis