OpenClaw-Medical-Skills bio-genome-assembly-assembly-qc

<!--

install
source · Clone the upstream repo
git clone https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/bio-genome-assembly-assembly-qc" ~/.claude/skills/freedomintelligence-openclaw-medical-skills-bio-genome-assembly-assembly-qc && rm -rf "$T"
OpenClaw · Install into ~/.openclaw/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills "$T" && mkdir -p ~/.openclaw/skills && cp -r "$T/skills/bio-genome-assembly-assembly-qc" ~/.openclaw/skills/freedomintelligence-openclaw-medical-skills-bio-genome-assembly-assembly-qc && rm -rf "$T"
manifest: skills/bio-genome-assembly-assembly-qc/SKILL.md
source content
<!-- # COPYRIGHT NOTICE # This file is part of the "Universal Biomedical Skills" project. # Copyright (c) 2026 MD BABU MIA, PhD <md.babu.mia@mssm.edu> # All Rights Reserved. # # This code is proprietary and confidential. # Unauthorized copying of this file, via any medium is strictly prohibited. # # Provenance: Authenticated by MD BABU MIA -->

name: bio-genome-assembly-assembly-qc description: Assess genome assembly quality using QUAST for contiguity metrics and BUSCO for completeness. Essential for evaluating assembly success and comparing assemblers. Use when evaluating assembly completeness and quality. tool_type: cli primary_tool: QUAST measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes. allowed-tools:

  • read_file
  • run_shell_command

Assembly QC

Evaluate genome assembly quality with contiguity metrics (QUAST) and gene completeness (BUSCO).

Key Metrics

MetricGood Assembly
N50High (relative to genome)
L50Low
ContigsFew
Misassemblies0 (with reference)
BUSCO Complete>95%
BUSCO Duplicated<5% (unless polyploid)

QUAST

Installation

conda install -c bioconda quast

Basic Usage

quast.py assembly.fasta -o quast_output

With Reference Genome

quast.py assembly.fasta -r reference.fasta -o quast_output

Compare Multiple Assemblies

quast.py assembly1.fa assembly2.fa assembly3.fa -o comparison

Key Options

OptionDescription
-o
Output directory
-r
Reference genome
-g
Gene annotations (GFF)
-t
Threads
-m
Min contig length (default: 500)
--large
For large genomes (>100Mb)
--fragmented
For highly fragmented assemblies
--scaffolds
Input is scaffolds (includes N-gaps)

With Gene Annotations

quast.py assembly.fasta -r reference.fasta -g genes.gff -o quast_output

For Large Genomes

quast.py --large assembly.fasta -o quast_output -t 16

Output Files

quast_output/
├── report.txt        # Summary statistics
├── report.html       # Interactive report
├── report.tsv        # Tab-separated stats
├── icarus.html       # Contig viewer
└── aligned_stats/    # If reference provided

Key Output Metrics

MetricDescription
Total lengthSum of contig lengths
# contigsNumber of contigs (>= min length)
Largest contigLength of largest contig
N5050% of assembly in contigs >= this length
N9090% of assembly in contigs >= this length
L50Number of contigs comprising N50
GC %GC content
# misassembliesWith reference: structural errors
Genome fractionWith reference: % of reference covered

BUSCO

Installation

conda install -c bioconda busco

Basic Usage

busco -i assembly.fasta -m genome -l bacteria_odb10 -o busco_output

Key Options

OptionDescription
-i
Input assembly
-m
Mode: genome, proteins, transcriptome
-l
Lineage dataset
-o
Output name
-c
CPU threads
--auto-lineage
Auto-detect lineage
--offline
Use downloaded datasets only
--list-datasets
List available lineages

List Available Lineages

busco --list-datasets

Common Lineages

LineageUse For
bacteria_odb10Bacteria
archaea_odb10Archaea
eukaryota_odb10General eukaryote
fungi_odb10Fungi
metazoa_odb10Animals
vertebrata_odb10Vertebrates
mammalia_odb10Mammals
viridiplantae_odb10Plants
saccharomycetes_odb10Yeasts

Auto-Lineage Detection

busco -i assembly.fasta -m genome --auto-lineage -o busco_output

Output Files

busco_output/
├── short_summary.txt           # Quick summary
├── full_table.tsv              # All BUSCO results
├── missing_busco_list.tsv      # Missing genes
└── busco_sequences/            # BUSCO gene sequences

Interpret Results

C:98.5%[S:97.0%,D:1.5%],F:0.5%,M:1.0%,n:4085

C - Complete (total)
S - Single-copy
D - Duplicated
F - Fragmented
M - Missing
n - Total BUSCO groups

Quality Thresholds

QualityCompleteMissing
Excellent>95%<2%
Good>90%<5%
Acceptable>80%<10%
Poor<80%>10%

Complete QC Workflow

#!/bin/bash
set -euo pipefail

ASSEMBLY=$1
REFERENCE=${2:-}
LINEAGE=${3:-bacteria_odb10}
OUTDIR=${4:-assembly_qc}

mkdir -p $OUTDIR

echo "=== Assembly QC ==="

# QUAST
echo "Running QUAST..."
if [ -n "$REFERENCE" ]; then
    quast.py $ASSEMBLY -r $REFERENCE -o ${OUTDIR}/quast -t 8
else
    quast.py $ASSEMBLY -o ${OUTDIR}/quast -t 8
fi

# BUSCO
echo "Running BUSCO..."
busco -i $ASSEMBLY -m genome -l $LINEAGE -o busco_run -c 8
mv busco_run ${OUTDIR}/busco

# Summary
echo ""
echo "=== QUAST Summary ==="
cat ${OUTDIR}/quast/report.txt

echo ""
echo "=== BUSCO Summary ==="
cat ${OUTDIR}/busco/short_summary*.txt

echo ""
echo "Reports saved to $OUTDIR"

Compare Assemblies

QUAST Comparison

quast.py \
    spades_assembly.fa \
    flye_assembly.fa \
    canu_assembly.fa \
    -r reference.fa \
    -l "SPAdes,Flye,Canu" \
    -o assembly_comparison

BUSCO Comparison

# Run BUSCO on each assembly
for asm in spades.fa flye.fa canu.fa; do
    name=$(basename $asm .fa)
    busco -i $asm -m genome -l bacteria_odb10 -o busco_${name}
done

# Generate comparison plot
generate_plot.py -wd . busco_spades busco_flye busco_canu

Python: Parse QUAST Output

import pandas as pd

def parse_quast(report_tsv):
    '''Parse QUAST report.tsv file.'''
    df = pd.read_csv(report_tsv, sep='\t', index_col=0)
    return df.T

stats = parse_quast('quast_output/report.tsv')
print(f"N50: {stats['N50'].values[0]}")
print(f"Total length: {stats['Total length'].values[0]}")
print(f"# contigs: {stats['# contigs'].values[0]}")

Python: Parse BUSCO Output

import re

def parse_busco_summary(summary_file):
    '''Parse BUSCO short summary.'''
    with open(summary_file) as f:
        text = f.read()

    pattern = r'C:(\d+\.\d+)%\[S:(\d+\.\d+)%,D:(\d+\.\d+)%\],F:(\d+\.\d+)%,M:(\d+\.\d+)%,n:(\d+)'
    match = re.search(pattern, text)

    if match:
        return {
            'complete': float(match.group(1)),
            'single': float(match.group(2)),
            'duplicated': float(match.group(3)),
            'fragmented': float(match.group(4)),
            'missing': float(match.group(5)),
            'total': int(match.group(6))
        }
    return None

result = parse_busco_summary('busco_output/short_summary.txt')
print(f"Complete: {result['complete']}%")

MetaQUAST (Metagenomes)

metaquast.py metagenome_assembly.fa -o metaquast_output -t 16

Troubleshooting

Low N50

  • Check coverage depth
  • Consider longer reads
  • Try different assembler

Low BUSCO Completeness

  • Check input read quality
  • Verify correct lineage dataset
  • May indicate real gene loss (compare to relatives)

High Duplication in BUSCO

  • Normal for polyploids
  • May indicate contamination
  • Check for collapsed haplotypes

Related Skills

  • short-read-assembly - SPAdes assembly
  • long-read-assembly - Flye/Canu assembly
  • assembly-polishing - Improve accuracy
  • metagenomics - Metagenome analysis
<!-- AUTHOR_SIGNATURE: 9a7f3c2e-MD-BABU-MIA-2026-MSSM-SECURE -->