BioSkills bio-genome-assembly-scaffolding

Scaffold contigs into chromosome-level assemblies using Hi-C data with YaHS, 3D-DNA, SALSA2, and validate with BUSCO and contact maps. Use when scaffolding contigs to chromosome-level assemblies.

install

source · Clone the upstream repo

git clone https://github.com/GPTomics/bioSkills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/genome-assembly/scaffolding" ~/.claude/skills/gptomics-bioskills-bio-genome-assembly-scaffolding && rm -rf "$T"

manifest: genome-assembly/scaffolding/SKILL.md

source content

Version Compatibility

Reference examples tested with: BUSCO 5.5+, BWA 0.7.17+, QUAST 5.2+, bedtools 2.31+, samtools 1.19+

Before using code patterns, verify installed versions match. If versions differ:

CLI:
```
<tool> --version
```
then
```
<tool> --help
```
to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Genome Scaffolding

"Scaffold my contigs to chromosome level" → Order and orient contigs into chromosome-scale scaffolds using Hi-C proximity ligation data.

CLI:

yahs contigs.fa hic_alignments.bam

3d-dna

salsa2

Hi-C Data Preprocessing

Goal: Prepare Hi-C read alignments for scaffolding tools.

Approach: Align Hi-C reads with BWA mem using Hi-C-specific flags (-5SP), then filter and deduplicate contacts with pairtools.

# Align Hi-C reads to draft assembly
bwa index draft_assembly.fa
bwa mem -5SP -t 16 draft_assembly.fa hic_R1.fq.gz hic_R2.fq.gz | \
    samtools view -@ 8 -bhS - > aligned.bam

# Filter for Hi-C contacts (pairtools)
pairtools parse --min-mapq 40 --walks-policy 5unique --max-inter-align-gap 30 \
    --nproc-in 8 --nproc-out 8 --chroms-path draft_assembly.fa.fai aligned.bam | \
    pairtools sort --nproc 8 | \
    pairtools dedup --nproc 8 --mark-dups | \
    pairtools split --output-pairs contacts.pairs.gz

YaHS Scaffolding (Recommended)

Goal: Order and orient contigs into chromosome-scale scaffolds using Hi-C contact data.

Approach: Convert alignments to BED format and run YaHS, which uses an iterative scaffolding algorithm with contact frequency optimization.

# Index assembly
samtools faidx draft_assembly.fa

# Convert BAM to BED
bedtools bamtobed -i aligned.bam | sort -k4 > aligned.bed

# Run YaHS
yahs draft_assembly.fa aligned.bed -o scaffolds

# Output files:
# scaffolds_scaffolds_final.fa - final scaffolds
# scaffolds_scaffolds_final.agp - AGP file
# scaffolds.bin - contact matrix

YaHS with Error Correction

Goal: Generate scaffolds with contact maps suitable for manual curation in Juicebox.

Approach: Run YaHS with error correction, then create a .hic file from the scaffolding output for visualization.

# Run with error correction
yahs draft_assembly.fa aligned.bed -o scaffolds --no-contig-ec

# Generate contact map for juicebox
juicer pre scaffolds.bin scaffolds_scaffolds_final.agp draft_assembly.fa.fai | \
    sort -k2,2d -k6,6d -T ./ --parallel=8 -S 50G | \
    awk 'NF' > scaffolds.pre.txt

# Create .hic file
java -Xmx48G -jar juicer_tools.jar pre scaffolds.pre.txt scaffolds.hic \
    <(cut -f1,2 scaffolds_scaffolds_final.fa.fai)

3D-DNA Pipeline

Goal: Scaffold contigs using the 3D-DNA automated pipeline with Juicer integration.

Approach: Run the 3D-DNA assembly pipeline on Juicer-processed Hi-C data, then optionally refine with Juicebox review.

# Prepare input (requires Juicer aligned data)
# Run Juicer first to get merged_nodups.txt

# Run 3D-DNA
run-asm-pipeline.sh -r 2 draft_assembly.fa merged_nodups.txt

# Output: draft_assembly.final.fasta

# Generate review assembly for Juicebox
run-asm-pipeline-post-review.sh -r draft_assembly.final.review.assembly \
    draft_assembly.final.fasta merged_nodups.txt

SALSA2 Scaffolding

Goal: Scaffold contigs using SALSA2's proximity-guided assembly algorithm.

Approach: Run SALSA2 with Hi-C alignments and restriction enzyme sites for scaffolding decisions.

# Run SALSA2
python run_pipeline.py -a draft_assembly.fa -l draft_assembly.fa.fai \
    -b aligned.bed -e GATC -o salsa_output -m yes

# With multiple restriction enzymes
python run_pipeline.py -a draft_assembly.fa -l draft_assembly.fa.fai \
    -b aligned.bed -e GATC,GANTC -o salsa_output -m yes -p yes

Generate Contact Map

Goal: Create multi-resolution contact matrices for visualization and validation.

Approach: Load contacts into cooler format, balance the matrix, and generate a multi-resolution .mcool file.

# Using cooler
cooler cload pairs -c1 2 -p1 3 -c2 4 -p2 5 \
    draft_assembly.fa.fai:10000 contacts.pairs.gz scaffolds.cool

# Balance matrix
cooler balance scaffolds.cool

# Multi-resolution (mcool)
cooler zoomify scaffolds.cool -o scaffolds.mcool

Visualize with HiGlass

Goal: Interactively visualize Hi-C contact maps to verify scaffolding correctness.

Approach: Convert data to HiGlass-compatible formats and serve via Docker container.

# Convert to higlass format
clodius aggregate bedfile --chromsizes-filename chrom.sizes \
    --output-file scaffolds.beddb scaffold_boundaries.bed

# Load into higlass server
docker run --detach --publish 8888:80 \
    --volume ~/hg-data:/data \
    higlass/higlass-docker:latest

Manual Curation (Juicebox)

# Load .hic file in Juicebox Assembly Tools (JBAT)
# Perform manual corrections:
# - Break misjoins
# - Order/orient scaffolds
# - Merge scaffolds

# Export corrected assembly
# File -> Export Assembly -> FASTA

Post-Scaffolding Gap Filling

# TGS-GapCloser for long-read gap filling
tgsgapcloser --scaff scaffolds.fa --reads ont_reads.fq.gz \
    --output filled --thread 16 --ne

# LR_Gapcloser alternative
LR_Gapcloser.sh -i scaffolds.fa -l ont_reads.fq.gz -t 16 -o gapclosed.fa

Validate Scaffolding

# Check chromosome-scale contiguity
seqkit stats scaffolds.fa

# BUSCO on scaffolds
busco -i scaffolds.fa -l eukaryota_odb10 -o busco_scaffolds -m genome -c 16

# N50/L50 statistics
assembly-stats scaffolds.fa

# Compare pre/post scaffolding
quast.py draft_assembly.fa scaffolds.fa -o quast_comparison

Check Telomeres

# Find telomeric repeats (vertebrate TTAGGG)
seqkit locate -i -p 'TTAGGG{10,}' scaffolds.fa > telomeres_forward.bed
seqkit locate -i -p 'CCCTAA{10,}' scaffolds.fa > telomeres_reverse.bed

# Count chromosomes with telomeres on both ends
awk '$2 < 1000' telomeres_forward.bed | cut -f1 | sort -u > left_telomeres.txt
awk -v OFS='\t' 'NR==FNR{len[$1]=$2;next} $3 > len[$1]-1000' \
    scaffolds.fa.fai telomeres_reverse.bed | cut -f1 | sort -u > right_telomeres.txt
comm -12 left_telomeres.txt right_telomeres.txt > complete_chromosomes.txt

Rename to Chromosomes

# After manual curation, rename scaffolds to chromosomes
awk '/^>/{print ">chr" ++i; next}{print}' scaffolds.fa > chromosomes.fa

# Or with mapping file
seqkit replace -p '(.+)' -r '{kv}' -k name_mapping.tsv scaffolds.fa > chromosomes.fa

Related Skills

genome-assembly/long-read-assembly - Generate initial contigs
genome-assembly/assembly-polishing - Polish before scaffolding
genome-assembly/assembly-qc - Validate final assembly
hi-c-analysis/hic-data-io - Hi-C data processing