BioSkills bio-genome-assembly-hifi-assembly

High-quality genome assembly from PacBio HiFi reads using hifiasm with phasing support. Use when building reference-quality diploid assemblies from HiFi data, especially with trio or Hi-C phasing for fully resolved haplotypes.

install

source · Clone the upstream repo

git clone https://github.com/GPTomics/bioSkills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/genome-assembly/hifi-assembly" ~/.claude/skills/gptomics-bioskills-bio-genome-assembly-hifi-assembly && rm -rf "$T"

manifest: genome-assembly/hifi-assembly/SKILL.md

source content

Version Compatibility

Reference examples tested with: BUSCO 5.5+, QUAST 5.2+

Before using code patterns, verify installed versions match. If versions differ:

Python:
```
pip show <package>
```
then
```
help(module.function)
```
to check signatures
CLI:
```
<tool> --version
```
then
```
<tool> --help
```
to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

HiFi Assembly

"Assemble a genome from HiFi reads" → Build a phased, reference-quality diploid assembly from PacBio HiFi reads with optional trio or Hi-C phasing for full haplotype resolution.

CLI:
```
hifiasm -o output reads.fq.gz
```

Basic Assembly

Goal: Produce a primary contig assembly from PacBio HiFi reads.

Approach: Run hifiasm with default parameters and convert GFA output to FASTA.

# Primary assembly (single haplotype consensus)
hifiasm -o output_prefix -t 32 reads.hifi.fastq.gz

# Output files:
# output_prefix.bp.p_ctg.gfa  - Primary contigs
# output_prefix.bp.a_ctg.gfa  - Alternate contigs
# output_prefix.bp.hap1.p_ctg.gfa - Haplotype 1 (if phased)
# output_prefix.bp.hap2.p_ctg.gfa - Haplotype 2 (if phased)

# Convert GFA to FASTA
awk '/^S/{print ">"$2;print $3}' output_prefix.bp.p_ctg.gfa > assembly.fasta

Trio-Binned Phasing

Goal: Generate fully phased haplotype assemblies using parental short-read data.

Approach: Build k-mer databases from parental reads with yak, then supply them to hifiasm for trio binning.

# With parental short reads for trio binning
hifiasm -o trio_asm -t 32 \
    -1 paternal.yak \
    -2 maternal.yak \
    child.hifi.fastq.gz

# Create yak databases from parental Illumina reads first
yak count -b37 -t16 -o paternal.yak paternal_R1.fq.gz paternal_R2.fq.gz
yak count -b37 -t16 -o maternal.yak maternal_R1.fq.gz maternal_R2.fq.gz

Hi-C Phasing

Goal: Phase haplotypes without parental data using chromatin proximity.

Approach: Supply Hi-C paired-end reads to hifiasm for contact-based haplotype resolution.

# Use Hi-C reads for phasing (no parents needed)
hifiasm -o hic_asm -t 32 \
    --h1 hic_R1.fastq.gz \
    --h2 hic_R2.fastq.gz \
    reads.hifi.fastq.gz

# Produces fully phased hap1 and hap2 assemblies

Key Parameters

Parameter	Default	Description
-t	1	Threads
-l	0	Purge level (0=none, 1=light, 2=aggressive)
-s	0.55	Similarity threshold for duplicate detection
--primary	-	Output primary contigs only (no alternates)
--n-hap	2	Expected number of haplotypes
-D	5.0	Drop reads with depth > D*average
-N	100	Consider up to N overlaps for each read

Purge Duplicates

Goal: Control haplotype duplication levels in the primary assembly.

Approach: Adjust the hifiasm purge level parameter based on sample heterozygosity.

# Aggressive purging for high heterozygosity
hifiasm -o asm -t 32 -l 2 reads.hifi.fastq.gz

# Minimal purging for inbred samples
hifiasm -o asm -t 32 -l 0 reads.hifi.fastq.gz

Ultra-Long ONT Integration

Goal: Improve contiguity across complex repeats by supplementing HiFi with ultra-long ONT reads.

Approach: Supply ONT reads via the --ul flag to span regions that HiFi reads alone cannot resolve.

# Combine HiFi accuracy with ONT length
hifiasm -o hybrid_asm -t 32 \
    --ul ont_ultralong.fastq.gz \
    hifi_reads.fastq.gz

# UL reads help span complex repeats

Assembly Stats

Goal: Evaluate assembly contiguity, completeness, and correctness.

Approach: Compute summary statistics with seqkit/assembly-stats, run QUAST for contiguity metrics, and BUSCO for gene completeness.

# Quick stats with seqkit
seqkit stats assembly.fasta

# Detailed with assembly-stats
assembly-stats assembly.fasta

# QUAST assessment
quast.py -o quast_output assembly.fasta

# BUSCO completeness
busco -i assembly.fasta -l mammalia_odb10 -o busco_out -m genome

Memory and Runtime

Genome Size	HiFi Coverage	RAM	Time (32 cores)
3 Gb	30x	~200 GB	12-24 hours
3 Gb	60x	~400 GB	24-48 hours
500 Mb	40x	~64 GB	2-4 hours

Python Wrapper

Goal: Provide a reusable Python interface for running hifiasm with various phasing modes.

Approach: Wrap the hifiasm CLI call with optional Hi-C and ultra-long ONT parameters, then convert GFA output to FASTA.

import subprocess
from pathlib import Path

def run_hifiasm(hifi_reads, output_prefix, threads=32, purge_level=0,
                hic_r1=None, hic_r2=None, ul_reads=None):
    cmd = ['hifiasm', '-o', output_prefix, '-t', str(threads), '-l', str(purge_level)]

    if hic_r1 and hic_r2:
        cmd.extend(['--h1', hic_r1, '--h2', hic_r2])

    if ul_reads:
        cmd.extend(['--ul', ul_reads])

    cmd.append(hifi_reads)
    subprocess.run(cmd, check=True)

    gfa = Path(f'{output_prefix}.bp.p_ctg.gfa')
    fasta = Path(f'{output_prefix}.fasta')

    with open(fasta, 'w') as out:
        with open(gfa) as f:
            for line in f:
                if line.startswith('S'):
                    parts = line.strip().split('\t')
                    out.write(f'>{parts[1]}\n{parts[2]}\n')

    return fasta

# Example
assembly = run_hifiasm('sample.hifi.fq.gz', 'sample_asm', threads=48, hic_r1='hic_R1.fq.gz', hic_r2='hic_R2.fq.gz')

Troubleshooting

Issue	Solution
High duplication	Increase purge level (-l 2)
Missing haplotypes	Add Hi-C or trio data for phasing
Memory errors	Reduce -D parameter or downsample reads
Fragmented assembly	Check read quality; consider UL ONT addition

Related Skills

genome-assembly/assembly-qc - QUAST and BUSCO
genome-assembly/scaffolding - YaHS Hi-C scaffolding
genome-assembly/contamination-detection - CheckM2 decontamination
long-read-sequencing/read-qc - HiFi quality control