BioSkills bio-genome-annotation-eukaryotic-gene-prediction

Predict protein-coding genes in eukaryotic genomes using BRAKER3 for combined RNA-seq and protein evidence, or GALBA for protein-only evidence. Runs Augustus with trained parameters for accurate gene models. Use when annotating a newly assembled eukaryotic genome or improving existing gene predictions.

install
source · Clone the upstream repo
git clone https://github.com/GPTomics/bioSkills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/genome-annotation/eukaryotic-gene-prediction" ~/.claude/skills/gptomics-bioskills-bio-genome-annotation-eukaryotic-gene-prediction && rm -rf "$T"
manifest: genome-annotation/eukaryotic-gene-prediction/SKILL.md
source content

Version Compatibility

Reference examples tested with: BUSCO 5.5+, HISAT2 2.2.1+, pandas 2.2+, samtools 1.19+

Before using code patterns, verify installed versions match. If versions differ:

  • Python:
    pip show <package>
    then
    help(module.function)
    to check signatures
  • CLI:
    <tool> --version
    then
    <tool> --help
    to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Eukaryotic Gene Prediction

"Predict genes in my eukaryotic genome" → Identify protein-coding gene structures (exons, introns, UTRs) using RNA-seq alignment evidence and/or protein homology to train ab initio predictors.

  • CLI:
    braker.pl --genome=assembly.fa --bam=rnaseq.bam --prot_seq=proteins.fa
    (BRAKER3)

Predict protein-coding genes in eukaryotic genomes using evidence-based methods. BRAKER3 combines RNA-seq and protein homology evidence for the most accurate predictions. GALBA provides an alternative when only protein evidence is available.

Prerequisites

CRITICAL: The input assembly must be softmasked (repeats in lowercase). Run repeat-annotation first to softmask the genome. Unmasked assemblies produce many false positive gene predictions in repetitive regions.

BRAKER3 (RNA-seq + Protein Evidence)

BRAKER3 is the preferred pipeline when RNA-seq data exists. It combines GeneMark-ETP with Augustus training for high-accuracy gene models.

Evidence Preparation

# Align RNA-seq with HISAT2
hisat2-build assembly_softmasked.fasta genome_index
hisat2 -x genome_index -1 reads_R1.fastq.gz -2 reads_R2.fastq.gz \
    --dta -p 16 | samtools sort -@ 4 -o rnaseq_sorted.bam
samtools index rnaseq_sorted.bam

# Download OrthoDB proteins for the relevant clade
# Available: Metazoa, Viridiplantae, Fungi, Arthropoda, Vertebrata
wget https://bioinf.uni-greifswald.de/bioinf/partitioned_odb11/Viridiplantae.fa.gz
gunzip Viridiplantae.fa.gz

Run BRAKER3

braker.pl \
    --genome=assembly_softmasked.fasta \
    --bam=rnaseq_sorted.bam \
    --prot_seq=Viridiplantae.fa \
    --softmasking \
    --threads=16 \
    --species=my_species \
    --workingdir=braker3_out \
    --gff3

Key Options

OptionDescription
--genome
Softmasked genome assembly (FASTA)
--bam
RNA-seq alignments (BAM, can specify multiple comma-separated)
--prot_seq
Protein evidence (FASTA)
--softmasking
Genome is softmasked (required)
--species
Species name for Augustus training
--threads
CPU threads
--gff3
Output in GFF3 format (default is GTF)
--fungus
Use fungal-specific parameters
--workingdir
Output directory
--UTR
Predict UTRs (requires sufficient RNA-seq coverage)

Output Files

braker3_out/
├── braker.gtf           # Gene predictions (GTF)
├── braker.gff3          # Gene predictions (GFF3, if --gff3)
├── braker.codingseq     # CDS nucleotide sequences
├── braker.aa            # Protein sequences
├── hintsfile.gff        # All evidence hints
├── Augustus/            # Trained Augustus parameters
└── GeneMark-ETP/        # GeneMark intermediate files

GALBA (Protein-Only Evidence)

GALBA works best with closely related species proteins. Always prefer BRAKER3 when RNA-seq data is available, as intron evidence from RNA-seq substantially improves splice site accuracy.

galba.pl \
    --genome=assembly_softmasked.fasta \
    --prot_seq=closely_related_proteins.fa \
    --softmasking \
    --threads=16 \
    --species=my_species \
    --workingdir=galba_out \
    --gff3

When to Use GALBA vs BRAKER3

ScenarioTool
RNA-seq + proteins availableBRAKER3
Proteins only, closely related speciesGALBA
Proteins only, distant speciesBRAKER3 --prot_seq only (uses ProtHint)
No external evidenceAugustus with pre-trained species (last resort)

Augustus Standalone

For genomes with pre-trained parameters or when rerunning predictions with different hints.

# List available pre-trained species
augustus --species=help

# Run with pre-trained species
augustus \
    --species=arabidopsis \
    --softmasking=1 \
    --gff3=on \
    --UTR=off \
    assembly_softmasked.fasta > augustus_predictions.gff3

# Run with hints (evidence)
augustus \
    --species=my_species \
    --softmasking=1 \
    --gff3=on \
    --hintsfile=hints.gff \
    --extrinsicCfgFile=extrinsic.cfg \
    assembly_softmasked.fasta > augustus_hints.gff3

Evaluation with BUSCO

# Evaluate predicted proteins
busco -i braker3_out/braker.aa -m proteins -l embryophyta_odb10 -o busco_proteins -c 8

# Compare to genome-mode BUSCO
busco -i assembly_softmasked.fasta -m genome -l embryophyta_odb10 -o busco_genome -c 8

Interpretation

MetricMeaning
Protein BUSCO > Genome BUSCOAnnotation captures most genes
Protein BUSCO < Genome BUSCOMissing gene models, re-examine evidence
High duplicationCheck for retained duplicates vs artifacts

Python: Parse Gene Models

Goal: Compute summary statistics for predicted gene models to assess annotation quality and compare against known species benchmarks.

Approach: Load the GFF3 into a gffutils database, iterate through gene and transcript features to compute counts, exons per transcript, intron lengths, and single-exon gene fraction.

import gffutils
import pandas as pd

def load_gene_models(gff_file):
    '''Load eukaryotic gene models from GFF3/GTF.'''
    db = gffutils.create_db(str(gff_file), ':memory:', merge_strategy='merge')
    return db

def gene_model_stats(db):
    '''Compute statistics for predicted gene models.'''
    genes = list(db.features_of_type('gene'))
    transcripts = list(db.features_of_type(['mRNA', 'transcript']))

    exon_counts, intron_lengths, gene_lengths = [], [], []
    for gene in genes:
        gene_lengths.append(gene.end - gene.start + 1)
        for tx in db.children(gene, featuretype=['mRNA', 'transcript']):
            exons = list(db.children(tx, featuretype='exon'))
            exon_counts.append(len(exons))
            sorted_exons = sorted(exons, key=lambda e: e.start)
            for i in range(len(sorted_exons) - 1):
                intron_len = sorted_exons[i + 1].start - sorted_exons[i].end - 1
                intron_lengths.append(intron_len)

    stats = {
        'total_genes': len(genes),
        'total_transcripts': len(transcripts),
        'transcripts_per_gene': len(transcripts) / len(genes) if genes else 0,
        'median_exons_per_transcript': pd.Series(exon_counts).median() if exon_counts else 0,
        'median_gene_length': pd.Series(gene_lengths).median() if gene_lengths else 0,
        'median_intron_length': pd.Series(intron_lengths).median() if intron_lengths else 0,
        'single_exon_genes': sum(1 for e in exon_counts if e == 1),
    }
    return stats

db = load_gene_models('braker3_out/braker.gff3')
stats = gene_model_stats(db)
for key, val in stats.items():
    print(f'{key}: {val}')

Troubleshooting

Low Gene Count

  • Verify softmasking (lowercase repeats present in assembly)
  • Check RNA-seq alignment rate (>70% expected)
  • Ensure OrthoDB proteins match the correct clade

Many Single-Exon Genes

  • May indicate bacterial contamination
  • Check if assembly was properly softmasked
  • Consider filtering predictions by evidence support

BRAKER3 Fails

  • Ensure all dependencies are in PATH (GeneMark, Augustus, DIAMOND)
  • Check that genome headers have no special characters (pipes, spaces)
  • Simplify FASTA headers:
    sed 's/ .*//' assembly.fasta > clean.fasta

Related Skills

  • repeat-annotation - PREREQUISITE: softmask repeats before gene prediction
  • functional-annotation - Add GO/KEGG/Pfam to predicted proteins
  • read-alignment/star-alignment - Alternative RNA-seq aligner for evidence
  • genome-assembly/assembly-qc - Verify assembly quality before prediction