SciAgent-Skills prokka-genome-annotation
Annotate prokaryotic genome assemblies (bacteria, archaea, viruses) with Prokka's BLAST/HMM-based pipeline. Identifies CDS, rRNA, tRNA, tmRNA, and signal peptides against Pfam, TIGRFAMs, and RefSeq databases. Produces GFF3, GenBank, protein FASTA, and TSV outputs. Use PGAP instead when submitting to NCBI GenBank; use Bakta for faster annotation with NCBI-compatible outputs on modern assemblies.
git clone https://github.com/jaechang-hits/SciAgent-Skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/genomics-bioinformatics/prokka-genome-annotation" ~/.claude/skills/jaechang-hits-sciagent-skills-prokka-genome-annotation && rm -rf "$T"
skills/genomics-bioinformatics/prokka-genome-annotation/SKILL.mdProkka Genome Annotation
Overview
Prokka is a command-line pipeline for rapid annotation of prokaryotic genomes (bacteria, archaea, and viruses). It uses a tiered search strategy: protein-coding genes (CDS) are predicted with Prodigal and searched first against a genus-specific database, then RefSeq proteins, then Pfam/TIGRFAMs HMMs. Non-coding RNA genes (rRNA, tRNA, tmRNA) are identified with Barrnap, Aragorn, and Infernal. Prokka processes a single FASTA assembly in minutes and outputs a comprehensive annotation in GFF3, GenBank, FASTA, and tabular formats.
When to Use
- Annotating a newly assembled bacterial or archaeal genome from Illumina, PacBio, or Nanopore assemblies
- Getting functional protein annotations (CDS with product names, EC numbers, GO terms) from a draft or complete genome
- Preparing annotation files for downstream comparative genomics (Roary pan-genome, OrthoFinder)
- Annotating viral or phage genomes when kingdom-specific databases are important
- Performing metagenome-assembled genome (MAG) annotation with the
flag--metagenome - Parsing annotated outputs in Python with BioPython for downstream sequence or feature analysis
- Use PGAP (NCBI Prokaryotic Genome Annotation Pipeline) instead when the goal is NCBI GenBank submission with standards compliance
- Use Bakta instead for faster annotation with built-in NCBI-compatible outputs and a more regularly updated database
Prerequisites
- Software: Prokka ≥ 1.14, Perl 5, Prodigal, Barrnap, HMMER3, BLAST+, Aragorn, Infernal, tbl2asn
- Python packages (for output parsing):
,biopython
,pandasmatplotlib - Input: assembled genome in FASTA format (complete or draft with multiple contigs)
- Environment: conda strongly recommended to handle the Perl and C dependency stack
Check before installing: The tool may already be available in the current environment (e.g., inside a
/pixienv). Runcondafirst and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool viacommand -v prokkarather than barepixi run prokka.prokka
# Install Prokka via conda/mamba (recommended) conda install -c conda-forge -c bioconda prokka # Or with mamba (faster) mamba install -c conda-forge -c bioconda prokka # Verify installation and database setup prokka --version # prokka 1.14.6 # Check that required tools are on PATH prokka --depends # prokka needs: awk, sed, grep, makeblastdb, blastp, hmmscan, ... # Install Python parsing dependencies pip install biopython pandas matplotlib
Quick Start
# Annotate a bacterial genome assembly — results in results/ directory prokka genome.fasta \ --outdir results/ \ --prefix sample1 \ --kingdom Bacteria \ --cpus 4 # Check output summary cat results/sample1.txt # Organism: Genus species strain # Contigs: 1 # Bases: 4639675 # CDS: 4140 # rRNA: 22 # tRNA: 86 echo "Annotation complete. Key output files:" ls results/sample1.{gff,gbk,faa,ffn,tsv}
Workflow
Step 1: Install and Verify Prokka
Install Prokka and confirm all dependent tools are accessible in the current environment.
# Create a dedicated conda environment conda create -n prokka_env -c conda-forge -c bioconda prokka python=3.10 -y conda activate prokka_env # Verify Prokka version and all tool dependencies prokka --version # prokka 1.14.6 prokka --depends # Checking that required tools are installed... # OK: makeblastdb is installed (2.13.0+) # OK: blastp is installed (2.13.0+) # OK: hmmscan is installed (3.3.2) # OK: prodigal is installed (2.6.3) # OK: barrnap is installed (0.9) # Check available genus-specific databases bundled with Prokka ls $(conda info --base)/envs/prokka_env/db/genus/ # Archaea Bacteria Mitochondria Viruses # Install Python parsing tools pip install biopython pandas matplotlib
Step 2: Prepare the Input Genome
Clean and rename contigs to comply with Prokka's header requirements before annotation.
from Bio import SeqIO import re # Load and inspect assembly input_fasta = "genome.fasta" records = list(SeqIO.parse(input_fasta, "fasta")) print(f"Input assembly: {len(records)} contigs") total_bases = sum(len(r) for r in records) print(f"Total bases: {total_bases:,}") print(f"Largest contig: {max(len(r) for r in records):,} bp") print(f"N50 approx: see assembly stats tool") # Rename contigs to short IDs compatible with Prokka (max 37 chars) # Prokka requires: no spaces, no special characters in header cleaned = [] for i, rec in enumerate(records, 1): new_id = f"contig_{i:04d}" new_rec = rec.__class__(rec.seq, id=new_id, description=f"len={len(rec.seq)}") cleaned.append(new_rec) SeqIO.write(cleaned, "genome_clean.fasta", "fasta") print(f"\nWrote genome_clean.fasta with {len(cleaned)} renamed contigs") # genome_clean.fasta: contig_0001 through contig_NNNN
# Alternatively, clean headers with a simple bash one-liner awk '/^>/{print ">contig_" ++i; next}{print}' genome.fasta > genome_clean.fasta # Filter out short contigs (< 200 bp) to reduce annotation noise awk '/^>/{header=$0; next} length($0) >= 200 {print header; print}' \ genome_clean.fasta > genome_filtered.fasta echo "Filtered assembly ready: $(grep -c '>' genome_filtered.fasta) contigs"
Step 3: Run Basic Prokka Annotation
Run Prokka with standard options for a bacterial genome, specifying genus/species for database selection.
# Basic annotation with genus/species hint (uses genus-specific protein database first) prokka genome_clean.fasta \ --outdir annotation/ \ --prefix E_coli_K12 \ --kingdom Bacteria \ --genus Escherichia \ --species coli \ --strain K12 \ --cpus 8 \ --mincontiglen 200 # Expected runtime: 2–10 minutes for a typical 4–6 Mb bacterial genome echo "Prokka annotation output files:" ls annotation/ # E_coli_K12.err E_coli_K12.faa E_coli_K12.ffn # E_coli_K12.fna E_coli_K12.gbk E_coli_K12.gff # E_coli_K12.log E_coli_K12.sqn E_coli_K12.tbl # E_coli_K12.tsv E_coli_K12.txt
Step 4: Parse Annotation Summary (TSV)
Load the TSV output for a quick overview of annotated features and their functional assignments.
import pandas as pd # Load the annotation TSV (tab-delimited feature table) tsv_file = "annotation/E_coli_K12.tsv" df = pd.read_csv(tsv_file, sep="\t") print(f"Total features: {len(df)}") print(f"Columns: {list(df.columns)}") # Columns: [locus_tag, ftype, length_bp, gene, EC_number, COG, product] # Feature type summary print("\nFeature type counts:") print(df["ftype"].value_counts().to_string()) # CDS 4140 # tRNA 86 # rRNA 22 # tmRNA 1 # Functional gene annotations (non-hypothetical CDS) cds_df = df[df["ftype"] == "CDS"].copy() hypothetical = cds_df["product"].str.contains("hypothetical", case=False, na=True) print(f"\nCDS with known function: {(~hypothetical).sum()}") print(f"Hypothetical proteins: {hypothetical.sum()}") # Genes with EC numbers (enzymes) ec_annotated = cds_df[cds_df["EC_number"].notna() & (cds_df["EC_number"] != "")] print(f"CDS with EC numbers: {len(ec_annotated)}") print(ec_annotated[["locus_tag", "gene", "EC_number", "product"]].head(5).to_string(index=False))
Step 5: Parse GenBank Output with BioPython
Read the GenBank file to access per-gene sequences, qualifiers, and feature coordinates.
from Bio import SeqIO import pandas as pd # Parse GenBank file gbk_file = "annotation/E_coli_K12.gbk" records = list(SeqIO.parse(gbk_file, "genbank")) print(f"Contigs in GenBank: {len(records)}") # Iterate over CDS features and extract details rows = [] for rec in records: for feat in rec.features: if feat.type != "CDS": continue qualifiers = feat.qualifiers rows.append({ "contig": rec.id, "locus_tag": qualifiers.get("locus_tag", ["?"])[0], "gene": qualifiers.get("gene", [""])[0], "product": qualifiers.get("product", ["hypothetical protein"])[0], "EC_number": qualifiers.get("EC_number", [""])[0], "protein_id": qualifiers.get("protein_id", [""])[0], "start": int(feat.location.start), "end": int(feat.location.end), "strand": feat.location.strand, "aa_length": len(qualifiers.get("translation", [""])[0]), }) features_df = pd.DataFrame(rows) print(f"CDS features extracted: {len(features_df)}") print(features_df.head(3).to_string(index=False)) # Retrieve protein sequence for a specific gene gene_name = "dnaA" gene_feat = features_df[features_df["gene"] == gene_name] if not gene_feat.empty: idx = gene_feat.index[0] print(f"\n{gene_name}: {features_df.loc[idx, 'aa_length']} aa")
Step 6: Visualize Annotation Statistics
Generate a summary barplot of feature types and functional annotation coverage.
import pandas as pd import matplotlib.pyplot as plt import matplotlib.patches as mpatches tsv_file = "annotation/E_coli_K12.tsv" df = pd.read_csv(tsv_file, sep="\t") cds = df[df["ftype"] == "CDS"].copy() n_cds = len(cds) n_known = (~cds["product"].str.contains("hypothetical", case=False, na=True)).sum() n_hypo = n_cds - n_known n_ec = cds["EC_number"].notna().sum() n_rrna = (df["ftype"] == "rRNA").sum() n_trna = (df["ftype"] == "tRNA").sum() fig, axes = plt.subplots(1, 2, figsize=(11, 4)) # Left panel: feature type counts types = df["ftype"].value_counts() colors_left = ["#2980B9", "#27AE60", "#E74C3C", "#F39C12"] + ["#95A5A6"] * len(types) axes[0].bar(types.index, types.values, color=colors_left[:len(types)], edgecolor="white") axes[0].set_title("Annotated Feature Counts") axes[0].set_xlabel("Feature Type") axes[0].set_ylabel("Count") for i, (label, val) in enumerate(types.items()): axes[0].text(i, val + 5, str(val), ha="center", fontsize=9) # Right panel: CDS functional annotation breakdown labels = ["Known function", "Hypothetical protein", "Enzyme (EC number)"] sizes = [n_known, n_hypo, n_ec] colors_right = ["#2980B9", "#BDC3C7", "#E74C3C"] bars = axes[1].bar(labels, sizes, color=colors_right, edgecolor="white") axes[1].set_title(f"CDS Functional Annotation\n(n={n_cds} total CDS)") axes[1].set_ylabel("Count") axes[1].tick_params(axis="x", rotation=15) for bar, val in zip(bars, sizes): axes[1].text(bar.get_x() + bar.get_width() / 2, val + 10, str(val), ha="center", fontsize=9) plt.suptitle("Prokka Genome Annotation Summary", fontsize=12, fontweight="bold") plt.tight_layout() plt.savefig("prokka_annotation_summary.png", dpi=150, bbox_inches="tight") print(f"Saved prokka_annotation_summary.png") print(f"CDS: {n_cds} | Known function: {n_known} | Hypothetical: {n_hypo}") print(f"rRNA: {n_rrna} | tRNA: {n_trna} | EC-annotated CDS: {n_ec}")
Step 7: Batch Annotation Across Multiple Genomes
Annotate multiple genome assemblies sequentially and collect summary statistics.
#!/bin/bash # batch_prokka.sh — annotate all FASTA files in a directory INPUT_DIR="genomes/" OUTPUT_DIR="annotations/" mkdir -p "$OUTPUT_DIR" for FASTA in "$INPUT_DIR"/*.fasta; do SAMPLE=$(basename "$FASTA" .fasta) echo "Annotating: $SAMPLE" prokka "$FASTA" \ --outdir "${OUTPUT_DIR}/${SAMPLE}" \ --prefix "$SAMPLE" \ --kingdom Bacteria \ --cpus 4 \ --mincontiglen 200 \ --quiet echo " Done: ${OUTPUT_DIR}/${SAMPLE}/${SAMPLE}.txt" done echo "Batch annotation complete."
# Collect summary statistics from all Prokka .txt files from pathlib import Path import pandas as pd annotation_dir = Path("annotations/") rows = [] for txt_file in sorted(annotation_dir.glob("*/*.txt")): sample = txt_file.stem stats = {} with open(txt_file) as f: for line in f: line = line.strip() if ": " in line: key, val = line.split(": ", 1) stats[key.strip()] = val.strip() rows.append({ "sample": sample, "contigs": int(stats.get("Contigs", 0)), "bases": int(stats.get("Bases", 0)), "CDS": int(stats.get("CDS", 0)), "rRNA": int(stats.get("rRNA", 0)), "tRNA": int(stats.get("tRNA", 0)), }) summary_df = pd.DataFrame(rows) print(f"Annotated genomes: {len(summary_df)}") print(summary_df.to_string(index=False)) summary_df.to_csv("batch_annotation_summary.csv", index=False) print("\nSaved: batch_annotation_summary.csv")
Step 8: Compare Annotations Between Strains
Use the protein FASTA outputs for cross-strain comparison with identity-based clustering.
from Bio import SeqIO, pairwise2 import pandas as pd # Load protein sequences from two strains def load_proteins(faa_file): """Return dict of locus_tag -> protein sequence.""" return {rec.id: str(rec.seq) for rec in SeqIO.parse(faa_file, "fasta")} strain_a = load_proteins("annotation_A/strain_A.faa") strain_b = load_proteins("annotation_B/strain_B.faa") print(f"Strain A proteins: {len(strain_a)}") print(f"Strain B proteins: {len(strain_b)}") # Compare gene counts and size distributions import matplotlib.pyplot as plt import numpy as np len_a = [len(seq) for seq in strain_a.values()] len_b = [len(seq) for seq in strain_b.values()] fig, ax = plt.subplots(figsize=(8, 4)) bins = np.linspace(0, 1500, 50) ax.hist(len_a, bins=bins, alpha=0.6, label=f"Strain A (n={len(len_a)})", color="#2980B9") ax.hist(len_b, bins=bins, alpha=0.6, label=f"Strain B (n={len(len_b)})", color="#E74C3C") ax.set_xlabel("Protein Length (aa)") ax.set_ylabel("Count") ax.set_title("Protein Length Distribution by Strain") ax.legend() plt.tight_layout() plt.savefig("strain_protein_length_comparison.png", dpi=150, bbox_inches="tight") print("Saved strain_protein_length_comparison.png") # Find proteins unique to each strain by size/count difference print(f"\nSize difference: {abs(len(strain_a) - len(strain_b))} proteins") print("→ Use Roary or OrthoFinder for formal pan-genome analysis")
Key Parameters
| Parameter | Default | Range / Options | Effect |
|---|---|---|---|
| | , , , | Selects gene prediction model and default databases |
| — | any genus name string | Prioritizes genus-specific protein database for CDS annotation |
| — | any species name string | Combined with for locus_tag prefix and organism metadata |
| — | any string | Added to organism metadata in GenBank output |
| — | path to FASTA file | Custom protein database prepended before RefSeq search |
| — | path to HMM file | Additional custom HMM database for specialized annotation |
| | – | E-value cutoff for BLAST and HMMER hits |
| | –CPU count | Parallel processes for BLAST and HMMER searches |
| | any integer | Skip contigs shorter than this length (bp) |
| off | flag | Disables Prodigal training on this genome (for MAGs) |
| off | flag | Enable Infernal rRNA/ncRNA search against Rfam (slower) |
| off | flag | Skip rRNA prediction (use when assembly has no rRNA genes) |
| off | flag | Skip tRNA prediction |
Common Recipes
Recipe: Annotation with Custom Protein Database
When to use: You have closely related reference proteins (e.g., characterized isolate) to improve annotation accuracy.
# Use a custom protein database to enhance annotation of a novel strain # Custom proteins are searched first, before internal Prokka databases prokka genome.fasta \ --proteins reference_proteins.faa \ --outdir custom_annotation/ \ --prefix novel_strain \ --kingdom Bacteria \ --cpus 4 echo "Custom DB annotation complete:" grep "CDS" custom_annotation/novel_strain.txt
Recipe: Metagenome-Assembled Genome (MAG) Annotation
When to use: Annotating a MAG where Prodigal cannot train on the full genome sequence.
# --metagenome disables Prodigal model training (uses meta mode) # --mincontiglen 500 discards short, potentially chimeric contigs prokka MAG_bin_42.fasta \ --metagenome \ --kingdom Bacteria \ --outdir mag_annotation/ \ --prefix MAG_bin_42 \ --mincontiglen 500 \ --cpus 4 echo "MAG annotation complete:" cat mag_annotation/MAG_bin_42.txt
Recipe: Extract Specific Gene Sequences
When to use: Retrieve nucleotide or protein sequences for a target gene or pathway from the annotation.
from Bio import SeqIO # Extract all sequences for a specific gene name from protein FASTA faa_file = "annotation/E_coli_K12.faa" target_gene = "dnaA" matches = [] for rec in SeqIO.parse(faa_file, "fasta"): # Prokka FASTA header: >locus_tag gene product if target_gene in rec.description: matches.append(rec) print(f"Proteins matching '{target_gene}': {len(matches)}") for rec in matches: print(f" {rec.id}: {len(rec.seq)} aa — {rec.description}") # Save matching sequences if matches: SeqIO.write(matches, f"{target_gene}_proteins.faa", "fasta") print(f"Saved: {target_gene}_proteins.faa")
Recipe: Convert GFF to Pandas DataFrame
When to use: Work with genomic coordinates for feature overlap analysis or visualization.
import pandas as pd def parse_gff(gff_file): """Parse Prokka GFF3 file into a DataFrame (skips sequence section).""" rows = [] with open(gff_file) as f: for line in f: if line.startswith("##FASTA"): break if line.startswith("#") or not line.strip(): continue parts = line.rstrip("\n").split("\t") if len(parts) < 9: continue attr_dict = {} for item in parts[8].split(";"): if "=" in item: k, v = item.split("=", 1) attr_dict[k] = v rows.append({ "seqname": parts[0], "source": parts[1], "feature": parts[2], "start": int(parts[3]), "end": int(parts[4]), "score": parts[5], "strand": parts[6], "frame": parts[7], "locus_tag": attr_dict.get("ID", ""), "gene": attr_dict.get("gene", ""), "product": attr_dict.get("product", ""), }) return pd.DataFrame(rows) gff_df = parse_gff("annotation/E_coli_K12.gff") print(f"GFF features: {len(gff_df)}") print(gff_df[gff_df["feature"] == "CDS"].head(5)[ ["seqname", "start", "end", "strand", "gene", "product"] ].to_string(index=False))
Expected Outputs
| Output File | Format | Description |
|---|---|---|
| GFF3 | Genome annotation with feature coordinates and attributes; includes FASTA sequence at the end |
| GenBank | Full GenBank-format annotation for use in Geneious, BioPython, and submission prep |
| FASTA | Predicted protein sequences for all CDS features |
| FASTA | Nucleotide sequences for all annotated features (CDS, rRNA, tRNA) |
| FASTA | Nucleotide FASTA of the complete assembly (contigs) |
| TSV | Tab-delimited summary table: locus_tag, ftype, length_bp, gene, EC_number, COG, product |
| Text | One-line summary counts: Contigs, Bases, CDS, rRNA, tRNA, tmRNA |
| TBL | Feature table format for tbl2asn GenBank submission |
| SQN | ASN.1 format for NCBI submission (generated by tbl2asn) |
| Text | Warnings and errors from tbl2asn validation step |
| Text | Full Prokka run log with timing per step |
Troubleshooting
| Problem | Cause | Solution |
|---|---|---|
| No tRNA detected on short or highly fragmented assembly | Add flag; check if assembly is too fragmented (N50 < 1 kb) |
| BLAST+ not on PATH | ; ensure prokka env is activated |
| Assembler produced long contig headers | Pre-process FASTA to shorten headers: |
| Very high hypothetical protein rate (>60%) | Divergent organism with few database matches | Add with closely related strain FAA; consider flag |
| Path to custom protein file is incorrect | Use absolute path; verify file exists with |
| Duplicate product qualifiers in annotation | Ignore if exporting for local use; for NCBI submission, use PGAP instead |
| Annotation is very slow (>30 min for ~5 Mb) | not set or set to 1; enabled | Set to available thread count; disable for faster runs |
| rRNA count is 0 for complete genome | flag was set, or barrnap threshold too strict | Remove ; check barrnap is installed with |
References
- Prokka GitHub: tseemann/prokka — source code, installation instructions, and detailed parameter reference
- Seemann T (2014) Bioinformatics 30(14):2068–2069 — original Prokka paper with benchmarks vs. RAST
- Prodigal documentation — gene prediction engine used by Prokka for CDS calling
- Barrnap GitHub: tseemann/barrnap — rRNA prediction tool bundled with Prokka
- BioPython GenBank parsing tutorial — reference for parsing Prokka GenBank output in Python