BioSkills bio-genome-annotation-ncrna-annotation

Identify non-coding RNAs including tRNAs, rRNAs, snoRNAs, and regulatory RNAs using Infernal covariance model searches against Rfam and tRNAscan-SE for tRNA prediction. Use when performing genome-wide ncRNA annotation with assembly input producing GFF output.

install
source · Clone the upstream repo
git clone https://github.com/GPTomics/bioSkills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/genome-annotation/ncrna-annotation" ~/.claude/skills/gptomics-bioskills-bio-genome-annotation-ncrna-annotation && rm -rf "$T"
manifest: genome-annotation/ncrna-annotation/SKILL.md
source content

Version Compatibility

Reference examples tested with: pandas 2.2+

Before using code patterns, verify installed versions match. If versions differ:

  • Python:
    pip show <package>
    then
    help(module.function)
    to check signatures
  • CLI:
    <tool> --version
    then
    <tool> --help
    to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Non-Coding RNA Annotation

"Find non-coding RNAs in my genome" → Scan a genome assembly for rRNAs, tRNAs, snoRNAs, and other ncRNA families using covariance model search and specialized detectors.

  • CLI:
    cmscan --rfam --tblout hits.tbl Rfam.cm assembly.fa
    (Infernal),
    tRNAscan-SE -o trnas.txt assembly.fa

Identify and annotate non-coding RNAs in genome assemblies using Infernal (general ncRNAs via Rfam covariance models) and tRNAscan-SE (specialized tRNA detection).

Infernal / cmscan

Infernal uses covariance models (CMs) from Rfam to identify ncRNA families by both sequence and secondary structure similarity.

Rfam Database Setup

# Download Rfam covariance models (~600 MB compressed)
wget https://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.cm.gz
gunzip Rfam.cm.gz

# Press the CM database (required for cmscan)
cmpress Rfam.cm

# Download clan information (for overlap resolution)
wget https://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.clanin

Basic cmscan

# Search genome against all Rfam families
# --cut_ga: Use Rfam gathering threshold (recommended, family-specific)
# --rfam: Run in Rfam mode (skip slow alignment for large models)
# --nohmmonly: Use CM scoring only (more accurate, slower)
# --fmt 2: Output in table format
cmscan \
    --cut_ga \
    --rfam \
    --nohmmonly \
    --tblout ncrna_hits.tbl \
    --fmt 2 \
    --cpu 16 \
    --clanin Rfam.clanin \
    Rfam.cm \
    genome.fasta > ncrna_hits.out

Key Options

OptionDescription
--cut_ga
Use Rfam gathering threshold (recommended)
--rfam
Speed optimization for Rfam-scale searches
--nohmmonly
Force CM-only scoring (more sensitive for structured RNAs)
--clanin
Rfam clan file for resolving overlapping hits
--tblout
Tabular output file
--fmt 2
Extended table format with accessions
--cpu
CPU threads
-E
E-value threshold (default: 10; --cut_ga is preferred)

Convert cmscan Output to GFF3

# Using Infernal's built-in conversion
grep -v '^#' ncrna_hits.tbl | \
awk 'BEGIN{OFS="\t"} {
    if ($10 == "+") strand = "+"; else strand = "-";
    if ($10 == "+") {start=$8; end=$9} else {start=$9; end=$8};
    print $4, "Infernal", "ncRNA", start, end, $17, strand, ".", "ID="$2";Name="$3";rfam_acc="$2";evalue="$17
}' > ncrna_infernal.gff3

E-value Guidelines

CategoryTypical E-valueNotes
High confidence< 1e-10Unambiguous family assignment
Moderate1e-10 to 1e-5Likely real, verify context
Rfam GA thresholdfamily-specificRecommended cutoff with --cut_ga
Marginal> 1e-3May be pseudogenes or degraded copies

tRNAscan-SE

Specialized tRNA detector using covariance models. More sensitive and specific for tRNAs than Infernal alone.

Basic Usage

# Bacterial/archaeal mode
tRNAscan-SE -B -o trna_results.txt --gff trna.gff3 genome.fasta

# Eukaryotic mode
tRNAscan-SE -E -o trna_results.txt --gff trna.gff3 genome.fasta

# General mode (auto-detect domain)
tRNAscan-SE -G -o trna_results.txt --gff trna.gff3 genome.fasta

Key Options

OptionDescription
-B
Bacterial mode
-A
Archaeal mode
-E
Eukaryotic mode
-G
General (mixed/unknown) mode
-o
Tabular output
--gff
GFF3 output
--detail
Detailed output with isotype info
--thread
CPU threads
-Q
Covariance model scoring only (slower, more accurate)

Domain-Specific Models

DomainFlagNotes
Bacteria
-B
Shorter introns, smaller genomes
Archaea
-A
Split tRNAs, bulge-helix-bulge introns
Eukaryota
-E
Longer introns, pseudogenes common
Organelle
-O
Mitochondrial/chloroplast tRNAs
Mitochondrial
-M
Mammalian mitochondrial-specific

Expected tRNA Counts

Organism TypeExpected tRNAsNotes
Bacteria30-90Fewer isotypes, some shared
Archaea30-60Similar to bacteria
Yeast275-400Many isodecoders
Nematode600-900Gene family expansion
Mammal400-600Many pseudogenes
Plant500-1,000+Large gene families

barrnap (rRNA Detection)

barrnap predicts rRNA genes using HMM models. Note: barrnap has been unmaintained since 2018. Bakta handles rRNA detection internally for prokaryotes.

# Detect rRNAs
barrnap --kingdom bac genome.fasta > rrna.gff3

# Kingdoms: bac (bacteria), arc (archaea), euk (eukaryote), mito (mitochondria)
barrnap --kingdom euk --threads 4 genome.fasta > rrna.gff3

Combining ncRNA Annotations

Goal: Merge Infernal and tRNAscan-SE results into a unified ncRNA annotation, using the best tool for each RNA type.

Approach: Parse Infernal tabular output and classify hits by Rfam family name into broad ncRNA categories, parse tRNAscan-SE GFF for tRNA calls, then combine by dropping Infernal tRNA hits (tRNAscan-SE is more accurate for tRNAs) and keeping Infernal for all other ncRNA types.

import pandas as pd
from collections import defaultdict

def parse_infernal_tbl(tbl_file):
    '''Parse Infernal cmscan tabular output.'''
    hits = []
    with open(tbl_file) as f:
        for line in f:
            if line.startswith('#'):
                continue
            parts = line.split()
            if len(parts) < 18:
                continue
            strand = '+' if parts[9] == '+' else '-'
            start, end = (int(parts[7]), int(parts[8])) if strand == '+' else (int(parts[8]), int(parts[7]))
            hits.append({
                'target': parts[0],
                'rfam_acc': parts[1],
                'rfam_name': parts[2],
                'seqid': parts[3],
                'start': start,
                'end': end,
                'strand': strand,
                'evalue': float(parts[16]),
                'score': float(parts[15]),
                'rna_type': classify_rfam(parts[2]),
            })
    return pd.DataFrame(hits)

def classify_rfam(rfam_name):
    '''Classify Rfam family into broad ncRNA categories.'''
    name_lower = rfam_name.lower()
    if 'rrna' in name_lower or 'ssu' in name_lower or 'lsu' in name_lower:
        return 'rRNA'
    if 'trna' in name_lower:
        return 'tRNA'
    if 'snorna' in name_lower or 'snord' in name_lower or 'snora' in name_lower:
        return 'snoRNA'
    if 'mir' in name_lower:
        return 'miRNA'
    if 'riboswitch' in name_lower or 'thermoregulator' in name_lower:
        return 'riboswitch'
    if 'ires' in name_lower or 'leader' in name_lower:
        return 'cis-reg'
    return 'other_ncRNA'

def parse_trnascan_gff(gff_file):
    '''Parse tRNAscan-SE GFF3 output.'''
    trnas = []
    with open(gff_file) as f:
        for line in f:
            if line.startswith('#'):
                continue
            parts = line.strip().split('\t')
            if len(parts) < 9 or parts[2] != 'tRNA':
                continue
            attrs = dict(item.split('=') for item in parts[8].split(';') if '=' in item)
            trnas.append({
                'seqid': parts[0],
                'start': int(parts[3]),
                'end': int(parts[4]),
                'strand': parts[6],
                'isotype': attrs.get('isotype', 'unknown'),
                'anticodon': attrs.get('anticodon', 'unknown'),
                'score': float(parts[5]) if parts[5] != '.' else 0,
                'rna_type': 'tRNA',
            })
    return pd.DataFrame(trnas)

def combine_ncrna_annotations(infernal_tbl, trnascan_gff, output_gff):
    '''Combine Infernal and tRNAscan-SE results, preferring tRNAscan for tRNAs.'''
    infernal_df = parse_infernal_tbl(infernal_tbl)
    trna_df = parse_trnascan_gff(trnascan_gff)

    # Remove Infernal tRNA hits (tRNAscan-SE is more accurate for tRNAs)
    infernal_no_trna = infernal_df[infernal_df['rna_type'] != 'tRNA']

    summary = defaultdict(int)
    for rna_type in infernal_no_trna['rna_type'].unique():
        summary[rna_type] = len(infernal_no_trna[infernal_no_trna['rna_type'] == rna_type])
    summary['tRNA'] = len(trna_df)

    print('=== ncRNA Summary ===')
    for rna_type, count in sorted(summary.items()):
        print(f'  {rna_type}: {count}')
    print(f'  Total: {sum(summary.values())}')

    return infernal_no_trna, trna_df, summary

Troubleshooting

cmscan Is Slow

  • Use
    --rfam
    flag for Rfam-mode optimizations
  • Use
    --FZ 2
    to further speed up (less sensitive)
  • Split genome into chunks and run in parallel

Missing Expected ncRNAs

  • Verify Rfam.cm is current version
  • Check that cmpress was run successfully (look for .i1m, .i1p, .i1f, .i1i files)
  • For mitochondrial/chloroplast ncRNAs, search organelle sequence separately

tRNAscan-SE Finds Too Many Pseudogenes

  • Normal for eukaryotic genomes (especially mammals)
  • Filter by score: high-confidence tRNAs typically score > 50 bits
  • Use
    --detail
    flag to identify pseudogenes in output

Related Skills

  • prokaryotic-annotation - Bakta includes ncRNA annotation
  • eukaryotic-gene-prediction - Gene prediction excludes ncRNAs
  • rna-structure/ncrna-search - Targeted ncRNA homology searches