BioSkills bio-genome-annotation-ncrna-annotation

Identify non-coding RNAs including tRNAs, rRNAs, snoRNAs, and regulatory RNAs using Infernal covariance model searches against Rfam and tRNAscan-SE for tRNA prediction. Use when performing genome-wide ncRNA annotation with assembly input producing GFF output.

install

source · Clone the upstream repo

git clone https://github.com/GPTomics/bioSkills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/genome-annotation/ncrna-annotation" ~/.claude/skills/gptomics-bioskills-bio-genome-annotation-ncrna-annotation && rm -rf "$T"

manifest: genome-annotation/ncrna-annotation/SKILL.md

source content

Version Compatibility

Reference examples tested with: pandas 2.2+

Before using code patterns, verify installed versions match. If versions differ:

Python:
```
pip show <package>
```
then
```
help(module.function)
```
to check signatures
CLI:
```
<tool> --version
```
then
```
<tool> --help
```
to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Non-Coding RNA Annotation

"Find non-coding RNAs in my genome" → Scan a genome assembly for rRNAs, tRNAs, snoRNAs, and other ncRNA families using covariance model search and specialized detectors.

CLI:

cmscan --rfam --tblout hits.tbl Rfam.cm assembly.fa

(Infernal),

tRNAscan-SE -o trnas.txt assembly.fa

Identify and annotate non-coding RNAs in genome assemblies using Infernal (general ncRNAs via Rfam covariance models) and tRNAscan-SE (specialized tRNA detection).

Infernal / cmscan

Infernal uses covariance models (CMs) from Rfam to identify ncRNA families by both sequence and secondary structure similarity.

Rfam Database Setup

# Download Rfam covariance models (~600 MB compressed)
wget https://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.cm.gz
gunzip Rfam.cm.gz

# Press the CM database (required for cmscan)
cmpress Rfam.cm

# Download clan information (for overlap resolution)
wget https://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.clanin

Basic cmscan

# Search genome against all Rfam families
# --cut_ga: Use Rfam gathering threshold (recommended, family-specific)
# --rfam: Run in Rfam mode (skip slow alignment for large models)
# --nohmmonly: Use CM scoring only (more accurate, slower)
# --fmt 2: Output in table format
cmscan \
    --cut_ga \
    --rfam \
    --nohmmonly \
    --tblout ncrna_hits.tbl \
    --fmt 2 \
    --cpu 16 \
    --clanin Rfam.clanin \
    Rfam.cm \
    genome.fasta > ncrna_hits.out

Key Options

Option	Description
`--cut_ga`	Use Rfam gathering threshold (recommended)
`--rfam`	Speed optimization for Rfam-scale searches
`--nohmmonly`	Force CM-only scoring (more sensitive for structured RNAs)
`--clanin`	Rfam clan file for resolving overlapping hits
`--tblout`	Tabular output file
`--fmt 2`	Extended table format with accessions
`--cpu`	CPU threads
`-E`	E-value threshold (default: 10; --cut_ga is preferred)

Convert cmscan Output to GFF3

# Using Infernal's built-in conversion
grep -v '^#' ncrna_hits.tbl | \
awk 'BEGIN{OFS="\t"} {
    if ($10 == "+") strand = "+"; else strand = "-";
    if ($10 == "+") {start=$8; end=$9} else {start=$9; end=$8};
    print $4, "Infernal", "ncRNA", start, end, $17, strand, ".", "ID="$2";Name="$3";rfam_acc="$2";evalue="$17
}' > ncrna_infernal.gff3

E-value Guidelines

Category	Typical E-value	Notes
High confidence	< 1e-10	Unambiguous family assignment
Moderate	1e-10 to 1e-5	Likely real, verify context
Rfam GA threshold	family-specific	Recommended cutoff with --cut_ga
Marginal	> 1e-3	May be pseudogenes or degraded copies

tRNAscan-SE

Specialized tRNA detector using covariance models. More sensitive and specific for tRNAs than Infernal alone.

Basic Usage

# Bacterial/archaeal mode
tRNAscan-SE -B -o trna_results.txt --gff trna.gff3 genome.fasta

# Eukaryotic mode
tRNAscan-SE -E -o trna_results.txt --gff trna.gff3 genome.fasta

# General mode (auto-detect domain)
tRNAscan-SE -G -o trna_results.txt --gff trna.gff3 genome.fasta

Key Options

Option	Description
`-B`	Bacterial mode
`-A`	Archaeal mode
`-E`	Eukaryotic mode
`-G`	General (mixed/unknown) mode
`-o`	Tabular output
`--gff`	GFF3 output
`--detail`	Detailed output with isotype info
`--thread`	CPU threads
`-Q`	Covariance model scoring only (slower, more accurate)

Domain-Specific Models

Domain	Flag	Notes
Bacteria	`-B`	Shorter introns, smaller genomes
Archaea	`-A`	Split tRNAs, bulge-helix-bulge introns
Eukaryota	`-E`	Longer introns, pseudogenes common
Organelle	`-O`	Mitochondrial/chloroplast tRNAs
Mitochondrial	`-M`	Mammalian mitochondrial-specific

Expected tRNA Counts

Organism Type	Expected tRNAs	Notes
Bacteria	30-90	Fewer isotypes, some shared
Archaea	30-60	Similar to bacteria
Yeast	275-400	Many isodecoders
Nematode	600-900	Gene family expansion
Mammal	400-600	Many pseudogenes
Plant	500-1,000+	Large gene families

barrnap (rRNA Detection)

barrnap predicts rRNA genes using HMM models. Note: barrnap has been unmaintained since 2018. Bakta handles rRNA detection internally for prokaryotes.

# Detect rRNAs
barrnap --kingdom bac genome.fasta > rrna.gff3

# Kingdoms: bac (bacteria), arc (archaea), euk (eukaryote), mito (mitochondria)
barrnap --kingdom euk --threads 4 genome.fasta > rrna.gff3

Combining ncRNA Annotations

Goal: Merge Infernal and tRNAscan-SE results into a unified ncRNA annotation, using the best tool for each RNA type.

Approach: Parse Infernal tabular output and classify hits by Rfam family name into broad ncRNA categories, parse tRNAscan-SE GFF for tRNA calls, then combine by dropping Infernal tRNA hits (tRNAscan-SE is more accurate for tRNAs) and keeping Infernal for all other ncRNA types.

import pandas as pd
from collections import defaultdict

def parse_infernal_tbl(tbl_file):
    '''Parse Infernal cmscan tabular output.'''
    hits = []
    with open(tbl_file) as f:
        for line in f:
            if line.startswith('#'):
                continue
            parts = line.split()
            if len(parts) < 18:
                continue
            strand = '+' if parts[9] == '+' else '-'
            start, end = (int(parts[7]), int(parts[8])) if strand == '+' else (int(parts[8]), int(parts[7]))
            hits.append({
                'target': parts[0],
                'rfam_acc': parts[1],
                'rfam_name': parts[2],
                'seqid': parts[3],
                'start': start,
                'end': end,
                'strand': strand,
                'evalue': float(parts[16]),
                'score': float(parts[15]),
                'rna_type': classify_rfam(parts[2]),
            })
    return pd.DataFrame(hits)

def classify_rfam(rfam_name):
    '''Classify Rfam family into broad ncRNA categories.'''
    name_lower = rfam_name.lower()
    if 'rrna' in name_lower or 'ssu' in name_lower or 'lsu' in name_lower:
        return 'rRNA'
    if 'trna' in name_lower:
        return 'tRNA'
    if 'snorna' in name_lower or 'snord' in name_lower or 'snora' in name_lower:
        return 'snoRNA'
    if 'mir' in name_lower:
        return 'miRNA'
    if 'riboswitch' in name_lower or 'thermoregulator' in name_lower:
        return 'riboswitch'
    if 'ires' in name_lower or 'leader' in name_lower:
        return 'cis-reg'
    return 'other_ncRNA'

def parse_trnascan_gff(gff_file):
    '''Parse tRNAscan-SE GFF3 output.'''
    trnas = []
    with open(gff_file) as f:
        for line in f:
            if line.startswith('#'):
                continue
            parts = line.strip().split('\t')
            if len(parts) < 9 or parts[2] != 'tRNA':
                continue
            attrs = dict(item.split('=') for item in parts[8].split(';') if '=' in item)
            trnas.append({
                'seqid': parts[0],
                'start': int(parts[3]),
                'end': int(parts[4]),
                'strand': parts[6],
                'isotype': attrs.get('isotype', 'unknown'),
                'anticodon': attrs.get('anticodon', 'unknown'),
                'score': float(parts[5]) if parts[5] != '.' else 0,
                'rna_type': 'tRNA',
            })
    return pd.DataFrame(trnas)

def combine_ncrna_annotations(infernal_tbl, trnascan_gff, output_gff):
    '''Combine Infernal and tRNAscan-SE results, preferring tRNAscan for tRNAs.'''
    infernal_df = parse_infernal_tbl(infernal_tbl)
    trna_df = parse_trnascan_gff(trnascan_gff)

    # Remove Infernal tRNA hits (tRNAscan-SE is more accurate for tRNAs)
    infernal_no_trna = infernal_df[infernal_df['rna_type'] != 'tRNA']

    summary = defaultdict(int)
    for rna_type in infernal_no_trna['rna_type'].unique():
        summary[rna_type] = len(infernal_no_trna[infernal_no_trna['rna_type'] == rna_type])
    summary['tRNA'] = len(trna_df)

    print('=== ncRNA Summary ===')
    for rna_type, count in sorted(summary.items()):
        print(f'  {rna_type}: {count}')
    print(f'  Total: {sum(summary.values())}')

    return infernal_no_trna, trna_df, summary

Troubleshooting

cmscan Is Slow

Use
```
--rfam
```
flag for Rfam-mode optimizations
Use
```
--FZ 2
```
to further speed up (less sensitive)
Split genome into chunks and run in parallel

Missing Expected ncRNAs

Verify Rfam.cm is current version
Check that cmpress was run successfully (look for .i1m, .i1p, .i1f, .i1i files)
For mitochondrial/chloroplast ncRNAs, search organelle sequence separately

tRNAscan-SE Finds Too Many Pseudogenes

Normal for eukaryotic genomes (especially mammals)
Filter by score: high-confidence tRNAs typically score > 50 bits
Use
```
--detail
```
flag to identify pseudogenes in output

Related Skills

prokaryotic-annotation - Bakta includes ncRNA annotation
eukaryotic-gene-prediction - Gene prediction excludes ncRNAs
rna-structure/ncrna-search - Targeted ncRNA homology searches