BioSkills bio-genome-annotation-repeat-annotation
Identify and classify repetitive elements and transposable elements using RepeatModeler for de novo repeat library construction and RepeatMasker for genome-wide repeat annotation. Quantify TE expression from RNA-seq with TEtranscripts. Use when masking repeats before gene prediction or analyzing transposable element activity.
git clone https://github.com/GPTomics/bioSkills
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/genome-annotation/repeat-annotation" ~/.claude/skills/gptomics-bioskills-bio-genome-annotation-repeat-annotation && rm -rf "$T"
genome-annotation/repeat-annotation/SKILL.mdVersion Compatibility
Reference examples tested with: DESeq2 1.42+, STAR 2.7.11+, matplotlib 3.8+, pandas 2.2+
Before using code patterns, verify installed versions match. If versions differ:
- Python:
thenpip show <package>
to check signatureshelp(module.function) - CLI:
then<tool> --version
to confirm flags<tool> --help
If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Repeat and Transposable Element Annotation
"Mask repeats in my genome assembly" → Build a de novo repeat library and annotate/softmask repetitive elements as a prerequisite for gene prediction.
- CLI:
(library),RepeatModeler -database mydb
(masking)RepeatMasker -lib custom-lib.fa -xsmall assembly.fa
Identify, classify, and mask repetitive elements using RepeatModeler (de novo library construction) and RepeatMasker (genome-wide annotation). Softmasked output is a prerequisite for eukaryotic gene prediction.
RepeatModeler (De Novo Library)
RepeatModeler builds a species-specific repeat library by detecting repetitive elements de novo from the assembly.
Build Database and Run
# Build RepeatModeler database BuildDatabase -name my_genome -engine ncbi assembly.fasta # Run RepeatModeler (this takes hours to days depending on genome size) # -LTRStruct enables LTR structural detection (recommended) RepeatModeler -database my_genome -pa 16 -LTRStruct
Key Options
| Option | Description |
|---|---|
| Database name from BuildDatabase |
| Parallel processes |
| Enable LTR structural detection pipeline |
| Search engine: ncbi (RMBLAST) or abblast |
Output
my_genome-families.fa # Consensus repeat library my_genome-families.stk # Stockholm alignments RM_*/ # Working directory with intermediate files
The output
*-families.fa is the repeat library used by RepeatMasker.
RepeatMasker (Genome-Wide Annotation)
With De Novo Library
# Use species-specific de novo library (recommended) RepeatMasker \ -lib my_genome-families.fa \ -pa 16 \ -xsmall \ -gff \ -dir repeatmasker_out \ assembly.fasta
With Dfam/RepBase Library
# Use Dfam curated library for a known species RepeatMasker \ -species "Homo sapiens" \ -pa 16 \ -xsmall \ -gff \ -dir repeatmasker_out \ assembly.fasta
Combined Library (De Novo + Known)
# Combine de novo and curated libraries for best results cat my_genome-families.fa known_repeats.fa > combined_lib.fa RepeatMasker \ -lib combined_lib.fa \ -pa 16 \ -xsmall \ -gff \ -dir repeatmasker_out \ assembly.fasta
Key Options
| Option | Description |
|---|---|
| Custom repeat library FASTA |
| Species name (uses Dfam database) |
| Parallel processes |
| Softmask output (lowercase repeats, required for gene prediction) |
| Generate GFF output |
| Output directory |
| Skip low-complexity masking |
| Skip interspersed repeats |
| Search engine: crossmatch, ncbi, hmmer, abblast |
| Slow/sensitive search |
| Quick search (5-10% less sensitive) |
Output Files
repeatmasker_out/ ├── assembly.fasta.masked # Hardmasked genome (N's replace repeats) ├── assembly.fasta.out # Detailed repeat annotation table ├── assembly.fasta.tbl # Summary statistics table ├── assembly.fasta.out.gff # GFF annotation of repeats └── assembly.fasta.cat.gz # Search result alignments
Softmasking for Gene Prediction
The
-xsmall flag produces softmasked output where repeats are lowercase. This is the required input format for BRAKER3 and most gene prediction tools.
# The softmasked genome is written in place of the input # Copy original first cp assembly.fasta assembly_original.fasta RepeatMasker -lib my_genome-families.fa -pa 16 -xsmall assembly.fasta # assembly.fasta.masked is the softmasked output mv assembly.fasta.masked assembly_softmasked.fasta
TEtranscripts (TE Expression)
Quantify transposable element expression from RNA-seq data using TEtranscripts, which works with DESeq2 for differential TE expression.
# Requires STAR alignment with multi-mapping reads retained STAR --runThreadN 16 \ --genomeDir star_index \ --readFilesIn reads_R1.fq.gz reads_R2.fq.gz \ --readFilesCommand zcat \ --outSAMtype BAM SortedByCoordinate \ --winAnchorMultimapNmax 100 \ --outFilterMultimapNmax 100 \ --outFileNamePrefix sample_ # Run TEtranscripts for differential expression TEtranscripts \ --treatment sample1.bam sample2.bam sample3.bam \ --control ctrl1.bam ctrl2.bam ctrl3.bam \ --GTF genes.gtf \ --TE te_annotation.gtf \ --mode multi \ --sortByPos
Key TEtranscripts Options
| Option | Description |
|---|---|
| Treatment BAM files |
| Control BAM files |
| Gene annotation GTF |
| TE annotation GTF (from RepeatMasker) |
| multi (recommended) or uniq |
| Input sorted by position |
| Strand-specific protocol (yes, no, reverse) |
Python: Repeat Statistics
Goal: Parse RepeatMasker output to summarize repeat content by class and visualize the repeat divergence landscape.
Approach: Read the RepeatMasker
.out file into a DataFrame, group by repeat class to compute total bp and genome percentage, then plot a Kimura divergence histogram stratified by major TE classes (LINE, SINE, LTR, DNA).
import pandas as pd import re def parse_repeatmasker_out(out_file): '''Parse RepeatMasker .out file into a DataFrame.''' records = [] with open(out_file) as f: for i, line in enumerate(f): if i < 3: continue parts = line.split() if len(parts) < 15: continue records.append({ 'score': int(parts[0]), 'perc_div': float(parts[1]), 'perc_del': float(parts[2]), 'perc_ins': float(parts[3]), 'seqid': parts[4], 'start': int(parts[5]), 'end': int(parts[6]), 'strand': '+' if parts[8] == '+' else '-', 'repeat_name': parts[9], 'repeat_class': parts[10], 'length': int(parts[6]) - int(parts[5]) + 1, }) return pd.DataFrame(records) def repeat_summary(rm_df, genome_size): '''Summarize repeat content by class.''' class_summary = rm_df.groupby('repeat_class').agg( count=('repeat_name', 'count'), total_bp=('length', 'sum'), ).sort_values('total_bp', ascending=False) class_summary['pct_genome'] = class_summary['total_bp'] / genome_size * 100 total_masked = rm_df['length'].sum() print(f'=== Repeat Summary ===') print(f'Total masked: {total_masked:,} bp ({total_masked/genome_size:.1%} of genome)') print(f'\nBy class:') for cls, row in class_summary.iterrows(): print(f' {cls}: {row["count"]:,} elements, {row["total_bp"]:,} bp ({row["pct_genome"]:.1f}%)') return class_summary def repeat_landscape(rm_df, output_file='repeat_landscape.png'): '''Plot repeat divergence landscape (Kimura distance).''' import matplotlib.pyplot as plt fig, ax = plt.subplots(figsize=(12, 6)) major_classes = ['LINE', 'SINE', 'LTR', 'DNA'] colors = {'LINE': '#1f77b4', 'SINE': '#ff7f0e', 'LTR': '#2ca02c', 'DNA': '#d62728'} for cls in major_classes: subset = rm_df[rm_df['repeat_class'].str.contains(cls, case=False, na=False)] if len(subset) > 0: ax.hist(subset['perc_div'], bins=50, range=(0, 50), alpha=0.6, label=cls, color=colors.get(cls)) ax.set_xlabel('Kimura Divergence (%)') ax.set_ylabel('Count') ax.set_title('Repeat Landscape') ax.legend() plt.savefig(output_file, dpi=150, bbox_inches='tight') plt.close()
Expected Repeat Content
| Organism | Repeat Content | Notes |
|---|---|---|
| Bacteria | 1-5% | Mostly IS elements |
| Yeast | 3-5% | Ty elements |
| Drosophila | 15-25% | LTR-rich |
| Zebrafish | 45-55% | DNA transposon-rich |
| Human | 45-50% | LINE/SINE-rich |
| Maize | 80-85% | LTR-rich |
Troubleshooting
RepeatModeler Runs Very Slowly
- Normal for large genomes (days for mammalian-size)
- Use
for parallelization-pa - Consider EDTA as alternative for plant genomes
Low Masking Percentage
- May indicate novel repeats not in database
- Always run RepeatModeler before RepeatMasker
- Combine de novo + known libraries
Gene Prediction Finds Too Many Genes After Masking
- Verify softmasking with:
grep -v '^>' assembly.fasta | tr -cd 'a-z' | wc -c - Ensure using
not default hardmasking-xsmall
Related Skills
- eukaryotic-gene-prediction - Requires softmasked genome from repeat annotation
- genome-assembly/assembly-qc - Assess assembly quality including repeat content
- differential-expression/deseq2-basics - Differential TE expression analysis