Claude-skill-registry bio-genome-intervals-gtf-gff-handling
Parse, query, and convert GTF and GFF3 annotation files. Extract gene, transcript, and exon coordinates using gffread, gtfparse, and gffutils. Use when extracting specific features from gene annotations or converting between annotation formats.
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/gtf-gff-handling" ~/.claude/skills/majiayu000-claude-skill-registry-bio-genome-intervals-gtf-gff-handling && rm -rf "$T"
manifest:
skills/data/gtf-gff-handling/SKILL.mdsource content
GTF/GFF Handling
GTF and GFF3 are standard gene annotation formats. Both use 1-based coordinates.
Format Comparison
| Feature | GTF | GFF3 |
|---|---|---|
| Coordinate system | 1-based, inclusive | 1-based, inclusive |
| Hierarchy | Implicit (gene_id, transcript_id) | Explicit (Parent attribute) |
| Attribute format | key "value"; | key=value; |
| Comments | # | # |
| Fasta sequences | Not standard | ##FASTA directive |
GTF Format
chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000223972"; gene_name "DDX11L1"; chr1 HAVANA transcript 11869 14409 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; chr1 HAVANA exon 11869 12227 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "1";
GFF3 Format
chr1 HAVANA gene 11869 14409 . + . ID=ENSG00000223972;Name=DDX11L1 chr1 HAVANA mRNA 11869 14409 . + . ID=ENST00000456328;Parent=ENSG00000223972 chr1 HAVANA exon 11869 12227 . + . ID=exon1;Parent=ENST00000456328
Parse GTF with gtfparse (Python)
Installation
pip install gtfparse
Basic Parsing
import gtfparse # Load entire GTF df = gtfparse.read_gtf('annotation.gtf') # View columns print(df.columns) # ['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', # 'gene_id', 'transcript_id', 'gene_name', ...] # Filter by feature type genes = df[df['feature'] == 'gene'] transcripts = df[df['feature'] == 'transcript'] exons = df[df['feature'] == 'exon'] # Get specific gene gene_df = df[df['gene_name'] == 'TP53']
Extract Gene Coordinates
import gtfparse df = gtfparse.read_gtf('annotation.gtf') # All genes genes = df[df['feature'] == 'gene'][['seqname', 'start', 'end', 'strand', 'gene_id', 'gene_name']] # Convert to BED format (0-based) genes_bed = genes.copy() genes_bed['start'] = genes_bed['start'] - 1 # GTF is 1-based, BED is 0-based genes_bed = genes_bed[['seqname', 'start', 'end', 'gene_name', 'gene_id', 'strand']] genes_bed.to_csv('genes.bed', sep='\t', header=False, index=False)
Get Exons for Gene
import gtfparse df = gtfparse.read_gtf('annotation.gtf') # Get all exons for TP53 tp53_exons = df[(df['gene_name'] == 'TP53') & (df['feature'] == 'exon')] tp53_exons = tp53_exons[['seqname', 'start', 'end', 'transcript_id', 'exon_number']] print(tp53_exons)
Parse GFF with gffutils (Python)
Installation
pip install gffutils
Create Database
import gffutils # Create database (slow first time, fast for subsequent queries) db = gffutils.create_db('annotation.gff3', 'annotation.db', force=True, merge_strategy='create_unique') # Or load existing database db = gffutils.FeatureDB('annotation.db')
Query Features
import gffutils db = gffutils.FeatureDB('annotation.db') # Count features by type for featuretype in db.featuretypes(): count = db.count_features_of_type(featuretype) print(f'{featuretype}: {count}') # Get all genes for gene in db.features_of_type('gene'): print(f'{gene.id}: {gene.seqid}:{gene.start}-{gene.end}') # Get gene by ID gene = db['ENSG00000141510'] # TP53 print(f'{gene.attributes["Name"][0]}: {gene.seqid}:{gene.start}-{gene.end}') # Get children (transcripts, exons) for transcript in db.children(gene, featuretype='mRNA'): print(f' Transcript: {transcript.id}') for exon in db.children(transcript, featuretype='exon'): print(f' Exon: {exon.start}-{exon.end}')
Get Introns
import gffutils db = gffutils.FeatureDB('annotation.db') # Get introns for a transcript transcript = db['ENST00000269305'] introns = list(db.interfeatures(db.children(transcript, featuretype='exon'), new_featuretype='intron')) for intron in introns: print(f'Intron: {intron.start}-{intron.end}')
Convert Formats with gffread (CLI)
Installation
conda install -c bioconda gffread
GTF to GFF3
gffread annotation.gtf -o annotation.gff3
GFF3 to GTF
gffread annotation.gff3 -T -o annotation.gtf
Extract Sequences
# Extract transcript sequences gffread -w transcripts.fa -g genome.fa annotation.gtf # Extract CDS sequences gffread -x cds.fa -g genome.fa annotation.gtf # Extract protein sequences gffread -y proteins.fa -g genome.fa annotation.gtf
Filter Features
# Keep only protein-coding genes gffread annotation.gtf -C -o coding.gtf # Keep specific gene types gffread annotation.gtf --keep-genes=protein_coding -o coding.gtf
Extract Regions with bedtools
Get Promoters
# Extract TSS (transcript start sites) awk '$3 == "transcript"' annotation.gtf | \ awk -v OFS='\t' '{ if ($7 == "+") print $1, $4-1, $4, ".", ".", $7; else print $1, $5-1, $5, ".", ".", $7; }' > tss.bed # Get promoter regions (2kb upstream of TSS) bedtools flank -i tss.bed -g genome.txt -l 2000 -r 0 -s > promoters.bed
Get Gene Bodies
# Extract gene coordinates to BED awk '$3 == "gene"' annotation.gtf | \ awk -v OFS='\t' '{ split($0, a, "gene_id \""); split(a[2], b, "\""); print $1, $4-1, $5, b[1], ".", $7; }' > genes.bed
Get Exons
# Extract unique exons awk '$3 == "exon"' annotation.gtf | \ awk -v OFS='\t' '{print $1, $4-1, $5, ".", ".", $7}' | \ sort -k1,1 -k2,2n | uniq > exons.bed
Python: GTF to BED Conversion
import gtfparse import pandas as pd def gtf_to_bed(gtf_path, feature_type='gene', output_path=None): '''Convert GTF features to BED format.''' df = gtfparse.read_gtf(gtf_path) features = df[df['feature'] == feature_type].copy() # Convert to 0-based coordinates bed = pd.DataFrame({ 'chrom': features['seqname'], 'start': features['start'] - 1, 'end': features['end'], 'name': features.get('gene_name', features.get('gene_id', '.')), 'score': 0, 'strand': features['strand'] }) if output_path: bed.to_csv(output_path, sep='\t', header=False, index=False) return bed # Usage genes_bed = gtf_to_bed('annotation.gtf', 'gene', 'genes.bed') exons_bed = gtf_to_bed('annotation.gtf', 'exon', 'exons.bed')
Validate GTF/GFF
# Check GTF format gffread -E annotation.gtf # Check GFF3 format gffread -E annotation.gff3 # Detailed validation gt gff3validator annotation.gff3 # requires genometools
Common Attributes
GTF Attributes
| Attribute | Description |
|---|---|
| gene_id | Ensembl gene ID |
| gene_name | Gene symbol |
| gene_biotype | protein_coding, lncRNA, etc. |
| transcript_id | Ensembl transcript ID |
| transcript_name | Transcript symbol |
| exon_number | Exon position in transcript |
| exon_id | Ensembl exon ID |
GFF3 Attributes
| Attribute | Description |
|---|---|
| ID | Unique feature identifier |
| Name | Display name |
| Parent | Parent feature ID |
| Dbxref | Database cross-references |
| gene_biotype | Gene type |
Memory-Efficient Processing
import gtfparse # Process large files in chunks (gtfparse loads all into memory) # For very large files, use gffutils database approach # Or filter during parsing df = gtfparse.read_gtf('annotation.gtf', features=['gene', 'exon']) # Only load specific features
Related Skills
- bed-file-basics - BED format and conversion
- interval-arithmetic - Gene/exon overlap analysis
- proximity-operations - TSS proximity analysis
- differential-expression - Gene coordinate mapping