Claude-skill-registry bio-genome-intervals-gtf-gff-handling

Parse, query, and convert GTF and GFF3 annotation files. Extract gene, transcript, and exon coordinates using gffread, gtfparse, and gffutils. Use when extracting specific features from gene annotations or converting between annotation formats.

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/gtf-gff-handling" ~/.claude/skills/majiayu000-claude-skill-registry-bio-genome-intervals-gtf-gff-handling && rm -rf "$T"
manifest: skills/data/gtf-gff-handling/SKILL.md
source content

GTF/GFF Handling

GTF and GFF3 are standard gene annotation formats. Both use 1-based coordinates.

Format Comparison

FeatureGTFGFF3
Coordinate system1-based, inclusive1-based, inclusive
HierarchyImplicit (gene_id, transcript_id)Explicit (Parent attribute)
Attribute formatkey "value";key=value;
Comments##
Fasta sequencesNot standard##FASTA directive

GTF Format

chr1	HAVANA	gene	11869	14409	.	+	.	gene_id "ENSG00000223972"; gene_name "DDX11L1";
chr1	HAVANA	transcript	11869	14409	.	+	.	gene_id "ENSG00000223972"; transcript_id "ENST00000456328";
chr1	HAVANA	exon	11869	12227	.	+	.	gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; exon_number "1";

GFF3 Format

chr1	HAVANA	gene	11869	14409	.	+	.	ID=ENSG00000223972;Name=DDX11L1
chr1	HAVANA	mRNA	11869	14409	.	+	.	ID=ENST00000456328;Parent=ENSG00000223972
chr1	HAVANA	exon	11869	12227	.	+	.	ID=exon1;Parent=ENST00000456328

Parse GTF with gtfparse (Python)

Installation

pip install gtfparse

Basic Parsing

import gtfparse

# Load entire GTF
df = gtfparse.read_gtf('annotation.gtf')

# View columns
print(df.columns)
# ['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame',
#  'gene_id', 'transcript_id', 'gene_name', ...]

# Filter by feature type
genes = df[df['feature'] == 'gene']
transcripts = df[df['feature'] == 'transcript']
exons = df[df['feature'] == 'exon']

# Get specific gene
gene_df = df[df['gene_name'] == 'TP53']

Extract Gene Coordinates

import gtfparse

df = gtfparse.read_gtf('annotation.gtf')

# All genes
genes = df[df['feature'] == 'gene'][['seqname', 'start', 'end', 'strand', 'gene_id', 'gene_name']]

# Convert to BED format (0-based)
genes_bed = genes.copy()
genes_bed['start'] = genes_bed['start'] - 1  # GTF is 1-based, BED is 0-based
genes_bed = genes_bed[['seqname', 'start', 'end', 'gene_name', 'gene_id', 'strand']]
genes_bed.to_csv('genes.bed', sep='\t', header=False, index=False)

Get Exons for Gene

import gtfparse

df = gtfparse.read_gtf('annotation.gtf')

# Get all exons for TP53
tp53_exons = df[(df['gene_name'] == 'TP53') & (df['feature'] == 'exon')]
tp53_exons = tp53_exons[['seqname', 'start', 'end', 'transcript_id', 'exon_number']]
print(tp53_exons)

Parse GFF with gffutils (Python)

Installation

pip install gffutils

Create Database

import gffutils

# Create database (slow first time, fast for subsequent queries)
db = gffutils.create_db('annotation.gff3', 'annotation.db',
                         force=True, merge_strategy='create_unique')

# Or load existing database
db = gffutils.FeatureDB('annotation.db')

Query Features

import gffutils

db = gffutils.FeatureDB('annotation.db')

# Count features by type
for featuretype in db.featuretypes():
    count = db.count_features_of_type(featuretype)
    print(f'{featuretype}: {count}')

# Get all genes
for gene in db.features_of_type('gene'):
    print(f'{gene.id}: {gene.seqid}:{gene.start}-{gene.end}')

# Get gene by ID
gene = db['ENSG00000141510']  # TP53
print(f'{gene.attributes["Name"][0]}: {gene.seqid}:{gene.start}-{gene.end}')

# Get children (transcripts, exons)
for transcript in db.children(gene, featuretype='mRNA'):
    print(f'  Transcript: {transcript.id}')
    for exon in db.children(transcript, featuretype='exon'):
        print(f'    Exon: {exon.start}-{exon.end}')

Get Introns

import gffutils

db = gffutils.FeatureDB('annotation.db')

# Get introns for a transcript
transcript = db['ENST00000269305']
introns = list(db.interfeatures(db.children(transcript, featuretype='exon'),
                                 new_featuretype='intron'))
for intron in introns:
    print(f'Intron: {intron.start}-{intron.end}')

Convert Formats with gffread (CLI)

Installation

conda install -c bioconda gffread

GTF to GFF3

gffread annotation.gtf -o annotation.gff3

GFF3 to GTF

gffread annotation.gff3 -T -o annotation.gtf

Extract Sequences

# Extract transcript sequences
gffread -w transcripts.fa -g genome.fa annotation.gtf

# Extract CDS sequences
gffread -x cds.fa -g genome.fa annotation.gtf

# Extract protein sequences
gffread -y proteins.fa -g genome.fa annotation.gtf

Filter Features

# Keep only protein-coding genes
gffread annotation.gtf -C -o coding.gtf

# Keep specific gene types
gffread annotation.gtf --keep-genes=protein_coding -o coding.gtf

Extract Regions with bedtools

Get Promoters

# Extract TSS (transcript start sites)
awk '$3 == "transcript"' annotation.gtf | \
    awk -v OFS='\t' '{
        if ($7 == "+") print $1, $4-1, $4, ".", ".", $7;
        else print $1, $5-1, $5, ".", ".", $7;
    }' > tss.bed

# Get promoter regions (2kb upstream of TSS)
bedtools flank -i tss.bed -g genome.txt -l 2000 -r 0 -s > promoters.bed

Get Gene Bodies

# Extract gene coordinates to BED
awk '$3 == "gene"' annotation.gtf | \
    awk -v OFS='\t' '{
        split($0, a, "gene_id \""); split(a[2], b, "\"");
        print $1, $4-1, $5, b[1], ".", $7;
    }' > genes.bed

Get Exons

# Extract unique exons
awk '$3 == "exon"' annotation.gtf | \
    awk -v OFS='\t' '{print $1, $4-1, $5, ".", ".", $7}' | \
    sort -k1,1 -k2,2n | uniq > exons.bed

Python: GTF to BED Conversion

import gtfparse
import pandas as pd

def gtf_to_bed(gtf_path, feature_type='gene', output_path=None):
    '''Convert GTF features to BED format.'''
    df = gtfparse.read_gtf(gtf_path)
    features = df[df['feature'] == feature_type].copy()

    # Convert to 0-based coordinates
    bed = pd.DataFrame({
        'chrom': features['seqname'],
        'start': features['start'] - 1,
        'end': features['end'],
        'name': features.get('gene_name', features.get('gene_id', '.')),
        'score': 0,
        'strand': features['strand']
    })

    if output_path:
        bed.to_csv(output_path, sep='\t', header=False, index=False)
    return bed

# Usage
genes_bed = gtf_to_bed('annotation.gtf', 'gene', 'genes.bed')
exons_bed = gtf_to_bed('annotation.gtf', 'exon', 'exons.bed')

Validate GTF/GFF

# Check GTF format
gffread -E annotation.gtf

# Check GFF3 format
gffread -E annotation.gff3

# Detailed validation
gt gff3validator annotation.gff3  # requires genometools

Common Attributes

GTF Attributes

AttributeDescription
gene_idEnsembl gene ID
gene_nameGene symbol
gene_biotypeprotein_coding, lncRNA, etc.
transcript_idEnsembl transcript ID
transcript_nameTranscript symbol
exon_numberExon position in transcript
exon_idEnsembl exon ID

GFF3 Attributes

AttributeDescription
IDUnique feature identifier
NameDisplay name
ParentParent feature ID
DbxrefDatabase cross-references
gene_biotypeGene type

Memory-Efficient Processing

import gtfparse

# Process large files in chunks (gtfparse loads all into memory)
# For very large files, use gffutils database approach

# Or filter during parsing
df = gtfparse.read_gtf('annotation.gtf',
                        features=['gene', 'exon'])  # Only load specific features

Related Skills

  • bed-file-basics - BED format and conversion
  • interval-arithmetic - Gene/exon overlap analysis
  • proximity-operations - TSS proximity analysis
  • differential-expression - Gene coordinate mapping