BioSkills bio-rna-quantification-featurecounts-counting
Count reads per gene from aligned BAM files using Subread featureCounts. Use when processing BAM files from STAR/HISAT2 to generate gene-level counts for DESeq2/edgeR.
git clone https://github.com/GPTomics/bioSkills
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/rna-quantification/featurecounts-counting" ~/.claude/skills/gptomics-bioskills-bio-rna-quantification-featurecounts-counting && rm -rf "$T"
rna-quantification/featurecounts-counting/SKILL.mdVersion Compatibility
Reference examples tested with: DESeq2 1.42+, HISAT2 2.2.1+, STAR 2.7.11+, Subread 2.0+, edgeR 4.0+, pandas 2.2+, scanpy 1.10+
Before using code patterns, verify installed versions match. If versions differ:
- Python:
thenpip show <package>
to check signatureshelp(module.function) - R:
thenpackageVersion('<pkg>')
to verify parameters?function_name - CLI:
then<tool> --version
to confirm flags<tool> --help
If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
featureCounts Counting
"Count reads per gene from my BAM files" → Assign aligned reads to genomic features using a GTF annotation to produce a gene-by-sample count matrix for DE analysis.
- CLI:
featureCounts -a genes.gtf -o counts.txt sample1.bam sample2.bam
Count reads mapping to genomic features (genes, exons) from BAM files.
Basic Usage
# Single sample featureCounts -a annotation.gtf -o counts.txt aligned.bam # Multiple samples (recommended - single matrix output) featureCounts -a annotation.gtf -o counts.txt sample1.bam sample2.bam sample3.bam # All BAMs in directory featureCounts -a annotation.gtf -o counts.txt *.bam
Paired-End Data
# Count fragments, not reads (required for paired-end) featureCounts -p --countReadPairs -a annotation.gtf -o counts.txt *.bam # Check proper pairs only featureCounts -p --countReadPairs -B -C -a annotation.gtf -o counts.txt *.bam
Flags:
- Input is paired-end-p
- Count fragments instead of reads--countReadPairs
- Only count properly paired reads-B
- Don't count chimeric fragments-C
Strand-Specific Libraries
# Unstranded (default) featureCounts -s 0 -a annotation.gtf -o counts.txt *.bam # Forward stranded (e.g., dUTP, NSR) featureCounts -s 1 -a annotation.gtf -o counts.txt *.bam # Reverse stranded (e.g., Illumina TruSeq, most common) featureCounts -s 2 -a annotation.gtf -o counts.txt *.bam
Determining strandedness: Use
infer_experiment.py from RSeQC or check library prep protocol.
Feature Types
# Count at gene level (default) featureCounts -t exon -g gene_id -a annotation.gtf -o counts.txt *.bam # Count at transcript level featureCounts -t exon -g transcript_id -a annotation.gtf -o counts.txt *.bam # Count CDS only featureCounts -t CDS -g gene_id -a annotation.gtf -o counts.txt *.bam
Flags:
- Feature type in GTF (default: exon)-t
- Meta-feature attribute (default: gene_id)-g
Multi-Mapping Reads
# Discard multi-mappers (default, recommended for DE) featureCounts -a annotation.gtf -o counts.txt *.bam # Count multi-mappers (fractional) featureCounts -M --fraction -a annotation.gtf -o counts.txt *.bam # Count multi-mappers (full count to each location) featureCounts -M -a annotation.gtf -o counts.txt *.bam
Overlapping Features
# Discard reads overlapping multiple features (default) featureCounts -a annotation.gtf -o counts.txt *.bam # Count reads overlapping multiple features featureCounts -O -a annotation.gtf -o counts.txt *.bam # Fractional count for overlaps featureCounts -O --fraction -a annotation.gtf -o counts.txt *.bam
Performance Options
# Use multiple threads featureCounts -T 8 -a annotation.gtf -o counts.txt *.bam # Use less memory (slower) featureCounts --largeBAM -a annotation.gtf -o counts.txt *.bam
Output Files
featureCounts produces two files:
- counts.txt - Main count matrix
Geneid Chr Start End Strand Length sample1.bam sample2.bam GENE1 chr1 100 500 + 400 1523 1891 GENE2 chr1 1000 2000 - 1000 892 756
- counts.txt.summary - Assignment statistics
Status sample1.bam sample2.bam Assigned 1523456 1678234 Unassigned_Unmapped 12345 11234 Unassigned_NoFeatures 234567 245678
Extract Count Matrix
# Remove first 6 columns (metadata) to get just counts cut -f1,7- counts.txt | tail -n +2 > count_matrix.txt
Python Processing
import pandas as pd counts = pd.read_csv('counts.txt', sep='\t', comment='#') count_matrix = counts.set_index('Geneid').iloc[:, 5:] # Skip metadata columns count_matrix.columns = [c.replace('.bam', '') for c in count_matrix.columns] count_matrix.to_csv('count_matrix.csv')
R Processing
counts <- read.table('counts.txt', header=TRUE, row.names=1, skip=1) count_matrix <- counts[, 6:ncol(counts)] # Skip metadata columns colnames(count_matrix) <- gsub('.bam', '', colnames(count_matrix))
Common Issues
Low assignment rate:
- Check strandedness setting (
)-s - Verify GTF matches reference genome version
- Check BAM alignment quality
Zero counts for known expressed genes:
- Ensure feature type matches GTF (
vs-t exon
)-t gene - Check gene_id attribute name in GTF
Related Skills
- alignment-files/sam-bam-basics - Input BAM file handling
- genome-intervals/gtf-gff-handling - GTF annotation files
- differential-expression/deseq2-basics - Downstream analysis with counts
- rna-quantification/count-matrix-qc - QC of count data