SciAgent-Skills star-rna-seq-aligner
Splice-aware RNA-seq read aligner producing sorted BAM files and splice junction tables. Performs genome index generation and two-pass alignment for improved junction detection. Outputs coordinate-sorted BAM, splice junctions (SJ.out.tab), alignment statistics (Log.final.out), and optional gene count tables. Use Salmon instead for ultra-fast pseudoalignment; use STAR when you need a genome-aligned BAM for downstream tools (variant calling, visualization, ENCODE pipelines).
git clone https://github.com/jaechang-hits/SciAgent-Skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/genomics-bioinformatics/star-rna-seq-aligner" ~/.claude/skills/jaechang-hits-sciagent-skills-star-rna-seq-aligner && rm -rf "$T"
skills/genomics-bioinformatics/star-rna-seq-aligner/SKILL.mdSTAR — Spliced RNA-seq Aligner
Overview
STAR (Spliced Transcripts Alignment to a Reference) aligns RNA-seq reads to a genome in a splice-aware manner, identifying novel and annotated splice junctions in a single pass. It generates coordinate-sorted BAM files compatible with samtools, IGV, deeptools, and GATK. STAR's 2-pass mode re-aligns reads using junctions discovered in the first pass, improving sensitivity for novel splice sites. With
--quantMode GeneCounts, STAR simultaneously produces gene-level read count tables without requiring a separate featureCounts or HTSeq step.
When to Use
- Aligning bulk RNA-seq reads to a reference genome when downstream tools require a BAM file (variant calling, visualization, deeptools)
- Running ENCODE-compliant RNA-seq pipelines that mandate genome alignment
- Discovering novel splice junctions and alternative splicing events in the dataset
- Generating gene count tables alongside BAM alignment in a single step with
--quantMode GeneCounts - Processing long reads or reads with high mismatch rates by tuning
--outFilterMismatchNmax - Use Salmon instead when you only need transcript/gene quantification and do not need a BAM file — Salmon is 20-50× faster
Prerequisites
- Software: STAR ≥ 2.7.0 (conda or compiled binary)
- Reference files: genome FASTA + GTF annotation (same assembly)
- RAM: 30–32 GB for human/mouse genome index; 8–16 GB for smaller genomes
- Disk: ~25 GB for human genome index, ~5–10 GB per sample BAM
Check before installing: The tool may already be available in the current environment (e.g., inside a
/pixienv). Runcondafirst and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool viacommand -v STARrather than barepixi run STAR.STAR
# Install with conda (recommended) conda install -c bioconda star # Verify STAR --version # STAR_2.7.11a # Or compile from source git clone https://github.com/alexdobin/STAR cd STAR/source && make STAR
Quick Start
# 1. Generate genome index (~30 min, run once) STAR --runMode genomeGenerate \ --runThreadN 8 \ --genomeDir genome/star_index \ --genomeFastaFiles genome/GRCh38.fa \ --sjdbGTFfile genome/gencode.v47.gtf \ --sjdbOverhang 100 # ReadLength - 1 # 2. Align paired-end reads (~10-20 min) STAR --runThreadN 8 \ --genomeDir genome/star_index \ --readFilesIn sample_R1.fastq.gz sample_R2.fastq.gz \ --readFilesCommand zcat \ --outSAMtype BAM SortedByCoordinate \ --outFileNamePrefix results/sample/ # 3. Index the BAM samtools index results/sample/Aligned.sortedByCoord.out.bam
Workflow
Step 1: Prepare Reference Files
Download a genome FASTA and matching GTF annotation (same assembly version).
# Download GRCh38 genome and GENCODE annotation wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_47/GRCh38.primary_assembly.genome.fa.gz wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_47/gencode.v47.primary_assembly.annotation.gtf.gz gunzip GRCh38.primary_assembly.genome.fa.gz gencode.v47.primary_assembly.annotation.gtf.gz mkdir -p genome/star_index echo "Genome and GTF ready." ls -lh GRCh38.primary_assembly.genome.fa gencode.v47.primary_assembly.annotation.gtf
Step 2: Generate Genome Index
Build the STAR genome index — required once per genome/read-length combination.
# Standard human genome index (requires ~32 GB RAM) STAR --runMode genomeGenerate \ --runThreadN 16 \ --genomeDir genome/star_index/ \ --genomeFastaFiles GRCh38.primary_assembly.genome.fa \ --sjdbGTFfile gencode.v47.primary_assembly.annotation.gtf \ --sjdbOverhang 100 # For small genomes (e.g., E. coli ~4.6 Mb), reduce genomeSAindexNbases # STAR --runMode genomeGenerate \ # --genomeSAindexNbases 11 \ # --genomeDir genome/ecoli_index/ ... echo "Index complete: $(ls genome/star_index/ | wc -l) files"
Step 3: Align RNA-seq Reads
Align single-end or paired-end FASTQ files to the indexed genome.
# Single-end alignment STAR --runThreadN 8 \ --genomeDir genome/star_index/ \ --readFilesIn sample1.fastq.gz \ --readFilesCommand zcat \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes NH HI AS NM MD \ --outFileNamePrefix results/sample1/ # Paired-end alignment STAR --runThreadN 8 \ --genomeDir genome/star_index/ \ --readFilesIn sample1_R1.fastq.gz sample1_R2.fastq.gz \ --readFilesCommand zcat \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes NH HI AS NM MD \ --outFileNamePrefix results/sample1/ echo "BAM: results/sample1/Aligned.sortedByCoord.out.bam"
Step 4: Run 2-Pass Alignment for Improved Sensitivity
Two-pass mode collects splice junctions from the first pass and uses them as annotation for the second pass.
# First pass — collect splice junctions STAR --runThreadN 8 \ --genomeDir genome/star_index/ \ --readFilesIn sample1_R1.fastq.gz sample1_R2.fastq.gz \ --readFilesCommand zcat \ --outSAMtype None \ --outFileNamePrefix pass1/sample1/ # Second pass — realign with all junctions from pass 1 SJ_FILES=$(ls pass1/*/SJ.out.tab | tr '\n' ' ') STAR --runThreadN 8 \ --genomeDir genome/star_index/ \ --readFilesIn sample1_R1.fastq.gz sample1_R2.fastq.gz \ --readFilesCommand zcat \ --sjdbFileChrStartEnd $SJ_FILES \ --outSAMtype BAM SortedByCoordinate \ --outFileNamePrefix results/sample1/ # Alternative: single-command 2-pass STAR --runThreadN 8 \ --genomeDir genome/star_index/ \ --readFilesIn sample1_R1.fastq.gz sample1_R2.fastq.gz \ --readFilesCommand zcat \ --twopassMode Basic \ --outSAMtype BAM SortedByCoordinate \ --outFileNamePrefix results/sample1/
Step 5: Check Alignment Statistics
Parse the alignment log to assess mapping rate and read quality.
# View the alignment summary cat results/sample1/Log.final.out # Parse key metrics with python python3 - << 'EOF' import re, sys from pathlib import Path log = Path("results/sample1/Log.final.out").read_text() metrics = {} for line in log.splitlines(): if "|" in line: key, _, val = line.partition("|") metrics[key.strip()] = val.strip() print(f"Unique mapping: {metrics.get('Uniquely mapped reads %', 'N/A')}") print(f"Multi-mapping: {metrics.get('% of reads mapped to multiple loci', 'N/A')}") print(f"Too many mismatches:{metrics.get('% of reads unmapped: too many mismatches', 'N/A')}") print(f"Total input reads: {metrics.get('Number of input reads', 'N/A')}") EOF
Step 6: Generate Gene Count Tables
Enable simultaneous gene counting during alignment using
--quantMode GeneCounts.
# Align and count simultaneously STAR --runThreadN 8 \ --genomeDir genome/star_index/ \ --readFilesIn sample1_R1.fastq.gz sample1_R2.fastq.gz \ --readFilesCommand zcat \ --outSAMtype BAM SortedByCoordinate \ --quantMode GeneCounts \ --outFileNamePrefix results/sample1/ # ReadsPerGene.out.tab has 4 columns: # gene_id unstranded stranded_fwd stranded_rev head results/sample1/ReadsPerGene.out.tab # Load into pandas (select column based on library strandedness) python3 - << 'EOF' import pandas as pd df = pd.read_csv("results/sample1/ReadsPerGene.out.tab", sep="\t", header=None, skiprows=4, names=["gene_id", "unstranded", "fwd", "rev"]) # For unstranded library: use column 2 (unstranded) counts = df.set_index("gene_id")["unstranded"] print(f"Genes with counts > 0: {(counts > 0).sum()}") print(counts[counts > 0].sort_values(ascending=False).head()) EOF
Key Parameters
| Parameter | Default | Range/Options | Effect |
|---|---|---|---|
| | 1–64 | CPU threads for alignment |
| | ReadLength-1 | Splice junction overhang; set to ReadLength-1 |
| | , | Output format and sort order |
| | 0–33 | Max mismatches per read; lower for stricter mapping |
| | 1–9999 | Max genomic loci per read; reads exceeding limit marked unmapped |
| | , | Enable gene counting or transcriptome BAM |
| | , | Enable 2-pass alignment for novel junction discovery |
| | 1–1e9 | Maximum intron length; reduce for bacterial genomes |
| | | Write unmapped reads to FASTQ |
| | 10–14 | SA index size; set log2(GenomeSize)/2 − 1 for small genomes |
Common Recipes
Recipe 1: Batch Align All Samples
#!/bin/bash # Align all paired-end samples in a directory SAMPLES=(ctrl_1 ctrl_2 treat_1 treat_2) INDEX="genome/star_index" DATA="data" OUT="results" THREADS=12 mkdir -p "$OUT" for sample in "${SAMPLES[@]}"; do echo "Aligning: $sample" mkdir -p "$OUT/$sample" STAR --runThreadN "$THREADS" \ --genomeDir "$INDEX" \ --readFilesIn "$DATA/${sample}_R1.fastq.gz" "$DATA/${sample}_R2.fastq.gz" \ --readFilesCommand zcat \ --outSAMtype BAM SortedByCoordinate \ --quantMode GeneCounts \ --twopassMode Basic \ --outFileNamePrefix "$OUT/$sample/" samtools index "$OUT/$sample/Aligned.sortedByCoord.out.bam" echo "Done: $sample — $(grep 'Uniquely mapped reads %' $OUT/$sample/Log.final.out | awk '{print $NF}')" done
Recipe 2: Build Gene Count Matrix Across Samples
import pandas as pd from pathlib import Path results_dir = Path("results") samples = ["ctrl_1", "ctrl_2", "treat_1", "treat_2"] strandedness = "unstranded" # or "fwd" / "rev" col_map = {"unstranded": 1, "fwd": 2, "rev": 3} col = col_map[strandedness] counts = {} for sample in samples: count_file = results_dir / sample / "ReadsPerGene.out.tab" df = pd.read_csv(count_file, sep="\t", header=None, skiprows=4) counts[sample] = df.set_index(0)[col] matrix = pd.DataFrame(counts) matrix = matrix[matrix.sum(axis=1) > 0] # drop zero-count genes matrix.to_csv("gene_count_matrix.tsv", sep="\t") print(f"Count matrix: {matrix.shape} (genes × samples)") print(matrix.head())
Recipe 3: Integrate with DESeq2 via pydeseq2
import pandas as pd from pydeseq2.dds import DeseqDataSet from pydeseq2.default_inference import DefaultInference # Load count matrix from STAR output counts = pd.read_csv("gene_count_matrix.tsv", sep="\t", index_col=0).T metadata = pd.DataFrame({ "condition": ["control", "control", "treated", "treated"] }, index=counts.index) # Run DESeq2 dds = DeseqDataSet(counts=counts, metadata=metadata, design_factors="condition", inference=DefaultInference(n_cpus=4)) dds.deseq2() print("DESeq2 complete — see dds.varm['LFC'] for results")
Expected Outputs
| Output | Format | Description |
|---|---|---|
| BAM | Coordinate-sorted aligned reads; index with |
| TSV | Splice junction table with coverage, motif, and novelty flags |
| Text | Alignment statistics: unique mapping %, multimappers %, etc. |
| TSV | Gene counts (4 columns: unstranded/fwd/rev) when |
| FASTQ | Unmapped reads (when ) |
| Text | Verbose run log; check for warnings and parameter echoes |
Troubleshooting
| Problem | Cause | Solution |
|---|---|---|
| Unique mapping < 60% | Wrong genome assembly or species contamination | Verify genome FASTA matches sample species; run FastQC to check overrepresented sequences |
| Wrong path or incomplete index | Re-run ; check contains , , files |
| Out of memory during genome generation | Not enough RAM for genome SA index | Add (or lower) for small genomes; request ≥32 GB RAM for human |
files not decompressed | Missing | Add for gzip-compressed inputs |
| R1/R2 read count mismatch | Verify FASTQ files with `zcat file.fastq.gz |
missing | not set | Re-run with or use featureCounts on BAM |
| Very high multimapping (>20%) | Highly repetitive genome or wrong | Reduce ; use to output only one alignment per read |
| Genome index takes too long | Large genome + slow disk | Use SSD storage; pre-built indices available from ENCODE and Ensembl |
References
- STAR GitHub repository — source code, releases, and STAR manual PDF
- Dobin A et al. (2013) "STAR: ultrafast universal RNA-seq aligner" — Bioinformatics 29(1):15-21. DOI:10.1093/bioinformatics/bts635
- ENCODE RNA-seq alignment standards — ENCODE project guidelines using STAR
- GENCODE genome annotations — recommended GTF source for human/mouse STAR indices