SciAgent-Skills salmon-rna-quantification

Ultra-fast RNA-seq transcript and gene-level quantification using quasi-mapping (no BAM required). Builds a k-mer index from a transcriptome FASTA, then quantifies reads in minutes. Outputs transcript-level TPM/count tables (quant.sf) with optional GC-bias and sequence-bias correction. Integrates directly with tximeta/tximport for DESeq2 or edgeR. Use STAR instead when a genome-aligned BAM is required for variant calling or visualization.

install
source · Clone the upstream repo
git clone https://github.com/jaechang-hits/SciAgent-Skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/genomics-bioinformatics/salmon-rna-quantification" ~/.claude/skills/jaechang-hits-sciagent-skills-salmon-rna-quantification && rm -rf "$T"
manifest: skills/genomics-bioinformatics/salmon-rna-quantification/SKILL.md
source content

Salmon — Fast RNA-seq Quantification

Overview

Salmon quantifies transcript abundance from RNA-seq reads using quasi-mapping — matching reads to a k-mer index of the transcriptome without full genome alignment. This makes Salmon 20–50× faster than alignment-based tools while producing accurate TPM and estimated count values. Salmon corrects for sequence-specific bias (

--seqBias
), GC-content bias (
--gcBias
), and fragment length distribution automatically. Output
quant.sf
files integrate directly with
tximeta
(R) or
pydeseq2
(Python) for differential expression analysis. For improved accuracy, decoy-aware indexing uses the full genome to identify spurious quasi-mappings.

When to Use

  • Performing fast RNA-seq quantification when you do not need a genome-aligned BAM file
  • Running large-scale RNA-seq studies where alignment speed is a bottleneck (Salmon is 20-50× faster than STAR + featureCounts)
  • Computing TPM and estimated counts from bulk RNA-seq for differential expression with DESeq2 or edgeR
  • Correcting for GC bias, fragment length, and sequence context bias with
    --gcBias --seqBias
  • Estimating transcript-level uncertainty via bootstrap resampling with
    --numBootstraps
  • Use STAR instead when you need a genome-aligned BAM for downstream tools (variant calling, deeptools, IGV visualization)
  • Use Kallisto as an alternative for similar speed; Salmon provides better bias correction and decoy-aware indexing

Prerequisites

  • Software: Salmon ≥ 1.10 (conda or pre-compiled binary)
  • Reference: transcriptome FASTA (cDNA sequences, e.g., GENCODE or Ensembl) + genome FASTA for decoy-aware indexing
  • Python packages:
    pandas
    for parsing output;
    pydeseq2
    for differential expression

Check before installing: The tool may already be available in the current environment (e.g., inside a

pixi
/
conda
env). Run
command -v salmon
first and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool via
pixi run salmon
rather than bare
salmon
.

# Install with conda (recommended)
conda install -c bioconda salmon

# Verify
salmon --version
# salmon 1.10.3

# Or download pre-compiled binary
wget https://github.com/COMBINE-lab/salmon/releases/download/v1.10.0/salmon-1.10.0_linux_x86_64.tar.gz
tar xzvf salmon-1.10.0_linux_x86_64.tar.gz
export PATH="$PWD/salmon-latest_linux_x86_64/bin:$PATH"

Quick Start

# 1. Build transcriptome index (~5 min)
salmon index -t transcriptome.fa -i salmon_index/ -p 8

# 2. Quantify paired-end reads (~2-5 min per sample)
salmon quant \
    -i salmon_index/ \
    -l A \
    -1 sample_R1.fastq.gz \
    -2 sample_R2.fastq.gz \
    -p 8 \
    --gcBias --validateMappings \
    -o results/sample1/

# Output: results/sample1/quant.sf
head results/sample1/quant.sf

Workflow

Step 1: Download Transcriptome Reference

Fetch a transcript FASTA from GENCODE or Ensembl (cDNA sequences only — not genome).

# Human transcriptome from GENCODE (recommended)
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_47/gencode.v47.transcripts.fa.gz
gunzip gencode.v47.transcripts.fa.gz

# Count transcripts
grep -c "^>" gencode.v47.transcripts.fa
# ~252,000 transcripts

echo "Reference ready."
ls -lh gencode.v47.transcripts.fa

Step 2: Build Salmon Index

Index the transcriptome for quasi-mapping. Add genome decoys for improved accuracy.

# Standard index (fast, sufficient for most analyses)
salmon index \
    -t gencode.v47.transcripts.fa \
    -i salmon_index/ \
    -p 8
echo "Standard index complete."

# Decoy-aware index (recommended for accuracy — uses full genome as decoy)
# Step 1: create decoy list from genome chromosome names
grep "^>" GRCh38.primary_assembly.genome.fa | cut -d " " -f 1 | sed 's/>//' > decoys.txt

# Step 2: concatenate transcriptome + genome
cat gencode.v47.transcripts.fa GRCh38.primary_assembly.genome.fa > gentrome.fa

# Step 3: build decoy-aware index
salmon index \
    -t gentrome.fa \
    -d decoys.txt \
    -i salmon_decoy_index/ \
    -p 8
echo "Decoy-aware index complete."

Step 3: Quantify Single-End Reads

Run Salmon on single-end FASTQ files.

# Single-end quantification
salmon quant \
    -i salmon_index/ \
    -l A \
    -r sample1.fastq.gz \
    -p 8 \
    --seqBias \
    --validateMappings \
    -o results/sample1/

echo "Mapping rate: $(grep 'Mapping rate' results/sample1/logs/salmon_quant.log | tail -1)"
echo "Output: results/sample1/quant.sf"

Step 4: Quantify Paired-End Reads with Bias Correction

Run Salmon on paired-end FASTQ files with recommended bias correction flags.

# Paired-end with GC bias + sequence bias correction
salmon quant \
    -i salmon_decoy_index/ \
    -l A \
    -1 sample1_R1.fastq.gz \
    -2 sample1_R2.fastq.gz \
    -p 8 \
    --gcBias \
    --seqBias \
    --validateMappings \
    --numBootstraps 100 \
    -o results/sample1/

# quant.sf columns: Name, Length, EffectiveLength, TPM, NumReads
head results/sample1/quant.sf

Step 5: Load and Summarize Quantification Output

Parse

quant.sf
to build a gene-level count matrix for differential expression.

import pandas as pd
from pathlib import Path

# Load single-sample output
quant = pd.read_csv("results/sample1/quant.sf", sep="\t")
print(f"Transcripts quantified: {len(quant)}")
print(f"Total estimated reads: {quant['NumReads'].sum():.0f}")
print(f"Transcripts with TPM > 1: {(quant['TPM'] > 1).sum()}")
print(quant.sort_values("TPM", ascending=False).head())

# Build a multi-sample TPM matrix
samples = ["ctrl_1", "ctrl_2", "treat_1", "treat_2"]
tpm_matrix = pd.DataFrame({
    s: pd.read_csv(f"results/{s}/quant.sf", sep="\t").set_index("Name")["TPM"]
    for s in samples
})
print(f"\nTPM matrix: {tpm_matrix.shape}")
tpm_matrix.to_csv("tpm_matrix.tsv", sep="\t")

Step 6: Aggregate to Gene Level and Run DESeq2

Summarize transcript-level estimates to gene level and perform differential expression.

import pandas as pd
import re
from pathlib import Path
from pydeseq2.dds import DeseqDataSet
from pydeseq2.default_inference import DefaultInference
from pydeseq2.ds import DeseqStats

# Aggregate transcript counts to gene level using Ensembl gene IDs
# quant.sf Name format: "ENST00000456328.2|ENSG00000223972.6|..."
def extract_gene_id(transcript_id):
    parts = transcript_id.split("|")
    return parts[1].split(".")[0] if len(parts) > 1 else transcript_id

samples = ["ctrl_1", "ctrl_2", "treat_1", "treat_2"]
count_frames = []
for s in samples:
    df = pd.read_csv(f"results/{s}/quant.sf", sep="\t")
    df["gene_id"] = df["Name"].apply(extract_gene_id)
    gene_counts = df.groupby("gene_id")["NumReads"].sum().round().astype(int)
    count_frames.append(gene_counts.rename(s))

count_matrix = pd.DataFrame(count_frames).fillna(0).astype(int)
metadata = pd.DataFrame({
    "condition": ["control", "control", "treated", "treated"]
}, index=samples)

# Run DESeq2
dds = DeseqDataSet(counts=count_matrix, metadata=metadata,
                   design_factors="condition",
                   inference=DefaultInference(n_cpus=4))
dds.deseq2()

stat_res = DeseqStats(dds, contrast=["condition", "treated", "control"],
                      inference=DefaultInference())
stat_res.summary()
results = stat_res.results_df
print(f"DE genes (padj < 0.05): {(results['padj'] < 0.05).sum()}")
print(results[results['padj'] < 0.05].sort_values('log2FoldChange').head())

Key Parameters

ParameterDefaultRange/OptionsEffect
-l / --libType
required
A
(auto),
SF
,
SR
,
IU
,
IS
,
MS
,
MR
Library strandedness;
A
auto-detects from first reads
-p / --threads
1
1–64CPU threads; 8–16 is typical
--gcBias
offflagCorrect for GC-content bias in fragment selection; recommended for most samples
--seqBias
offflagCorrect for sequence-specific bias at read starts; recommended
--validateMappings
offflagUse selective alignment for improved accuracy; slight speed cost
--numBootstraps
0
0–200Bootstrap replicates for uncertainty estimation; enables Sleuth/Swish
--dumpCsvCounts
offflagDump raw counts to CSV alongside quant.sf
-d / --decoys
fileDecoy sequence list for decoy-aware indexing
--rangeFactorizationBins
4
1–8Bins for range-factorization model; increases accuracy at small speed cost
--skipQuant
offflagBuild index and exit; useful for cluster pipelines

Common Recipes

Recipe 1: Batch Quantify All Samples

#!/bin/bash
# Quantify all paired-end samples with recommended settings
INDEX="salmon_decoy_index"
DATA="data"
OUT="results"
THREADS=12

SAMPLES=(ctrl_1 ctrl_2 treat_1 treat_2)
mkdir -p "$OUT"

for sample in "${SAMPLES[@]}"; do
    echo "Quantifying: $sample"
    salmon quant \
        -i "$INDEX" \
        -l A \
        -1 "$DATA/${sample}_R1.fastq.gz" \
        -2 "$DATA/${sample}_R2.fastq.gz" \
        -p "$THREADS" \
        --gcBias --seqBias --validateMappings \
        -o "$OUT/$sample/"
    echo "Done: $sample — mapping $(grep 'Mapping rate' $OUT/$sample/logs/salmon_quant.log | tail -1)"
done
echo "All samples quantified."

Recipe 2: Add Salmon to a Snakemake Pipeline

# Snakefile — Salmon quantification rule
configfile: "config.yaml"

SAMPLES = config["samples"]

rule all:
    input:
        expand("results/{sample}/quant.sf", sample=SAMPLES)

rule salmon_index:
    input:
        transcriptome = config["transcriptome_fa"]
    output:
        directory("salmon_index")
    threads: 8
    shell:
        "salmon index -t {input.transcriptome} -i {output} -p {threads}"

rule salmon_quant:
    input:
        index = "salmon_index",
        r1 = "data/{sample}_R1.fastq.gz",
        r2 = "data/{sample}_R2.fastq.gz"
    output:
        quant = "results/{sample}/quant.sf"
    params:
        outdir = "results/{sample}"
    threads: 8
    shell:
        """
        salmon quant -i {input.index} -l A \
            -1 {input.r1} -2 {input.r2} \
            -p {threads} --gcBias --seqBias --validateMappings \
            -o {params.outdir}
        """

Recipe 3: Filter Low-Expression Transcripts and Compute Fold Changes

import pandas as pd
import numpy as np

samples = {
    "ctrl_1": "results/ctrl_1/quant.sf",
    "ctrl_2": "results/ctrl_2/quant.sf",
    "treat_1": "results/treat_1/quant.sf",
    "treat_2": "results/treat_2/quant.sf",
}

# Build TPM matrix
tpm = pd.DataFrame({
    name: pd.read_csv(path, sep="\t").set_index("Name")["TPM"]
    for name, path in samples.items()
})

# Filter: keep transcripts with TPM > 1 in at least 2 samples
expressed = (tpm > 1).sum(axis=1) >= 2
tpm_filt = tpm[expressed]
print(f"Expressed transcripts: {expressed.sum()} / {len(tpm)}")

# Simple log2 fold change (treat vs ctrl)
ctrl_mean = tpm_filt[["ctrl_1", "ctrl_2"]].mean(axis=1)
treat_mean = tpm_filt[["treat_1", "treat_2"]].mean(axis=1)
lfc = np.log2(treat_mean + 0.5) - np.log2(ctrl_mean + 0.5)
top_up = lfc.sort_values(ascending=False).head(10)
print("Top upregulated transcripts:")
print(top_up)

Expected Outputs

OutputFormatDescription
quant.sf
TSVTranscript-level quantification: Name, Length, EffectiveLength, TPM, NumReads
quant.genes.sf
TSVGene-level quantification (when
--geneMap
provided)
logs/salmon_quant.log
TextDetailed log with mapping rate, processed reads, elapsed time
aux_info/meta_info.json
JSONRun metadata: library type detected, mapping rate, num processed reads
aux_info/fld.gz
BinaryFragment length distribution (paired-end)
bootstrap/
BinaryBootstrap count distributions (when
--numBootstraps > 0
)

Troubleshooting

ProblemCauseSolution
Mapping rate < 50%Wrong transcriptome species or assembly mismatchVerify transcriptome FASTA matches sample organism; use genome-decoy index
Library type detection wrongAmbiguous or mixed-strand librarySpecify explicitly:
-l SF
(stranded fwd) or
-l SR
(stranded rev)
quant.sf
all zeros
Index built from wrong referenceRebuild index with correct transcriptome FASTA
Out of memory during indexingTranscriptome + genome concatenation too largeUse standard index without genome decoy; or increase available RAM
Many low-mapping transcriptsNo GC/seq bias correctionAdd
--gcBias --seqBias --validateMappings
; helps with low-complexity regions
meta_info.json
mapping rate < 30%
Reads are from a different molecule (e.g., rRNA contamination)Check FastQC overrepresented sequences; verify library preparation
Gene-level output missing
--geneMap
or
-g
not provided
Re-run with
-g gencode.v47.gtf
to get
quant.genes.sf
Bootstrap takes too longHigh
--numBootstraps
on slow disk
Reduce to
--numBootstraps 30
for most DE tests; use SSD

References