BioSkills bio-rna-quantification-alignment-free-quant
Quantify transcript expression using pseudo-alignment with Salmon or kallisto. Use when quantifying transcripts with Salmon or kallisto.
git clone https://github.com/GPTomics/bioSkills
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/rna-quantification/alignment-free-quant" ~/.claude/skills/gptomics-bioskills-bio-rna-quantification-alignment-free-quant && rm -rf "$T"
rna-quantification/alignment-free-quant/SKILL.mdVersion Compatibility
Reference examples tested with: Salmon 1.10+, fastp 0.23+, kallisto 0.50+, pandas 2.2+
Before using code patterns, verify installed versions match. If versions differ:
- CLI:
then<tool> --version
to confirm flags<tool> --help
If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Alignment-Free Quantification
"Quantify gene expression without alignment" → Estimate transcript abundances directly from FASTQ reads using pseudo-alignment or selective alignment, bypassing genome mapping.
- CLI:
,salmon quant -i index -l A -1 R1.fq.gz -2 R2.fq.gz -o quant/kallisto quant -i index -o output R1.fq.gz R2.fq.gz
Quantify transcript abundance directly from FASTQ reads using pseudo-alignment (kallisto) or selective alignment (Salmon).
Salmon Workflow
Build Index
# Download transcriptome FASTA # Ensembl: Homo_sapiens.GRCh38.cdna.all.fa.gz # Basic index (fast, less accurate) salmon index -t transcripts.fa -i salmon_index # Decoy-aware index (recommended for accuracy) # First, create decoys from genome grep "^>" genome.fa | cut -d " " -f 1 | sed 's/>//g' > decoys.txt cat transcripts.fa genome.fa > gentrome.fa salmon index -t gentrome.fa -d decoys.txt -i salmon_index -p 8
Quantify Samples
# Paired-end reads salmon quant -i salmon_index -l A \ -1 sample_R1.fastq.gz -2 sample_R2.fastq.gz \ -o sample_quant -p 8 # Single-end reads salmon quant -i salmon_index -l A \ -r sample.fastq.gz \ -o sample_quant -p 8
Key flags:
- Automatically detect library type-l A
- Number of threads-p
- More accurate (default in recent versions)--validateMappings
- Correct for GC bias--gcBias
- Correct for sequence-specific bias--seqBias
Library Types
| Code | Description |
|---|---|
| Automatic detection (recommended) |
| Inward, stranded, read 1 from reverse |
| Inward, stranded, read 1 from forward |
| Inward, unstranded |
Batch Processing
for sample in sample1 sample2 sample3; do salmon quant -i salmon_index -l A \ -1 ${sample}_R1.fastq.gz -2 ${sample}_R2.fastq.gz \ -o ${sample}_quant -p 8 done
Output Files
sample_quant/ ├── quant.sf # Main quantification file ├── aux_info/ # Auxiliary information ├── cmd_info.json # Command used ├── lib_format_counts.json # Library format detection └── logs/ # Log files
quant.sf format:
Name Length EffectiveLength TPM NumReads ENST00000456328.2 1657 1477.000 0.000000 0.000 ENST00000450305.2 632 452.000 12.345678 156.789
kallisto Workflow
Build Index
kallisto index -i kallisto_index transcripts.fa
Quantify Samples
# Paired-end kallisto quant -i kallisto_index -o sample_quant \ sample_R1.fastq.gz sample_R2.fastq.gz # Single-end (must specify fragment length) kallisto quant -i kallisto_index -o sample_quant \ --single -l 200 -s 20 sample.fastq.gz # With bootstraps (for sleuth) kallisto quant -i kallisto_index -o sample_quant -b 100 \ sample_R1.fastq.gz sample_R2.fastq.gz
Key flags:
- Number of bootstrap samples-b
- Number of threads-t
- Single-end mode--single
- Estimated fragment length (single-end)-l
- Fragment length standard deviation-s
Output Files
sample_quant/ ├── abundance.tsv # Main quantification (text) ├── abundance.h5 # HDF5 format (for sleuth) └── run_info.json # Run information
abundance.tsv format:
target_id length eff_length est_counts tpm ENST00000456328.2 1657 1477.00 0.00 0.000000 ENST00000450305.2 632 452.00 156.79 12.345678
Salmon vs kallisto
| Feature | Salmon | kallisto |
|---|---|---|
| Speed | Fast | Fastest |
| Accuracy | Higher | Good |
| GC bias correction | Yes | No |
| Decoy sequences | Yes | No |
| Memory usage | Moderate | Low |
Recommendation: Use Salmon for production, kallisto for quick exploratory analysis.
Combining Results
# Salmon: use tximport in R # kallisto: use tximport or sleuth # Quick Python combination python << 'EOF' import pandas as pd from pathlib import Path samples = ['sample1', 'sample2', 'sample3'] tpm_data = {} counts_data = {} for sample in samples: quant_file = Path(f'{sample}_quant/quant.sf') # Salmon # quant_file = Path(f'{sample}_quant/abundance.tsv') # kallisto df = pd.read_csv(quant_file, sep='\t', index_col=0) tpm_data[sample] = df['TPM'] counts_data[sample] = df['NumReads'] # or est_counts for kallisto tpm_matrix = pd.DataFrame(tpm_data) counts_matrix = pd.DataFrame(counts_data) tpm_matrix.to_csv('tpm_matrix.csv') counts_matrix.to_csv('counts_matrix.csv') EOF
Quality Checks
# Check mapping rate from Salmon logs grep "Mapping rate" sample_quant/logs/salmon_quant.log # Check library type detection cat sample_quant/lib_format_counts.json
Good metrics:
- Mapping rate > 70%
- Consistent library type across samples
Common Issues
Low mapping rate:
- Wrong transcriptome version
- Contamination in samples
- Wrong library type
Inconsistent library types:
- Mixed library preparations
- Sample swap
Related Skills
- read-qc/fastp-workflow - Upstream preprocessing
- rna-quantification/tximport-workflow - Import results to R
- rna-quantification/count-matrix-qc - QC of quantification
- differential-expression/deseq2-basics - Downstream analysis