LLMs-Universal-Life-Science-and-Clinical-Skills- rnaseq-qc

<!--

install
source · Clone the upstream repo
git clone https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills-
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills- "$T" && mkdir -p ~/.claude/skills && cp -r "$T/Skills/NGS_QC/read-qc/rnaseq-qc" ~/.claude/skills/mdbabumiamssm-llms-universal-life-science-and-clinical-skills-rnaseq-qc && rm -rf "$T"
manifest: Skills/NGS_QC/read-qc/rnaseq-qc/SKILL.md
source content
<!-- # COPYRIGHT NOTICE # This file is part of the "Universal Biomedical Skills" project. # Copyright (c) 2026 MD BABU MIA, PhD <md.babu.mia@mssm.edu> # All Rights Reserved. # # This code is proprietary and confidential. # Unauthorized copying of this file, via any medium is strictly prohibited. # # Provenance: Authenticated by MD BABU MIA -->

name: bio-rnaseq-qc description: RNA-seq specific quality control including rRNA contamination detection, strandedness verification, gene body coverage, and transcript integrity metrics. Use when validating RNA-seq libraries before differential expression analysis. tool_type: mixed primary_tool: RSeQC measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes. allowed-tools:

  • read_file
  • run_shell_command

RNA-seq Quality Control

RNA-seq specific QC metrics beyond general read quality.

rRNA Contamination Detection

High rRNA content indicates failed rRNA depletion or polyA selection.

SortMeRNA

sortmerna \
    --ref rRNA_databases/smr_v4.3_default_db.fasta \
    --reads sample.fastq.gz \
    --aligned rRNA_reads \
    --other non_rRNA_reads \
    --fastx \
    --threads 8

rrna_count=$(grep -c "^@" rRNA_reads.fastq 2>/dev/null || echo 0)
total_count=$(zcat sample.fastq.gz | grep -c "^@")
rrna_pct=$(echo "scale=2; $rrna_count / $total_count * 100" | bc)
echo "rRNA: ${rrna_pct}%"

BLAST Against rRNA

seqkit sample -n 10000 sample.fastq.gz | seqkit fq2fa > sample_10k.fasta
blastn -query sample_10k.fasta -db rrna_db -outfmt 6 -evalue 1e-10 -max_target_seqs 1 | wc -l

Expected rRNA Levels

Library TypeExpected rRNA
PolyA selected< 5%
rRNA depleted< 10%
Total RNA50-80%

Strandedness Verification

RSeQC infer_experiment

infer_experiment.py -i aligned.bam -r genes.bed

Output Interpretation

Fraction of reads explained by "1++,1--,2+-,2-+": 0.9856  # Forward stranded
Fraction of reads explained by "1+-,1-+,2++,2--": 0.0144  # Reverse (should be low)

Strand Inference

Tool Setting1++,1--,2+-,2-+1+-,1-+,2++,2--
Forward (dUTP)~0~1
Reverse (Illumina)~1~0
Unstranded~0.5~0.5

Salmon Strandedness

salmon quant -i index -l A -r sample.fastq.gz -o quant/
grep "library_types" quant/lib_format_counts.json

Gene Body Coverage

Check for 3' or 5' bias indicating RNA degradation.

RSeQC geneBody_coverage

geneBody_coverage.py \
    -i aligned.bam \
    -r housekeeping_genes.bed \
    -o coverage

Interpretation

PatternIndicates
Even coverageGood quality
3' biasDegradation or polyA artifacts
5' biasIncomplete reverse transcription
Steep dropSevere degradation

Read Distribution

RSeQC read_distribution

read_distribution.py -i aligned.bam -r genes.bed > distribution.txt

Expected Distribution

RegionGood Library
CDS_Exons60-80%
UTRs10-20%
Introns5-20%
Intergenic< 10%

Transcript Integrity Number (TIN)

Measure of RNA degradation per transcript.

RSeQC tin

tin.py -i aligned.bam -r genes.bed > tin_scores.txt

TIN Interpretation

TIN ScoreQuality
> 70Good
50-70Moderate
< 50Poor

Duplication Rate

Picard MarkDuplicates

java -jar picard.jar MarkDuplicates \
    I=aligned.bam \
    O=marked.bam \
    M=dup_metrics.txt \
    REMOVE_DUPLICATES=false

grep -A 1 "LIBRARY" dup_metrics.txt | tail -1 | cut -f9

RNA-seq Expected Duplication

LibraryExpected
High complexity< 20%
Low input20-50%
Concerning> 50%

Insert Size (Paired-End)

Picard CollectInsertSizeMetrics

java -jar picard.jar CollectInsertSizeMetrics \
    I=aligned.bam \
    O=insert_metrics.txt \
    H=insert_histogram.pdf

Saturation Analysis

Subsampling Analysis

for frac in 0.1 0.25 0.5 0.75 1.0; do
    samtools view -bs $frac aligned.bam > sub_${frac}.bam
    featureCounts -a genes.gtf -o counts_${frac}.txt sub_${frac}.bam
    detected=$(awk '$7 > 0' counts_${frac}.txt | wc -l)
    echo "$frac: $detected genes"
done

Picard CollectRnaSeqMetrics

Comprehensive RNA-seq metrics from Picard.

java -jar picard.jar CollectRnaSeqMetrics \
    I=aligned.bam \
    O=rnaseq_metrics.txt \
    REF_FLAT=refFlat.txt \
    STRAND=SECOND_READ_TRANSCRIPTION_STRAND \
    RIBOSOMAL_INTERVALS=rRNA.interval_list

Key Metrics

MetricDescription
PCT_CODING_BASES% in coding regions
PCT_UTR_BASES% in UTRs
PCT_INTRONIC_BASES% in introns
PCT_INTERGENIC_BASES% intergenic
PCT_RIBOSOMAL_BASES% rRNA
MEDIAN_5PRIME_TO_3PRIME_BIAS3' bias

MultiQC Report

Aggregate all QC metrics.

multiqc fastqc/ star_output/ featurecounts/ -o multiqc_report/

Complete RNA-seq QC Pipeline

#!/bin/bash
SAMPLE=$1
BAM=$2
GENES_BED=$3
REF_FLAT=$4

echo "=== RNA-seq QC: $SAMPLE ===" > qc_report.txt

echo -e "\n--- Strandedness ---" >> qc_report.txt
infer_experiment.py -i $BAM -r $GENES_BED >> qc_report.txt

echo -e "\n--- Read Distribution ---" >> qc_report.txt
read_distribution.py -i $BAM -r $GENES_BED >> qc_report.txt

echo -e "\n--- Gene Body Coverage ---" >> qc_report.txt
geneBody_coverage.py -i $BAM -r $GENES_BED -o coverage

echo -e "\n--- TIN Scores ---" >> qc_report.txt
tin.py -i $BAM -r $GENES_BED > tin.txt
awk '{sum+=$3; count++} END {print "Mean TIN:", sum/count}' tin.txt >> qc_report.txt

echo -e "\n--- Duplication ---" >> qc_report.txt
java -jar picard.jar MarkDuplicates I=$BAM O=/dev/null M=dup.txt 2>/dev/null
grep -A 1 "LIBRARY" dup.txt | tail -1 | awk '{print "Duplication rate:", $9}' >> qc_report.txt

echo -e "\n--- RNA-seq Metrics ---" >> qc_report.txt
java -jar picard.jar CollectRnaSeqMetrics I=$BAM O=rnaseq.txt REF_FLAT=$REF_FLAT STRAND=SECOND_READ_TRANSCRIPTION_STRAND 2>/dev/null
grep -A 2 "## METRICS CLASS" rnaseq.txt >> qc_report.txt

cat qc_report.txt

Python QC Summary

import pysam
import numpy as np
from collections import Counter

def rnaseq_qc(bam_file, sample_size=100000):
    bam = pysam.AlignmentFile(bam_file, 'rb')

    strand_counts = Counter()
    insert_sizes = []

    for i, read in enumerate(bam.fetch()):
        if i >= sample_size:
            break
        if not read.is_unmapped:
            if read.is_read1:
                if read.is_reverse:
                    strand_counts['1-'] += 1
                else:
                    strand_counts['1+'] += 1
            if read.is_proper_pair and read.template_length > 0:
                insert_sizes.append(read.template_length)

    bam.close()

    total = sum(strand_counts.values())
    print(f'Read 1 forward: {strand_counts["1+"]/total:.2%}')
    print(f'Read 1 reverse: {strand_counts["1-"]/total:.2%}')

    if insert_sizes:
        print(f'Median insert: {np.median(insert_sizes):.0f}')

rnaseq_qc('aligned.bam')

QC Thresholds Summary

MetricGoodWarningFail
Mapping rate> 85%70-85%< 70%
rRNA %< 10%10-20%> 20%
Exonic %> 60%40-60%< 40%
Duplication< 20%20-40%> 40%
Mean TIN> 7050-70< 50
3' bias< 1.51.5-2> 2

Related Skills

  • quality-reports - General FastQC
  • fastp-workflow - Read trimming
  • alignment-files/alignment-validation - General BAM QC
  • rna-quantification/featurecounts-counting - Quantification after QC
<!-- AUTHOR_SIGNATURE: 9a7f3c2e-MD-BABU-MIA-2026-MSSM-SECURE -->