Awesome-Agent-Skills-for-Empirical-Research genomics-analysis-guide

Workflows for RNA-seq, GWAS, and variant calling in genomic research

install

source · Clone the upstream repo

git clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/43-wentorai-research-plugins/skills/domains/biomedical/genomics-analysis-guide" ~/.claude/skills/brycewang-stanford-awesome-agent-skills-for-empirical-research-genomics-analysis && rm -rf "$T"

manifest: skills/43-wentorai-research-plugins/skills/domains/biomedical/genomics-analysis-guide/SKILL.md

source content

Genomics Analysis Guide

Overview

Genomic data analysis is the computational backbone of modern molecular biology. From identifying disease-associated variants through Genome-Wide Association Studies (GWAS) to quantifying gene expression with RNA-seq, these workflows transform raw sequencing data into biological insights that drive discoveries in medicine, agriculture, and evolutionary biology.

This guide covers the three most common genomic analysis workflows: RNA-seq differential expression analysis, GWAS for variant-trait associations, and variant calling from whole-genome sequencing (WGS) data. Each workflow is described with tool recommendations, command-line examples, and downstream analysis steps in R and Python.

The emphasis is on reproducibility and best practices. Genomic analyses involve many sequential steps, and errors in early stages propagate through the entire pipeline. Following standardized workflows -- like those from the Broad Institute, ENCODE, and Bioconductor -- reduces the risk of methodological errors.

RNA-seq Analysis Pipeline

Workflow Overview

Raw FASTQ files
    |
    v
[Quality Control] --> FastQC, MultiQC
    |
    v
[Trimming] --> Trimmomatic, fastp
    |
    v
[Alignment] --> STAR, HISAT2
    |
    v
[Quantification] --> featureCounts, Salmon
    |
    v
[Differential Expression] --> DESeq2, edgeR
    |
    v
[Pathway Analysis] --> clusterProfiler, GSEA

Step 1: Quality Control

# Run FastQC on all FASTQ files
fastqc -t 8 -o qc_results/ raw_data/*.fastq.gz

# Aggregate QC reports
multiqc qc_results/ -o multiqc_report/

Step 2: Read Trimming

# fastp for quality trimming and adapter removal
fastp \
  --in1 sample_R1.fastq.gz \
  --in2 sample_R2.fastq.gz \
  --out1 trimmed_R1.fastq.gz \
  --out2 trimmed_R2.fastq.gz \
  --detect_adapter_for_pe \
  --thread 8 \
  --html fastp_report.html

Step 3: Alignment with STAR

# Build genome index (one time)
STAR --runMode genomeGenerate \
  --genomeDir star_index/ \
  --genomeFastaFiles genome.fa \
  --sjdbGTFfile annotations.gtf \
  --runThreadN 16

# Align reads
STAR --runMode alignReads \
  --genomeDir star_index/ \
  --readFilesIn trimmed_R1.fastq.gz trimmed_R2.fastq.gz \
  --readFilesCommand zcat \
  --outSAMtype BAM SortedByCoordinate \
  --quantMode GeneCounts \
  --outFileNamePrefix sample_ \
  --runThreadN 16

Step 4: Differential Expression with DESeq2

library(DESeq2)

# Load count matrix and sample info
counts <- read.csv("gene_counts.csv", row.names = 1)
coldata <- read.csv("sample_info.csv", row.names = 1)

# Create DESeq2 object
dds <- DESeqDataSetFromMatrix(
  countData = counts,
  colData = coldata,
  design = ~ condition
)

# Filter low-count genes
keep <- rowSums(counts(dds) >= 10) >= 3
dds <- dds[keep, ]

# Run differential expression
dds <- DESeq(dds)
res <- results(dds, contrast = c("condition", "treated", "control"),
               alpha = 0.05)

# Summary
summary(res)

# Export significant genes
sig_genes <- subset(as.data.frame(res), padj < 0.05 & abs(log2FoldChange) > 1)
write.csv(sig_genes, "significant_genes.csv")

GWAS Pipeline

Workflow Overview

Genotype Data (VCF/PLINK)
    |
    v
[Quality Control] --> Sample/variant filtering
    |
    v
[Population Stratification] --> PCA
    |
    v
[Association Testing] --> PLINK2, REGENIE
    |
    v
[Multiple Testing Correction] --> Bonferroni, FDR
    |
    v
[Visualization] --> Manhattan plot, QQ plot

QC with PLINK2

# Sample QC
plink2 \
  --bfile dataset \
  --mind 0.05 \          # Remove samples with >5% missing
  --geno 0.02 \          # Remove variants with >2% missing
  --maf 0.01 \           # Remove rare variants (MAF < 1%)
  --hwe 1e-6 \           # HWE filter
  --make-bed \
  --out dataset_qc

# LD pruning for PCA
plink2 \
  --bfile dataset_qc \
  --indep-pairwise 50 5 0.2 \
  --out pruned

# PCA for population stratification
plink2 \
  --bfile dataset_qc \
  --extract pruned.prune.in \
  --pca 10 \
  --out pca_results

Association Testing

# Linear/logistic regression with covariates
plink2 \
  --bfile dataset_qc \
  --glm \
  --pheno phenotypes.txt \
  --covar pca_results.eigenvec \
  --covar-col-nums 3-12 \
  --out gwas_results

Manhattan Plot in Python

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

def manhattan_plot(gwas_file, output='manhattan.pdf'):
    df = pd.read_csv(gwas_file, sep='\t')
    df['-log10p'] = -np.log10(df['P'])

    # Assign cumulative positions
    df = df.sort_values(['CHR', 'BP'])
    df['pos_cum'] = 0
    offset = 0
    for chrom in df['CHR'].unique():
        mask = df['CHR'] == chrom
        df.loc[mask, 'pos_cum'] = df.loc[mask, 'BP'] + offset
        offset = df.loc[mask, 'pos_cum'].max()

    fig, ax = plt.subplots(figsize=(16, 5))
    colors = ['#3B82F6', '#94A3B8']
    for i, chrom in enumerate(df['CHR'].unique()):
        subset = df[df['CHR'] == chrom]
        ax.scatter(subset['pos_cum'], subset['-log10p'],
                   s=2, color=colors[i % 2], alpha=0.7)

    ax.axhline(-np.log10(5e-8), color='red', linestyle='--', linewidth=0.8)
    ax.set_xlabel('Chromosome')
    ax.set_ylabel('-log10(p-value)')
    fig.savefig(output, dpi=300, bbox_inches='tight')

Variant Calling Pipeline

GATK Best Practices

# Mark duplicates
gatk MarkDuplicates \
  -I aligned.bam \
  -O dedup.bam \
  -M metrics.txt

# Base quality score recalibration
gatk BaseRecalibrator \
  -I dedup.bam \
  -R reference.fa \
  --known-sites dbsnp.vcf \
  -O recal_table.txt

gatk ApplyBQSR \
  -I dedup.bam \
  -R reference.fa \
  --bqsr-recal-file recal_table.txt \
  -O recal.bam

# Call variants
gatk HaplotypeCaller \
  -I recal.bam \
  -R reference.fa \
  -O variants.g.vcf \
  -ERC GVCF

Best Practices

Use containerized workflows. Nextflow + Docker/Singularity ensures reproducibility across environments.
Document every parameter. Small changes in alignment settings can significantly affect downstream results.
Apply appropriate multiple testing corrections. Genome-wide significance is p < 5e-8 for GWAS.
Validate findings in independent cohorts. Replication is essential before biological interpretation.
Archive raw data and analysis scripts. Deposit in GEO (expression) or dbGaP (genotypes) for reproducibility.
Use established pipelines (nf-core). Community-maintained Nextflow pipelines encode best practices.

References

DESeq2 Vignette -- Differential expression analysis
GATK Best Practices -- Variant calling
PLINK2 Documentation -- Genetic association analysis
nf-core Pipelines -- Community Nextflow workflows
RNA-seq Analysis Tutorial -- Griffith Lab comprehensive tutorial