BioSkills bio-workflows-causal-genomics-pipeline

End-to-end causal inference pipeline from GWAS summary statistics through Mendelian randomization, colocalization, fine-mapping, and mediation analysis. Use when performing post-GWAS causal inference to identify causal exposures, shared causal variants, and mediating mechanisms.

install
source · Clone the upstream repo
git clone https://github.com/GPTomics/bioSkills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/workflows/causal-genomics-pipeline" ~/.claude/skills/gptomics-bioskills-bio-workflows-causal-genomics-pipeline && rm -rf "$T"
manifest: workflows/causal-genomics-pipeline/SKILL.md
source content

Version Compatibility

Reference examples tested with: TwoSampleMR 0.5+, coloc 5.2+, susieR 0.12+

Before using code patterns, verify installed versions match. If versions differ:

  • R:
    packageVersion("<pkg>")
    then
    ?function_name
    to verify parameters

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Causal Genomics Pipeline

"Run post-GWAS causal inference from summary statistics" → Orchestrate instrument selection, Mendelian randomization, pleiotropy sensitivity analyses, colocalization, fine-mapping, and mediation analysis to identify causal exposures and mechanisms.

Complete post-GWAS causal inference workflow from summary statistics to causal mechanisms.

Pipeline Overview

GWAS Summary Statistics (exposure + outcome)
    |
    v
[1. Instrument Selection] ----> LD clumping, F-stat filtering
    |
    v
[2. Mendelian Randomization] -> IVW, MR-Egger, Weighted Median
    |
    +--> [3. Sensitivity Analysis] ---> MR-PRESSO, Egger intercept,
    |                                    leave-one-out, Steiger
    |
    v                           (can run in parallel with MR)
[4. Colocalization] ----------> coloc, eCAVIAR, HyPrColoc
    |
    v
[5. Fine-Mapping] ------------> SuSiE, FINEMAP (credible sets)
    |
    v
[6. Mediation Analysis] ------> Network MR, MVMR
    |
    v
Causal evidence summary (triangulation across methods)

Step 1: Instrument Selection

library(TwoSampleMR)

exposure_dat <- read_exposure_data(
    filename = 'exposure_gwas.tsv',
    sep = '\t',
    snp_col = 'SNP',
    beta_col = 'BETA',
    se_col = 'SE',
    effect_allele_col = 'A1',
    other_allele_col = 'A2',
    eaf_col = 'EAF',
    pval_col = 'P'
)

# Genome-wide significance threshold for instrument selection
# p < 5e-8: Standard GWAS threshold; use 5e-6 for underpowered exposures
exposure_dat <- subset(exposure_dat, pval.exposure < 5e-8)

# LD clumping to ensure independent instruments
# clump_r2 = 0.001: Strict independence; 0.01 for more instruments
# clump_kb = 10000: 10 Mb window, standard for MR
exposure_dat <- clump_data(exposure_dat, clump_r2 = 0.001, clump_kb = 10000)

QC Checkpoint: Instrument Strength

check_instrument_strength <- function(exposure_dat) {
    # F-statistic > 10 to avoid weak instrument bias
    # F = beta^2 / se^2 for single-SNP instruments
    exposure_dat$F_stat <- (exposure_dat$beta.exposure / exposure_dat$se.exposure)^2
    weak <- sum(exposure_dat$F_stat < 10)

    cat(sprintf('Instruments: %d SNPs\n', nrow(exposure_dat)))
    cat(sprintf('Mean F-statistic: %.1f\n', mean(exposure_dat$F_stat)))
    cat(sprintf('Weak instruments (F < 10): %d\n', weak))

    if (weak > 0) {
        cat('WARNING: Remove weak instruments before proceeding\n')
        exposure_dat <- subset(exposure_dat, F_stat >= 10)
    }

    if (nrow(exposure_dat) < 3) {
        cat('WARNING: < 3 instruments. Cannot run MR-Egger or weighted median.\n')
    }

    return(exposure_dat)
}

exposure_dat <- check_instrument_strength(exposure_dat)

Step 2: Mendelian Randomization

outcome_dat <- read_outcome_data(
    filename = 'outcome_gwas.tsv',
    sep = '\t',
    snp_col = 'SNP',
    beta_col = 'BETA',
    se_col = 'SE',
    effect_allele_col = 'A1',
    other_allele_col = 'A2',
    eaf_col = 'EAF',
    pval_col = 'P'
)

# Harmonize exposure and outcome data (align alleles)
dat <- harmonise_data(exposure_dat, outcome_dat)

# Run multiple MR methods for triangulation
mr_results <- mr(dat, method_list = c(
    'mr_ivw',              # Primary method (assumes no pleiotropy)
    'mr_egger_regression', # Allows directional pleiotropy (needs >= 3 SNPs)
    'mr_weighted_median',  # Robust to up to 50% invalid instruments
    'mr_weighted_mode'     # Most robust, least powerful
))

print(mr_results[, c('method', 'nsnp', 'b', 'se', 'pval')])

QC Checkpoint: MR Consistency

check_mr_consistency <- function(mr_results) {
    # All methods should agree on direction of effect
    directions <- sign(mr_results$b)
    consistent <- length(unique(directions)) == 1

    if (!consistent) {
        cat('WARNING: Inconsistent effect directions across methods.\n')
        cat('This suggests potential pleiotropy or weak instruments.\n')
    } else {
        cat('OK: All methods agree on direction of effect.\n')
    }

    # IVW should be significant (primary test)
    ivw <- mr_results[mr_results$method == 'Inverse variance weighted', ]
    cat(sprintf('IVW: beta = %.3f, p = %.2e\n', ivw$b, ivw$pval))

    return(consistent)
}

check_mr_consistency(mr_results)

Step 3: Sensitivity Analysis

# MR-PRESSO: detect and correct for outlier instruments
library(MRPRESSO)

presso_results <- mr_presso(
    BetaOutcome = 'beta.outcome',
    BetaExposure = 'beta.exposure',
    SdOutcome = 'se.outcome',
    SdExposure = 'se.exposure',
    OUTLIERtest = TRUE,
    DISTORTIONtest = TRUE,
    data = dat,
    NbDistribution = 3000,
    SignifThreshold = 0.05
)

# Global test: p > 0.05 means no significant outliers
cat(sprintf('MR-PRESSO global test p: %.4f\n', presso_results$`MR-PRESSO results`$`Global Test`$Pvalue))

# MR-Egger intercept test for directional pleiotropy
egger_intercept <- mr_pleiotropy_test(dat)
cat(sprintf('Egger intercept: %.4f, p: %.4f\n', egger_intercept$egger_intercept, egger_intercept$pval))
# p > 0.05: No evidence of directional pleiotropy (good)

# Cochran Q test for heterogeneity
het <- mr_heterogeneity(dat)
cat(sprintf('Cochran Q p: %.4f\n', het$Q_pval[het$method == 'Inverse variance weighted']))

# Leave-one-out analysis
loo <- mr_leaveoneout(dat)

# Steiger directionality test
steiger <- directionality_test(dat)
cat(sprintf('Steiger: correct direction = %s, p = %.2e\n', steiger$correct_causal_direction, steiger$steiger_pval))

QC Checkpoint: Sensitivity

check_sensitivity <- function(presso_pval, egger_intercept_pval, steiger_correct) {
    issues <- c()

    if (presso_pval < 0.05) {
        issues <- c(issues, 'MR-PRESSO detects outliers - use corrected estimate')
    }
    if (egger_intercept_pval < 0.05) {
        issues <- c(issues, 'Egger intercept significant - directional pleiotropy present')
    }
    if (!steiger_correct) {
        issues <- c(issues, 'Steiger test suggests reverse causation')
    }

    if (length(issues) == 0) {
        cat('OK: All sensitivity tests passed.\n')
    } else {
        cat('WARNINGS:\n')
        for (issue in issues) cat(sprintf('  - %s\n', issue))
    }

    return(length(issues) == 0)
}

Step 4: Colocalization

library(coloc)

# Prepare datasets (need full summary statistics for the locus)
dataset1 <- list(
    beta = exposure_locus$BETA,
    varbeta = exposure_locus$SE^2,
    snp = exposure_locus$SNP,
    position = exposure_locus$BP,
    type = 'quant',
    N = exposure_n,
    MAF = exposure_locus$EAF
)

dataset2 <- list(
    beta = outcome_locus$BETA,
    varbeta = outcome_locus$SE^2,
    snp = outcome_locus$SNP,
    position = outcome_locus$BP,
    type = 'cc',
    N = outcome_n,
    s = case_fraction,
    MAF = outcome_locus$EAF
)

# Run colocalization
# Priors: p1=1e-4, p2=1e-4 (standard), p12=1e-5 (conservative)
# Use p12=5e-6 for stricter colocalization
result <- coloc.abf(dataset1, dataset2, p1 = 1e-4, p2 = 1e-4, p12 = 1e-5)

cat(sprintf('PP.H0 (neither): %.3f\n', result$summary['PP.H0.abf']))
cat(sprintf('PP.H1 (exposure only): %.3f\n', result$summary['PP.H1.abf']))
cat(sprintf('PP.H2 (outcome only): %.3f\n', result$summary['PP.H2.abf']))
cat(sprintf('PP.H3 (both, different variants): %.3f\n', result$summary['PP.H3.abf']))
cat(sprintf('PP.H4 (shared causal variant): %.3f\n', result$summary['PP.H4.abf']))

# PP.H4 > 0.8: Strong evidence for shared causal variant
# PP.H4 > 0.5: Suggestive colocalization
# PP.H3 > 0.5: Different causal variants at same locus

Step 5: Fine-Mapping with SuSiE

library(susieR)

# Prepare LD matrix for the locus (from reference panel or in-sample)
# R: correlation matrix from genotype data
R <- as.matrix(read.csv('ld_matrix.csv', row.names = 1))

fitted <- susie_rss(
    bhat = locus_stats$BETA,
    shat = locus_stats$SE,
    R = R,
    n = sample_size,
    L = 10,  # Maximum number of causal variants (10 is standard)
    coverage = 0.95  # 95% credible set coverage
)

# Extract credible sets
cs <- fitted$sets$cs
for (i in seq_along(cs)) {
    snps_in_cs <- locus_stats$SNP[cs[[i]]]
    pip <- fitted$pip[cs[[i]]]
    cat(sprintf('Credible set %d: %d SNPs, min PIP = %.3f\n', i, length(snps_in_cs), min(pip)))
}

# SNPs with PIP > 0.5 are strong causal candidates
high_pip <- locus_stats$SNP[fitted$pip > 0.5]
cat(sprintf('High-confidence causal SNPs (PIP > 0.5): %s\n', paste(high_pip, collapse = ', ')))

Step 6: Mediation Analysis with MVMR

library(MendelianRandomization)

# Multivariable MR: test if mediator explains exposure-outcome relationship
# Example: BMI -> CRP -> CHD (is CRP a mediator?)

# Get instruments for all exposures
exposure_bmi <- mv_extract_exposures(c('ieu-a-2', 'ieu-a-1089'))  # BMI and CRP from IEU GWAS DB

outcome_chd <- mv_extract_outcome(exposure_bmi$SNP, 'ieu-a-7')

mvdat <- mv_harmonise_data(exposure_bmi, outcome_chd)
mvmr_result <- mv_multiple(mvdat)

cat('MVMR results (direct effects):\n')
print(mvmr_result$result[, c('exposure', 'b', 'se', 'pval')])

# If direct effect of exposure attenuates after adjusting for mediator,
# the mediator partially explains the causal path

Complete Pipeline Script

library(TwoSampleMR)
library(MRPRESSO)
library(coloc)
library(susieR)

run_causal_pipeline <- function(exposure_file, outcome_file, locus_chr, locus_start, locus_end,
                                 ld_matrix_file = NULL, exposure_n = NULL, outcome_n = NULL,
                                 case_fraction = NULL) {

    cat('=== Step 1: Instrument Selection ===\n')
    exposure_dat <- read_exposure_data(
        filename = exposure_file, sep = '\t',
        snp_col = 'SNP', beta_col = 'BETA', se_col = 'SE',
        effect_allele_col = 'A1', other_allele_col = 'A2',
        eaf_col = 'EAF', pval_col = 'P'
    )
    exposure_dat <- subset(exposure_dat, pval.exposure < 5e-8)
    exposure_dat <- clump_data(exposure_dat, clump_r2 = 0.001, clump_kb = 10000)
    exposure_dat$F_stat <- (exposure_dat$beta.exposure / exposure_dat$se.exposure)^2
    exposure_dat <- subset(exposure_dat, F_stat >= 10)
    cat(sprintf('Selected %d instruments (mean F = %.1f)\n', nrow(exposure_dat), mean(exposure_dat$F_stat)))

    cat('\n=== Step 2: Mendelian Randomization ===\n')
    outcome_dat <- read_outcome_data(
        filename = outcome_file, sep = '\t',
        snp_col = 'SNP', beta_col = 'BETA', se_col = 'SE',
        effect_allele_col = 'A1', other_allele_col = 'A2',
        eaf_col = 'EAF', pval_col = 'P'
    )
    dat <- harmonise_data(exposure_dat, outcome_dat)
    mr_results <- mr(dat, method_list = c('mr_ivw', 'mr_egger_regression',
                                           'mr_weighted_median', 'mr_weighted_mode'))
    print(mr_results[, c('method', 'nsnp', 'b', 'se', 'pval')])

    cat('\n=== Step 3: Sensitivity Analysis ===\n')
    presso <- mr_presso(
        BetaOutcome = 'beta.outcome', BetaExposure = 'beta.exposure',
        SdOutcome = 'se.outcome', SdExposure = 'se.exposure',
        OUTLIERtest = TRUE, DISTORTIONtest = TRUE,
        data = dat, NbDistribution = 3000, SignifThreshold = 0.05
    )
    egger_int <- mr_pleiotropy_test(dat)
    steiger <- directionality_test(dat)
    cat(sprintf('MR-PRESSO global p: %.4f\n', presso$`MR-PRESSO results`$`Global Test`$Pvalue))
    cat(sprintf('Egger intercept p: %.4f\n', egger_int$pval))
    cat(sprintf('Steiger correct direction: %s\n', steiger$correct_causal_direction))

    cat('\n=== Step 4: Colocalization ===\n')
    if (!is.null(exposure_n) && !is.null(outcome_n)) {
        exposure_locus <- read.delim(exposure_file)
        exposure_locus <- subset(exposure_locus, CHR == locus_chr & BP >= locus_start & BP <= locus_end)
        outcome_locus <- read.delim(outcome_file)
        outcome_locus <- subset(outcome_locus, CHR == locus_chr & BP >= locus_start & BP <= locus_end)

        common_snps <- intersect(exposure_locus$SNP, outcome_locus$SNP)
        exposure_locus <- exposure_locus[match(common_snps, exposure_locus$SNP), ]
        outcome_locus <- outcome_locus[match(common_snps, outcome_locus$SNP), ]

        d1 <- list(beta = exposure_locus$BETA, varbeta = exposure_locus$SE^2,
                    snp = exposure_locus$SNP, position = exposure_locus$BP,
                    type = 'quant', N = exposure_n, MAF = exposure_locus$EAF)
        d2 <- list(beta = outcome_locus$BETA, varbeta = outcome_locus$SE^2,
                    snp = outcome_locus$SNP, position = outcome_locus$BP,
                    type = 'cc', N = outcome_n, s = case_fraction, MAF = outcome_locus$EAF)

        coloc_result <- coloc.abf(d1, d2, p1 = 1e-4, p2 = 1e-4, p12 = 1e-5)
        cat(sprintf('PP.H4 (shared causal variant): %.3f\n', coloc_result$summary['PP.H4.abf']))
    }

    cat('\n=== Step 5: Fine-Mapping ===\n')
    if (!is.null(ld_matrix_file)) {
        R <- as.matrix(read.csv(ld_matrix_file, row.names = 1))
        fitted <- susie_rss(
            bhat = exposure_locus$BETA, shat = exposure_locus$SE,
            R = R, n = exposure_n, L = 10, coverage = 0.95
        )
        high_pip <- exposure_locus$SNP[fitted$pip > 0.5]
        cat(sprintf('Causal SNPs (PIP > 0.5): %s\n', paste(high_pip, collapse = ', ')))
    }

    cat('\n=== Pipeline Complete ===\n')
    return(list(mr = mr_results, harmonised = dat, sensitivity = list(
        presso = presso, egger_intercept = egger_int, steiger = steiger
    )))
}

Parameter Recommendations

StepParameterRecommendation
Instrumentsp-value threshold5e-8 (standard), 5e-6 for underpowered
Instrumentsclump_r20.001 (strict), 0.01 for more instruments
InstrumentsF-statistic> 10 required, > 30 preferred
Colocp12 prior1e-5 (standard), 5e-6 (conservative)
ColocPP.H4 threshold> 0.8 strong, > 0.5 suggestive
SuSiEL (max causal)10 (default), reduce for simple loci
SuSiEcoverage0.95 (95% credible set)
MR-PRESSONbDistribution3000 (standard), 10000 for publication

Troubleshooting

IssueLikely CauseSolution
No instrumentsUnderpowered GWASRelax p-value to 5e-6 (with caution)
Weak instruments (F < 10)Small effect SNPsRemove weak instruments; use more powered GWAS
Inconsistent MR methodsPleiotropyCheck MR-PRESSO outliers; use weighted median
Egger intercept p < 0.05Directional pleiotropyReport Egger estimate as primary
PP.H3 > PP.H4Different causal variantsLoci associate through different mechanisms
No credible setsLD matrix issuesVerify LD matrix matches summary stats population
Steiger reverse directionReverse causationConsider bidirectional MR

Related Skills

  • causal-genomics/mendelian-randomization - MR method details
  • causal-genomics/colocalization-analysis - Coloc implementation
  • causal-genomics/fine-mapping - SuSiE and FINEMAP details
  • causal-genomics/pleiotropy-detection - MR-PRESSO and sensitivity tests
  • causal-genomics/mediation-analysis - MVMR and network MR
  • population-genetics/association-testing - Upstream GWAS methods