BioSkills bio-population-genetics-linkage-disequilibrium
Calculate linkage disequilibrium statistics (r², D'), perform LD pruning for population structure analysis, identify haplotype blocks, and visualize LD patterns using PLINK, scikit-allel, and LDBlockShow. Use when calculating LD or pruning variants.
git clone https://github.com/GPTomics/bioSkills
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/population-genetics/linkage-disequilibrium" ~/.claude/skills/gptomics-bioskills-bio-population-genetics-linkage-disequilibrium && rm -rf "$T"
population-genetics/linkage-disequilibrium/SKILL.mdVersion Compatibility
Reference examples tested with: matplotlib 3.8+, numpy 1.26+, pandas 2.2+
Before using code patterns, verify installed versions match. If versions differ:
- Python:
thenpip show <package>
to check signatureshelp(module.function) - CLI:
then<tool> --version
to confirm flags<tool> --help
If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Linkage Disequilibrium
"Calculate LD between my variants" → Compute pairwise LD statistics (r², D'), prune correlated variants for independent sets, and identify haplotype blocks from genotype data.
- CLI:
for LD calculation,plink2 --r2
for pruning--indep-pairwise - Python:
for windowed LD in scikit-allelallel.rogers_huff_r()
Calculate LD statistics, prune correlated variants, and identify haplotype blocks.
PLINK LD Calculations
Pairwise r²
# All pairs within window plink2 --bfile data --r2 --ld-window-kb 1000 --ld-window-r2 0.2 --out ld_results # With SNP names in output plink2 --bfile data --r2 inter-chr --ld-window-r2 0 --out all_pairs # Squared correlation matrix plink2 --bfile data --r2-phased square --out ld_matrix
Output Format
# ld_results.ld contains: CHR_A BP_A SNP_A CHR_B BP_B SNP_B R2
PLINK 1.9 Options
# r² with D' statistics plink --bfile data --r2 dprime --ld-window-kb 500 --out ld_with_dprime # Inter-chromosome LD plink --bfile data --r2 inter-chr --ld-snp-list target_snps.txt --out target_ld
LD Pruning
Standard Pruning
# Calculate pruning list plink2 --bfile data --indep-pairwise 50 10 0.1 --out prune # Output files: # prune.prune.in - Variants to keep # prune.prune.out - Variants to remove # Extract pruned set plink2 --bfile data --extract prune.prune.in --make-bed --out data_pruned
Pruning Parameters
| Parameter | Description | Common Values |
|---|---|---|
| Window (50) | Variants per window | 50-200 |
| Step (10) | Variants to shift | 5-50 |
| r² threshold (0.1) | Max LD allowed | 0.1-0.5 |
Use Cases
# Strict pruning for PCA/Admixture plink2 --bfile data --indep-pairwise 50 10 0.1 --out strict_prune # Moderate pruning for polygenic scores plink2 --bfile data --indep-pairwise 200 50 0.5 --out moderate_prune # Region-based pruning plink2 --bfile data --indep-pairwise 50 10 0.2 --chr 6 --from-mb 25 --to-mb 35 --out mhc_prune
scikit-allel LD
Pairwise r²
import allel import numpy as np callset = allel.read_vcf('data.vcf.gz') gt = allel.GenotypeArray(callset['calldata/GT']) pos = callset['variants/POS'] gn = gt.to_n_alt() r2 = allel.rogers_huff_r(gn[:100]) ** 2
LD Decay
Goal: Plot LD decay as a function of physical distance to characterize the extent of linkage in the population.
Approach: Compute pairwise r² values between nearby variants using scikit-allel, bin by physical distance, and plot mean r² per distance bin.
import allel import numpy as np import matplotlib.pyplot as plt gn = gt.to_n_alt() r2, dist = [], [] n_variants = min(1000, gn.shape[0]) for i in range(n_variants): for j in range(i + 1, min(i + 100, n_variants)): r = allel.rogers_huff_r(gn[[i, j]])[0, 1] ** 2 d = pos[j] - pos[i] r2.append(r) dist.append(d) r2 = np.array(r2) dist = np.array(dist) bins = np.arange(0, 100001, 1000) bin_means = [] for i in range(len(bins) - 1): mask = (dist >= bins[i]) & (dist < bins[i + 1]) if mask.sum() > 0: bin_means.append(np.mean(r2[mask])) else: bin_means.append(np.nan) plt.figure(figsize=(10, 6)) plt.plot(bins[:-1] / 1000, bin_means) plt.xlabel('Distance (kb)') plt.ylabel('Mean r²') plt.title('LD Decay') plt.savefig('ld_decay.png')
Haplotype Blocks
PLINK
# Identify haplotype blocks (Gabriel et al.) plink --bfile data --blocks no-pheno-req --out blocks # Output: blocks.blocks (block boundaries) # Output: blocks.blocks.det (block details)
Block Statistics
import pandas as pd blocks = pd.read_csv('blocks.blocks.det', sep='\s+') print(f'Number of blocks: {len(blocks)}') print(f'Mean block size: {blocks["KB"].mean():.1f} kb') print(f'Mean SNPs per block: {blocks["NSNPS"].mean():.1f}')
LD Matrix Visualization
import allel import numpy as np import matplotlib.pyplot as plt gn = gt.to_n_alt()[:200] r = allel.rogers_huff_r(gn) r2_matrix = r ** 2 plt.figure(figsize=(10, 10)) plt.imshow(r2_matrix, cmap='hot', vmin=0, vmax=1) plt.colorbar(label='r²') plt.xlabel('Variant index') plt.ylabel('Variant index') plt.title('LD Matrix') plt.savefig('ld_matrix.png', dpi=150)
LD-based Clumping (GWAS)
# Clump GWAS results by LD plink --bfile data \ --clump gwas_results.txt \ --clump-p1 5e-8 \ --clump-p2 1e-5 \ --clump-r2 0.1 \ --clump-kb 250 \ --out clumped # Output: clumped.clumped (independent signals)
Clump Parameters
| Parameter | Description |
|---|---|
| --clump-p1 | Index SNP p-value threshold |
| --clump-p2 | Clumped SNP p-value threshold |
| --clump-r2 | LD threshold for clumping |
| --clump-kb | Physical distance threshold |
vcftools LD
# Pairwise LD for region vcftools --vcf data.vcf --geno-r2 --ld-window-bp 100000 --out ld_results # Output: ld_results.geno.ld # Haplotype-based r² vcftools --vcf data.vcf --hap-r2 --ld-window-bp 100000 --out hap_ld
Complete Workflow
# 1. Calculate genome-wide LD plink2 --bfile data --r2 --ld-window-kb 500 --ld-window-r2 0.2 --out ld_genome # 2. Generate pruned set for PCA plink2 --bfile data --indep-pairwise 50 10 0.1 --out prune plink2 --bfile data --extract prune.prune.in --make-bed --out pruned # 3. Identify haplotype blocks plink --bfile data --blocks no-pheno-req --out blocks # 4. Visualize LD for specific region plink --bfile data --r2 dprime --chr 6 --from-mb 28 --to-mb 34 --out mhc_ld
Related Skills
- plink-basics - File format handling
- population-structure - Use pruned data for PCA
- association-testing - LD clumping for GWAS
- selection-statistics - LD affects EHH statistics