BioSkills bio-population-genetics-linkage-disequilibrium

Calculate linkage disequilibrium statistics (r², D'), perform LD pruning for population structure analysis, identify haplotype blocks, and visualize LD patterns using PLINK, scikit-allel, and LDBlockShow. Use when calculating LD or pruning variants.

install
source · Clone the upstream repo
git clone https://github.com/GPTomics/bioSkills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/population-genetics/linkage-disequilibrium" ~/.claude/skills/gptomics-bioskills-bio-population-genetics-linkage-disequilibrium && rm -rf "$T"
manifest: population-genetics/linkage-disequilibrium/SKILL.md
source content

Version Compatibility

Reference examples tested with: matplotlib 3.8+, numpy 1.26+, pandas 2.2+

Before using code patterns, verify installed versions match. If versions differ:

  • Python:
    pip show <package>
    then
    help(module.function)
    to check signatures
  • CLI:
    <tool> --version
    then
    <tool> --help
    to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Linkage Disequilibrium

"Calculate LD between my variants" → Compute pairwise LD statistics (r², D'), prune correlated variants for independent sets, and identify haplotype blocks from genotype data.

  • CLI:
    plink2 --r2
    for LD calculation,
    --indep-pairwise
    for pruning
  • Python:
    allel.rogers_huff_r()
    for windowed LD in scikit-allel

Calculate LD statistics, prune correlated variants, and identify haplotype blocks.

PLINK LD Calculations

Pairwise r²

# All pairs within window
plink2 --bfile data --r2 --ld-window-kb 1000 --ld-window-r2 0.2 --out ld_results

# With SNP names in output
plink2 --bfile data --r2 inter-chr --ld-window-r2 0 --out all_pairs

# Squared correlation matrix
plink2 --bfile data --r2-phased square --out ld_matrix

Output Format

# ld_results.ld contains:
CHR_A  BP_A  SNP_A  CHR_B  BP_B  SNP_B  R2

PLINK 1.9 Options

# r² with D' statistics
plink --bfile data --r2 dprime --ld-window-kb 500 --out ld_with_dprime

# Inter-chromosome LD
plink --bfile data --r2 inter-chr --ld-snp-list target_snps.txt --out target_ld

LD Pruning

Standard Pruning

# Calculate pruning list
plink2 --bfile data --indep-pairwise 50 10 0.1 --out prune

# Output files:
# prune.prune.in  - Variants to keep
# prune.prune.out - Variants to remove

# Extract pruned set
plink2 --bfile data --extract prune.prune.in --make-bed --out data_pruned

Pruning Parameters

ParameterDescriptionCommon Values
Window (50)Variants per window50-200
Step (10)Variants to shift5-50
r² threshold (0.1)Max LD allowed0.1-0.5

Use Cases

# Strict pruning for PCA/Admixture
plink2 --bfile data --indep-pairwise 50 10 0.1 --out strict_prune

# Moderate pruning for polygenic scores
plink2 --bfile data --indep-pairwise 200 50 0.5 --out moderate_prune

# Region-based pruning
plink2 --bfile data --indep-pairwise 50 10 0.2 --chr 6 --from-mb 25 --to-mb 35 --out mhc_prune

scikit-allel LD

Pairwise r²

import allel
import numpy as np

callset = allel.read_vcf('data.vcf.gz')
gt = allel.GenotypeArray(callset['calldata/GT'])
pos = callset['variants/POS']

gn = gt.to_n_alt()

r2 = allel.rogers_huff_r(gn[:100]) ** 2

LD Decay

Goal: Plot LD decay as a function of physical distance to characterize the extent of linkage in the population.

Approach: Compute pairwise r² values between nearby variants using scikit-allel, bin by physical distance, and plot mean r² per distance bin.

import allel
import numpy as np
import matplotlib.pyplot as plt

gn = gt.to_n_alt()

r2, dist = [], []
n_variants = min(1000, gn.shape[0])

for i in range(n_variants):
    for j in range(i + 1, min(i + 100, n_variants)):
        r = allel.rogers_huff_r(gn[[i, j]])[0, 1] ** 2
        d = pos[j] - pos[i]
        r2.append(r)
        dist.append(d)

r2 = np.array(r2)
dist = np.array(dist)

bins = np.arange(0, 100001, 1000)
bin_means = []
for i in range(len(bins) - 1):
    mask = (dist >= bins[i]) & (dist < bins[i + 1])
    if mask.sum() > 0:
        bin_means.append(np.mean(r2[mask]))
    else:
        bin_means.append(np.nan)

plt.figure(figsize=(10, 6))
plt.plot(bins[:-1] / 1000, bin_means)
plt.xlabel('Distance (kb)')
plt.ylabel('Mean r²')
plt.title('LD Decay')
plt.savefig('ld_decay.png')

Haplotype Blocks

PLINK

# Identify haplotype blocks (Gabriel et al.)
plink --bfile data --blocks no-pheno-req --out blocks

# Output: blocks.blocks (block boundaries)
# Output: blocks.blocks.det (block details)

Block Statistics

import pandas as pd

blocks = pd.read_csv('blocks.blocks.det', sep='\s+')

print(f'Number of blocks: {len(blocks)}')
print(f'Mean block size: {blocks["KB"].mean():.1f} kb')
print(f'Mean SNPs per block: {blocks["NSNPS"].mean():.1f}')

LD Matrix Visualization

import allel
import numpy as np
import matplotlib.pyplot as plt

gn = gt.to_n_alt()[:200]

r = allel.rogers_huff_r(gn)
r2_matrix = r ** 2

plt.figure(figsize=(10, 10))
plt.imshow(r2_matrix, cmap='hot', vmin=0, vmax=1)
plt.colorbar(label='r²')
plt.xlabel('Variant index')
plt.ylabel('Variant index')
plt.title('LD Matrix')
plt.savefig('ld_matrix.png', dpi=150)

LD-based Clumping (GWAS)

# Clump GWAS results by LD
plink --bfile data \
    --clump gwas_results.txt \
    --clump-p1 5e-8 \
    --clump-p2 1e-5 \
    --clump-r2 0.1 \
    --clump-kb 250 \
    --out clumped

# Output: clumped.clumped (independent signals)

Clump Parameters

ParameterDescription
--clump-p1Index SNP p-value threshold
--clump-p2Clumped SNP p-value threshold
--clump-r2LD threshold for clumping
--clump-kbPhysical distance threshold

vcftools LD

# Pairwise LD for region
vcftools --vcf data.vcf --geno-r2 --ld-window-bp 100000 --out ld_results

# Output: ld_results.geno.ld

# Haplotype-based r²
vcftools --vcf data.vcf --hap-r2 --ld-window-bp 100000 --out hap_ld

Complete Workflow

# 1. Calculate genome-wide LD
plink2 --bfile data --r2 --ld-window-kb 500 --ld-window-r2 0.2 --out ld_genome

# 2. Generate pruned set for PCA
plink2 --bfile data --indep-pairwise 50 10 0.1 --out prune
plink2 --bfile data --extract prune.prune.in --make-bed --out pruned

# 3. Identify haplotype blocks
plink --bfile data --blocks no-pheno-req --out blocks

# 4. Visualize LD for specific region
plink --bfile data --r2 dprime --chr 6 --from-mb 28 --to-mb 34 --out mhc_ld

Related Skills

  • plink-basics - File format handling
  • population-structure - Use pruned data for PCA
  • association-testing - LD clumping for GWAS
  • selection-statistics - LD affects EHH statistics