BioSkills bio-population-genetics-linkage-disequilibrium

Calculate linkage disequilibrium statistics (r², D'), perform LD pruning for population structure analysis, identify haplotype blocks, and visualize LD patterns using PLINK, scikit-allel, and LDBlockShow. Use when calculating LD or pruning variants.

install

source · Clone the upstream repo

git clone https://github.com/GPTomics/bioSkills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/population-genetics/linkage-disequilibrium" ~/.claude/skills/gptomics-bioskills-bio-population-genetics-linkage-disequilibrium && rm -rf "$T"

manifest: population-genetics/linkage-disequilibrium/SKILL.md

source content

Version Compatibility

Reference examples tested with: matplotlib 3.8+, numpy 1.26+, pandas 2.2+

Before using code patterns, verify installed versions match. If versions differ:

Python:
```
pip show <package>
```
then
```
help(module.function)
```
to check signatures
CLI:
```
<tool> --version
```
then
```
<tool> --help
```
to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Linkage Disequilibrium

"Calculate LD between my variants" → Compute pairwise LD statistics (r², D'), prune correlated variants for independent sets, and identify haplotype blocks from genotype data.

CLI:
```
plink2 --r2
```
for LD calculation,
```
--indep-pairwise
```
for pruning
Python:
```
allel.rogers_huff_r()
```
for windowed LD in scikit-allel

Calculate LD statistics, prune correlated variants, and identify haplotype blocks.

PLINK LD Calculations

Pairwise r²

# All pairs within window
plink2 --bfile data --r2 --ld-window-kb 1000 --ld-window-r2 0.2 --out ld_results

# With SNP names in output
plink2 --bfile data --r2 inter-chr --ld-window-r2 0 --out all_pairs

# Squared correlation matrix
plink2 --bfile data --r2-phased square --out ld_matrix

Output Format

# ld_results.ld contains:
CHR_A  BP_A  SNP_A  CHR_B  BP_B  SNP_B  R2

PLINK 1.9 Options

# r² with D' statistics
plink --bfile data --r2 dprime --ld-window-kb 500 --out ld_with_dprime

# Inter-chromosome LD
plink --bfile data --r2 inter-chr --ld-snp-list target_snps.txt --out target_ld

LD Pruning

Standard Pruning

# Calculate pruning list
plink2 --bfile data --indep-pairwise 50 10 0.1 --out prune

# Output files:
# prune.prune.in  - Variants to keep
# prune.prune.out - Variants to remove

# Extract pruned set
plink2 --bfile data --extract prune.prune.in --make-bed --out data_pruned

Pruning Parameters

Parameter	Description	Common Values
Window (50)	Variants per window	50-200
Step (10)	Variants to shift	5-50
r² threshold (0.1)	Max LD allowed	0.1-0.5

Use Cases

# Strict pruning for PCA/Admixture
plink2 --bfile data --indep-pairwise 50 10 0.1 --out strict_prune

# Moderate pruning for polygenic scores
plink2 --bfile data --indep-pairwise 200 50 0.5 --out moderate_prune

# Region-based pruning
plink2 --bfile data --indep-pairwise 50 10 0.2 --chr 6 --from-mb 25 --to-mb 35 --out mhc_prune

scikit-allel LD

Pairwise r²

import allel
import numpy as np

callset = allel.read_vcf('data.vcf.gz')
gt = allel.GenotypeArray(callset['calldata/GT'])
pos = callset['variants/POS']

gn = gt.to_n_alt()

r2 = allel.rogers_huff_r(gn[:100]) ** 2

LD Decay

Goal: Plot LD decay as a function of physical distance to characterize the extent of linkage in the population.

Approach: Compute pairwise r² values between nearby variants using scikit-allel, bin by physical distance, and plot mean r² per distance bin.

import allel
import numpy as np
import matplotlib.pyplot as plt

gn = gt.to_n_alt()

r2, dist = [], []
n_variants = min(1000, gn.shape[0])

for i in range(n_variants):
    for j in range(i + 1, min(i + 100, n_variants)):
        r = allel.rogers_huff_r(gn[[i, j]])[0, 1] ** 2
        d = pos[j] - pos[i]
        r2.append(r)
        dist.append(d)

r2 = np.array(r2)
dist = np.array(dist)

bins = np.arange(0, 100001, 1000)
bin_means = []
for i in range(len(bins) - 1):
    mask = (dist >= bins[i]) & (dist < bins[i + 1])
    if mask.sum() > 0:
        bin_means.append(np.mean(r2[mask]))
    else:
        bin_means.append(np.nan)

plt.figure(figsize=(10, 6))
plt.plot(bins[:-1] / 1000, bin_means)
plt.xlabel('Distance (kb)')
plt.ylabel('Mean r²')
plt.title('LD Decay')
plt.savefig('ld_decay.png')

Haplotype Blocks

PLINK

# Identify haplotype blocks (Gabriel et al.)
plink --bfile data --blocks no-pheno-req --out blocks

# Output: blocks.blocks (block boundaries)
# Output: blocks.blocks.det (block details)

Block Statistics

import pandas as pd

blocks = pd.read_csv('blocks.blocks.det', sep='\s+')

print(f'Number of blocks: {len(blocks)}')
print(f'Mean block size: {blocks["KB"].mean():.1f} kb')
print(f'Mean SNPs per block: {blocks["NSNPS"].mean():.1f}')

LD Matrix Visualization

import allel
import numpy as np
import matplotlib.pyplot as plt

gn = gt.to_n_alt()[:200]

r = allel.rogers_huff_r(gn)
r2_matrix = r ** 2

plt.figure(figsize=(10, 10))
plt.imshow(r2_matrix, cmap='hot', vmin=0, vmax=1)
plt.colorbar(label='r²')
plt.xlabel('Variant index')
plt.ylabel('Variant index')
plt.title('LD Matrix')
plt.savefig('ld_matrix.png', dpi=150)

LD-based Clumping (GWAS)

# Clump GWAS results by LD
plink --bfile data \
    --clump gwas_results.txt \
    --clump-p1 5e-8 \
    --clump-p2 1e-5 \
    --clump-r2 0.1 \
    --clump-kb 250 \
    --out clumped

# Output: clumped.clumped (independent signals)

Clump Parameters

Parameter	Description
--clump-p1	Index SNP p-value threshold
--clump-p2	Clumped SNP p-value threshold
--clump-r2	LD threshold for clumping
--clump-kb	Physical distance threshold

vcftools LD

# Pairwise LD for region
vcftools --vcf data.vcf --geno-r2 --ld-window-bp 100000 --out ld_results

# Output: ld_results.geno.ld

# Haplotype-based r²
vcftools --vcf data.vcf --hap-r2 --ld-window-bp 100000 --out hap_ld

Complete Workflow

# 1. Calculate genome-wide LD
plink2 --bfile data --r2 --ld-window-kb 500 --ld-window-r2 0.2 --out ld_genome

# 2. Generate pruned set for PCA
plink2 --bfile data --indep-pairwise 50 10 0.1 --out prune
plink2 --bfile data --extract prune.prune.in --make-bed --out pruned

# 3. Identify haplotype blocks
plink --bfile data --blocks no-pheno-req --out blocks

# 4. Visualize LD for specific region
plink --bfile data --r2 dprime --chr 6 --from-mb 28 --to-mb 34 --out mhc_ld

Related Skills

plink-basics - File format handling
population-structure - Use pruned data for PCA
association-testing - LD clumping for GWAS
selection-statistics - LD affects EHH statistics