Claude-skill-registry bio-phasing-imputation-imputation-qc

Quality control of phasing and imputation results. Filter by INFO scores, assess accuracy, and prepare imputed data for downstream analysis. Use when filtering low-quality imputed variants or validating imputation accuracy before GWAS.

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/imputation-qc" ~/.claude/skills/majiayu000-claude-skill-registry-bio-phasing-imputation-imputation-qc && rm -rf "$T"
manifest: skills/data/imputation-qc/SKILL.md
source content

Imputation QC

Extract INFO Scores

# Beagle DR2 (dosage R-squared)
bcftools query -f '%CHROM\t%POS\t%ID\t%REF\t%ALT\t%INFO/DR2\t%INFO/AF\n' \
    imputed.vcf.gz > info_scores.txt

# Minimac R2
bcftools query -f '%CHROM\t%POS\t%ID\t%REF\t%ALT\t%INFO/R2\t%INFO/MAF\n' \
    imputed.vcf.gz > info_scores.txt

# IMPUTE info
bcftools query -f '%CHROM\t%POS\t%ID\t%INFO\n' imputed.vcf.gz > info_scores.txt

Filter by INFO Score

# Standard threshold for GWAS
bcftools view -i 'INFO/DR2 > 0.3' imputed.vcf.gz -Oz -o imputed_r2_03.vcf.gz

# Strict threshold for fine-mapping
bcftools view -i 'INFO/DR2 > 0.8' imputed.vcf.gz -Oz -o imputed_r2_08.vcf.gz

# Combined filtering
bcftools view -i 'INFO/DR2 > 0.3 && INFO/AF > 0.01 && INFO/AF < 0.99' \
    imputed.vcf.gz -Oz -o imputed_filtered.vcf.gz

INFO Score Distribution

import pandas as pd
import matplotlib.pyplot as plt

# Load INFO scores
info = pd.read_csv('info_scores.txt', sep='\t',
    names=['CHR', 'POS', 'ID', 'REF', 'ALT', 'R2', 'AF'])

# Distribution plot
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Overall distribution
axes[0].hist(info['R2'], bins=50, edgecolor='black')
axes[0].axvline(0.3, color='red', linestyle='--', label='Threshold 0.3')
axes[0].set_xlabel('INFO Score (R2)')
axes[0].set_ylabel('Count')
axes[0].set_title('INFO Score Distribution')
axes[0].legend()

# R2 by MAF
info['MAF'] = info['AF'].apply(lambda x: min(x, 1-x))
info['MAF_bin'] = pd.cut(info['MAF'], bins=[0, 0.01, 0.05, 0.1, 0.5])
info.boxplot(column='R2', by='MAF_bin', ax=axes[1])
axes[1].set_xlabel('MAF Bin')
axes[1].set_ylabel('INFO Score')
axes[1].set_title('INFO by MAF')

# Scatter
axes[2].scatter(info['MAF'], info['R2'], alpha=0.1, s=1)
axes[2].set_xlabel('Minor Allele Frequency')
axes[2].set_ylabel('INFO Score')
axes[2].set_title('INFO vs MAF')

plt.tight_layout()
plt.savefig('imputation_qc.png', dpi=150)

Summarize Imputation Quality

# Count variants by quality
bcftools query -f '%INFO/DR2\n' imputed.vcf.gz | \
    awk '{
        if ($1 >= 0.8) high++;
        else if ($1 >= 0.3) med++;
        else low++
    } END {
        print "High quality (R2>=0.8):", high
        print "Medium quality (0.3<=R2<0.8):", med
        print "Low quality (R2<0.3):", low
    }'

# Variants passing filter
echo "Total variants: $(bcftools view -H imputed.vcf.gz | wc -l)"
echo "Passing R2>0.3: $(bcftools view -i 'INFO/DR2>0.3' imputed.vcf.gz -H | wc -l)"

Check Concordance with Typed Variants

# Extract typed variants from imputed file
bcftools view -i 'INFO/TYPED=1' imputed.vcf.gz -Oz -o typed.vcf.gz

# Compare imputed vs original genotypes
bcftools gtcheck -g original.vcf.gz typed.vcf.gz > concordance.txt

# Parse concordance
grep "^CN" concordance.txt

Python: Comprehensive QC Report

import pandas as pd
import numpy as np

def imputation_qc_report(info_file, output_prefix):
    '''Generate comprehensive imputation QC report.'''
    info = pd.read_csv(info_file, sep='\t',
        names=['CHR', 'POS', 'ID', 'REF', 'ALT', 'R2', 'AF'])

    # Calculate MAF
    info['MAF'] = info['AF'].apply(lambda x: min(x, 1-x))

    # Basic statistics
    stats = {
        'total_variants': len(info),
        'mean_r2': info['R2'].mean(),
        'median_r2': info['R2'].median(),
        'variants_r2_03': (info['R2'] >= 0.3).sum(),
        'variants_r2_08': (info['R2'] >= 0.8).sum(),
        'pct_r2_03': 100 * (info['R2'] >= 0.3).mean(),
        'pct_r2_08': 100 * (info['R2'] >= 0.8).mean(),
    }

    # By MAF bin
    maf_bins = [(0, 0.001), (0.001, 0.01), (0.01, 0.05), (0.05, 0.5)]
    for low, high in maf_bins:
        mask = (info['MAF'] >= low) & (info['MAF'] < high)
        stats[f'mean_r2_maf_{low}_{high}'] = info.loc[mask, 'R2'].mean()
        stats[f'n_variants_maf_{low}_{high}'] = mask.sum()

    # By chromosome
    chr_stats = info.groupby('CHR').agg({
        'R2': ['mean', 'count'],
        'MAF': 'mean'
    }).round(3)

    # Write reports
    with open(f'{output_prefix}_summary.txt', 'w') as f:
        for k, v in stats.items():
            f.write(f'{k}: {v}\n')

    chr_stats.to_csv(f'{output_prefix}_by_chrom.txt', sep='\t')

    return stats, chr_stats

stats, chr_stats = imputation_qc_report('info_scores.txt', 'imputation_qc')

Compare Multiple Imputation Runs

def compare_imputations(vcf1, vcf2, output):
    '''Compare INFO scores between two imputation runs.'''
    import subprocess

    # Extract INFO from both
    cmd1 = f"bcftools query -f '%CHROM:%POS\t%INFO/DR2\n' {vcf1}"
    cmd2 = f"bcftools query -f '%CHROM:%POS\t%INFO/DR2\n' {vcf2}"

    info1 = pd.read_csv(subprocess.Popen(cmd1, shell=True, stdout=subprocess.PIPE).stdout,
        sep='\t', names=['ID', 'R2_1'])
    info2 = pd.read_csv(subprocess.Popen(cmd2, shell=True, stdout=subprocess.PIPE).stdout,
        sep='\t', names=['ID', 'R2_2'])

    merged = info1.merge(info2, on='ID')
    merged['R2_diff'] = merged['R2_1'] - merged['R2_2']

    # Correlation
    corr = merged['R2_1'].corr(merged['R2_2'])
    print(f'Correlation between R2 scores: {corr:.4f}')

    return merged

Hardy-Weinberg Filter

# Calculate HWE p-values (PLINK2)
plink2 --vcf imputed.vcf.gz \
    --hardy \
    --out hwe_check

# Filter extreme HWE deviations
plink2 --vcf imputed.vcf.gz \
    --hwe 1e-6 \
    --make-pgen \
    --out imputed_hwe_filtered

Final QC Pipeline

#!/bin/bash
# Complete post-imputation QC

INPUT=$1
OUTPUT=$2

# 1. Filter by INFO score
bcftools view -i 'INFO/DR2 > 0.3' $INPUT -Oz -o ${OUTPUT}_r2.vcf.gz

# 2. Filter by MAF
bcftools view -i 'INFO/AF > 0.01 && INFO/AF < 0.99' \
    ${OUTPUT}_r2.vcf.gz -Oz -o ${OUTPUT}_maf.vcf.gz

# 3. Remove duplicates
bcftools norm -d all ${OUTPUT}_maf.vcf.gz -Oz -o ${OUTPUT}_nodup.vcf.gz

# 4. Index
bcftools index ${OUTPUT}_nodup.vcf.gz

# 5. Final stats
echo "Input variants: $(bcftools view -H $INPUT | wc -l)"
echo "After R2 filter: $(bcftools view -H ${OUTPUT}_r2.vcf.gz | wc -l)"
echo "After MAF filter: $(bcftools view -H ${OUTPUT}_maf.vcf.gz | wc -l)"
echo "Final variants: $(bcftools view -H ${OUTPUT}_nodup.vcf.gz | wc -l)"

Quality Thresholds by Application

ApplicationR2 ThresholdMAF ThresholdNotes
GWAS discovery0.30.01Standard
GWAS replication0.50.01More stringent
Fine-mapping0.80.001High accuracy needed
Polygenic scores0.90.01Very high accuracy
Meta-analysis0.5Study-specificMatch across studies

Related Skills

  • phasing-imputation/genotype-imputation - Generate imputed data
  • variant-calling/filtering-best-practices - VCF filtering operations
  • population-genetics/association-testing - GWAS with imputed data