BioSkills bio-phasing-imputation-imputation-qc

Quality control of phasing and imputation results. Filter by INFO scores, assess accuracy, and prepare imputed data for downstream analysis. Use when filtering low-quality imputed variants or validating imputation accuracy before GWAS.

install

source · Clone the upstream repo

git clone https://github.com/GPTomics/bioSkills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/phasing-imputation/imputation-qc" ~/.claude/skills/gptomics-bioskills-bio-phasing-imputation-imputation-qc && rm -rf "$T"

manifest: phasing-imputation/imputation-qc/SKILL.md

source content

Version Compatibility

Reference examples tested with: bcftools 1.19+, matplotlib 3.8+, numpy 1.26+, pandas 2.2+

Before using code patterns, verify installed versions match. If versions differ:

Python:
```
pip show <package>
```
then
```
help(module.function)
```
to check signatures
CLI:
```
<tool> --version
```
then
```
<tool> --help
```
to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Imputation QC

"Filter my imputed genotypes by quality" → Assess imputation accuracy using INFO/R2 scores, filter low-quality imputed variants, and validate against known genotypes before downstream GWAS.

CLI:

bcftools query -f '%CHROM %POS %INFO/R2\n'

to extract quality scores

Python:
```
pandas
```
for R2 distribution analysis and threshold selection

Extract INFO Scores

# Beagle DR2 (dosage R-squared)
bcftools query -f '%CHROM\t%POS\t%ID\t%REF\t%ALT\t%INFO/DR2\t%INFO/AF\n' \
    imputed.vcf.gz > info_scores.txt

# Minimac R2
bcftools query -f '%CHROM\t%POS\t%ID\t%REF\t%ALT\t%INFO/R2\t%INFO/MAF\n' \
    imputed.vcf.gz > info_scores.txt

# IMPUTE info
bcftools query -f '%CHROM\t%POS\t%ID\t%INFO\n' imputed.vcf.gz > info_scores.txt

Filter by INFO Score

# Standard threshold for GWAS
bcftools view -i 'INFO/DR2 > 0.3' imputed.vcf.gz -Oz -o imputed_r2_03.vcf.gz

# Strict threshold for fine-mapping
bcftools view -i 'INFO/DR2 > 0.8' imputed.vcf.gz -Oz -o imputed_r2_08.vcf.gz

# Combined filtering
bcftools view -i 'INFO/DR2 > 0.3 && INFO/AF > 0.01 && INFO/AF < 0.99' \
    imputed.vcf.gz -Oz -o imputed_filtered.vcf.gz

INFO Score Distribution

import pandas as pd
import matplotlib.pyplot as plt

# Load INFO scores
info = pd.read_csv('info_scores.txt', sep='\t',
    names=['CHR', 'POS', 'ID', 'REF', 'ALT', 'R2', 'AF'])

# Distribution plot
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Overall distribution
axes[0].hist(info['R2'], bins=50, edgecolor='black')
axes[0].axvline(0.3, color='red', linestyle='--', label='Threshold 0.3')
axes[0].set_xlabel('INFO Score (R2)')
axes[0].set_ylabel('Count')
axes[0].set_title('INFO Score Distribution')
axes[0].legend()

# R2 by MAF
info['MAF'] = info['AF'].apply(lambda x: min(x, 1-x))
info['MAF_bin'] = pd.cut(info['MAF'], bins=[0, 0.01, 0.05, 0.1, 0.5])
info.boxplot(column='R2', by='MAF_bin', ax=axes[1])
axes[1].set_xlabel('MAF Bin')
axes[1].set_ylabel('INFO Score')
axes[1].set_title('INFO by MAF')

# Scatter
axes[2].scatter(info['MAF'], info['R2'], alpha=0.1, s=1)
axes[2].set_xlabel('Minor Allele Frequency')
axes[2].set_ylabel('INFO Score')
axes[2].set_title('INFO vs MAF')

plt.tight_layout()
plt.savefig('imputation_qc.png', dpi=150)

Summarize Imputation Quality

# Count variants by quality
bcftools query -f '%INFO/DR2\n' imputed.vcf.gz | \
    awk '{
        if ($1 >= 0.8) high++;
        else if ($1 >= 0.3) med++;
        else low++
    } END {
        print "High quality (R2>=0.8):", high
        print "Medium quality (0.3<=R2<0.8):", med
        print "Low quality (R2<0.3):", low
    }'

# Variants passing filter
echo "Total variants: $(bcftools view -H imputed.vcf.gz | wc -l)"
echo "Passing R2>0.3: $(bcftools view -i 'INFO/DR2>0.3' imputed.vcf.gz -H | wc -l)"

Check Concordance with Typed Variants

# Extract typed variants from imputed file
bcftools view -i 'INFO/TYPED=1' imputed.vcf.gz -Oz -o typed.vcf.gz

# Compare imputed vs original genotypes
bcftools gtcheck -g original.vcf.gz typed.vcf.gz > concordance.txt

# Parse concordance
grep "^CN" concordance.txt

Python: Comprehensive QC Report

Goal: Generate a comprehensive imputation quality summary with overall statistics, MAF-stratified accuracy, and per-chromosome breakdowns.

Approach: Load INFO scores into a dataframe, compute aggregate R2 statistics, bin variants by minor allele frequency for quality stratification, and produce per-chromosome summaries.

import pandas as pd
import numpy as np

def imputation_qc_report(info_file, output_prefix):
    '''Generate comprehensive imputation QC report.'''
    info = pd.read_csv(info_file, sep='\t',
        names=['CHR', 'POS', 'ID', 'REF', 'ALT', 'R2', 'AF'])

    # Calculate MAF
    info['MAF'] = info['AF'].apply(lambda x: min(x, 1-x))

    # Basic statistics
    stats = {
        'total_variants': len(info),
        'mean_r2': info['R2'].mean(),
        'median_r2': info['R2'].median(),
        'variants_r2_03': (info['R2'] >= 0.3).sum(),
        'variants_r2_08': (info['R2'] >= 0.8).sum(),
        'pct_r2_03': 100 * (info['R2'] >= 0.3).mean(),
        'pct_r2_08': 100 * (info['R2'] >= 0.8).mean(),
    }

    # By MAF bin
    maf_bins = [(0, 0.001), (0.001, 0.01), (0.01, 0.05), (0.05, 0.5)]
    for low, high in maf_bins:
        mask = (info['MAF'] >= low) & (info['MAF'] < high)
        stats[f'mean_r2_maf_{low}_{high}'] = info.loc[mask, 'R2'].mean()
        stats[f'n_variants_maf_{low}_{high}'] = mask.sum()

    # By chromosome
    chr_stats = info.groupby('CHR').agg({
        'R2': ['mean', 'count'],
        'MAF': 'mean'
    }).round(3)

    # Write reports
    with open(f'{output_prefix}_summary.txt', 'w') as f:
        for k, v in stats.items():
            f.write(f'{k}: {v}\n')

    chr_stats.to_csv(f'{output_prefix}_by_chrom.txt', sep='\t')

    return stats, chr_stats

stats, chr_stats = imputation_qc_report('info_scores.txt', 'imputation_qc')

Compare Multiple Imputation Runs

def compare_imputations(vcf1, vcf2, output):
    '''Compare INFO scores between two imputation runs.'''
    import subprocess

    # Extract INFO from both
    cmd1 = f"bcftools query -f '%CHROM:%POS\t%INFO/DR2\n' {vcf1}"
    cmd2 = f"bcftools query -f '%CHROM:%POS\t%INFO/DR2\n' {vcf2}"

    info1 = pd.read_csv(subprocess.Popen(cmd1, shell=True, stdout=subprocess.PIPE).stdout,
        sep='\t', names=['ID', 'R2_1'])
    info2 = pd.read_csv(subprocess.Popen(cmd2, shell=True, stdout=subprocess.PIPE).stdout,
        sep='\t', names=['ID', 'R2_2'])

    merged = info1.merge(info2, on='ID')
    merged['R2_diff'] = merged['R2_1'] - merged['R2_2']

    # Correlation
    corr = merged['R2_1'].corr(merged['R2_2'])
    print(f'Correlation between R2 scores: {corr:.4f}')

    return merged

Hardy-Weinberg Filter

# Calculate HWE p-values (PLINK2)
plink2 --vcf imputed.vcf.gz \
    --hardy \
    --out hwe_check

# Filter extreme HWE deviations
plink2 --vcf imputed.vcf.gz \
    --hwe 1e-6 \
    --make-pgen \
    --out imputed_hwe_filtered

Final QC Pipeline

#!/bin/bash
# Complete post-imputation QC

INPUT=$1
OUTPUT=$2

# 1. Filter by INFO score
bcftools view -i 'INFO/DR2 > 0.3' $INPUT -Oz -o ${OUTPUT}_r2.vcf.gz

# 2. Filter by MAF
bcftools view -i 'INFO/AF > 0.01 && INFO/AF < 0.99' \
    ${OUTPUT}_r2.vcf.gz -Oz -o ${OUTPUT}_maf.vcf.gz

# 3. Remove duplicates
bcftools norm -d all ${OUTPUT}_maf.vcf.gz -Oz -o ${OUTPUT}_nodup.vcf.gz

# 4. Index
bcftools index ${OUTPUT}_nodup.vcf.gz

# 5. Final stats
echo "Input variants: $(bcftools view -H $INPUT | wc -l)"
echo "After R2 filter: $(bcftools view -H ${OUTPUT}_r2.vcf.gz | wc -l)"
echo "After MAF filter: $(bcftools view -H ${OUTPUT}_maf.vcf.gz | wc -l)"
echo "Final variants: $(bcftools view -H ${OUTPUT}_nodup.vcf.gz | wc -l)"

Quality Thresholds by Application

Application	R2 Threshold	MAF Threshold	Notes
GWAS discovery	0.3	0.01	Standard
GWAS replication	0.5	0.01	More stringent
Fine-mapping	0.8	0.001	High accuracy needed
Polygenic scores	0.9	0.01	Very high accuracy
Meta-analysis	0.5	Study-specific	Match across studies

Related Skills

phasing-imputation/genotype-imputation - Generate imputed data
variant-calling/filtering-best-practices - VCF filtering operations
population-genetics/association-testing - GWAS with imputed data