install
source · Clone the upstream repo
git clone https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/variant-interpretation-acmg/bioSkills/joint-calling" ~/.claude/skills/freedomintelligence-openclaw-medical-skills-joint-calling && rm -rf "$T"
OpenClaw · Install into ~/.openclaw/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills "$T" && mkdir -p ~/.openclaw/skills && cp -r "$T/skills/variant-interpretation-acmg/bioSkills/joint-calling" ~/.openclaw/skills/freedomintelligence-openclaw-medical-skills-joint-calling && rm -rf "$T"
manifest:
skills/variant-interpretation-acmg/bioSkills/joint-calling/SKILL.mdsource content
<!--
# COPYRIGHT NOTICE
# This file is part of the "Universal Biomedical Skills" project.
# Copyright (c) 2026 MD BABU MIA, PhD <md.babu.mia@mssm.edu>
# All Rights Reserved.
#
# This code is proprietary and confidential.
# Unauthorized copying of this file, via any medium is strictly prohibited.
#
# Provenance: Authenticated by MD BABU MIA
-->
name: bio-variant-calling-joint-calling description: Joint genotype calling across multiple samples using GATK CombineGVCFs and GenotypeGVCFs. Essential for cohort studies, population genetics, and leveraging VQSR. Use when performing joint genotyping across multiple samples. tool_type: cli primary_tool: GATK measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes. allowed-tools:
- read_file
- run_shell_command
Joint Calling
Call variants jointly across multiple samples for improved accuracy and consistent genotyping.
Why Joint Calling?
- Improved sensitivity - Leverage information across samples
- Consistent genotyping - Same sites called across all samples
- VQSR eligible - Requires cohort for machine learning filtering
- Population analysis - Allele frequencies across cohort
Workflow Overview
Sample BAMs │ ├── HaplotypeCaller (per-sample, -ERC GVCF) │ └── sample1.g.vcf.gz, sample2.g.vcf.gz, ... │ ├── CombineGVCFs or GenomicsDBImport │ └── Combine into cohort database │ ├── GenotypeGVCFs │ └── Joint genotyping │ └── VQSR or Hard Filtering └── Final VCF
Step 1: Per-Sample gVCF Generation
# Generate gVCF for each sample gatk HaplotypeCaller \ -R reference.fa \ -I sample1.bam \ -O sample1.g.vcf.gz \ -ERC GVCF # With intervals (faster) gatk HaplotypeCaller \ -R reference.fa \ -I sample1.bam \ -O sample1.g.vcf.gz \ -ERC GVCF \ -L intervals.bed
Batch Processing
# Process all samples for bam in *.bam; do sample=$(basename $bam .bam) gatk HaplotypeCaller \ -R reference.fa \ -I $bam \ -O ${sample}.g.vcf.gz \ -ERC GVCF & done wait
Step 2a: CombineGVCFs (Small Cohorts)
For <100 samples:
gatk CombineGVCFs \ -R reference.fa \ -V sample1.g.vcf.gz \ -V sample2.g.vcf.gz \ -V sample3.g.vcf.gz \ -O cohort.g.vcf.gz
From Sample Map
# Create sample map file # sample1 /path/to/sample1.g.vcf.gz # sample2 /path/to/sample2.g.vcf.gz ls *.g.vcf.gz | while read f; do echo -e "$(basename $f .g.vcf.gz)\t$f" done > sample_map.txt # Combine with -V for each gatk CombineGVCFs \ -R reference.fa \ $(cat sample_map.txt | cut -f2 | sed 's/^/-V /') \ -O cohort.g.vcf.gz
Step 2b: GenomicsDBImport (Large Cohorts)
For >100 samples, use GenomicsDB:
# Create sample map ls *.g.vcf.gz | while read f; do echo -e "$(basename $f .g.vcf.gz)\t$f" done > sample_map.txt # Import to GenomicsDB (per chromosome for parallelism) gatk GenomicsDBImport \ --sample-name-map sample_map.txt \ --genomicsdb-workspace-path genomicsdb_chr1 \ -L chr1 \ --reader-threads 4 # Or all chromosomes for chr in {1..22} X Y; do gatk GenomicsDBImport \ --sample-name-map sample_map.txt \ --genomicsdb-workspace-path genomicsdb_chr${chr} \ -L chr${chr} & done wait
Update GenomicsDB with New Samples
gatk GenomicsDBImport \ --genomicsdb-update-workspace-path genomicsdb_chr1 \ --sample-name-map new_samples.txt \ -L chr1
Step 3: GenotypeGVCFs
From Combined gVCF
gatk GenotypeGVCFs \ -R reference.fa \ -V cohort.g.vcf.gz \ -O cohort.vcf.gz
From GenomicsDB
gatk GenotypeGVCFs \ -R reference.fa \ -V gendb://genomicsdb_chr1 \ -O chr1.vcf.gz # All chromosomes for chr in {1..22} X Y; do gatk GenotypeGVCFs \ -R reference.fa \ -V gendb://genomicsdb_chr${chr} \ -O chr${chr}.vcf.gz & done wait # Merge chromosomes bcftools concat chr{1..22}.vcf.gz chrX.vcf.gz chrY.vcf.gz \ -Oz -o cohort.vcf.gz
With Allele-Specific Annotations
gatk GenotypeGVCFs \ -R reference.fa \ -V gendb://genomicsdb \ -O cohort.vcf.gz \ -G StandardAnnotation \ -G AS_StandardAnnotation
Step 4: Filtering
VQSR (Recommended for >30 Samples)
# SNPs gatk VariantRecalibrator \ -R reference.fa \ -V cohort.vcf.gz \ --resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap.vcf.gz \ --resource:omni,known=false,training=true,truth=false,prior=12.0 omni.vcf.gz \ --resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G.vcf.gz \ --resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.vcf.gz \ -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR \ -mode SNP \ -O snps.recal \ --tranches-file snps.tranches gatk ApplyVQSR \ -R reference.fa \ -V cohort.vcf.gz \ --recal-file snps.recal \ --tranches-file snps.tranches \ -mode SNP \ --truth-sensitivity-filter-level 99.5 \ -O cohort.snps.vcf.gz # Indels gatk VariantRecalibrator \ -R reference.fa \ -V cohort.snps.vcf.gz \ --resource:mills,known=false,training=true,truth=true,prior=12.0 mills.vcf.gz \ --resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.vcf.gz \ -an QD -an MQRankSum -an ReadPosRankSum -an FS -an SOR \ -mode INDEL \ -O indels.recal \ --tranches-file indels.tranches gatk ApplyVQSR \ -R reference.fa \ -V cohort.snps.vcf.gz \ --recal-file indels.recal \ --tranches-file indels.tranches \ -mode INDEL \ --truth-sensitivity-filter-level 99.0 \ -O cohort.filtered.vcf.gz
Hard Filtering (Small Cohorts)
# See filtering-best-practices skill gatk VariantFiltration \ -R reference.fa \ -V cohort.vcf.gz \ --filter-expression "QD < 2.0" --filter-name "QD2" \ --filter-expression "FS > 60.0" --filter-name "FS60" \ --filter-expression "MQ < 40.0" --filter-name "MQ40" \ -O cohort.filtered.vcf.gz
Complete Pipeline Script
#!/bin/bash set -euo pipefail REFERENCE=$1 OUTPUT_DIR=$2 THREADS=16 mkdir -p $OUTPUT_DIR/{gvcfs,genomicsdb,vcfs} echo "=== Step 1: Generate gVCFs ===" for bam in data/*.bam; do sample=$(basename $bam .bam) gatk HaplotypeCaller \ -R $REFERENCE \ -I $bam \ -O $OUTPUT_DIR/gvcfs/${sample}.g.vcf.gz \ -ERC GVCF & # Limit parallelism while [ $(jobs -r | wc -l) -ge $THREADS ]; do sleep 1; done done wait echo "=== Step 2: Create sample map ===" ls $OUTPUT_DIR/gvcfs/*.g.vcf.gz | while read f; do echo -e "$(basename $f .g.vcf.gz)\t$(realpath $f)" done > $OUTPUT_DIR/sample_map.txt echo "=== Step 3: GenomicsDBImport ===" gatk GenomicsDBImport \ --sample-name-map $OUTPUT_DIR/sample_map.txt \ --genomicsdb-workspace-path $OUTPUT_DIR/genomicsdb \ -L intervals.bed \ --reader-threads 4 echo "=== Step 4: Joint genotyping ===" gatk GenotypeGVCFs \ -R $REFERENCE \ -V gendb://$OUTPUT_DIR/genomicsdb \ -O $OUTPUT_DIR/vcfs/cohort.vcf.gz echo "=== Step 5: Index ===" bcftools index -t $OUTPUT_DIR/vcfs/cohort.vcf.gz echo "=== Statistics ===" bcftools stats $OUTPUT_DIR/vcfs/cohort.vcf.gz > $OUTPUT_DIR/vcfs/cohort_stats.txt echo "=== Complete ===" echo "Joint VCF: $OUTPUT_DIR/vcfs/cohort.vcf.gz"
Tips
Memory for Large Cohorts
# Increase Java heap gatk --java-options "-Xmx64g" GenotypeGVCFs ... # Batch size for GenomicsDBImport gatk GenomicsDBImport --batch-size 50 ...
Incremental Updates
# Add new samples to existing database gatk GenomicsDBImport \ --genomicsdb-update-workspace-path existing_db \ --sample-name-map new_samples.txt
Related Skills
- variant-calling/gatk-variant-calling - Single-sample calling
- variant-calling/filtering-best-practices - VQSR and hard filtering
- population-genetics/plink-basics - Population analysis of joint calls
- workflows/fastq-to-variants - End-to-end germline pipeline