OpenClaw-Medical-Skills bio-vcf-manipulation
Merge, concatenate, sort, intersect, and subset VCF files using bcftools. Use when combining variant files, comparing call sets, or restructuring VCF data.
git clone https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/bio-vcf-manipulation" ~/.claude/skills/freedomintelligence-openclaw-medical-skills-bio-vcf-manipulation && rm -rf "$T"
T=$(mktemp -d) && git clone --depth=1 https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills "$T" && mkdir -p ~/.openclaw/skills && cp -r "$T/skills/bio-vcf-manipulation" ~/.openclaw/skills/freedomintelligence-openclaw-medical-skills-bio-vcf-manipulation && rm -rf "$T"
skills/bio-vcf-manipulation/SKILL.mdVersion Compatibility
Reference examples tested with: GATK 4.5+, bcftools 1.19+
Before using code patterns, verify installed versions match. If versions differ:
- Python:
thenpip show <package>
to check signatureshelp(module.function) - CLI:
then<tool> --version
to confirm flags<tool> --help
If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
VCF Manipulation
Merge, concat, sort, and compare VCF files using bcftools.
Operations Overview
| Operation | Command | Use Case |
|---|---|---|
| Merge | | Combine samples from multiple VCFs |
| Concat | | Combine regions from multiple VCFs |
| Sort | | Sort unsorted VCF |
| Intersect | | Compare/intersect call sets |
| Subset | | Extract samples or regions |
bcftools merge
Goal: Combine VCF files from different samples into a single multi-sample VCF.
Approach: Use bcftools merge to join files with different sample columns at shared genomic positions.
"Merge my per-sample VCFs into one file" → Combine variant records from multiple samples into a single multi-sample VCF.
Combine multiple VCF files with different samples at the same positions.
Basic Merge
bcftools merge sample1.vcf.gz sample2.vcf.gz -Oz -o merged.vcf.gz
Merge Multiple Files
bcftools merge *.vcf.gz -Oz -o all_samples.vcf.gz
Merge from File List
# files.txt: one VCF path per line bcftools merge -l files.txt -Oz -o merged.vcf.gz
Handle Missing Genotypes
# Output missing genotypes as ./. (default) bcftools merge sample1.vcf.gz sample2.vcf.gz -Oz -o merged.vcf.gz # Output missing as reference (0/0) bcftools merge --missing-to-ref sample1.vcf.gz sample2.vcf.gz -Oz -o merged.vcf.gz
Force Sample Names
When sample names conflict:
bcftools merge --force-samples sample1.vcf.gz sample2.vcf.gz -Oz -o merged.vcf.gz
Merge Specific Regions
bcftools merge -r chr1:1000000-2000000 sample1.vcf.gz sample2.vcf.gz -Oz -o merged.vcf.gz
bcftools concat
Goal: Concatenate VCF files that cover different genomic regions for the same samples.
Approach: Use bcftools concat to join region-split files (e.g., per-chromosome VCFs) in order.
Combine VCF files with same samples from different regions.
Concatenate Chromosomes
bcftools concat chr1.vcf.gz chr2.vcf.gz chr3.vcf.gz -Oz -o genome.vcf.gz
Concatenate All Chromosomes
bcftools concat chr*.vcf.gz -Oz -o genome.vcf.gz
From File List
# files.txt: one VCF path per line (in order) bcftools concat -f files.txt -Oz -o concatenated.vcf.gz
Allow Overlapping Regions
bcftools concat -a chr1_part1.vcf.gz chr1_part2.vcf.gz -Oz -o chr1.vcf.gz
Remove Duplicates
bcftools concat -a -d all file1.vcf.gz file2.vcf.gz -Oz -o merged.vcf.gz
Options for
-d:
- Remove duplicate SNPssnps
- Remove duplicate indelsindels
- Remove duplicate SNPs and indelsboth
- Remove all duplicatesall
- Remove exact duplicates onlyexact
bcftools sort
Goal: Sort a VCF file by chromosome and position.
Approach: Use bcftools sort with optional temp directory and memory limits for large files.
Sort VCF by chromosome and position.
Basic Sort
bcftools sort input.vcf -Oz -o sorted.vcf.gz
With Temporary Directory
For large files:
bcftools sort -T /tmp input.vcf.gz -Oz -o sorted.vcf.gz
Memory Limit
bcftools sort -m 4G input.vcf.gz -Oz -o sorted.vcf.gz
bcftools isec
Goal: Identify shared and private variants between two or more VCF files.
Approach: Use bcftools isec to partition variants into private-to-each-file and shared subsets.
"Find variants called by both GATK and bcftools" → Intersect two call sets to identify concordant and discordant variants.
Intersect and compare VCF files.
Find Shared Variants
bcftools isec -p output_dir sample1.vcf.gz sample2.vcf.gz
Creates:
- Private to sample10000.vcf
- Private to sample20001.vcf
- Shared (sample1 records)0002.vcf
- Shared (sample2 records)0003.vcf
Output Compressed
bcftools isec -p output_dir -Oz sample1.vcf.gz sample2.vcf.gz
Intersection Only
bcftools isec -p output_dir -n=2 sample1.vcf.gz sample2.vcf.gz # Only outputs variants present in exactly 2 files
Comparison Options
| Flag | Description |
|---|---|
| Present in exactly 2 files |
| Present in 2 or more files |
| Present in fewer than 2 files |
| Boolean: file1 AND file2 |
| Boolean: file1 AND NOT file2 |
Two-File Intersection
# Variants in both files bcftools isec -n=2 -w1 sample1.vcf.gz sample2.vcf.gz -Oz -o shared.vcf.gz # Variants only in sample1 bcftools isec -n~10 -w1 sample1.vcf.gz sample2.vcf.gz -Oz -o only_sample1.vcf.gz
Complement Mode
# Variants in file1 not in file2 bcftools isec -C sample1.vcf.gz sample2.vcf.gz -Oz -o unique.vcf.gz
Subsetting VCF Files
Goal: Extract a subset of samples or regions from a multi-sample VCF.
Approach: Use bcftools view with -s (samples) or -r/-R (regions) flags to create targeted subsets.
Extract Samples
bcftools view -s sample1,sample2 input.vcf.gz -Oz -o subset.vcf.gz
Exclude Samples
bcftools view -s ^sample3 input.vcf.gz -Oz -o without_sample3.vcf.gz
From Sample List File
# samples.txt: one sample name per line bcftools view -S samples.txt input.vcf.gz -Oz -o subset.vcf.gz
Extract Region
bcftools view -r chr1:1000000-2000000 input.vcf.gz -Oz -o region.vcf.gz
Extract Multiple Regions
bcftools view -R regions.bed input.vcf.gz -Oz -o targets.vcf.gz
Renaming Samples
Goal: Rename sample columns in a VCF header.
Approach: Use bcftools reheader with a mapping file of old-to-new sample names.
Single Sample
echo "old_name new_name" > rename.txt bcftools reheader -s rename.txt input.vcf.gz -o renamed.vcf.gz
Multiple Samples
# rename.txt format: old_name new_name cat > rename.txt << EOF sample1 patient_001 sample2 patient_002 sample3 patient_003 EOF bcftools reheader -s rename.txt input.vcf.gz -o renamed.vcf.gz
Splitting VCF Files
Goal: Split a multi-sample or multi-chromosome VCF into separate files.
Approach: Iterate over samples or chromosomes and extract each with bcftools view.
Split by Sample
for sample in $(bcftools query -l input.vcf.gz); do bcftools view -s "$sample" input.vcf.gz -Oz -o "${sample}.vcf.gz" done
Split by Chromosome
for chr in $(bcftools view -h input.vcf.gz | grep "^##contig" | sed 's/.*ID=\([^,]*\).*/\1/'); do bcftools view -r "$chr" input.vcf.gz -Oz -o "${chr}.vcf.gz" done
Split Multiallelic Sites
bcftools norm -m-any input.vcf.gz -Oz -o split.vcf.gz
Common Workflows
Goal: Execute typical multi-step VCF manipulation tasks.
Approach: Chain merge, concat, isec, and view operations for cohort assembly, caller comparison, and filtering.
Merge Cohort VCFs
# Create file list ls *.vcf.gz > files.txt # Merge all samples bcftools merge -l files.txt -Oz -o cohort.vcf.gz bcftools index cohort.vcf.gz
Combine Chromosome VCFs
# After parallel variant calling by chromosome bcftools concat chr{1..22}.vcf.gz chrX.vcf.gz chrY.vcf.gz -Oz -o genome.vcf.gz bcftools index genome.vcf.gz
Compare Two Callers
# Find variants called by both GATK and bcftools bcftools isec -p comparison gatk.vcf.gz bcftools.vcf.gz # Count results wc -l comparison/*.vcf
Extract Passing Variants
bcftools view -f PASS input.vcf.gz -Oz -o pass_only.vcf.gz bcftools index pass_only.vcf.gz
cyvcf2 Python Operations
Goal: Perform VCF set operations programmatically in Python.
Approach: Use cyvcf2 for position-based comparisons and record concatenation; use bcftools merge for true multi-sample merging.
Note: True VCF merging (combining samples at matching positions) is complex. Use
bcftools merge for production work. cyvcf2 is better for filtering/querying.
Concatenate Records (Not True Merge)
from cyvcf2 import VCF, Writer # WARNING: This concatenates records, not a true merge # For actual merging of samples, use bcftools merge vcf1 = VCF('file1.vcf.gz') writer = Writer('combined.vcf', vcf1) for variant in vcf1: writer.write_record(variant) writer.close() vcf1.close()
Find Shared Positions
from cyvcf2 import VCF # Load positions from first VCF vcf1_positions = set() for variant in VCF('sample1.vcf.gz'): vcf1_positions.add((variant.CHROM, variant.POS)) # Check second VCF shared = 0 unique = 0 for variant in VCF('sample2.vcf.gz'): if (variant.CHROM, variant.POS) in vcf1_positions: shared += 1 else: unique += 1 print(f'Shared: {shared}') print(f'Unique to sample2: {unique}')
Quick Reference
| Task | Command |
|---|---|
| Merge samples | |
| Concat regions | |
| Sort VCF | |
| Intersect | |
| Extract samples | |
| Rename samples | |
Common Errors
| Error | Cause | Solution |
|---|---|---|
| merge vs concat confusion | Use merge for samples, concat for regions |
| Unsorted input to concat | Sort first or use flag |
| Duplicate sample names | Use |
| Missing index for merge/isec | Run first |
Related Skills
- vcf-basics - View and query VCF files
- filtering-best-practices - Filter variants before manipulation
- variant-normalization - Normalize before comparing
- vcf-statistics - Compare statistics after manipulation