OpenClaw-Medical-Skills bio-variant-annotation
Comprehensive variant annotation using bcftools annotate/csq, VEP, SnpEff, and ANNOVAR. Add database annotations, predict functional consequences, and assess clinical significance. Use when annotating variants with functional and clinical information.
git clone https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/bio-variant-annotation" ~/.claude/skills/freedomintelligence-openclaw-medical-skills-bio-variant-annotation && rm -rf "$T"
T=$(mktemp -d) && git clone --depth=1 https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills "$T" && mkdir -p ~/.openclaw/skills && cp -r "$T/skills/bio-variant-annotation" ~/.openclaw/skills/freedomintelligence-openclaw-medical-skills-bio-variant-annotation && rm -rf "$T"
skills/bio-variant-annotation/SKILL.mdVersion Compatibility
Reference examples tested with: bcftools 1.19+
Before using code patterns, verify installed versions match. If versions differ:
- Python:
thenpip show <package>
to check signatureshelp(module.function) - CLI:
then<tool> --version
to confirm flags<tool> --help
If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Variant Annotation
Tool Comparison
| Tool | Best For | Speed | Output |
|---|---|---|---|
| bcftools csq | Simple consequence prediction | Fast | VCF |
| VEP | Comprehensive with plugins | Moderate | VCF/TXT |
| SnpEff | Fast batch annotation | Fast | VCF |
| ANNOVAR | Flexible databases | Moderate | TXT |
bcftools annotate
Goal: Add or remove INFO/ID annotations from external databases using bcftools.
Approach: Match variants by position and allele against annotation VCF/BED/TAB files, copying specified columns.
"Add rsIDs to my VCF from dbSNP" → Match variant positions against a database and copy identifiers or annotation fields into the VCF.
Add Annotations from Database
bcftools annotate -a dbsnp.vcf.gz -c ID input.vcf.gz -Oz -o annotated.vcf.gz
Annotation Columns (-c
)
-c| Option | Description |
|---|---|
| Copy ID column |
| Copy all INFO fields |
| Copy specific INFO field |
| Add to existing values |
Add rsIDs from dbSNP
bcftools annotate -a dbsnp.vcf.gz -c ID input.vcf.gz -Oz -o with_rsids.vcf.gz
Add Multiple Annotations
bcftools annotate -a database.vcf.gz -c ID,INFO/AF,INFO/CAF input.vcf.gz -Oz -o annotated.vcf.gz
Add from BED/TAB Files
# BED with 4th column as annotation bcftools annotate -a regions.bed.gz -c CHROM,FROM,TO,INFO/REGION \ -h <(echo '##INFO=<ID=REGION,Number=1,Type=String,Description="Region name">') \ input.vcf.gz -Oz -o annotated.vcf.gz # Tab file: CHROM POS VALUE bcftools annotate -a annotations.tab.gz -c CHROM,POS,INFO/SCORE \ -h <(echo '##INFO=<ID=SCORE,Number=1,Type=Float,Description="Custom score">') \ input.vcf.gz -Oz -o annotated.vcf.gz
Remove Annotations
bcftools annotate -x INFO/DP,INFO/MQ input.vcf.gz -Oz -o clean.vcf.gz bcftools annotate -x INFO input.vcf.gz -Oz -o minimal.vcf.gz # Remove all INFO
Set ID from Fields
bcftools annotate --set-id '%CHROM\_%POS\_%REF\_%ALT' input.vcf.gz -Oz -o with_ids.vcf.gz
bcftools csq
Goal: Predict functional consequences of variants using gene annotations.
Approach: Map variants to GFF3 gene models and classify as synonymous, missense, frameshift, etc.
Simple consequence prediction using GFF annotation.
bcftools csq -f reference.fa -g genes.gff3.gz input.vcf.gz -Oz -o consequences.vcf.gz
Consequence Types
| Consequence | Description |
|---|---|
| No amino acid change |
| Amino acid change |
| Introduces stop codon |
| Changes reading frame |
| Affects splicing |
Ensembl VEP
Goal: Annotate variants comprehensively with consequence, impact, pathogenicity scores, and population frequencies.
Approach: Run VEP with offline cache, enabling SIFT, PolyPhen, HGVS, frequency, and plugin-based predictions.
"Annotate my variants with functional consequences" → Predict coding effects, impact severity, and pathogenicity using Ensembl's Variant Effect Predictor.
Installation
conda install -c bioconda ensembl-vep vep_install -a cf -s homo_sapiens -y GRCh38 --CONVERT
Basic Annotation
vep -i input.vcf -o output.vcf --vcf --cache --offline
Comprehensive Annotation
vep -i input.vcf -o output.vcf \ --vcf \ --cache --offline \ --species homo_sapiens \ --assembly GRCh38 \ --everything \ --fork 4
--everything Enables
- SIFT predictions--sift b
- PolyPhen predictions--polyphen b
- HGVS nomenclature--hgvs
- Gene symbols--symbol
- Canonical transcript--canonical
- 1000 Genomes frequencies--af
- gnomAD frequencies--af_gnomade/g
- PubMed IDs--pubmed
Filter by Impact
vep -i input.vcf -o output.vcf --vcf \ --cache --offline \ --pick \ --filter "IMPACT in HIGH,MODERATE"
Plugins
# CADD scores vep -i input.vcf -o output.vcf --vcf \ --cache --offline \ --plugin CADD,whole_genome_SNVs.tsv.gz # dbNSFP (multiple predictors) vep -i input.vcf -o output.vcf --vcf \ --cache --offline \ --plugin dbNSFP,dbNSFP4.3a.gz,ALL # Multiple plugins vep -i input.vcf -o output.vcf --vcf \ --cache --offline \ --plugin CADD,cadd.tsv.gz \ --plugin dbNSFP,dbnsfp.gz,SIFT_score,Polyphen2_HDIV_score \ --plugin SpliceAI,spliceai.vcf.gz
VEP Output Fields
| Field | Description |
|---|---|
| Consequence | SO term (e.g., missense_variant) |
| IMPACT | HIGH, MODERATE, LOW, MODIFIER |
| SYMBOL | Gene symbol |
| HGVSc/HGVSp | HGVS coding/protein change |
| SIFT/PolyPhen | Pathogenicity predictions |
SnpEff
Goal: Annotate variants with gene effects and impact categories using SnpEff.
Approach: Run SnpEff ann against a genome database, then use SnpSift for database cross-referencing and filtering.
Installation
conda install -c bioconda snpeff snpEff download GRCh38.105
Basic Annotation
snpEff ann GRCh38.105 input.vcf > output.vcf
With Statistics
snpEff ann -v -stats stats.html -csvStats stats.csv GRCh38.105 input.vcf > output.vcf
Filter by Impact
snpEff ann GRCh38.105 input.vcf | \ SnpSift filter "(ANN[*].IMPACT = 'HIGH')" > high_impact.vcf
SnpEff Impact Categories
| Impact | Examples |
|---|---|
| HIGH | Stop gained, frameshift, splice donor/acceptor |
| MODERATE | Missense, inframe indel |
| LOW | Synonymous, splice region |
| MODIFIER | Intron, intergenic, UTR |
SnpSift Database Annotations
# dbSNP SnpSift annotate dbsnp.vcf.gz input.vcf > annotated.vcf # ClinVar SnpSift annotate clinvar.vcf.gz input.vcf > annotated.vcf # dbNSFP SnpSift dbnsfp -db dbNSFP4.3a.txt.gz input.vcf > annotated.vcf # Chain multiple snpEff ann GRCh38.105 input.vcf | \ SnpSift annotate dbsnp.vcf.gz | \ SnpSift annotate clinvar.vcf.gz > fully_annotated.vcf
SnpSift Filtering
SnpSift filter "(QUAL >= 30) & (DP >= 10)" input.vcf > filtered.vcf SnpSift filter "(exists CLNSIG) & (CLNSIG has 'Pathogenic')" input.vcf > pathogenic.vcf
ANNOVAR
Goal: Annotate variants with gene, frequency, and pathogenicity databases using ANNOVAR.
Approach: Run table_annovar.pl with multiple protocols (gene, filter, region) against downloaded annotation databases.
Installation
# Download from https://annovar.openbioinformatics.org/ (registration required) annotate_variation.pl -buildver hg38 -downdb -webfrom annovar refGene humandb/ annotate_variation.pl -buildver hg38 -downdb -webfrom annovar gnomad30_genome humandb/
Table Annotation
table_annovar.pl input.vcf humandb/ \ -buildver hg38 \ -out annotated \ -remove \ -protocol refGene,gnomad30_genome,clinvar_20230416,dbnsfp42a \ -operation g,f,f,f \ -nastring . \ -vcfinput
Python: Parse Annotated VCF
Goal: Extract and interpret annotation fields from VEP CSQ or SnpEff ANN strings in Python.
Approach: Parse pipe-delimited annotation strings against the header-defined field order, then filter by impact or consequence.
Parse VEP CSQ
from cyvcf2 import VCF def parse_vep_csq(csq_string, csq_header): fields = csq_header.split('|') values = csq_string.split('|') return dict(zip(fields, values)) vcf = VCF('vep_output.vcf') csq_header = None for h in vcf.header_iter(): if h['HeaderType'] == 'INFO' and h['ID'] == 'CSQ': csq_header = h['Description'].split('Format: ')[1].rstrip('"') break for variant in vcf: csq = variant.INFO.get('CSQ') if csq: for transcript in csq.split(','): parsed = parse_vep_csq(transcript, csq_header) if parsed.get('IMPACT') in ('HIGH', 'MODERATE'): print(f"{variant.CHROM}:{variant.POS} {parsed['SYMBOL']} {parsed['Consequence']}")
Parse SnpEff ANN
from cyvcf2 import VCF def parse_snpeff_ann(ann_string): fields = ['Allele', 'Annotation', 'Impact', 'Gene_Name', 'Gene_ID', 'Feature_Type', 'Feature_ID', 'Transcript_BioType', 'Rank', 'HGVS_c', 'HGVS_p', 'cDNA_pos', 'CDS_pos', 'Protein_pos', 'Distance'] values = ann_string.split('|') return dict(zip(fields, values[:len(fields)])) for variant in VCF('snpeff_output.vcf'): ann = variant.INFO.get('ANN') if ann: for transcript in ann.split(','): parsed = parse_snpeff_ann(transcript) if parsed['Impact'] == 'HIGH': print(f"{variant.CHROM}:{variant.POS} {parsed['Gene_Name']} {parsed['Annotation']}")
Complete Annotation Pipeline
Goal: Run a full annotation workflow from normalization through VEP annotation to impact filtering.
Approach: Normalize variants, annotate with VEP (--everything --pick), then filter for HIGH/MODERATE impact.
#!/bin/bash set -euo pipefail INPUT=$1 REFERENCE=$2 VEP_CACHE=$3 OUTPUT_PREFIX=$4 # Normalize variants bcftools norm -f $REFERENCE -m-any $INPUT -Oz -o ${OUTPUT_PREFIX}_norm.vcf.gz bcftools index ${OUTPUT_PREFIX}_norm.vcf.gz # VEP annotation vep -i ${OUTPUT_PREFIX}_norm.vcf.gz \ -o ${OUTPUT_PREFIX}_vep.vcf \ --vcf --cache --offline --dir_cache $VEP_CACHE \ --assembly GRCh38 --everything --pick --fork 4 bgzip ${OUTPUT_PREFIX}_vep.vcf bcftools index ${OUTPUT_PREFIX}_vep.vcf.gz # Filter high/moderate impact bcftools view -i 'INFO/CSQ~"HIGH" || INFO/CSQ~"MODERATE"' \ ${OUTPUT_PREFIX}_vep.vcf.gz -Oz -o ${OUTPUT_PREFIX}_filtered.vcf.gz
Pathogenicity Predictors
| Predictor | Deleterious | Benign |
|---|---|---|
| SIFT | < 0.05 | >= 0.05 |
| PolyPhen-2 (HDIV) | > 0.957 (probably), > 0.453 (possibly) | <= 0.453 |
| CADD | > 20 (top 1%), > 30 (top 0.1%) | < 10 |
| REVEL | > 0.5 | < 0.5 |
Clinical Significance (ClinVar)
| Code | Meaning |
|---|---|
| Pathogenic | Disease-causing |
| Likely_pathogenic | Probably disease-causing |
| Uncertain_significance | VUS |
| Likely_benign | Probably not disease-causing |
| Benign | Not disease-causing |
Quick Reference
| Task | Command |
|---|---|
| Add rsIDs | |
| VEP annotation | |
| SnpEff annotation | |
| Consequences only | |
Related Skills
- variant-calling/variant-normalization - Normalize before annotating
- variant-calling/filtering-best-practices - Filter by annotations
- variant-calling/vcf-basics - Query annotated fields
- database-access/entrez-fetch - Download annotation databases