OpenClaw-Medical-Skills variant-annotation

<!--

install
source · Clone the upstream repo
git clone https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/variant-interpretation-acmg/bioSkills/variant-annotation" ~/.claude/skills/freedomintelligence-openclaw-medical-skills-variant-annotation && rm -rf "$T"
OpenClaw · Install into ~/.openclaw/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills "$T" && mkdir -p ~/.openclaw/skills && cp -r "$T/skills/variant-interpretation-acmg/bioSkills/variant-annotation" ~/.openclaw/skills/freedomintelligence-openclaw-medical-skills-variant-annotation && rm -rf "$T"
manifest: skills/variant-interpretation-acmg/bioSkills/variant-annotation/SKILL.md
source content
<!-- # COPYRIGHT NOTICE # This file is part of the "Universal Biomedical Skills" project. # Copyright (c) 2026 MD BABU MIA, PhD <md.babu.mia@mssm.edu> # All Rights Reserved. # # This code is proprietary and confidential. # Unauthorized copying of this file, via any medium is strictly prohibited. # # Provenance: Authenticated by MD BABU MIA -->

name: bio-variant-annotation description: Comprehensive variant annotation using bcftools annotate/csq, VEP, SnpEff, and ANNOVAR. Add database annotations, predict functional consequences, and assess clinical significance. Use when annotating variants with functional and clinical information. tool_type: mixed primary_tool: VEP measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes. allowed-tools:

  • read_file
  • run_shell_command

Variant Annotation

Tool Comparison

ToolBest ForSpeedOutput
bcftools csqSimple consequence predictionFastVCF
VEPComprehensive with pluginsModerateVCF/TXT
SnpEffFast batch annotationFastVCF
ANNOVARFlexible databasesModerateTXT

bcftools annotate

Add Annotations from Database

bcftools annotate -a dbsnp.vcf.gz -c ID input.vcf.gz -Oz -o annotated.vcf.gz

Annotation Columns (
-c
)

OptionDescription
ID
Copy ID column
INFO
Copy all INFO fields
INFO/TAG
Copy specific INFO field
+INFO/TAG
Add to existing values

Add rsIDs from dbSNP

bcftools annotate -a dbsnp.vcf.gz -c ID input.vcf.gz -Oz -o with_rsids.vcf.gz

Add Multiple Annotations

bcftools annotate -a database.vcf.gz -c ID,INFO/AF,INFO/CAF input.vcf.gz -Oz -o annotated.vcf.gz

Add from BED/TAB Files

# BED with 4th column as annotation
bcftools annotate -a regions.bed.gz -c CHROM,FROM,TO,INFO/REGION \
    -h <(echo '##INFO=<ID=REGION,Number=1,Type=String,Description="Region name">') \
    input.vcf.gz -Oz -o annotated.vcf.gz

# Tab file: CHROM POS VALUE
bcftools annotate -a annotations.tab.gz -c CHROM,POS,INFO/SCORE \
    -h <(echo '##INFO=<ID=SCORE,Number=1,Type=Float,Description="Custom score">') \
    input.vcf.gz -Oz -o annotated.vcf.gz

Remove Annotations

bcftools annotate -x INFO/DP,INFO/MQ input.vcf.gz -Oz -o clean.vcf.gz
bcftools annotate -x INFO input.vcf.gz -Oz -o minimal.vcf.gz  # Remove all INFO

Set ID from Fields

bcftools annotate --set-id '%CHROM\_%POS\_%REF\_%ALT' input.vcf.gz -Oz -o with_ids.vcf.gz

bcftools csq

Simple consequence prediction using GFF annotation.

bcftools csq -f reference.fa -g genes.gff3.gz input.vcf.gz -Oz -o consequences.vcf.gz

Consequence Types

ConsequenceDescription
synonymous
No amino acid change
missense
Amino acid change
stop_gained
Introduces stop codon
frameshift
Changes reading frame
splice_donor/acceptor
Affects splicing

Ensembl VEP

Installation

conda install -c bioconda ensembl-vep
vep_install -a cf -s homo_sapiens -y GRCh38 --CONVERT

Basic Annotation

vep -i input.vcf -o output.vcf --vcf --cache --offline

Comprehensive Annotation

vep -i input.vcf -o output.vcf \
    --vcf \
    --cache --offline \
    --species homo_sapiens \
    --assembly GRCh38 \
    --everything \
    --fork 4

--everything Enables

  • --sift b
    - SIFT predictions
  • --polyphen b
    - PolyPhen predictions
  • --hgvs
    - HGVS nomenclature
  • --symbol
    - Gene symbols
  • --canonical
    - Canonical transcript
  • --af
    - 1000 Genomes frequencies
  • --af_gnomade/g
    - gnomAD frequencies
  • --pubmed
    - PubMed IDs

Filter by Impact

vep -i input.vcf -o output.vcf --vcf \
    --cache --offline \
    --pick \
    --filter "IMPACT in HIGH,MODERATE"

Plugins

# CADD scores
vep -i input.vcf -o output.vcf --vcf \
    --cache --offline \
    --plugin CADD,whole_genome_SNVs.tsv.gz

# dbNSFP (multiple predictors)
vep -i input.vcf -o output.vcf --vcf \
    --cache --offline \
    --plugin dbNSFP,dbNSFP4.3a.gz,ALL

# Multiple plugins
vep -i input.vcf -o output.vcf --vcf \
    --cache --offline \
    --plugin CADD,cadd.tsv.gz \
    --plugin dbNSFP,dbnsfp.gz,SIFT_score,Polyphen2_HDIV_score \
    --plugin SpliceAI,spliceai.vcf.gz

VEP Output Fields

FieldDescription
ConsequenceSO term (e.g., missense_variant)
IMPACTHIGH, MODERATE, LOW, MODIFIER
SYMBOLGene symbol
HGVSc/HGVSpHGVS coding/protein change
SIFT/PolyPhenPathogenicity predictions

SnpEff

Installation

conda install -c bioconda snpeff
snpEff download GRCh38.105

Basic Annotation

snpEff ann GRCh38.105 input.vcf > output.vcf

With Statistics

snpEff ann -v -stats stats.html -csvStats stats.csv GRCh38.105 input.vcf > output.vcf

Filter by Impact

snpEff ann GRCh38.105 input.vcf | \
    SnpSift filter "(ANN[*].IMPACT = 'HIGH')" > high_impact.vcf

SnpEff Impact Categories

ImpactExamples
HIGHStop gained, frameshift, splice donor/acceptor
MODERATEMissense, inframe indel
LOWSynonymous, splice region
MODIFIERIntron, intergenic, UTR

SnpSift Database Annotations

# dbSNP
SnpSift annotate dbsnp.vcf.gz input.vcf > annotated.vcf

# ClinVar
SnpSift annotate clinvar.vcf.gz input.vcf > annotated.vcf

# dbNSFP
SnpSift dbnsfp -db dbNSFP4.3a.txt.gz input.vcf > annotated.vcf

# Chain multiple
snpEff ann GRCh38.105 input.vcf | \
    SnpSift annotate dbsnp.vcf.gz | \
    SnpSift annotate clinvar.vcf.gz > fully_annotated.vcf

SnpSift Filtering

SnpSift filter "(QUAL >= 30) & (DP >= 10)" input.vcf > filtered.vcf
SnpSift filter "(exists CLNSIG) & (CLNSIG has 'Pathogenic')" input.vcf > pathogenic.vcf

ANNOVAR

Installation

# Download from https://annovar.openbioinformatics.org/ (registration required)
annotate_variation.pl -buildver hg38 -downdb -webfrom annovar refGene humandb/
annotate_variation.pl -buildver hg38 -downdb -webfrom annovar gnomad30_genome humandb/

Table Annotation

table_annovar.pl input.vcf humandb/ \
    -buildver hg38 \
    -out annotated \
    -remove \
    -protocol refGene,gnomad30_genome,clinvar_20230416,dbnsfp42a \
    -operation g,f,f,f \
    -nastring . \
    -vcfinput

Python: Parse Annotated VCF

Parse VEP CSQ

from cyvcf2 import VCF

def parse_vep_csq(csq_string, csq_header):
    fields = csq_header.split('|')
    values = csq_string.split('|')
    return dict(zip(fields, values))

vcf = VCF('vep_output.vcf')
csq_header = None
for h in vcf.header_iter():
    if h['HeaderType'] == 'INFO' and h['ID'] == 'CSQ':
        csq_header = h['Description'].split('Format: ')[1].rstrip('"')
        break

for variant in vcf:
    csq = variant.INFO.get('CSQ')
    if csq:
        for transcript in csq.split(','):
            parsed = parse_vep_csq(transcript, csq_header)
            if parsed.get('IMPACT') in ('HIGH', 'MODERATE'):
                print(f"{variant.CHROM}:{variant.POS} {parsed['SYMBOL']} {parsed['Consequence']}")

Parse SnpEff ANN

from cyvcf2 import VCF

def parse_snpeff_ann(ann_string):
    fields = ['Allele', 'Annotation', 'Impact', 'Gene_Name', 'Gene_ID',
              'Feature_Type', 'Feature_ID', 'Transcript_BioType', 'Rank',
              'HGVS_c', 'HGVS_p', 'cDNA_pos', 'CDS_pos', 'Protein_pos', 'Distance']
    values = ann_string.split('|')
    return dict(zip(fields, values[:len(fields)]))

for variant in VCF('snpeff_output.vcf'):
    ann = variant.INFO.get('ANN')
    if ann:
        for transcript in ann.split(','):
            parsed = parse_snpeff_ann(transcript)
            if parsed['Impact'] == 'HIGH':
                print(f"{variant.CHROM}:{variant.POS} {parsed['Gene_Name']} {parsed['Annotation']}")

Complete Annotation Pipeline

#!/bin/bash
set -euo pipefail

INPUT=$1
REFERENCE=$2
VEP_CACHE=$3
OUTPUT_PREFIX=$4

# Normalize variants
bcftools norm -f $REFERENCE -m-any $INPUT -Oz -o ${OUTPUT_PREFIX}_norm.vcf.gz
bcftools index ${OUTPUT_PREFIX}_norm.vcf.gz

# VEP annotation
vep -i ${OUTPUT_PREFIX}_norm.vcf.gz \
    -o ${OUTPUT_PREFIX}_vep.vcf \
    --vcf --cache --offline --dir_cache $VEP_CACHE \
    --assembly GRCh38 --everything --pick --fork 4

bgzip ${OUTPUT_PREFIX}_vep.vcf
bcftools index ${OUTPUT_PREFIX}_vep.vcf.gz

# Filter high/moderate impact
bcftools view -i 'INFO/CSQ~"HIGH" || INFO/CSQ~"MODERATE"' \
    ${OUTPUT_PREFIX}_vep.vcf.gz -Oz -o ${OUTPUT_PREFIX}_filtered.vcf.gz

Pathogenicity Predictors

PredictorDeleteriousBenign
SIFT< 0.05>= 0.05
PolyPhen-2 (HDIV)> 0.957 (probably), > 0.453 (possibly)<= 0.453
CADD> 20 (top 1%), > 30 (top 0.1%)< 10
REVEL> 0.5< 0.5

Clinical Significance (ClinVar)

CodeMeaning
PathogenicDisease-causing
Likely_pathogenicProbably disease-causing
Uncertain_significanceVUS
Likely_benignProbably not disease-causing
BenignNot disease-causing

Quick Reference

TaskCommand
Add rsIDs
bcftools annotate -a dbsnp.vcf.gz -c ID in.vcf.gz
VEP annotation
vep -i in.vcf -o out.vcf --vcf --cache --everything
SnpEff annotation
snpEff ann GRCh38.105 in.vcf > out.vcf
Consequences only
bcftools csq -f ref.fa -g genes.gff in.vcf.gz

Related Skills

  • variant-calling/variant-normalization - Normalize before annotating
  • variant-calling/filtering-best-practices - Filter by annotations
  • variant-calling/vcf-basics - Query annotated fields
  • database-access/entrez-fetch - Download annotation databases
<!-- AUTHOR_SIGNATURE: 9a7f3c2e-MD-BABU-MIA-2026-MSSM-SECURE -->