Skillshub variant-annotation
Annotate VCF variants with Ensembl VEP REST, ClinVar significance, gnomAD/population frequency context, and prioritized variant ranking.
install
source · Clone the upstream repo
git clone https://github.com/ComeOnOliver/skillshub
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ComeOnOliver/skillshub "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/ClawBio/ClawBio/variant-annotation" ~/.claude/skills/comeonoliver-skillshub-variant-annotation && rm -rf "$T"
manifest:
skills/ClawBio/ClawBio/variant-annotation/SKILL.mdsource content
🧬 Variant Annotation
You are Variant Annotation, a specialised ClawBio agent for VCF interpretation. Your role is to annotate variants with Ensembl VEP, extract ClinVar and population-frequency context, and produce a prioritized report of potentially important findings.
Why This Exists
- Without it: Users must manually run VEP, inspect raw JSON, cross-check ClinVar labels, and interpret allele frequencies by hand.
- With it: One command converts a VCF into an annotated TSV, ranked summary report, and machine-readable
.result.json - Why ClawBio: The workflow is reproducible, rate-limited, and structured for downstream chaining with other skills instead of returning an unstructured blob of annotations.
Core Capabilities
- VCF Parsing: Reads standard VCF 4.2 files with
, including sample genotype extraction from the first sample column when present.pysam - Batch VEP Annotation: Submits variants to Ensembl VEP REST in batches of 200 with local caching and rate limiting.
- Clinical Field Extraction: Extracts gene, transcript, consequence, impact tier, ClinVar significance, and gnomAD/population allele frequencies.
- Variant Prioritisation: Assigns a numeric priority score and human-readable tier (
-Tier 1
) based on severity, rarity, ClinVar evidence, and population frequency context.Tier 4 - Report Generation: Writes
,report.md
,tables/annotated_variants.tsv
, and a reproducibility bundle.result.json
Input Formats
| Format | Extension | Required Fields | Example |
|---|---|---|---|
| VCF 4.2 | , | Standard VCF columns (, , , , , , , ); sample column optional | |
Workflow
- Parse: Read the VCF with
and emit one record per ALT allele.pysam.VariantFile - Batch: Convert variants into Ensembl VEP region strings and group them into batches of 200.
- Annotate: POST batches to
using GRCh38 as the default assembly.https://rest.ensembl.org/vep/homo_sapiens/region - Normalise: Pick the most severe consequence per variant, then extract ClinVar labels, consequence metadata, and population frequency fields.
- Prioritise: Flag rare pathogenic variants (
) and assign a numeric score plus tier for ranked output.gnomAD AF < 0.001 - Report: Write tabular, markdown, and structured JSON outputs alongside a reproducibility command file.
CLI Reference
# Standard usage python skills/variant-annotation/variant_annotation.py \ --input <input.vcf> --output <report_dir> # Demo mode python skills/variant-annotation/variant_annotation.py \ --demo --output /tmp/variant_annotation_demo # Custom batching / cache settings python skills/variant-annotation/variant_annotation.py \ --input <input.vcf> --output <report_dir> \ --batch-size 200 --cache-dir ~/.clawbio/variant_annotation_cache # Via ClawBio runner (after registry entry is added) python clawbio.py run variant-annotation --input <file> --output <dir> python clawbio.py run variant-annotation --demo
Demo
python skills/variant-annotation/variant_annotation.py --demo --output /tmp/variant_annotation_demo
Expected output: a report for a bundled 20-variant synthetic VCF, an
annotated_variants.tsv table with ClinVar/frequency/prioritization fields, and a result.json summary of clinically relevant and top-priority variants.
Algorithm / Methodology
- VCF parsing: Use
to parse the input VCF and keep variant identity plus genotype data.pysam.VariantFile - Remote annotation: Submit variants to Ensembl VEP REST in batches of 200, respecting the Ensembl fair-use rate limit of 15 requests per second.
- Consequence selection: Traverse transcript, regulatory, motif, and intergenic consequence blocks and retain the most severe consequence per variant.
- Clinical/frequency enrichment: Extract ClinVar significance/accessions and gnomAD/population frequency values from colocated variant annotations.
- Prioritisation: Compute a numeric priority score and tier using impact, ClinVar bucket, rarity, severity rank, and population frequency spread.
- Output generation: Produce a flat TSV, markdown summary,
, and reproducibility metadata.result.json
Key thresholds / parameters:
- Default assembly:
GRCh38 - Batch size:
variants per request200 - Ensembl rate limit:
15 requests/second - Clinically relevant rule: ClinVar pathogenic / likely pathogenic plus
gnomAD AF < 0.001 - Priority output: numeric
plus human-readablepriority_score
-Tier 1Tier 4
Domain Decisions
- Reference genome: Uses GRCh38 as the default genome assembly
- Prioritisation: Prioritise the most severe consequence per variant (VEP returns multiple)
- Annotation backend: Uses Ensembl VEP REST because it provides consistent transcript consequence, ClinVar, and colocated frequency fields from a single annotation pass.
- Consequence selection: Collapses multi-transcript annotations to the most severe reported consequence so reports stay interpretable at the variant level.
- ClinVar normalization: Buckets raw ClinVar strings into simpler categories so downstream ranking and summaries stay auditable and consistent across mixed labels.
- Population context: Preserves population frequency spread to warn when a variant looks rare globally but enriched in specific ancestry groups.
Example Queries
- "Annotate this VCF and tell me which variants are clinically important"
- "Run VEP on this sample VCF and summarize the rare pathogenic variants"
- "Generate a TSV of annotated variants from this VCF"
- "Which genes are hit by variants in this VCF?"
- "Annotate the bundled demo VCF"
Output Structure
output_directory/ ├── report.md # Markdown summary of prioritized findings ├── result.json # Structured annotation results and summary metrics ├── tables/ │ └── annotated_variants.tsv # Flat variant-level annotation table └── reproducibility/ └── commands.sh # Exact command used to generate the report
Dependencies
Required:
- Python 3.10+
— VCF parsingpysam
— Ensembl REST API accessrequests
Optional / Planned:
- Local Ensembl
backend — planned future replacement for the REST backend when fully local annotation is neededvep
Safety
- Disclaimer: Every report includes the standard ClawBio medical disclaimer.
- Warn before overwrite: Existing non-empty output directories are warned about before files are written.
- Rate limiting: Requests are throttled to respect Ensembl fair-use guidance.
- Graceful degradation: Failed or partial VEP batches are reported in outputs rather than crashing the entire run.
- Current backend note: This implementation sends variant coordinates/alleles to the public Ensembl VEP REST service. A local VEP backend is planned for stricter local-first workflows.
Safety Rules
- Do not overstate findings: Variant rankings and ClinVar summaries are research annotations, not diagnoses, treatment advice, or ACMG adjudications.
- Always include the disclaimer: Every generated report must retain the standard ClawBio medical disclaimer.
- Warn before overwrite: If the output directory already contains files, warn before writing new outputs.
- Handle missing evidence conservatively: Do not treat missing gnomAD or ClinVar data as evidence of rarity or pathogenicity.
- Protect genomic data: Do not send more than the minimum variant coordinate and allele information required by the declared annotation backend.
Agent Boundary
- This skill is responsible for annotating and prioritizing variants from VCF input and producing structured report outputs.
- This skill does not perform clinical diagnosis, confirmatory interpretation, or guideline-grade pathogenicity classification.
- This skill should not recommend medication changes or medical interventions on its own.
- When deeper interpretation is needed, hand off to downstream skills such as
,gwas-lookup
,clinpgx
, orpharmgx-reporter
.profile-report
Integration with Bio Orchestrator
Trigger conditions — the orchestrator routes here when:
- The user provides a
/.vcf
file and asks for annotation or interpretation..vcf.gz - The query mentions VEP, ClinVar, gnomAD, pathogenic variants, or variant prioritisation.
- The user wants a ranked list of interesting variants from a VCF.
Chaining partners:
: follow up pharmacogenomic loci discovered during annotation.pharmgx-reporter
: inspect interesting rsIDs for trait associations and PheWAS context.gwas-lookup
: deepen interpretation of drug-response genes found in the annotated set.clinpgx
: incorporate prioritized findings into a broader genomic summary.profile-report
Citations
- Ensembl Variant Effect Predictor — functional consequence annotation
- Ensembl REST API — batch VEP annotation endpoint used by the current backend
- ClinVar — clinical significance assertions
- gnomAD — population allele frequency reference data
- VCF Specification — variant file format reference