OpenClaw-Medical-Skills bio-consensus-sequences
Generate consensus FASTA sequences by applying VCF variants to a reference using bcftools consensus. Use when creating sample-specific reference sequences or reconstructing haplotypes.
git clone https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/bio-consensus-sequences" ~/.claude/skills/freedomintelligence-openclaw-medical-skills-bio-consensus-sequences && rm -rf "$T"
T=$(mktemp -d) && git clone --depth=1 https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills "$T" && mkdir -p ~/.openclaw/skills && cp -r "$T/skills/bio-consensus-sequences" ~/.openclaw/skills/freedomintelligence-openclaw-medical-skills-bio-consensus-sequences && rm -rf "$T"
skills/bio-consensus-sequences/SKILL.mdVersion Compatibility
Reference examples tested with: BioPython 1.83+, bcftools 1.19+, bedtools 2.31+, minimap2 2.26+, samtools 1.19+
Before using code patterns, verify installed versions match. If versions differ:
- Python:
thenpip show <package>
to check signatureshelp(module.function) - CLI:
then<tool> --version
to confirm flags<tool> --help
If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Consensus Sequences
"Generate a consensus sequence from my VCF" → Apply called variants to a reference FASTA, producing a sample-specific genome with optional haplotype selection and low-coverage masking.
- CLI:
bcftools consensus -f reference.fa input.vcf.gz - Python:
+cyvcf2
for simple SNP-only casesBio.SeqIO
Basic Usage
Generate Consensus
bcftools consensus -f reference.fa input.vcf.gz > consensus.fa
Specify Sample
bcftools consensus -f reference.fa -s sample1 input.vcf.gz > sample1.fa
Output to File
bcftools consensus -f reference.fa -o consensus.fa input.vcf.gz
Haplotype Selection
First Haplotype Only
bcftools consensus -f reference.fa -H 1 input.vcf.gz > haplotype1.fa
Second Haplotype Only
bcftools consensus -f reference.fa -H 2 input.vcf.gz > haplotype2.fa
Haplotype Options
| Option | Description |
|---|---|
| First haplotype |
| Second haplotype |
| Apply all ALT alleles |
| Apply REF alleles where heterozygous |
| Apply IUPAC ambiguity codes (separate flag) |
IUPAC Codes for Heterozygous Sites
bcftools consensus -f reference.fa -I input.vcf.gz > consensus_iupac.fa
Heterozygous sites encoded with IUPAC ambiguity codes:
- A/G → R
- C/T → Y
- A/C → M
- G/T → K
- A/T → W
- C/G → S
Missing Data Handling
Mark Missing as N
bcftools consensus -f reference.fa -M N input.vcf.gz > consensus.fa
Mark Low Coverage as N
Using a mask BED file:
# Create mask from depth samtools depth input.bam | awk '$3<10 {print $1"\t"$2-1"\t"$2}' > low_coverage.bed # Apply mask bcftools consensus -f reference.fa -m low_coverage.bed input.vcf.gz > consensus.fa
Mask Options
| Option | Description |
|---|---|
| Mask regions in BED file with N |
| Character for masked regions (default N) |
Region Selection
Specific Region
bcftools consensus -f reference.fa -r chr1:1000-2000 input.vcf.gz > region.fa
Multiple Regions
Use with BED file to extract multiple regions.
Chain Files
Generate Chain File
bcftools consensus -f reference.fa -c chain.txt input.vcf.gz > consensus.fa
Chain files map coordinates between reference and consensus:
- Useful for liftover of annotations
- Required when indels change sequence length
Chain File Format
chain score ref_name ref_size ref_strand ref_start ref_end query_name query_size query_strand query_start query_end id
Sample-Specific Consensus
For Each Sample
for sample in $(bcftools query -l input.vcf.gz); do bcftools consensus -f reference.fa -s "$sample" input.vcf.gz > "${sample}.fa" done
Both Haplotypes
sample="sample1" bcftools consensus -f reference.fa -s "$sample" -H 1 input.vcf.gz > "${sample}_hap1.fa" bcftools consensus -f reference.fa -s "$sample" -H 2 input.vcf.gz > "${sample}_hap2.fa"
Filtering Before Consensus
PASS Variants Only
bcftools view -f PASS input.vcf.gz | \ bcftools consensus -f reference.fa > consensus.fa
High-Quality Variants Only
bcftools filter -i 'QUAL>=30 && INFO/DP>=10' input.vcf.gz | \ bcftools consensus -f reference.fa > consensus.fa
SNPs Only
bcftools view -v snps input.vcf.gz | \ bcftools consensus -f reference.fa > consensus_snps.fa
Sequence Naming
Default Naming
Output uses reference sequence names.
Custom Prefix
bcftools consensus -f reference.fa -p "sample1_" input.vcf.gz > consensus.fa
Sequences named:
sample1_chr1, sample1_chr2, etc.
Common Workflows
Goal: Generate consensus sequences for downstream analyses like phylogenetics, viral surveillance, or gene-level comparison.
Approach: Filter variants to high-quality calls, apply per-sample consensus generation, mask low-coverage regions with N, then combine for multi-sample workflows.
Phylogenetic Analysis Preparation
# For each sample, generate consensus mkdir -p consensus for sample in $(bcftools query -l cohort.vcf.gz); do bcftools view -s "$sample" cohort.vcf.gz | \ bcftools view -c 1 | \ bcftools consensus -f reference.fa > "consensus/${sample}.fa" done # Combine for alignment cat consensus/*.fa > all_samples.fa
Viral Genome Assembly
# Apply high-quality variants only bcftools filter -i 'QUAL>=30 && INFO/DP>=20' variants.vcf.gz | \ bcftools view -f PASS | \ bcftools consensus -f reference.fa -M N > consensus.fa
Gene-Specific Consensus
# Extract gene region bcftools consensus -f reference.fa -r chr1:1000000-1010000 \ -s sample1 variants.vcf.gz > gene.fa
Masked Low-Coverage Regions
# Create mask from coverage samtools depth -a input.bam | \ awk '$3<5 {print $1"\t"$2-1"\t"$2}' | \ bedtools merge > low_coverage.bed # Generate consensus with mask bcftools consensus -f reference.fa -m low_coverage.bed \ variants.vcf.gz > consensus.fa
Verify Consensus
Check Differences
# Align consensus to reference minimap2 -a reference.fa consensus.fa | samtools view -bS > alignment.bam # Or simple comparison diff <(grep -v "^>" reference.fa) <(grep -v "^>" consensus.fa) | head
Count Changes
# Number of differences bcftools view -H input.vcf.gz | wc -l
Handling Overlapping Variants
bcftools consensus handles overlapping variants automatically:
- Applies variants in order
- Warns about conflicts
Check for warnings:
bcftools consensus -f reference.fa input.vcf.gz 2>&1 | grep -i warn
cyvcf2 Consensus (Simple Cases)
Manual Consensus Generation
from cyvcf2 import VCF from Bio import SeqIO # Load reference ref_dict = {rec.id: str(rec.seq) for rec in SeqIO.parse('reference.fa', 'fasta')} # Apply variants (SNPs only, simplified) vcf = VCF('input.vcf.gz') changes = {} for variant in vcf: if variant.is_snp and len(variant.ALT) == 1: chrom = variant.CHROM pos = variant.POS - 1 # 0-based if chrom not in changes: changes[chrom] = {} changes[chrom][pos] = variant.ALT[0] # Apply changes for chrom, positions in changes.items(): seq = list(ref_dict[chrom]) for pos, alt in positions.items(): seq[pos] = alt ref_dict[chrom] = ''.join(seq) # Write output with open('consensus.fa', 'w') as f: for chrom, seq in ref_dict.items(): f.write(f'>{chrom}\n{seq}\n')
Note: Use
bcftools consensus for production - handles indels and edge cases properly.
Quick Reference
| Task | Command |
|---|---|
| Basic consensus | |
| Specific sample | |
| Haplotype 1 | |
| IUPAC codes | |
| With mask | |
| Generate chain | |
| Specific region | |
Common Errors
| Error | Cause | Solution |
|---|---|---|
| VCF not indexed | Run |
| Chromosome mismatch | Check chromosome names |
| Variants overlap | Usually OK, check warnings |
| Wrong reference | Use same reference as caller |
Related Skills
- variant-calling - Generate VCF for consensus
- filtering-best-practices - Filter variants before consensus
- variant-normalization - Normalize indels first
- alignment-files/reference-operations - Reference manipulation