install
source · Clone the upstream repo
git clone https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/bio-alignment-validation" ~/.claude/skills/freedomintelligence-openclaw-medical-skills-bio-alignment-validation && rm -rf "$T"
OpenClaw · Install into ~/.openclaw/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills "$T" && mkdir -p ~/.openclaw/skills && cp -r "$T/skills/bio-alignment-validation" ~/.openclaw/skills/freedomintelligence-openclaw-medical-skills-bio-alignment-validation && rm -rf "$T"
manifest:
skills/bio-alignment-validation/SKILL.mdsource content
<!--
# COPYRIGHT NOTICE
# This file is part of the "Universal Biomedical Skills" project.
# Copyright (c) 2026 MD BABU MIA, PhD <md.babu.mia@mssm.edu>
# All Rights Reserved.
#
# This code is proprietary and confidential.
# Unauthorized copying of this file, via any medium is strictly prohibited.
#
# Provenance: Authenticated by MD BABU MIA
-->
name: bio-alignment-validation description: Validate alignment quality with insert size distribution, proper pairing rates, GC bias, strand balance, and other post-alignment metrics. Use when verifying alignment data quality before variant calling or quantification. tool_type: mixed primary_tool: samtools measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes. allowed-tools:
- read_file
- run_shell_command
Alignment Validation
Post-alignment quality control to verify alignment quality and identify issues.
Insert Size Distribution
Insert size should match library preparation protocol.
samtools stats
samtools stats input.bam > stats.txt grep "^IS" stats.txt | cut -f2,3 > insert_sizes.txt
Picard CollectInsertSizeMetrics
java -jar picard.jar CollectInsertSizeMetrics \ I=input.bam \ O=insert_metrics.txt \ H=insert_histogram.pdf
Expected Insert Sizes
| Library Type | Expected Size |
|---|---|
| Standard WGS | 300-500 bp |
| PCR-free | 350-550 bp |
| RNA-seq | 150-300 bp |
| ChIP-seq | 150-300 bp |
| ATAC-seq | Multimodal |
Python Insert Size Analysis
import pysam import numpy as np import matplotlib.pyplot as plt def get_insert_sizes(bam_file, max_reads=100000): sizes = [] bam = pysam.AlignmentFile(bam_file, 'rb') for i, read in enumerate(bam.fetch()): if i >= max_reads: break if read.is_proper_pair and not read.is_secondary and read.template_length > 0: sizes.append(read.template_length) bam.close() return sizes sizes = get_insert_sizes('sample.bam') print(f'Median insert size: {np.median(sizes):.0f}') print(f'Mean insert size: {np.mean(sizes):.0f}') print(f'Std dev: {np.std(sizes):.0f}') plt.hist(sizes, bins=100, range=(0, 1000)) plt.xlabel('Insert Size') plt.ylabel('Count') plt.savefig('insert_size_dist.pdf')
Proper Pairing Rate
Percentage of reads correctly paired.
samtools flagstat
samtools flagstat input.bam samtools flagstat input.bam | grep "properly paired"
Calculate Pairing Rate
proper=$(samtools view -c -f 2 input.bam) mapped=$(samtools view -c -F 4 input.bam) rate=$(echo "scale=4; $proper / $mapped * 100" | bc) echo "Proper pairing rate: ${rate}%"
Expected Rates
| Metric | Good | Marginal | Poor |
|---|---|---|---|
| Proper pair | > 90% | 80-90% | < 80% |
| Mapped | > 95% | 90-95% | < 90% |
| Singletons | < 5% | 5-10% | > 10% |
GC Bias
GC content correlation with coverage.
Picard CollectGcBiasMetrics
java -jar picard.jar CollectGcBiasMetrics \ I=input.bam \ O=gc_bias_metrics.txt \ CHART=gc_bias_chart.pdf \ S=gc_summary.txt \ R=reference.fa
deepTools computeGCBias
computeGCBias \ -b input.bam \ --effectiveGenomeSize 2913022398 \ -g hg38.2bit \ -o gc_bias.txt \ --biasPlot gc_bias.pdf
Interpret GC Bias
| Issue | Symptom |
|---|---|
| Under-representation | Low GC coverage drops |
| Over-representation | High GC coverage elevated |
| PCR bias | Strong correlation |
Strand Balance
Forward and reverse strand should be balanced.
Calculate Strand Ratio
forward=$(samtools view -c -F 16 input.bam) reverse=$(samtools view -c -f 16 input.bam) echo "Forward: $forward" echo "Reverse: $reverse" ratio=$(echo "scale=4; $forward / $reverse" | bc) echo "F/R ratio: $ratio"
Check Strand Bias per Chromosome
for chr in chr1 chr2 chr3; do fwd=$(samtools view -c -F 16 input.bam $chr) rev=$(samtools view -c -f 16 input.bam $chr) echo "$chr: F=$fwd R=$rev ratio=$(echo "scale=2; $fwd/$rev" | bc)" done
Mapping Quality Distribution
Extract MAPQ Distribution
samtools view input.bam | cut -f5 | sort -n | uniq -c | sort -k2 -n
Calculate Mean MAPQ
samtools view input.bam | awk '{sum+=$5; count++} END {print "Mean MAPQ:", sum/count}'
MAPQ Thresholds
| MAPQ | Meaning |
|---|---|
| 0 | Multi-mapper |
| 1-10 | Low confidence |
| 20-30 | Moderate |
| 40+ | High confidence |
| 60 | Unique (BWA) |
Chromosome Coverage Balance
Calculate Per-Chromosome Coverage
samtools idxstats input.bam | awk '{print $1, $3/$2}' | head -25
Check for Aneuploidy/Contamination
samtools idxstats input.bam | awk '$3 > 0 { sum += $3 len[$1] = $2 reads[$1] = $3 } END { for (chr in reads) { expected = len[chr] / sum * reads[chr] ratio = reads[chr] / expected if (ratio < 0.8 || ratio > 1.2) print chr, ratio } }'
Mismatch Rate
Picard CollectAlignmentSummaryMetrics
java -jar picard.jar CollectAlignmentSummaryMetrics \ I=input.bam \ R=reference.fa \ O=alignment_summary.txt
Key Metrics
| Metric | Description | Good Value |
|---|---|---|
| PCT_PF_READS_ALIGNED | Mapped % | > 95% |
| PF_MISMATCH_RATE | Mismatches | < 1% |
| PF_INDEL_RATE | Indels | < 0.1% |
| STRAND_BALANCE | Strand ratio | ~0.5 |
Comprehensive Validation Script
#!/bin/bash BAM=$1 REF=$2 NAME=$(basename $BAM .bam) OUTDIR=${3:-qc} mkdir -p $OUTDIR echo "=== Alignment Validation: $NAME ===" | tee $OUTDIR/report.txt echo -e "\n--- Flagstat ---" | tee -a $OUTDIR/report.txt samtools flagstat $BAM | tee -a $OUTDIR/report.txt echo -e "\n--- Mapping Rate ---" | tee -a $OUTDIR/report.txt mapped=$(samtools view -c -F 4 $BAM) total=$(samtools view -c $BAM) rate=$(echo "scale=2; $mapped / $total * 100" | bc) echo "Mapping rate: ${rate}%" | tee -a $OUTDIR/report.txt echo -e "\n--- Proper Pairing ---" | tee -a $OUTDIR/report.txt proper=$(samtools view -c -f 2 $BAM) pair_rate=$(echo "scale=2; $proper / $mapped * 100" | bc) echo "Proper pairing: ${pair_rate}%" | tee -a $OUTDIR/report.txt echo -e "\n--- Insert Size ---" | tee -a $OUTDIR/report.txt samtools stats $BAM | grep "insert size average" | tee -a $OUTDIR/report.txt echo -e "\n--- Strand Balance ---" | tee -a $OUTDIR/report.txt fwd=$(samtools view -c -F 16 $BAM) rev=$(samtools view -c -f 16 $BAM) strand_ratio=$(echo "scale=3; $fwd / $rev" | bc) echo "Forward: $fwd, Reverse: $rev, Ratio: $strand_ratio" | tee -a $OUTDIR/report.txt echo -e "\n--- Chromosome Coverage ---" | tee -a $OUTDIR/report.txt samtools idxstats $BAM | head -25 | tee -a $OUTDIR/report.txt echo -e "\nReport: $OUTDIR/report.txt"
Python Validation Module
import pysam import numpy as np from collections import Counter class AlignmentValidator: def __init__(self, bam_file): self.bam = pysam.AlignmentFile(bam_file, 'rb') def mapping_rate(self): stats = self.bam.get_index_statistics() mapped = sum(s.mapped for s in stats) unmapped = sum(s.unmapped for s in stats) return mapped / (mapped + unmapped) * 100 def proper_pair_rate(self, sample_size=100000): proper = 0 paired = 0 for i, read in enumerate(self.bam.fetch()): if i >= sample_size: break if read.is_paired: paired += 1 if read.is_proper_pair: proper += 1 return proper / paired * 100 if paired > 0 else 0 def mapq_distribution(self, sample_size=100000): mapqs = [] for i, read in enumerate(self.bam.fetch()): if i >= sample_size: break if not read.is_unmapped: mapqs.append(read.mapping_quality) return Counter(mapqs) def strand_balance(self, sample_size=100000): forward = 0 reverse = 0 for i, read in enumerate(self.bam.fetch()): if i >= sample_size: break if not read.is_unmapped: if read.is_reverse: reverse += 1 else: forward += 1 return forward / (forward + reverse) if (forward + reverse) > 0 else 0.5 def report(self): print(f'Mapping rate: {self.mapping_rate():.1f}%') print(f'Proper pairing: {self.proper_pair_rate():.1f}%') print(f'Strand balance: {self.strand_balance():.3f}') mapq_dist = self.mapq_distribution() high_qual = sum(v for k, v in mapq_dist.items() if k >= 30) total = sum(mapq_dist.values()) print(f'High MAPQ (>=30): {high_qual/total*100:.1f}%') def close(self): self.bam.close() validator = AlignmentValidator('sample.bam') validator.report() validator.close()
Quality Thresholds Summary
| Metric | Good | Warning | Fail |
|---|---|---|---|
| Mapping rate | > 95% | 90-95% | < 90% |
| Proper pairing | > 90% | 80-90% | < 80% |
| Duplicate rate | < 10% | 10-20% | > 20% |
| Strand balance | 0.48-0.52 | 0.45-0.55 | Outside |
| Mean MAPQ | > 40 | 30-40 | < 30 |
| GC bias | < 1.2x | 1.2-1.5x | > 1.5x |
Related Skills
- bam-statistics - Basic flagstat and depth
- duplicate-handling - Mark/remove duplicates
- alignment-filtering - Filter low-quality reads
- chip-seq/chipseq-qc - ChIP-specific metrics