OpenClaw-Medical-Skills fastq-analysis-pipeline
Guide through omicverse's alignment module for SRA downloading, FASTQ quality control, STAR alignment, gene quantification, and single-cell kallisto/bustools pipelines covering both bulk and single-cell RNA-seq workflows.
install
source · Clone the upstream repo
git clone https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/fastq-analysis" ~/.claude/skills/freedomintelligence-openclaw-medical-skills-fastq-analysis-pipeline && rm -rf "$T"
OpenClaw · Install into ~/.openclaw/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills "$T" && mkdir -p ~/.openclaw/skills && cp -r "$T/skills/fastq-analysis" ~/.openclaw/skills/freedomintelligence-openclaw-medical-skills-fastq-analysis-pipeline && rm -rf "$T"
manifest:
skills/fastq-analysis/SKILL.mdsafety · automated scan (low risk)
This is a pattern-based risk scan, not a security review. Our crawler flagged:
- pip install
Always read a skill's source content before installing. Patterns alone don't mean the skill is malicious — but they warrant attention.
source content
Overview
OmicVerse provides a complete FASTQ-to-count-matrix pipeline via the
ov.alignment module. This skill covers:
- SRA data acquisition:
andprefetch
(fasterq-dump wrapper)fqdump - Quality control:
for adapter trimming and QC reportsfastp - RNA-seq alignment:
aligner with auto-index buildingSTAR - Gene quantification:
(subread featureCounts wrapper)featureCount - Single-cell path:
andref
via kb-python (kallisto/bustools)count - Parallel SRA download:
parallel_fastq_dump
All functions share a common CLI infrastructure (
_cli_utils.py) that handles tool resolution, auto-installation via conda/mamba, parallel execution, and streaming output.
Instructions
-
Environment setup
- Bioinformatics tools are resolved automatically from PATH or the active conda environment.
- If
(default), missing tools are installed via mamba/conda on demand.auto_install=True - Supported tools:
,prefetch
,vdb-validate
,fasterq-dump
,fastp
,STAR
,samtools
,featureCounts
,pigz
.gzip - For the single-cell path, ensure
is installed:kb-python
.pip install kb-python
-
SRA data download (
+ov.alignment.prefetch
)ov.alignment.fqdump- Use
first for reliable downloads with integrity validation (prefetch
).vdb-validate - Then convert to FASTQ with
. It auto-detects single-end vs paired-end.fqdump
can also work directly from SRR accessions without prefetch.fqdump- Both support retry with exponential backoff for network errors.
import omicverse as ov # Step 1: Prefetch SRA files (optional but recommended) pre = ov.alignment.prefetch(['SRR1234567', 'SRR1234568'], output_dir='prefetch', jobs=4) # Step 2: Convert to FASTQ fq = ov.alignment.fqdump(['SRR1234567', 'SRR1234568'], output_dir='fastq', sra_dir='prefetch', gzip=True, threads=8, jobs=4) - Use
-
FASTQ quality control (
)ov.alignment.fastp- Runs fastp for adapter trimming, quality filtering, and QC reporting.
- Supports single-end and paired-end reads.
- Produces per-sample JSON and HTML QC reports.
- Sample format: tuple of
.(sample_name, fq1_path, fq2_path_or_None)
samples = [ ('S1', 'fastq/SRR1234567/SRR1234567_1.fastq.gz', 'fastq/SRR1234567/SRR1234567_2.fastq.gz'), ('S2', 'fastq/SRR1234568/SRR1234568_1.fastq.gz', 'fastq/SRR1234568/SRR1234568_2.fastq.gz'), ] clean = ov.alignment.fastp(samples, output_dir='fastp', threads=8, jobs=2) -
STAR alignment (
)ov.alignment.STAR- Aligns FASTQ reads using the STAR aligner.
- Auto-index building: set
(default) withauto_index=True
andgenome_fasta_files
to build index automatically if missing.gtf - Produces coordinate-sorted BAM files.
- Handles gzip-compressed FASTQs automatically (uses pigz/gzip/zcat).
- Use
(default) for graceful error handling per sample.strict=False
# Prepare samples from fastp output star_samples = [ ('S1', 'fastp/S1/S1_clean_1.fastq.gz', 'fastp/S1/S1_clean_2.fastq.gz'), ('S2', 'fastp/S2/S2_clean_1.fastq.gz', 'fastp/S2/S2_clean_2.fastq.gz'), ] bams = ov.alignment.STAR( star_samples, genome_dir='star_index', output_dir='star_out', gtf='genes.gtf', genome_fasta_files=['genome.fa'], threads=8, memory='50G', ) -
Gene quantification (
)ov.alignment.featureCount- Counts aligned reads per gene using featureCounts (subread).
- Auto-detects paired-end from BAM headers (via pysam or samtools).
(default) retries with corrected paired-end flag on error.auto_fix=True
maps gene_id to gene_name from the GTF.gene_mapping=True
produces a combined count matrix across all samples.merge_matrix=True
bam_items = [ ('S1', 'star_out/S1/Aligned.sortedByCoord.out.bam'), ('S2', 'star_out/S2/Aligned.sortedByCoord.out.bam'), ] counts = ov.alignment.featureCount( bam_items, gtf='genes.gtf', output_dir='counts', gene_mapping=True, merge_matrix=True, threads=8, ) # counts is a pandas DataFrame (gene_id x samples) -
Single-cell path (
+ov.alignment.ref
)ov.alignment.count- Uses kb-python (kallisto + bustools) for single-cell RNA-seq quantification.
builds a kallisto index and transcript-to-gene mapping.ref()
quantifies single-cell data with barcode/UMI handling.count()- Supports technologies: 10XV2, 10XV3, BULK, and custom.
- Output formats: h5ad, loom, cellranger MTX.
# Build reference index ref_result = ov.alignment.ref( index_path='kb_ref/index.idx', t2g_path='kb_ref/t2g.txt', fasta_paths=['genome.fa'], gtf_paths=['genes.gtf'], threads=8, ) # Quantify 10x v3 data count_result = ov.alignment.count( index_path='kb_ref/index.idx', t2g_path='kb_ref/t2g.txt', technology='10XV3', fastq_paths=['sample_R1.fastq.gz', 'sample_R2.fastq.gz'], output_path='kb_out', h5ad=True, filter_barcodes=True, threads=8, ) -
Wiring fastp output into STAR input
- fastp output is a list of dicts with keys:
,sample
,clean1
,clean2
,json
.html - Convert to STAR sample tuples:
star_samples = [ (r['sample'], r['clean1'], r['clean2'] if r['clean2'] else None) for r in (clean if isinstance(clean, list) else [clean]) ] - fastp output is a list of dicts with keys:
-
Wiring STAR output into featureCount input
- STAR output is a list of dicts with keys:
,sample
(orbam
).error - Convert to featureCount items:
bam_items = [ (r['sample'], r['bam']) for r in (bams if isinstance(bams, list) else [bams]) if 'bam' in r ] - STAR output is a list of dicts with keys:
-
Skipping completed steps
- All functions check for existing outputs and skip if
(default).overwrite=False - Set
to force re-execution.overwrite=True
- All functions check for existing outputs and skip if
-
Troubleshooting
- If a tool is not found, check
and that conda/mamba is accessible.auto_install=True - For STAR index errors, ensure
points to uncompressed or gzip FASTA files.genome_fasta_files - For featureCounts paired-end detection errors,
handles most cases automatically.auto_fix=True - GTF files can be gzip-compressed; they are auto-decompressed as needed.
- If a tool is not found, check
Critical API Reference
Sample Format Convention
All alignment functions use a consistent sample tuple format:
- FASTQ samples:
(sample_name, fq1_path, fq2_path_or_None) - BAM items:
or(sample_name, bam_path)(sample_name, bam_path, is_paired_bool) - Single samples can be passed as a single tuple; multiple as a list of tuples.
- When a single tuple is passed, the return value is a single dict; for a list, a list of dicts.
Auto-installation
# All functions support these parameters: auto_install=True # Auto-install missing tools via conda/mamba overwrite=False # Skip if outputs already exist threads=8 # Per-tool thread count jobs=None # Concurrent job count (auto-detected from CPU count)
Examples
- Bulk RNA-seq from SRA:
->prefetch
->fqdump
->fastp
->STAR
-> pandas DataFramefeatureCount - Single-cell 10x v3:
->ref
withcount
-> h5ad AnnDatatechnology='10XV3' - Local FASTQ files: Skip download steps, start directly with
->fastp
->STARfeatureCount
References
- See reference.md for copy-paste-ready code templates.