OpenClaw-Medical-Skills fastq-analysis-pipeline

Guide through omicverse's alignment module for SRA downloading, FASTQ quality control, STAR alignment, gene quantification, and single-cell kallisto/bustools pipelines covering both bulk and single-cell RNA-seq workflows.

install

source · Clone the upstream repo

git clone https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/fastq-analysis" ~/.claude/skills/freedomintelligence-openclaw-medical-skills-fastq-analysis-pipeline && rm -rf "$T"

OpenClaw · Install into ~/.openclaw/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills "$T" && mkdir -p ~/.openclaw/skills && cp -r "$T/skills/fastq-analysis" ~/.openclaw/skills/freedomintelligence-openclaw-medical-skills-fastq-analysis-pipeline && rm -rf "$T"

manifest: skills/fastq-analysis/SKILL.md

Overview

OmicVerse provides a complete FASTQ-to-count-matrix pipeline via the

ov.alignment

module. This skill covers:

SRA data acquisition:
```
prefetch
```
and
```
fqdump
```
(fasterq-dump wrapper)
Quality control:
```
fastp
```
for adapter trimming and QC reports
RNA-seq alignment:
```
STAR
```
aligner with auto-index building
Gene quantification:
```
featureCount
```
(subread featureCounts wrapper)
Single-cell path:
```
ref
```
and
```
count
```
via kb-python (kallisto/bustools)
Parallel SRA download:
```
parallel_fastq_dump
```

All functions share a common CLI infrastructure (

_cli_utils.py

) that handles tool resolution, auto-installation via conda/mamba, parallel execution, and streaming output.

Instructions

Environment setup
- Bioinformatics tools are resolved automatically from PATH or the active conda environment.
- If
```
auto_install=True
```
  (default), missing tools are installed via mamba/conda on demand.
- Supported tools:
```
prefetch
```
  ,
```
vdb-validate
```
  ,
```
fasterq-dump
```
  ,
```
fastp
```
  ,
```
STAR
```
  ,
```
samtools
```
  ,
```
featureCounts
```
  ,
```
pigz
```
  ,
```
gzip
```
  .
- For the single-cell path, ensure
```
kb-python
```
  is installed:
```
pip install kb-python
```
  .

SRA data download (

ov.alignment.prefetch

ov.alignment.fqdump

)

Use
```
prefetch
```
first for reliable downloads with integrity validation (
```
vdb-validate
```
).
Then convert to FASTQ with
```
fqdump
```
. It auto-detects single-end vs paired-end.
```
fqdump
```
can also work directly from SRR accessions without prefetch.
Both support retry with exponential backoff for network errors.

import omicverse as ov

# Step 1: Prefetch SRA files (optional but recommended)
pre = ov.alignment.prefetch(['SRR1234567', 'SRR1234568'], output_dir='prefetch', jobs=4)

# Step 2: Convert to FASTQ
fq = ov.alignment.fqdump(['SRR1234567', 'SRR1234568'],
                          output_dir='fastq', sra_dir='prefetch',
                          gzip=True, threads=8, jobs=4)

FASTQ quality control (

ov.alignment.fastp

)

Runs fastp for adapter trimming, quality filtering, and QC reporting.
Supports single-end and paired-end reads.
Produces per-sample JSON and HTML QC reports.

Sample format: tuple of

(sample_name, fq1_path, fq2_path_or_None)

samples = [
    ('S1', 'fastq/SRR1234567/SRR1234567_1.fastq.gz', 'fastq/SRR1234567/SRR1234567_2.fastq.gz'),
    ('S2', 'fastq/SRR1234568/SRR1234568_1.fastq.gz', 'fastq/SRR1234568/SRR1234568_2.fastq.gz'),
]
clean = ov.alignment.fastp(samples, output_dir='fastp', threads=8, jobs=2)

STAR alignment (

ov.alignment.STAR

)

Aligns FASTQ reads using the STAR aligner.
Auto-index building: set
```
auto_index=True
```
(default) with
```
genome_fasta_files
```
and
```
gtf
```
to build index automatically if missing.
Produces coordinate-sorted BAM files.
Handles gzip-compressed FASTQs automatically (uses pigz/gzip/zcat).
Use
```
strict=False
```
(default) for graceful error handling per sample.

# Prepare samples from fastp output
star_samples = [
    ('S1', 'fastp/S1/S1_clean_1.fastq.gz', 'fastp/S1/S1_clean_2.fastq.gz'),
    ('S2', 'fastp/S2/S2_clean_1.fastq.gz', 'fastp/S2/S2_clean_2.fastq.gz'),
]
bams = ov.alignment.STAR(
    star_samples,
    genome_dir='star_index',
    output_dir='star_out',
    gtf='genes.gtf',
    genome_fasta_files=['genome.fa'],
    threads=8,
    memory='50G',
)

Gene quantification (

ov.alignment.featureCount

)

Counts aligned reads per gene using featureCounts (subread).
Auto-detects paired-end from BAM headers (via pysam or samtools).
```
auto_fix=True
```
(default) retries with corrected paired-end flag on error.
```
gene_mapping=True
```
maps gene_id to gene_name from the GTF.
```
merge_matrix=True
```
produces a combined count matrix across all samples.

bam_items = [
    ('S1', 'star_out/S1/Aligned.sortedByCoord.out.bam'),
    ('S2', 'star_out/S2/Aligned.sortedByCoord.out.bam'),
]
counts = ov.alignment.featureCount(
    bam_items,
    gtf='genes.gtf',
    output_dir='counts',
    gene_mapping=True,
    merge_matrix=True,
    threads=8,
)
# counts is a pandas DataFrame (gene_id x samples)

Single-cell path (

ov.alignment.ref

ov.alignment.count

)

Uses kb-python (kallisto + bustools) for single-cell RNA-seq quantification.
```
ref()
```
builds a kallisto index and transcript-to-gene mapping.
```
count()
```
quantifies single-cell data with barcode/UMI handling.
Supports technologies: 10XV2, 10XV3, BULK, and custom.
Output formats: h5ad, loom, cellranger MTX.

# Build reference index
ref_result = ov.alignment.ref(
    index_path='kb_ref/index.idx',
    t2g_path='kb_ref/t2g.txt',
    fasta_paths=['genome.fa'],
    gtf_paths=['genes.gtf'],
    threads=8,
)

# Quantify 10x v3 data
count_result = ov.alignment.count(
    index_path='kb_ref/index.idx',
    t2g_path='kb_ref/t2g.txt',
    technology='10XV3',
    fastq_paths=['sample_R1.fastq.gz', 'sample_R2.fastq.gz'],
    output_path='kb_out',
    h5ad=True,
    filter_barcodes=True,
    threads=8,
)

Wiring fastp output into STAR input

fastp output is a list of dicts with keys:
```
sample
```
,
```
clean1
```
,
```
clean2
```
,
```
json
```
,
```
html
```
.
Convert to STAR sample tuples:

star_samples = [
    (r['sample'], r['clean1'], r['clean2'] if r['clean2'] else None)
    for r in (clean if isinstance(clean, list) else [clean])
]

Wiring STAR output into featureCount input

STAR output is a list of dicts with keys:
```
sample
```
,
```
bam
```
(or
```
error
```
).
Convert to featureCount items:

bam_items = [
    (r['sample'], r['bam'])
    for r in (bams if isinstance(bams, list) else [bams])
    if 'bam' in r
]

Skipping completed steps
- All functions check for existing outputs and skip if
```
overwrite=False
```
  (default).
- Set
```
overwrite=True
```
  to force re-execution.
Troubleshooting
- If a tool is not found, check
```
auto_install=True
```
  and that conda/mamba is accessible.
- For STAR index errors, ensure
```
genome_fasta_files
```
  points to uncompressed or gzip FASTA files.
- For featureCounts paired-end detection errors,
```
auto_fix=True
```
  handles most cases automatically.
- GTF files can be gzip-compressed; they are auto-decompressed as needed.

Critical API Reference

Sample Format Convention

All alignment functions use a consistent sample tuple format:

FASTQ samples:

(sample_name, fq1_path, fq2_path_or_None)

BAM items:

(sample_name, bam_path)

(sample_name, bam_path, is_paired_bool)

Single samples can be passed as a single tuple; multiple as a list of tuples.
When a single tuple is passed, the return value is a single dict; for a list, a list of dicts.

Auto-installation

# All functions support these parameters:
auto_install=True   # Auto-install missing tools via conda/mamba
overwrite=False     # Skip if outputs already exist
threads=8           # Per-tool thread count
jobs=None           # Concurrent job count (auto-detected from CPU count)

Examples

Bulk RNA-seq from SRA:
```
prefetch
```
->
```
fqdump
```
->
```
fastp
```
->
```
STAR
```
->
```
featureCount
```
-> pandas DataFrame
Single-cell 10x v3:
```
ref
```
->
```
count
```
with
```
technology='10XV3'
```
-> h5ad AnnData
Local FASTQ files: Skip download steps, start directly with
```
fastp
```
->
```
STAR
```
->
```
featureCount
```

References

See reference.md for copy-paste-ready code templates.