LLMs-Universal-Life-Science-and-Clinical-Skills- metaphlan-profiling

<!--

install

source · Clone the upstream repo

git clone https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills-

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills- "$T" && mkdir -p ~/.claude/skills && cp -r "$T/Skills/Genomics/Metagenomics/bioSkills/metaphlan-profiling" ~/.claude/skills/mdbabumiamssm-llms-universal-life-science-and-clinical-skills-metaphlan-profilin && rm -rf "$T"

manifest: Skills/Genomics/Metagenomics/bioSkills/metaphlan-profiling/SKILL.md

source content

name: bio-metagenomics-metaphlan description: Marker gene-based taxonomic profiling using MetaPhlAn 4. Provides accurate species-level relative abundances using clade-specific markers. Use when accurate taxonomic profiling is needed and computational resources are limited, or for comparison with HMP/other MetaPhlAn studies. tool_type: cli primary_tool: metaphlan measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes. allowed-tools:

read_file
run_shell_command

MetaPhlAn 4 Profiling

MetaPhlAn 4 uses ~5M clade-specific markers from 26,970 species-level genome bins. Supports both short reads (bowtie2) and long reads (minimap2).

Basic Profiling

# Profile single sample
metaphlan sample.fastq.gz \
    --input_type fastq \
    --output_file profile.txt

Paired-End Reads

# MetaPhlAn processes PE as single file or concatenated
metaphlan reads_R1.fastq.gz,reads_R2.fastq.gz \
    --input_type fastq \
    --output_file profile.txt \
    --mapout sample.map.bz2

Save Mapping Output for Reuse (MetaPhlAn 4.2+)

# First run - save intermediate mapping
metaphlan sample.fastq.gz \
    --input_type fastq \
    --mapout sample.map.bz2 \
    --output_file profile.txt

# Rerun with different settings without realigning
metaphlan sample.map.bz2 \
    --input_type mapout \
    --output_file profile_v2.txt

Long-Read Support (MetaPhlAn 4+)

# Long reads automatically use minimap2 instead of bowtie2
metaphlan long_reads.fastq.gz \
    --input_type fastq \
    --output_file profile.txt

Common Options

metaphlan sample.fastq.gz \
    --input_type fastq \
    --nproc 8 \                    # CPU threads
    --tax_lev s \                  # Taxonomic level (k,p,c,o,f,g,s,t)
    --min_cu_len 2000 \            # Min total nucleotide length
    --stat_q 0.2 \                 # Quantile for robust average
    --output_file profile.txt \
    --mapout sample.map.bz2

Install Database

# Download database (done automatically on first run)
metaphlan --install

# Or specify database location (MetaPhlAn 4.2+)
metaphlan --install --db_dir /path/to/db

Analysis Types

# Relative abundances (default)
metaphlan sample.fastq.gz --input_type fastq -t rel_ab

# Relative abundances with read counts
metaphlan sample.fastq.gz --input_type fastq -t rel_ab_w_read_stats

# Marker presence/absence
metaphlan sample.fastq.gz --input_type fastq -t marker_pres_table

# Marker abundances
metaphlan sample.fastq.gz --input_type fastq -t marker_ab_table

Multiple Samples

# Process each sample
for fq in samples/*.fastq.gz; do
    sample=$(basename $fq .fastq.gz)
    metaphlan $fq \
        --input_type fastq \
        --nproc 4 \
        --output_file profiles/${sample}_profile.txt \
        --mapout mapout/${sample}.map.bz2
done

# Merge profiles
merge_metaphlan_tables.py profiles/*_profile.txt > merged_abundance.txt

Filter by Taxonomic Level

# Species only
metaphlan sample.fastq.gz --input_type fastq --tax_lev s -o species.txt

# Genus only
metaphlan sample.fastq.gz --input_type fastq --tax_lev g -o genus.txt

# All levels (default)
metaphlan sample.fastq.gz --input_type fastq --tax_lev a -o all_levels.txt

Output Format

#SampleID	sample
#clade_name	relative_abundance
k__Bacteria	100.0
k__Bacteria|p__Proteobacteria	65.23
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria	62.15
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales	58.42
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae	55.21
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Escherichia	52.33
k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia_coli	52.33

Parse Output in Python

import pandas as pd

profile = pd.read_csv('profile.txt', sep='\t', comment='#', header=None,
                       names=['clade', 'abundance'])

species = profile[profile['clade'].str.contains('\\|s__')]
species['species'] = species['clade'].str.split('|').str[-1].str.replace('s__', '')
species.sort_values('abundance', ascending=False).head(20)

Extract SGBs (Strain-level)

# Include strain-level genomic bins
metaphlan sample.fastq.gz \
    --input_type fastq \
    --tax_lev t \                  # Include t__ level (SGBs)
    --output_file profile_with_sgb.txt

Sample Metadata in Output

# Add sample ID to output
metaphlan sample.fastq.gz \
    --input_type fastq \
    --sample_id sample_name \
    --output_file profile.txt

Key Parameters (MetaPhlAn 4.2+)

Parameter	Default	Description
--input_type	fastq	Input format (fastq, mapout)
--nproc	4	CPU threads
--tax_lev	a	Taxonomic level (a=all)
--stat_q	0.2	Quantile value
--min_cu_len	2000	Min clade length
-t	rel_ab	Analysis type
--mapout	none	Save mapping output
--db_dir	default	Database directory

Note: Unknown species estimation is now enabled by default in MetaPhlAn 4.2+

Analysis Types (-t)

Type	Description
rel_ab	Relative abundances (%)
rel_ab_w_read_stats	With read statistics
marker_pres_table	Marker presence/absence
marker_ab_table	Marker abundances
clade_specific_strain_tracker	Strain tracking

Related Skills

kraken-classification - Alternative k-mer based classification
abundance-estimation - Bracken for Kraken2 abundances
metagenome-visualization - Visualize profiles