install
source · Clone the upstream repo
git clone https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills-
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills- "$T" && mkdir -p ~/.claude/skills && cp -r "$T/Skills/Genomics/Metagenomics/bioSkills/kraken-classification" ~/.claude/skills/mdbabumiamssm-llms-universal-life-science-and-clinical-skills-kraken-classificat && rm -rf "$T"
manifest:
Skills/Genomics/Metagenomics/bioSkills/kraken-classification/SKILL.mdsource content
<!--
# COPYRIGHT NOTICE
# This file is part of the "Universal Biomedical Skills" project.
# Copyright (c) 2026 MD BABU MIA, PhD <md.babu.mia@mssm.edu>
# All Rights Reserved.
#
# This code is proprietary and confidential.
# Unauthorized copying of this file, via any medium is strictly prohibited.
#
# Provenance: Authenticated by MD BABU MIA
-->
name: bio-metagenomics-kraken description: Taxonomic classification of metagenomic reads using Kraken2. Fast k-mer based classification against RefSeq database. Use when performing initial taxonomic classification of shotgun metagenomic reads before abundance estimation with Bracken. tool_type: cli primary_tool: kraken2 measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes. allowed-tools:
- read_file
- run_shell_command
Kraken2 Classification
Basic Classification
# Classify reads against standard database kraken2 --db /path/to/kraken2_db \ --output output.kraken \ --report report.txt \ reads.fastq.gz
Paired-End Reads
kraken2 --db /path/to/kraken2_db \ --paired \ --output output.kraken \ --report report.txt \ reads_R1.fastq.gz reads_R2.fastq.gz
Common Options
kraken2 --db /path/to/kraken2_db \ --threads 8 \ # CPU threads --confidence 0.1 \ # Confidence threshold --minimum-base-quality 20 \ # Quality filter --output output.kraken \ --report report.txt \ --use-names \ # Add taxon names to output --gzip-compressed \ # Input is gzipped reads.fastq.gz
Memory-Efficient Mode
# For systems with limited RAM kraken2 --db /path/to/kraken2_db \ --memory-mapping \ # Use disk-based database --output output.kraken \ --report report.txt \ reads.fastq.gz
Report Only (No Per-Read Output)
# Save space by not writing per-read classifications kraken2 --db /path/to/kraken2_db \ --report report.txt \ --report-zero-counts \ # Include taxa with 0 counts reads.fastq.gz
Classified/Unclassified Output
# Separate classified and unclassified reads kraken2 --db /path/to/kraken2_db \ --classified-out classified#.fq \ # # replaced by 1/2 for PE --unclassified-out unclassified#.fq \ --output output.kraken \ --report report.txt \ --paired \ reads_R1.fastq.gz reads_R2.fastq.gz
Build Custom Database
# Download taxonomy kraken2-build --download-taxonomy --db custom_db # Download specific libraries kraken2-build --download-library bacteria --db custom_db kraken2-build --download-library archaea --db custom_db kraken2-build --download-library viral --db custom_db # Build database kraken2-build --build --db custom_db --threads 8 # Clean up intermediate files kraken2-build --clean --db custom_db
Add Custom Sequences
# Add FASTA sequences to library kraken2-build --add-to-library custom_genomes.fasta --db custom_db # Then build kraken2-build --build --db custom_db
Inspect Database
# View database contents kraken2-inspect --db /path/to/kraken2_db | head -50
Report Format
17.45 1745 1745 U 0 unclassified 82.55 8255 48 R 1 root 82.07 8207 2 R1 131567 cellular organisms 81.99 8199 132 D 2 Bacteria 76.23 7623 178 P 1224 Proteobacteria
Columns:
- Percentage of reads
- Number of reads rooted at taxon
- Number of reads directly assigned
- Rank code (U, R, D, P, C, O, F, G, S)
- NCBI taxon ID
- Scientific name
Parse Kraken Output in Python
import pandas as pd report = pd.read_csv('report.txt', sep='\t', header=None, names=['pct', 'reads_clade', 'reads_taxon', 'rank', 'taxid', 'name']) report['name'] = report['name'].str.strip() species = report[report['rank'] == 'S'] species_sorted = species.sort_values('pct', ascending=False) species_sorted.head(20)
Filter Report by Rank
# Get only species-level classifications awk '$4 == "S"' report.txt > species_report.txt # Get genus level awk '$4 == "G"' report.txt > genus_report.txt
Key Parameters
| Parameter | Default | Description |
|---|---|---|
| --db | required | Database path |
| --threads | 1 | CPU threads |
| --confidence | 0.0 | Confidence threshold (0-1) |
| --minimum-base-quality | 0 | Phred quality threshold |
| --memory-mapping | false | Use disk-based database |
| --paired | false | Paired-end mode |
| --use-names | false | Include taxon names |
| --report-zero-counts | false | Include 0-count taxa |
Database Libraries
| Library | Content |
|---|---|
| bacteria | RefSeq complete bacterial genomes |
| archaea | RefSeq complete archaeal genomes |
| viral | RefSeq complete viral genomes |
| plasmid | RefSeq plasmid nucleotide sequences |
| human | GRCh38 human genome |
| fungi | RefSeq fungi |
| protozoa | RefSeq protozoa |
| UniVec_Core | Common vector sequences |
Related Skills
- abundance-estimation - Estimate abundances with Bracken
- metaphlan-profiling - Alternative marker-based profiling
- metagenome-visualization - Visualize results