BioSkills bio-workflows-edna-pipeline

End-to-end eDNA metabarcoding from raw amplicons to community ecology. Covers QC, primer removal, denoising with OBITools3 or DADA2, contamination filtering, taxonomy assignment, Hill number diversity, and constrained ordination. Use when processing environmental DNA samples for biodiversity assessment or ecological surveys.

install
source · Clone the upstream repo
git clone https://github.com/GPTomics/bioSkills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/workflows/edna-pipeline" ~/.claude/skills/gptomics-bioskills-bio-workflows-edna-pipeline && rm -rf "$T"
manifest: workflows/edna-pipeline/SKILL.md
source content

Version Compatibility

Reference examples tested with: DADA2 1.30+, FastQC 0.12+, MultiQC 1.21+, cutadapt 4.4+, phyloseq 1.46+, vegan 2.6+

Before using code patterns, verify installed versions match. If versions differ:

  • R:
    packageVersion('<pkg>')
    then
    ?function_name
    to verify parameters
  • CLI:
    <tool> --version
    then
    <tool> --help
    to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

eDNA Metabarcoding Pipeline

"Process my eDNA samples from raw reads to community ecology" → Orchestrate primer removal, denoising (OBITools3 or DADA2), contamination filtering, taxonomy assignment, Hill number diversity estimation, and constrained ordination for species-environment analysis.

Complete workflow from raw amplicon sequences to community ecology analysis, supporting both OBITools3 and DADA2 processing paths.

Pipeline Overview

Raw amplicon FASTQ (demultiplexed)
    |
    v
[1. QC] ------------------> FastQC / MultiQC quality assessment
    |
    v
[2. Primer Removal] ------> Cutadapt (remove forward + reverse primers)
    |                            |
    |                            +---> QC: reads per sample >1000
    |
    +--- Path A: OBITools3     +--- Path B: DADA2
    |       |                  |       |
    |       v                  |       v
    |   [3a. obi alignpairedend]   [3b. filterAndTrim]
    |       |                  |       |
    |       v                  |       v
    |   [4a. obi uniq]        |   [4b. learnErrors + dada]
    |       |                  |       |
    |       v                  |       v
    |   [5a. obi ecotag]      |   [5b. assignTaxonomy]
    |                          |
    +-------- Merge -----------+
                |
                v
[6. Contamination Filter] -> decontam / microDecon (negative control removal)
    |
    v
[7. Taxonomy Table] -------> Species x sample matrix
    |
    v
[8. Diversity Analysis] ---> iNEXT Hill numbers (q=0,1,2)
    |
    v
[9. Community Comparison] -> vegan CCA/RDA + indicspecies
    |
    v
Species table + diversity metrics + ordination plots

Step 1: Quality Assessment

fastqc -t 8 -o fastqc_output/ raw_reads/*.fastq.gz
multiqc fastqc_output/ -o multiqc_report/

Step 2: Primer Removal

# Adapter sequences are marker-specific; examples below for common eDNA markers
# --discard-untrimmed: remove reads without primers (likely off-target)
# --minimum-length 50: discard very short fragments after trimming

# COI (Leray primers mlCOIintF / jgHCO2198)
cutadapt -g GGWACWGGWTGAACWGTWTAYCCYCC -G TAIACYTCIGGRTGICCRAARAAYCA \
    --discard-untrimmed --minimum-length 50 -j 8 \
    -o trimmed/{sample}_R1.fastq.gz -p trimmed/{sample}_R2.fastq.gz \
    raw_reads/{sample}_R1.fastq.gz raw_reads/{sample}_R2.fastq.gz

Common primer sets by marker:

MarkerForward PrimerReverse PrimerTarget
COImlCOIintFjgHCO2198Metazoan invertebrates
12S MiFishMiFish-U-FMiFish-U-RFish
ITS2ITS3ITS4Fungi
rbcLrbcLa-FrbcLa-RPlants
18S V91389F1510REukaryotes

QC Checkpoint: Demultiplexing

# Gate: reads per sample >1000; negative controls <100 reads
for f in trimmed/*_R1.fastq.gz; do
    sample=$(basename "$f" _R1.fastq.gz)
    count=$(zcat "$f" | awk 'END{print NR/4}')
    echo "$sample: $count reads"
done

Step 3: Paired-End Merging and Denoising

Path A: OBITools3

# Import paired FASTQ into OBITools3 DMS
obi import --fastq-input trimmed/reads_R1.fastq.gz reads/reads1
obi import --fastq-input trimmed/reads_R2.fastq.gz reads/reads2

# Paired-end alignment
obi alignpairedend -R reads/reads2 reads/reads1 reads/aligned

# Filter by alignment score
# score >= 50: removes poorly overlapping pairs
obi grep -p 'sequence["score"] >= 50' reads/aligned reads/filtered

# Filter by merged length (marker-dependent range)
# 100-500 bp: typical COI amplicon range
obi grep -p 'len(sequence) >= 100 and len(sequence) <= 500' \
    reads/filtered reads/length_filtered

# Dereplicate
obi uniq reads/length_filtered reads/dereplicated

# Remove singletons
# count >=2: removes sequencing errors; increase to 5-10 for noisy datasets
obi grep -p 'sequence["count"] >= 2' reads/dereplicated reads/denoised

# Denoise (remove PCR/sequencing errors)
# ratio 0.05: sequences <5% abundance of a 1-mismatch parent are merged
obi clean -s merged_sample -r 0.05 -H reads/denoised reads/cleaned

Path B: DADA2 (R)

library(dada2)

path <- 'trimmed/'
fnFs <- sort(list.files(path, pattern = '_R1.fastq.gz', full.names = TRUE))
fnRs <- sort(list.files(path, pattern = '_R2.fastq.gz', full.names = TRUE))
sample_names <- gsub('_R1.fastq.gz', '', basename(fnFs))

# Filter and trim
# truncLen: set based on quality profiles; marker-dependent
# maxEE c(2,2): max expected errors; standard for eDNA
# minLen 50: minimum after trimming
filtFs <- file.path('filtered', paste0(sample_names, '_F_filt.fastq.gz'))
filtRs <- file.path('filtered', paste0(sample_names, '_R_filt.fastq.gz'))
out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs,
                     truncLen = c(200, 180), maxEE = c(2, 2),
                     minLen = 50, truncQ = 2, rm.phix = TRUE,
                     multithread = TRUE)

# Learn error rates
errF <- learnErrors(filtFs, multithread = TRUE)
errR <- learnErrors(filtRs, multithread = TRUE)

# Denoise
dadaFs <- dada(filtFs, err = errF, multithread = TRUE)
dadaRs <- dada(filtRs, err = errR, multithread = TRUE)

# Merge paired reads
# minOverlap 20: standard; increase if amplicon has short overlap region
merged <- mergePairs(dadaFs, filtFs, dadaRs, filtRs, minOverlap = 20)

# Build ASV table
seqtab <- makeSequenceTable(merged)

# Remove chimeras
# method 'consensus': standard; 'pooled' for higher sensitivity
seqtab_nochim <- removeBimeraDenovo(seqtab, method = 'consensus', multithread = TRUE)
chimera_rate <- 1 - sum(seqtab_nochim) / sum(seqtab)
message(sprintf('Chimera rate: %.1f%%', chimera_rate * 100))

QC Checkpoint: Denoising

# Gate 1: Chimera rate <20%
if (chimera_rate > 0.20) message('WARNING: High chimera rate. Check primer removal and PCR conditions.')

# Gate 2: ASV count reasonable for marker
n_asvs <- ncol(seqtab_nochim)
message(sprintf('ASVs after denoising: %d', n_asvs))
# Typical ranges: COI 500-5000, 12S 50-500, ITS 200-3000

Step 4: Contamination Filtering

R (decontam)

library(decontam)
library(phyloseq)

ps <- phyloseq(otu_table(seqtab_nochim, taxa_are_rows = FALSE),
               sample_data(meta))

# Identify negative controls
sample_data(ps)$is_neg <- sample_data(ps)$sample_type == 'negative_control'

# prevalence method: standard for eDNA; threshold 0.5 identifies contaminants
# present more in negative controls than real samples
contam <- isContaminant(ps, method = 'prevalence', neg = 'is_neg', threshold = 0.5)
message(sprintf('Contaminant ASVs: %d', sum(contam$contaminant)))

ps_clean <- prune_taxa(!contam$contaminant, ps)

# Remove negative control samples
ps_clean <- subset_samples(ps_clean, sample_type != 'negative_control')

Tag-jumping removal

# Tag-jumping: cross-contamination from index hopping during sequencing
# Remove ASVs with <0.1% of max abundance in a sample (likely tag-jump artifacts)
otu <- as(otu_table(ps_clean), 'matrix')
max_per_asv <- apply(otu, 2, max)
otu_filtered <- otu
for (j in 1:ncol(otu)) {
    # 0.1% of max: standard tag-jump threshold
    threshold <- max_per_asv[j] * 0.001
    otu_filtered[otu[, j] < threshold, j] <- 0
}
otu_table(ps_clean) <- otu_table(otu_filtered, taxa_are_rows = FALSE)

QC Checkpoint: Decontamination

# Gate: verify contaminant ASVs were removed from real samples
n_before <- ntaxa(ps)
n_after <- ntaxa(ps_clean)
message(sprintf('ASVs removed as contaminants: %d (%.1f%%)',
                n_before - n_after, (n_before - n_after) / n_before * 100))
if ((n_before - n_after) / n_before > 0.5) {
    message('WARNING: >50% ASVs flagged as contaminants. Review decontam threshold.')
}

Step 5: Taxonomy Assignment

OBITools3 (ecotag)

# ecotag assigns taxonomy using LCA algorithm against reference database
# Reference databases: EMBL, BOLD, MIDORI2, UNITE (marker-dependent)
obi ecotag -R reads/refdb --taxonomy reads/taxonomy reads/cleaned reads/assigned

# Filter by assignment quality (species-level for COI)
obi grep -p 'sequence["best_identity"] >= 0.97' reads/assigned reads/filtered_assigned

obi export --tab-output reads/filtered_assigned > taxonomy_results.tsv

DADA2 (assignTaxonomy)

# SILVA for 16S/18S, UNITE for ITS, custom for COI/12S
# Reference databases must be formatted for DADA2
# minBoot 50: minimum bootstrap confidence; 80 for more conservative assignments
taxa <- assignTaxonomy(seqtab_nochim, 'reference_db.fa.gz', multithread = TRUE, minBoot = 50)
taxa <- addSpecies(taxa, 'species_db.fa.gz')

Marker-specific taxonomy databases

MarkerDatabaseTypical Assignment Rate
COIBOLD / Midori2>90% to phylum, 60-80% to species
12SMitoFish / 12S-seqdb>90% to family for fish
ITSUNITE>80% to genus for fungi
rbcLGenBank / NCBI nt>85% to family for plants
18SSILVA / PR2>90% to phylum

QC Checkpoint: Taxonomy

# Gate: assignment rate should meet marker expectations
assigned <- !is.na(taxa[, 'Phylum'])
assignment_rate <- sum(assigned) / length(assigned) * 100
message(sprintf('Taxonomy assignment rate (phylum level): %.1f%%', assignment_rate))
if (assignment_rate < 60) message('WARNING: Low assignment rate. Check reference database completeness.')

Step 6: Diversity Analysis

R (iNEXT)

library(iNEXT)

otu_matrix <- as(otu_table(ps_clean), 'matrix')

# Hill numbers: q=0 (richness), q=1 (Shannon diversity), q=2 (Simpson diversity)
# endpoint: extrapolation to 2x observed sample size
inext_out <- iNEXT(as.list(as.data.frame(t(otu_matrix))),
                   q = c(0, 1, 2), datatype = 'abundance',
                   endpoint = 2 * max(rowSums(otu_matrix)))

# Sample completeness: fraction of estimated diversity observed
completeness <- inext_out$DataInfo$SC
message(sprintf('Sample completeness range: %.1f%% - %.1f%%',
                min(completeness) * 100, max(completeness) * 100))

QC Checkpoint: Diversity

# Gate 1: rarefaction approaching asymptote
if (min(completeness) < 0.80) {
    message('WARNING: Some samples have low completeness (<80%). Deeper sequencing recommended.')
}

# Gate 2: reasonable richness estimates
q0_estimates <- inext_out$AsyEst[inext_out$AsyEst$Diversity == 'Species richness', ]
message(sprintf('Estimated richness range: %d - %d species',
                min(q0_estimates$Estimator), max(q0_estimates$Estimator)))

Step 7: Community Comparison

R (vegan + indicspecies)

library(vegan)
library(indicspecies)

otu_matrix <- as(otu_table(ps_clean), 'matrix')
env_data <- as(sample_data(ps_clean), 'data.frame')

# Hellinger transformation: standard for community composition data
# Reduces influence of dominant species
otu_hell <- decostand(otu_matrix, method = 'hellinger')

# DCA on untransformed data to determine gradient length
dca <- decorana(otu_matrix)
gradient_length <- diff(range(scores(dca, display = 'sites', choices = 1)))
message(sprintf('DCA gradient length: %.2f SD', gradient_length))

# RDA: linear response (<=3 SD), uses Hellinger-transformed data
# CCA: unimodal response (>3 SD), uses raw abundances (chi-squared distance)
if (gradient_length <= 3) {
    ord <- rda(otu_hell ~ temperature + depth + season, data = env_data)
    method_name <- 'RDA'
} else {
    ord <- cca(otu_matrix ~ temperature + depth + season, data = env_data)
    method_name <- 'CCA'
}

# Permutation test for significance
# permutations 999: standard; increase to 9999 for publication
anova_result <- anova.cca(ord, permutations = 999)
message(sprintf('%s significance: p = %.4f', method_name, anova_result$`Pr(>F)`[1]))

# Indicator species analysis
# Groups defined by environmental category (e.g., site, season, habitat)
indval <- multipatt(otu_matrix, env_data$site, control = how(nperm = 999))
summary(indval, alpha = 0.05)

Parameter Recommendations

StepParameterRecommendation
Cutadapt--discard-untrimmedAlways use; removes off-target reads
Cutadapt--minimum-length50 (general); adjust per expected amplicon size
DADA2truncLenSet from quality profiles; marker-dependent
DADA2maxEEc(2,2) standard; c(5,5) for degraded eDNA
DADA2minOverlap20 (standard); increase for short overlaps
OBITools3--min-count2 (removes singletons); 5-10 for noisy datasets
decontamthreshold0.5 (prevalence method); 0.1 for stringent
Tag-jumpingthreshold0.1% of max abundance per ASV
TaxonomyminBoot50 (sensitive); 80 (conservative)
ecotag--min-identity0.97 (COI species); 0.95 (genus); marker-dependent
iNEXTendpoint2x max observed sample size
veganpermutations999 (standard); 9999 (publication)

Troubleshooting

IssueLikely CauseSolution
Few reads after primer removalWrong primer sequences or orientationVerify primer sequences; try --revcomp
High chimera rate (>20%)Excessive PCR cycles or low-quality inputReduce PCR cycles; improve DNA extraction
Many unassigned ASVsIncomplete reference databaseUse marker-specific database; lower minBoot
Contamination in negativesTag-jumping or lab contaminationApply tag-jump filter; review extraction protocol
Low sample completenessInsufficient sequencing depthIncrease sequencing; pool fewer samples
Ordination axes not significantWeak environmental gradientsAdd more environmental variables; check sample size
Unexpected taxa (e.g., human)Sample contaminationFilter known contaminants; review field protocols
Very few ASVsOver-aggressive filteringRelax truncLen, maxEE, or min-count thresholds

Related Skills

  • ecological-genomics/edna-metabarcoding - Detailed eDNA processing
  • ecological-genomics/biodiversity-metrics - Diversity analysis details
  • ecological-genomics/community-ecology - Ordination and indicator species
  • read-qc/quality-reports - Raw read quality assessment
  • microbiome/amplicon-processing - 16S clinical alternative