SciAgent-Skills pyopenms-mass-spectrometry

Mass spectrometry data processing with PyOpenMS. Use for LC-MS/MS proteomics and metabolomics workflows — mzML/mzXML file I/O, signal processing (smoothing, peak picking, centroiding), feature detection and linking across samples, peptide/protein identification with FDR control, untargeted metabolomics pipelines. For simple spectral matching and metabolite ID, use matchms instead.

install
source · Clone the upstream repo
git clone https://github.com/jaechang-hits/SciAgent-Skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/proteomics-protein-engineering/pyopenms-mass-spectrometry" ~/.claude/skills/jaechang-hits-sciagent-skills-pyopenms-mass-spectrometry && rm -rf "$T"
manifest: skills/proteomics-protein-engineering/pyopenms-mass-spectrometry/SKILL.md
source content

PyOpenMS — Mass Spectrometry Analysis

Overview

PyOpenMS provides Python bindings to the OpenMS C++ library for computational mass spectrometry. It supports proteomics and metabolomics data processing including file I/O for 10+ MS formats, signal processing, feature detection, peptide/protein identification, and quantitative analysis across samples.

When to Use

  • Processing raw LC-MS/MS data (mzML, mzXML) for proteomics or metabolomics
  • Detecting chromatographic features and linking them across multiple samples
  • Identifying peptides and proteins from MS/MS search engine results with FDR control
  • Running untargeted metabolomics workflows (peak picking → feature detection → alignment → annotation)
  • Converting between mass spectrometry file formats (mzML, mzXML, featureXML, idXML)
  • Smoothing, filtering, and centroiding raw spectral data
  • For simple spectral library matching and metabolite identification, use matchms instead
  • For protein sequence analysis (not mass spec), use biopython instead

Prerequisites

uv pip install pyopenms numpy pandas matplotlib
  • Python 3.8+; NumPy for peak array operations
  • Input data: mzML files (standard MS format), FASTA databases (for identification)
  • All algorithms follow a consistent pattern:
    algo = Algorithm(); params = algo.getParameters(); params.setValue(...); algo.setParameters(params)

Quick Start

import pyopenms as ms

# Load mzML file
exp = ms.MSExperiment()
ms.MzMLFile().load("sample.mzML", exp)
print(f"Spectra: {exp.getNrSpectra()}, Chromatograms: {exp.getNrChromatograms()}")

# Examine first spectrum
spec = exp.getSpectrum(0)
mz, intensity = spec.get_peaks()
print(f"MS level: {spec.getMSLevel()}, RT: {spec.getRT():.2f}s, Peaks: {len(mz)}")

# Quick preprocessing: smooth + centroid
gauss = ms.GaussFilter()
p = gauss.getParameters(); p.setValue("gaussian_width", 0.1); gauss.setParameters(p)
gauss.filterExperiment(exp)

picker = ms.PeakPickerHiRes()
centroided = ms.MSExperiment()
picker.pickExperiment(exp, centroided)
print(f"Centroided spectra: {centroided.getNrSpectra()}")

Core API

Module 1: File I/O & Data Access

Read and write mass spectrometry data in multiple formats.

import pyopenms as ms

# Read mzML (standard MS format)
exp = ms.MSExperiment()
ms.MzMLFile().load("data.mzML", exp)

# Indexed access for large files (memory-efficient)
loader = ms.IndexedMzMLFileLoader()
indexed_file = ms.OnDiscMSExperiment()
loader.load("large_data.mzML", indexed_file)
spec = indexed_file.getSpectrum(0)  # Load single spectrum on demand
print(f"Total spectra: {indexed_file.getNrSpectra()}")

# Read identification results (idXML)
protein_ids, peptide_ids = [], []
ms.IdXMLFile().load("results.idXML", protein_ids, peptide_ids)
print(f"Peptide IDs: {len(peptide_ids)}, Protein IDs: {len(protein_ids)}")

# Read feature map
fm = ms.FeatureMap()
ms.FeatureXMLFile().load("features.featureXML", fm)
print(f"Features: {fm.size()}")
# Write mzML with compression
exp_out = ms.MSExperiment()
# ... populate experiment ...
ms.MzMLFile().store("output.mzML", exp_out)

# Read FASTA database
entries = []
ms.FASTAFile().load("database.fasta", entries)
print(f"Proteins in DB: {len(entries)}")
for e in entries[:3]:
    print(f"  {e.identifier}: {e.sequence[:30]}...")

Supported formats: mzML, mzXML, mzData (spectra); featureXML, consensusXML (features); idXML, mzIdentML, pepXML (identifications); TraML (transitions); mzTab (results); FASTA (sequences)

Module 2: Signal Processing & Peak Picking

Preprocess raw spectral data for downstream analysis.

import pyopenms as ms

exp = ms.MSExperiment()
ms.MzMLFile().load("raw.mzML", exp)

# Gaussian smoothing
gauss = ms.GaussFilter()
p = gauss.getParameters()
p.setValue("gaussian_width", 0.15)  # m/z width
gauss.setParameters(p)
gauss.filterExperiment(exp)

# Savitzky-Golay smoothing (alternative)
sg = ms.SavitzkyGolayFilter()
p = sg.getParameters()
p.setValue("frame_length", 15)  # Must be odd
sg.setParameters(p)
# sg.filterExperiment(exp)  # Use one smoother, not both

# Peak picking (centroiding) — required before feature detection
picker = ms.PeakPickerHiRes()
p = picker.getParameters()
p.setValue("signal_to_noise", 1.0)
picker.setParameters(p)
centroided = ms.MSExperiment()
picker.pickExperiment(exp, centroided)
print(f"Raw peaks in spec 0: {exp.getSpectrum(0).size()}")
print(f"Centroided peaks: {centroided.getSpectrum(0).size()}")
# Normalization
normalizer = ms.Normalizer()
p = normalizer.getParameters()
p.setValue("method", "to_one")  # "to_one" or "to_TIC"
normalizer.setParameters(p)
normalizer.filterPeakMap(centroided)

# Peak filtering — remove low-intensity noise
mower = ms.ThresholdMower()
p = mower.getParameters()
p.setValue("threshold", 100.0)  # Minimum intensity
mower.setParameters(p)
mower.filterPeakMap(centroided)

# Baseline reduction
morph = ms.MorphologicalFilter()
p = morph.getParameters()
p.setValue("struc_elem_length", 3.0)  # m/z window
morph.setParameters(p)
morph.filterExperiment(exp)

Module 3: Feature Detection & Linking

Detect chromatographic features and link them across samples.

import pyopenms as ms

# Load centroided data
exp = ms.MSExperiment()
ms.MzMLFile().load("centroided.mzML", exp)

# Feature detection (proteomics — centroided data)
ff = ms.FeatureFinder()
features = ms.FeatureMap()
seeds = ms.FeatureMap()
params = ms.FeatureFinder().getParameters("centroided")
ff.run("centroided", exp, features, params, seeds)
print(f"Detected {features.size()} features")

# Access feature properties
for f in features[:5]:
    print(f"  RT: {f.getRT():.1f}s, m/z: {f.getMZ():.4f}, "
          f"intensity: {f.getIntensity():.0f}, quality: {f.getOverallQuality():.3f}")
# Feature linking across samples — align retention times first
aligner = ms.MapAlignmentAlgorithmPoseClustering()
p = aligner.getParameters()
p.setValue("max_num_peaks_considered", 1000)
aligner.setParameters(p)

# Link features into consensus map
linker = ms.FeatureGroupingAlgorithmQT()
p = linker.getParameters()
p.setValue("distance_RT:max_difference", 60.0)  # seconds
p.setValue("distance_MZ:max_difference", 10.0)   # ppm
linker.setParameters(p)

consensus = ms.ConsensusMap()
linker.group([features_sample1, features_sample2, features_sample3], consensus)
print(f"Consensus features: {consensus.size()}")

# Export to pandas for downstream analysis
import pandas as pd
df = consensus.get_df()
print(f"Consensus table: {df.shape}")

Module 4: Peptide & Protein Identification

Process search engine results with FDR control and protein inference.

import pyopenms as ms

# Load search engine results
protein_ids, peptide_ids = [], []
ms.IdXMLFile().load("search_results.idXML", protein_ids, peptide_ids)

# Examine peptide hits
for pep_id in peptide_ids[:3]:
    print(f"Spectrum: RT={pep_id.getRT():.1f}, MZ={pep_id.getMZ():.4f}")
    for hit in pep_id.getHits():
        seq = hit.getSequence()
        print(f"  {seq} score={hit.getScore():.4f} charge={hit.getCharge()}")

# FDR filtering (target-decoy approach)
fdr = ms.FalseDiscoveryRate()
fdr.apply(peptide_ids)

# Filter at 1% FDR
filtered = []
for pep_id in peptide_ids:
    hits = [h for h in pep_id.getHits() if h.getScore() <= 0.01]
    if hits:
        pep_id.setHits(hits)
        filtered.append(pep_id)
print(f"Peptide IDs at 1% FDR: {len(filtered)}")
# Protein inference
inference = ms.BasicProteinInferenceAlgorithm()
inference.run(peptide_ids, protein_ids)

for prot_id in protein_ids:
    for hit in prot_id.getHits()[:5]:
        print(f"Protein: {hit.getAccession()}, score: {hit.getScore():.4f}")

# Peptide sequence handling
seq = ms.AASequence.fromString("PEPTIDER")
print(f"Molecular weight: {seq.getMonoWeight():.4f}")
print(f"Formula: {seq.getFormula()}")

# Modified sequence
mod_seq = ms.AASequence.fromString("PEPTM(Oxidation)DER")
print(f"Modified weight: {mod_seq.getMonoWeight():.4f}")

# Enzymatic digestion
digestor = ms.ProteaseDigestion()
digestor.setEnzyme("Trypsin")
digest = []
digestor.digest(ms.AASequence.fromString("MKWVTFISLLLLFSSAYSRGVFRR"), digest)
print(f"Tryptic peptides: {len(digest)}")

Module 5: Metabolomics Pipeline

Complete untargeted metabolomics workflow from raw data to feature table.

import pyopenms as ms

# Step 1: Load and centroid raw data
exp = ms.MSExperiment()
ms.MzMLFile().load("metabolomics_sample.mzML", exp)

picker = ms.PeakPickerHiRes()
centroided = ms.MSExperiment()
picker.pickExperiment(exp, centroided)

# Step 2: Feature detection for metabolomics (small molecules)
ff = ms.FeatureFinder()
features = ms.FeatureMap()
seeds = ms.FeatureMap()
params = ms.FeatureFinder().getParameters("centroided")
params.setValue("isotopic_pattern:charge_low", 1)
params.setValue("isotopic_pattern:charge_high", 3)
ff.run("centroided", centroided, features, params, seeds)
print(f"Detected features: {features.size()}")

# Step 3: Adduct detection (group related adducts)
decharger = ms.MetaboliteAdductDecharger()
p = decharger.getParameters()
p.setValue("potential_adducts", "H:+:0.6;Na:+:0.3;K:+:0.1")  # Positive mode
decharger.setParameters(p)
# decharger.compute(features, feature_map_out, consensus_map_out)
# Step 4: RT alignment across samples
import pyopenms as ms
import pandas as pd

# Assuming feature maps from multiple samples
sample_files = ["sample1.featureXML", "sample2.featureXML", "sample3.featureXML"]
feature_maps = []
for f in sample_files:
    fm = ms.FeatureMap()
    ms.FeatureXMLFile().load(f, fm)
    feature_maps.append(fm)

# Align retention times
aligner = ms.MapAlignmentAlgorithmPoseClustering()
aligner.setReference(0)  # Use first sample as reference

# Step 5: Link features to consensus map
linker = ms.FeatureGroupingAlgorithmQT()
p = linker.getParameters()
p.setValue("distance_RT:max_difference", 30.0)
p.setValue("distance_MZ:max_difference", 10.0)
linker.setParameters(p)

consensus = ms.ConsensusMap()
linker.group(feature_maps, consensus)

# Step 6: Export metabolite table
df = consensus.get_df()
print(f"Feature table: {df.shape[0]} features × {df.shape[1]} columns")
print(df[["RT", "mz", "intensity_0", "intensity_1", "intensity_2"]].head())

Key Concepts

Algorithm Parameter Pattern

All PyOpenMS algorithms follow the same 3-step pattern:

algo = ms.AlgorithmClass()           # 1. Instantiate
params = algo.getParameters()         # 2. Get parameters
params.setValue("param_name", value)   # 3. Set values
algo.setParameters(params)            # 4. Apply
# algo.process(input, output)         # 5. Execute

To discover available parameters:

for key in params.keys():
    print(f"{key}: {params.getValue(key)} (type: {params.getDescription(key)})")

Core Data Structures

ObjectDescriptionKey Methods
MSExperiment
Collection of spectra + chromatograms
getNrSpectra()
,
getSpectrum(i)
,
addSpectrum()
MSSpectrum
Single mass spectrum (m/z, intensity)
get_peaks()
,
getMSLevel()
,
getRT()
,
size()
MSChromatogram
Chromatographic trace
get_peaks()
,
getNativeID()
Feature
Detected chromatographic peak
getRT()
,
getMZ()
,
getIntensity()
,
getOverallQuality()
FeatureMap
Collection of features
size()
,
get_df()
, iteration
ConsensusMap
Features linked across samples
size()
,
get_df()
PeptideIdentification
Search results for one spectrum
getHits()
,
getRT()
,
getMZ()
AASequence
Amino acid sequence with modifications
fromString()
,
getMonoWeight()
,
getFormula()

Supported File Formats

FormatReader ClassContent
mzML
MzMLFile
Raw spectra (standard)
mzXML
MzXMLFile
Raw spectra (legacy)
featureXML
FeatureXMLFile
Detected features
consensusXML
ConsensusXMLFile
Linked features
idXML
IdXMLFile
Peptide/protein IDs
mzIdentML
MzIdentMLFile
Peptide/protein IDs (standard)
FASTA
FASTAFile
Protein sequences
TraML
TraMLFile
MRM transitions
mzTab
MzTabFile
Quantification results

Common Workflows

Workflow 1: Proteomics Feature Detection Pipeline

import pyopenms as ms

# Load → Smooth → Centroid → Detect → Export
exp = ms.MSExperiment()
ms.MzMLFile().load("proteomics.mzML", exp)

# Preprocessing
gauss = ms.GaussFilter()
p = gauss.getParameters(); p.setValue("gaussian_width", 0.1); gauss.setParameters(p)
gauss.filterExperiment(exp)

picker = ms.PeakPickerHiRes()
centroided = ms.MSExperiment()
picker.pickExperiment(exp, centroided)

# Feature detection
ff = ms.FeatureFinder()
features = ms.FeatureMap()
params = ff.getParameters("centroided")
ff.run("centroided", centroided, features, params, ms.FeatureMap())

# Export to pandas
df = features.get_df()
print(f"Features: {df.shape}")
df.to_csv("proteomics_features.csv", index=False)

Workflow 2: Multi-Sample Metabolomics Comparison

  1. Load mzML files for all samples (Core API Module 1)
  2. Centroid each sample (Core API Module 2 —
    PeakPickerHiRes
    )
  3. Detect features per sample (Core API Module 3 —
    FeatureFinder
    )
  4. Align retention times (Core API Module 3 —
    MapAlignmentAlgorithmPoseClustering
    )
  5. Link features to consensus map (Core API Module 3 —
    FeatureGroupingAlgorithmQT
    )
  6. Export consensus table to pandas and filter by CV across replicates

Workflow 3: Identification Results Analysis

import pyopenms as ms
import pandas as pd

# Load and filter identifications
protein_ids, peptide_ids = [], []
ms.IdXMLFile().load("search.idXML", protein_ids, peptide_ids)

# FDR filtering
fdr = ms.FalseDiscoveryRate()
fdr.apply(peptide_ids)

# Build results table
rows = []
for pep_id in peptide_ids:
    for hit in pep_id.getHits():
        if hit.getScore() <= 0.01:  # 1% FDR
            rows.append({
                "sequence": str(hit.getSequence()),
                "score": hit.getScore(),
                "charge": hit.getCharge(),
                "rt": pep_id.getRT(),
                "mz": pep_id.getMZ()
            })

df = pd.DataFrame(rows)
print(f"Identified peptides at 1% FDR: {len(df)}")
print(f"Unique sequences: {df['sequence'].nunique()}")
df.to_csv("peptide_identifications.csv", index=False)

Key Parameters

ParameterModuleDefaultRange/OptionsEffect
gaussian_width
GaussFilter0.20.05–1.0m/z smoothing window width
frame_length
SavitzkyGolayFilter115–31 (odd)Smoothing window points
signal_to_noise
PeakPickerHiRes1.00.5–10.0Minimum S/N for centroiding
isotopic_pattern:charge_low
FeatureFinder11–6Minimum charge state
isotopic_pattern:charge_high
FeatureFinder31–10Maximum charge state
distance_RT:max_difference
FeatureGroupingAlgorithmQT10010–300 sRT tolerance for feature linking
distance_MZ:max_difference
FeatureGroupingAlgorithmQT101–50 ppmm/z tolerance for feature linking
threshold
ThresholdMower0.00–10000Minimum peak intensity
method
Normalizer"to_one""to_one"/"to_TIC"Normalization method

Best Practices

  1. Always centroid before feature detection — FeatureFinder requires centroided data. Use
    PeakPickerHiRes
    first
  2. Use indexed loading for large files
    OnDiscMSExperiment
    loads spectra on demand, avoiding memory issues for files >1 GB
  3. Preserve original data — Store raw experiment before processing:
    orig = ms.MSExperiment(exp)
    . Processing is destructive
  4. Profile vs centroid awareness — Check data type:
    spec.getType()
    returns 1 (profile) or 2 (centroid). Some algorithms require specific types
  5. Parameter discovery — Print parameters before tuning:
    for k in params.keys(): print(k, params.getValue(k))
  6. Export to pandas early — Use
    features.get_df()
    or
    consensus.get_df()
    to leverage pandas/numpy for statistical analysis

Common Recipes

Recipe 1: Spectrum Statistics and Quality Check

import pyopenms as ms
import numpy as np

exp = ms.MSExperiment()
ms.MzMLFile().load("sample.mzML", exp)

ms1_count = sum(1 for s in exp if s.getMSLevel() == 1)
ms2_count = sum(1 for s in exp if s.getMSLevel() == 2)
rt_range = (exp.getSpectrum(0).getRT(), exp.getSpectrum(exp.getNrSpectra()-1).getRT())
peak_counts = [s.size() for s in exp]
print(f"MS1: {ms1_count}, MS2: {ms2_count}")
print(f"RT range: {rt_range[0]:.1f}–{rt_range[1]:.1f}s")
print(f"Peaks per spectrum: mean={np.mean(peak_counts):.0f}, "
      f"median={np.median(peak_counts):.0f}")

Recipe 2: Theoretical Spectrum Generation

import pyopenms as ms

# Generate theoretical fragment spectrum for a peptide
tsg = ms.TheoreticalSpectrumGenerator()
p = tsg.getParameters()
p.setValue("add_b_ions", "true")
p.setValue("add_y_ions", "true")
p.setValue("add_metainfo", "true")
tsg.setParameters(p)

spec = ms.MSSpectrum()
seq = ms.AASequence.fromString("PEPTIDER")
tsg.getSpectrum(spec, seq, 1, 2)  # charge 1 to 2
mz, intensity = spec.get_peaks()
print(f"Theoretical fragments for PEPTIDER: {len(mz)} ions")

Recipe 3: Format Conversion (mzXML → mzML)

import pyopenms as ms

exp = ms.MSExperiment()
ms.MzXMLFile().load("old_data.mzXML", exp)
ms.MzMLFile().store("converted.mzML", exp)
print(f"Converted {exp.getNrSpectra()} spectra to mzML format")

Troubleshooting

ProblemCauseSolution
FileNotFound
on load
Wrong path or formatVerify file exists; use correct reader class (MzMLFile for .mzML)
Empty feature map after detectionProfile data instead of centroidedRun
PeakPickerHiRes
before
FeatureFinder
Very few features detectedToo-strict S/N thresholdLower
signal_to_noise
to 0.5–1.0; adjust
isotopic_pattern
charge range
Memory error on large filesLoading entire file at onceUse
OnDiscMSExperiment
for indexed access
setValue
type error
Wrong value type for parameterCheck expected type:
params.getDescription(key)
. Use float for numeric, string for enum
RT alignment failsToo few common featuresIncrease
max_num_peaks_considered
; verify samples are from same experiment
FDR values all 1.0No decoy hits in search resultsEnsure search was run with target-decoy database; check score orientation
Feature linking too aggressiveLarge RT/m/z tolerancesReduce
distance_RT:max_difference
and
distance_MZ:max_difference
Slow processingProcessing all spectraFilter by MS level first:
[s for s in exp if s.getMSLevel() == 1]

Bundled Resources

  • references/data_structures_reference.md — Comprehensive documentation of PyOpenMS core objects (MSExperiment, MSSpectrum, Feature, FeatureMap, ConsensusMap, PeptideIdentification, AASequence, Param). Covers: object creation, attribute access, iteration patterns, memory management, type conversions.

    • Covers: all 13+ core data structure classes with creation/access/iteration examples
    • Relocated inline: Core Data Structures summary table in Key Concepts, AASequence usage in Core API Module 4
    • Omitted: exhaustive attribute listings duplicating PyOpenMS API docs
  • references/signal_feature_processing.md — Signal processing algorithms and feature detection/linking workflows consolidated from two original references.

    • Covers: smoothing (Gaussian, Savitzky-Golay), peak picking (HiRes, CWT), normalization, peak filtering (threshold, window, N-largest), baseline reduction, deconvolution, spectrum merging, RT alignment, mass calibration, feature finding (centroided, metabolomics), feature linking, adduct detection, quality control
    • Relocated inline: GaussFilter, PeakPickerHiRes, Normalizer, ThresholdMower in Core API Module 2; FeatureFinder, MapAlignment, FeatureGroupingAlgorithmQT in Core API Module 3
    • Omitted: CWT peak picker detailed parameters — rarely used vs HiRes; isotope wavelet transform — specialized use case
  • references/identification_metabolomics.md — Peptide/protein identification and untargeted metabolomics pipelines consolidated from two original references.

    • Covers: search engine integration, FDR calculation, protein inference, enzymatic digestion, spectral library search, theoretical spectrum generation, untargeted metabolomics pipeline (peak picking → feature detection → adduct detection → alignment → linking → gap filling → annotation → export), compound identification (mass-based, MS/MS-based), normalization (TIC), QC (CV filtering, blank filtering), MetaboAnalyst export
    • Relocated inline: FDR filtering, protein inference, AASequence handling in Core API Module 4; metabolomics pipeline in Core API Module 5
    • Omitted: detailed Comet/Mascot parameter configuration — search engine-specific; MetaboAnalyst export format details — external tool

Related Skills

  • matchms-spectral-matching — spectral library matching and metabolite identification from MS/MS
  • biopython-molecular-biology — protein sequence analysis (FASTA parsing, BLAST)

References