SciAgent-Skills pyopenms-mass-spectrometry
Mass spectrometry data processing with PyOpenMS. Use for LC-MS/MS proteomics and metabolomics workflows — mzML/mzXML file I/O, signal processing (smoothing, peak picking, centroiding), feature detection and linking across samples, peptide/protein identification with FDR control, untargeted metabolomics pipelines. For simple spectral matching and metabolite ID, use matchms instead.
git clone https://github.com/jaechang-hits/SciAgent-Skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/proteomics-protein-engineering/pyopenms-mass-spectrometry" ~/.claude/skills/jaechang-hits-sciagent-skills-pyopenms-mass-spectrometry && rm -rf "$T"
skills/proteomics-protein-engineering/pyopenms-mass-spectrometry/SKILL.mdPyOpenMS — Mass Spectrometry Analysis
Overview
PyOpenMS provides Python bindings to the OpenMS C++ library for computational mass spectrometry. It supports proteomics and metabolomics data processing including file I/O for 10+ MS formats, signal processing, feature detection, peptide/protein identification, and quantitative analysis across samples.
When to Use
- Processing raw LC-MS/MS data (mzML, mzXML) for proteomics or metabolomics
- Detecting chromatographic features and linking them across multiple samples
- Identifying peptides and proteins from MS/MS search engine results with FDR control
- Running untargeted metabolomics workflows (peak picking → feature detection → alignment → annotation)
- Converting between mass spectrometry file formats (mzML, mzXML, featureXML, idXML)
- Smoothing, filtering, and centroiding raw spectral data
- For simple spectral library matching and metabolite identification, use matchms instead
- For protein sequence analysis (not mass spec), use biopython instead
Prerequisites
uv pip install pyopenms numpy pandas matplotlib
- Python 3.8+; NumPy for peak array operations
- Input data: mzML files (standard MS format), FASTA databases (for identification)
- All algorithms follow a consistent pattern:
algo = Algorithm(); params = algo.getParameters(); params.setValue(...); algo.setParameters(params)
Quick Start
import pyopenms as ms # Load mzML file exp = ms.MSExperiment() ms.MzMLFile().load("sample.mzML", exp) print(f"Spectra: {exp.getNrSpectra()}, Chromatograms: {exp.getNrChromatograms()}") # Examine first spectrum spec = exp.getSpectrum(0) mz, intensity = spec.get_peaks() print(f"MS level: {spec.getMSLevel()}, RT: {spec.getRT():.2f}s, Peaks: {len(mz)}") # Quick preprocessing: smooth + centroid gauss = ms.GaussFilter() p = gauss.getParameters(); p.setValue("gaussian_width", 0.1); gauss.setParameters(p) gauss.filterExperiment(exp) picker = ms.PeakPickerHiRes() centroided = ms.MSExperiment() picker.pickExperiment(exp, centroided) print(f"Centroided spectra: {centroided.getNrSpectra()}")
Core API
Module 1: File I/O & Data Access
Read and write mass spectrometry data in multiple formats.
import pyopenms as ms # Read mzML (standard MS format) exp = ms.MSExperiment() ms.MzMLFile().load("data.mzML", exp) # Indexed access for large files (memory-efficient) loader = ms.IndexedMzMLFileLoader() indexed_file = ms.OnDiscMSExperiment() loader.load("large_data.mzML", indexed_file) spec = indexed_file.getSpectrum(0) # Load single spectrum on demand print(f"Total spectra: {indexed_file.getNrSpectra()}") # Read identification results (idXML) protein_ids, peptide_ids = [], [] ms.IdXMLFile().load("results.idXML", protein_ids, peptide_ids) print(f"Peptide IDs: {len(peptide_ids)}, Protein IDs: {len(protein_ids)}") # Read feature map fm = ms.FeatureMap() ms.FeatureXMLFile().load("features.featureXML", fm) print(f"Features: {fm.size()}")
# Write mzML with compression exp_out = ms.MSExperiment() # ... populate experiment ... ms.MzMLFile().store("output.mzML", exp_out) # Read FASTA database entries = [] ms.FASTAFile().load("database.fasta", entries) print(f"Proteins in DB: {len(entries)}") for e in entries[:3]: print(f" {e.identifier}: {e.sequence[:30]}...")
Supported formats: mzML, mzXML, mzData (spectra); featureXML, consensusXML (features); idXML, mzIdentML, pepXML (identifications); TraML (transitions); mzTab (results); FASTA (sequences)
Module 2: Signal Processing & Peak Picking
Preprocess raw spectral data for downstream analysis.
import pyopenms as ms exp = ms.MSExperiment() ms.MzMLFile().load("raw.mzML", exp) # Gaussian smoothing gauss = ms.GaussFilter() p = gauss.getParameters() p.setValue("gaussian_width", 0.15) # m/z width gauss.setParameters(p) gauss.filterExperiment(exp) # Savitzky-Golay smoothing (alternative) sg = ms.SavitzkyGolayFilter() p = sg.getParameters() p.setValue("frame_length", 15) # Must be odd sg.setParameters(p) # sg.filterExperiment(exp) # Use one smoother, not both # Peak picking (centroiding) — required before feature detection picker = ms.PeakPickerHiRes() p = picker.getParameters() p.setValue("signal_to_noise", 1.0) picker.setParameters(p) centroided = ms.MSExperiment() picker.pickExperiment(exp, centroided) print(f"Raw peaks in spec 0: {exp.getSpectrum(0).size()}") print(f"Centroided peaks: {centroided.getSpectrum(0).size()}")
# Normalization normalizer = ms.Normalizer() p = normalizer.getParameters() p.setValue("method", "to_one") # "to_one" or "to_TIC" normalizer.setParameters(p) normalizer.filterPeakMap(centroided) # Peak filtering — remove low-intensity noise mower = ms.ThresholdMower() p = mower.getParameters() p.setValue("threshold", 100.0) # Minimum intensity mower.setParameters(p) mower.filterPeakMap(centroided) # Baseline reduction morph = ms.MorphologicalFilter() p = morph.getParameters() p.setValue("struc_elem_length", 3.0) # m/z window morph.setParameters(p) morph.filterExperiment(exp)
Module 3: Feature Detection & Linking
Detect chromatographic features and link them across samples.
import pyopenms as ms # Load centroided data exp = ms.MSExperiment() ms.MzMLFile().load("centroided.mzML", exp) # Feature detection (proteomics — centroided data) ff = ms.FeatureFinder() features = ms.FeatureMap() seeds = ms.FeatureMap() params = ms.FeatureFinder().getParameters("centroided") ff.run("centroided", exp, features, params, seeds) print(f"Detected {features.size()} features") # Access feature properties for f in features[:5]: print(f" RT: {f.getRT():.1f}s, m/z: {f.getMZ():.4f}, " f"intensity: {f.getIntensity():.0f}, quality: {f.getOverallQuality():.3f}")
# Feature linking across samples — align retention times first aligner = ms.MapAlignmentAlgorithmPoseClustering() p = aligner.getParameters() p.setValue("max_num_peaks_considered", 1000) aligner.setParameters(p) # Link features into consensus map linker = ms.FeatureGroupingAlgorithmQT() p = linker.getParameters() p.setValue("distance_RT:max_difference", 60.0) # seconds p.setValue("distance_MZ:max_difference", 10.0) # ppm linker.setParameters(p) consensus = ms.ConsensusMap() linker.group([features_sample1, features_sample2, features_sample3], consensus) print(f"Consensus features: {consensus.size()}") # Export to pandas for downstream analysis import pandas as pd df = consensus.get_df() print(f"Consensus table: {df.shape}")
Module 4: Peptide & Protein Identification
Process search engine results with FDR control and protein inference.
import pyopenms as ms # Load search engine results protein_ids, peptide_ids = [], [] ms.IdXMLFile().load("search_results.idXML", protein_ids, peptide_ids) # Examine peptide hits for pep_id in peptide_ids[:3]: print(f"Spectrum: RT={pep_id.getRT():.1f}, MZ={pep_id.getMZ():.4f}") for hit in pep_id.getHits(): seq = hit.getSequence() print(f" {seq} score={hit.getScore():.4f} charge={hit.getCharge()}") # FDR filtering (target-decoy approach) fdr = ms.FalseDiscoveryRate() fdr.apply(peptide_ids) # Filter at 1% FDR filtered = [] for pep_id in peptide_ids: hits = [h for h in pep_id.getHits() if h.getScore() <= 0.01] if hits: pep_id.setHits(hits) filtered.append(pep_id) print(f"Peptide IDs at 1% FDR: {len(filtered)}")
# Protein inference inference = ms.BasicProteinInferenceAlgorithm() inference.run(peptide_ids, protein_ids) for prot_id in protein_ids: for hit in prot_id.getHits()[:5]: print(f"Protein: {hit.getAccession()}, score: {hit.getScore():.4f}") # Peptide sequence handling seq = ms.AASequence.fromString("PEPTIDER") print(f"Molecular weight: {seq.getMonoWeight():.4f}") print(f"Formula: {seq.getFormula()}") # Modified sequence mod_seq = ms.AASequence.fromString("PEPTM(Oxidation)DER") print(f"Modified weight: {mod_seq.getMonoWeight():.4f}") # Enzymatic digestion digestor = ms.ProteaseDigestion() digestor.setEnzyme("Trypsin") digest = [] digestor.digest(ms.AASequence.fromString("MKWVTFISLLLLFSSAYSRGVFRR"), digest) print(f"Tryptic peptides: {len(digest)}")
Module 5: Metabolomics Pipeline
Complete untargeted metabolomics workflow from raw data to feature table.
import pyopenms as ms # Step 1: Load and centroid raw data exp = ms.MSExperiment() ms.MzMLFile().load("metabolomics_sample.mzML", exp) picker = ms.PeakPickerHiRes() centroided = ms.MSExperiment() picker.pickExperiment(exp, centroided) # Step 2: Feature detection for metabolomics (small molecules) ff = ms.FeatureFinder() features = ms.FeatureMap() seeds = ms.FeatureMap() params = ms.FeatureFinder().getParameters("centroided") params.setValue("isotopic_pattern:charge_low", 1) params.setValue("isotopic_pattern:charge_high", 3) ff.run("centroided", centroided, features, params, seeds) print(f"Detected features: {features.size()}") # Step 3: Adduct detection (group related adducts) decharger = ms.MetaboliteAdductDecharger() p = decharger.getParameters() p.setValue("potential_adducts", "H:+:0.6;Na:+:0.3;K:+:0.1") # Positive mode decharger.setParameters(p) # decharger.compute(features, feature_map_out, consensus_map_out)
# Step 4: RT alignment across samples import pyopenms as ms import pandas as pd # Assuming feature maps from multiple samples sample_files = ["sample1.featureXML", "sample2.featureXML", "sample3.featureXML"] feature_maps = [] for f in sample_files: fm = ms.FeatureMap() ms.FeatureXMLFile().load(f, fm) feature_maps.append(fm) # Align retention times aligner = ms.MapAlignmentAlgorithmPoseClustering() aligner.setReference(0) # Use first sample as reference # Step 5: Link features to consensus map linker = ms.FeatureGroupingAlgorithmQT() p = linker.getParameters() p.setValue("distance_RT:max_difference", 30.0) p.setValue("distance_MZ:max_difference", 10.0) linker.setParameters(p) consensus = ms.ConsensusMap() linker.group(feature_maps, consensus) # Step 6: Export metabolite table df = consensus.get_df() print(f"Feature table: {df.shape[0]} features × {df.shape[1]} columns") print(df[["RT", "mz", "intensity_0", "intensity_1", "intensity_2"]].head())
Key Concepts
Algorithm Parameter Pattern
All PyOpenMS algorithms follow the same 3-step pattern:
algo = ms.AlgorithmClass() # 1. Instantiate params = algo.getParameters() # 2. Get parameters params.setValue("param_name", value) # 3. Set values algo.setParameters(params) # 4. Apply # algo.process(input, output) # 5. Execute
To discover available parameters:
for key in params.keys(): print(f"{key}: {params.getValue(key)} (type: {params.getDescription(key)})")
Core Data Structures
| Object | Description | Key Methods |
|---|---|---|
| Collection of spectra + chromatograms | , , |
| Single mass spectrum (m/z, intensity) | , , , |
| Chromatographic trace | , |
| Detected chromatographic peak | , , , |
| Collection of features | , , iteration |
| Features linked across samples | , |
| Search results for one spectrum | , , |
| Amino acid sequence with modifications | , , |
Supported File Formats
| Format | Reader Class | Content |
|---|---|---|
| mzML | | Raw spectra (standard) |
| mzXML | | Raw spectra (legacy) |
| featureXML | | Detected features |
| consensusXML | | Linked features |
| idXML | | Peptide/protein IDs |
| mzIdentML | | Peptide/protein IDs (standard) |
| FASTA | | Protein sequences |
| TraML | | MRM transitions |
| mzTab | | Quantification results |
Common Workflows
Workflow 1: Proteomics Feature Detection Pipeline
import pyopenms as ms # Load → Smooth → Centroid → Detect → Export exp = ms.MSExperiment() ms.MzMLFile().load("proteomics.mzML", exp) # Preprocessing gauss = ms.GaussFilter() p = gauss.getParameters(); p.setValue("gaussian_width", 0.1); gauss.setParameters(p) gauss.filterExperiment(exp) picker = ms.PeakPickerHiRes() centroided = ms.MSExperiment() picker.pickExperiment(exp, centroided) # Feature detection ff = ms.FeatureFinder() features = ms.FeatureMap() params = ff.getParameters("centroided") ff.run("centroided", centroided, features, params, ms.FeatureMap()) # Export to pandas df = features.get_df() print(f"Features: {df.shape}") df.to_csv("proteomics_features.csv", index=False)
Workflow 2: Multi-Sample Metabolomics Comparison
- Load mzML files for all samples (Core API Module 1)
- Centroid each sample (Core API Module 2 —
)PeakPickerHiRes - Detect features per sample (Core API Module 3 —
)FeatureFinder - Align retention times (Core API Module 3 —
)MapAlignmentAlgorithmPoseClustering - Link features to consensus map (Core API Module 3 —
)FeatureGroupingAlgorithmQT - Export consensus table to pandas and filter by CV across replicates
Workflow 3: Identification Results Analysis
import pyopenms as ms import pandas as pd # Load and filter identifications protein_ids, peptide_ids = [], [] ms.IdXMLFile().load("search.idXML", protein_ids, peptide_ids) # FDR filtering fdr = ms.FalseDiscoveryRate() fdr.apply(peptide_ids) # Build results table rows = [] for pep_id in peptide_ids: for hit in pep_id.getHits(): if hit.getScore() <= 0.01: # 1% FDR rows.append({ "sequence": str(hit.getSequence()), "score": hit.getScore(), "charge": hit.getCharge(), "rt": pep_id.getRT(), "mz": pep_id.getMZ() }) df = pd.DataFrame(rows) print(f"Identified peptides at 1% FDR: {len(df)}") print(f"Unique sequences: {df['sequence'].nunique()}") df.to_csv("peptide_identifications.csv", index=False)
Key Parameters
| Parameter | Module | Default | Range/Options | Effect |
|---|---|---|---|---|
| GaussFilter | 0.2 | 0.05–1.0 | m/z smoothing window width |
| SavitzkyGolayFilter | 11 | 5–31 (odd) | Smoothing window points |
| PeakPickerHiRes | 1.0 | 0.5–10.0 | Minimum S/N for centroiding |
| FeatureFinder | 1 | 1–6 | Minimum charge state |
| FeatureFinder | 3 | 1–10 | Maximum charge state |
| FeatureGroupingAlgorithmQT | 100 | 10–300 s | RT tolerance for feature linking |
| FeatureGroupingAlgorithmQT | 10 | 1–50 ppm | m/z tolerance for feature linking |
| ThresholdMower | 0.0 | 0–10000 | Minimum peak intensity |
| Normalizer | "to_one" | "to_one"/"to_TIC" | Normalization method |
Best Practices
- Always centroid before feature detection — FeatureFinder requires centroided data. Use
firstPeakPickerHiRes - Use indexed loading for large files —
loads spectra on demand, avoiding memory issues for files >1 GBOnDiscMSExperiment - Preserve original data — Store raw experiment before processing:
. Processing is destructiveorig = ms.MSExperiment(exp) - Profile vs centroid awareness — Check data type:
returns 1 (profile) or 2 (centroid). Some algorithms require specific typesspec.getType() - Parameter discovery — Print parameters before tuning:
for k in params.keys(): print(k, params.getValue(k)) - Export to pandas early — Use
orfeatures.get_df()
to leverage pandas/numpy for statistical analysisconsensus.get_df()
Common Recipes
Recipe 1: Spectrum Statistics and Quality Check
import pyopenms as ms import numpy as np exp = ms.MSExperiment() ms.MzMLFile().load("sample.mzML", exp) ms1_count = sum(1 for s in exp if s.getMSLevel() == 1) ms2_count = sum(1 for s in exp if s.getMSLevel() == 2) rt_range = (exp.getSpectrum(0).getRT(), exp.getSpectrum(exp.getNrSpectra()-1).getRT()) peak_counts = [s.size() for s in exp] print(f"MS1: {ms1_count}, MS2: {ms2_count}") print(f"RT range: {rt_range[0]:.1f}–{rt_range[1]:.1f}s") print(f"Peaks per spectrum: mean={np.mean(peak_counts):.0f}, " f"median={np.median(peak_counts):.0f}")
Recipe 2: Theoretical Spectrum Generation
import pyopenms as ms # Generate theoretical fragment spectrum for a peptide tsg = ms.TheoreticalSpectrumGenerator() p = tsg.getParameters() p.setValue("add_b_ions", "true") p.setValue("add_y_ions", "true") p.setValue("add_metainfo", "true") tsg.setParameters(p) spec = ms.MSSpectrum() seq = ms.AASequence.fromString("PEPTIDER") tsg.getSpectrum(spec, seq, 1, 2) # charge 1 to 2 mz, intensity = spec.get_peaks() print(f"Theoretical fragments for PEPTIDER: {len(mz)} ions")
Recipe 3: Format Conversion (mzXML → mzML)
import pyopenms as ms exp = ms.MSExperiment() ms.MzXMLFile().load("old_data.mzXML", exp) ms.MzMLFile().store("converted.mzML", exp) print(f"Converted {exp.getNrSpectra()} spectra to mzML format")
Troubleshooting
| Problem | Cause | Solution |
|---|---|---|
on load | Wrong path or format | Verify file exists; use correct reader class (MzMLFile for .mzML) |
| Empty feature map after detection | Profile data instead of centroided | Run before |
| Very few features detected | Too-strict S/N threshold | Lower to 0.5–1.0; adjust charge range |
| Memory error on large files | Loading entire file at once | Use for indexed access |
type error | Wrong value type for parameter | Check expected type: . Use float for numeric, string for enum |
| RT alignment fails | Too few common features | Increase ; verify samples are from same experiment |
| FDR values all 1.0 | No decoy hits in search results | Ensure search was run with target-decoy database; check score orientation |
| Feature linking too aggressive | Large RT/m/z tolerances | Reduce and |
| Slow processing | Processing all spectra | Filter by MS level first: |
Bundled Resources
-
references/data_structures_reference.md — Comprehensive documentation of PyOpenMS core objects (MSExperiment, MSSpectrum, Feature, FeatureMap, ConsensusMap, PeptideIdentification, AASequence, Param). Covers: object creation, attribute access, iteration patterns, memory management, type conversions.
- Covers: all 13+ core data structure classes with creation/access/iteration examples
- Relocated inline: Core Data Structures summary table in Key Concepts, AASequence usage in Core API Module 4
- Omitted: exhaustive attribute listings duplicating PyOpenMS API docs
-
references/signal_feature_processing.md — Signal processing algorithms and feature detection/linking workflows consolidated from two original references.
- Covers: smoothing (Gaussian, Savitzky-Golay), peak picking (HiRes, CWT), normalization, peak filtering (threshold, window, N-largest), baseline reduction, deconvolution, spectrum merging, RT alignment, mass calibration, feature finding (centroided, metabolomics), feature linking, adduct detection, quality control
- Relocated inline: GaussFilter, PeakPickerHiRes, Normalizer, ThresholdMower in Core API Module 2; FeatureFinder, MapAlignment, FeatureGroupingAlgorithmQT in Core API Module 3
- Omitted: CWT peak picker detailed parameters — rarely used vs HiRes; isotope wavelet transform — specialized use case
-
references/identification_metabolomics.md — Peptide/protein identification and untargeted metabolomics pipelines consolidated from two original references.
- Covers: search engine integration, FDR calculation, protein inference, enzymatic digestion, spectral library search, theoretical spectrum generation, untargeted metabolomics pipeline (peak picking → feature detection → adduct detection → alignment → linking → gap filling → annotation → export), compound identification (mass-based, MS/MS-based), normalization (TIC), QC (CV filtering, blank filtering), MetaboAnalyst export
- Relocated inline: FDR filtering, protein inference, AASequence handling in Core API Module 4; metabolomics pipeline in Core API Module 5
- Omitted: detailed Comet/Mascot parameter configuration — search engine-specific; MetaboAnalyst export format details — external tool
Related Skills
- matchms-spectral-matching — spectral library matching and metabolite identification from MS/MS
- biopython-molecular-biology — protein sequence analysis (FASTA parsing, BLAST)
References
- PyOpenMS documentation: https://pyopenms.readthedocs.io
- OpenMS project: https://www.openms.org
- Röst et al. (2016) OpenMS: a flexible open-source software platform for mass spectrometry data analysis. Nature Methods, DOI: 10.1038/nmeth.3959
- GitHub: https://github.com/OpenMS/OpenMS