SciAgent-Skills matchms-spectral-matching
Mass spectrometry spectral matching and metabolite identification with matchms. Use for importing spectra (mzML, MGF, MSP, JSON), filtering/normalizing peaks, computing spectral similarity (cosine, modified cosine, fingerprint), building reproducible processing pipelines, and identifying unknown metabolites from spectral libraries. For full LC-MS/MS proteomics pipelines, use pyopenms instead.
git clone https://github.com/jaechang-hits/SciAgent-Skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/proteomics-protein-engineering/matchms-spectral-matching" ~/.claude/skills/jaechang-hits-sciagent-skills-matchms-spectral-matching && rm -rf "$T"
skills/proteomics-protein-engineering/matchms-spectral-matching/SKILL.mdMatchms — Spectral Matching & Metabolite Identification
Overview
Matchms is a Python library for mass spectrometry data processing focused on spectral similarity calculation and compound identification. It provides multi-format I/O, 50+ spectrum filters for metadata harmonization and peak processing, 8 similarity scoring functions, and a pipeline framework for reproducible analytical workflows.
When to Use
- Identifying unknown metabolites by matching MS/MS spectra against reference libraries
- Computing spectral similarity scores (cosine, modified cosine, fingerprint-based)
- Processing and standardizing mass spectral data from multiple formats (mzML, MGF, MSP, JSON)
- Building reproducible spectral processing pipelines for quality control
- Harmonizing metadata across spectral databases (compound names, SMILES, InChI, adducts)
- Large-scale spectral library comparisons and duplicate detection
- For full LC-MS/MS proteomics workflows (feature detection, protein ID), use pyopenms instead
- For chemical structure similarity without mass spectra, use rdkit fingerprint comparison
Prerequisites
uv pip install matchms numpy pandas # For chemical structure processing (SMILES, InChI, fingerprints): uv pip install matchms[chemistry]
- Python 3.8+; NumPy for peak array operations
- Input: spectral data in MGF, MSP, mzML, mzXML, JSON, or pickle format
- Reference library in any supported format for matching
Quick Start
from matchms.importing import load_from_mgf from matchms.filtering import default_filters, normalize_intensities from matchms.filtering import select_by_relative_intensity, require_minimum_number_of_peaks from matchms import calculate_scores from matchms.similarity import CosineGreedy # Load and process query spectra queries = list(load_from_mgf("queries.mgf")) queries = [default_filters(s) for s in queries] queries = [normalize_intensities(s) for s in queries if s is not None] queries = [require_minimum_number_of_peaks(s, n_required=5) for s in queries if s is not None] # Load reference library refs = list(load_from_mgf("library.mgf")) refs = [default_filters(s) for s in refs] refs = [normalize_intensities(s) for s in refs if s is not None] # Calculate similarity scores scores = calculate_scores(references=refs, queries=queries, similarity_function=CosineGreedy(tolerance=0.1)) # Get best matches for first query best = scores.scores_by_query(queries[0], sort=True)[:5] for match, score_tuple in best: print(f"Score: {score_tuple['score']:.3f}, Matches: {score_tuple['matches']}")
Core API
Module 1: Spectrum I/O
Import spectra from multiple file formats and export processed data.
from matchms.importing import (load_from_mgf, load_from_mzml, load_from_msp, load_from_json, load_from_mzxml, load_from_pickle, load_from_usi) from matchms.exporting import save_as_mgf, save_as_msp, save_as_json, save_as_pickle # Import from various formats (returns generators) spectra_mgf = list(load_from_mgf("library.mgf")) spectra_mzml = list(load_from_mzml("data.mzML")) spectra_msp = list(load_from_msp("nist_library.msp")) spectra_json = list(load_from_json("gnps_spectra.json")) print(f"Loaded: MGF={len(spectra_mgf)}, mzML={len(spectra_mzml)}") # Export processed spectra save_as_mgf(spectra_mgf, "processed.mgf") save_as_json(spectra_mgf, "processed.json") save_as_pickle(spectra_mgf, "spectra.pickle") # Fast for intermediate results # Pickle for large datasets (fastest I/O) from matchms.importing import load_from_pickle spectra = list(load_from_pickle("spectra.pickle"))
from matchms import Spectrum import numpy as np # Create spectrum manually mz = np.array([100.0, 150.0, 200.0, 250.0, 300.0]) intensities = np.array([0.1, 0.5, 0.9, 0.3, 0.7]) metadata = { "precursor_mz": 325.5, "ionmode": "positive", "compound_name": "Caffeine", "smiles": "CN1C=NC2=C1C(=O)N(C(=O)N2C)C" } spectrum = Spectrum(mz=mz, intensities=intensities, metadata=metadata) # Access spectrum data print(f"Peaks: {spectrum.peaks.mz}") print(f"Precursor: {spectrum.get('precursor_mz')}") print(f"Name: {spectrum.get('compound_name')}") # Visualize spectrum.plot()
Module 2: Spectrum Filtering & Processing
Apply metadata harmonization and peak processing filters. Matchms provides 50+ filters.
from matchms.filtering import ( default_filters, normalize_intensities, select_by_relative_intensity, select_by_mz, require_minimum_number_of_peaks, reduce_to_number_of_peaks, remove_peaks_around_precursor_mz, add_losses ) # default_filters applies: metadata cleanup, charge correction, adduct parsing spectrum = default_filters(spectrum) # Peak normalization (max intensity → 1.0) spectrum = normalize_intensities(spectrum) # Filter peaks by relative intensity (remove noise below 1%) spectrum = select_by_relative_intensity(spectrum, intensity_from=0.01, intensity_to=1.0) # Filter peaks by m/z range spectrum = select_by_mz(spectrum, mz_from=50.0, mz_to=500.0) # Keep top N peaks only spectrum = reduce_to_number_of_peaks(spectrum, n_max=50) # Remove peaks near precursor (common contaminants) spectrum = remove_peaks_around_precursor_mz(spectrum, mz_tolerance=17.0) # Require minimum peaks for matching spectrum = require_minimum_number_of_peaks(spectrum, n_required=5) # Add neutral losses (useful for NeutralLossesCosine) spectrum = add_losses(spectrum) if spectrum is not None: print(f"After filtering: {len(spectrum.peaks.mz)} peaks")
# Chemical annotation filters (require matchms[chemistry]) from matchms.filtering import ( derive_inchi_from_smiles, derive_inchikey_from_inchi, derive_smiles_from_inchi, add_fingerprint, repair_inchi_inchikey_smiles, require_valid_annotation ) # Derive chemical identifiers from SMILES spectrum = derive_inchi_from_smiles(spectrum) spectrum = derive_inchikey_from_inchi(spectrum) # Add molecular fingerprint for structural similarity spectrum = add_fingerprint(spectrum, fingerprint_type="morgan", nbits=2048) # Validate annotations spectrum = require_valid_annotation(spectrum) if spectrum is not None: print(f"InChIKey: {spectrum.get('inchikey')}")
Module 3: Similarity Scoring
Compare spectra using multiple similarity metrics.
from matchms import calculate_scores from matchms.similarity import ( CosineGreedy, CosineHungarian, ModifiedCosine, NeutralLossesCosine, FingerprintSimilarity, MetadataMatch, PrecursorMzMatch ) # CosineGreedy — fast peak matching (greedy algorithm) scores = calculate_scores(references=library, queries=unknowns, similarity_function=CosineGreedy(tolerance=0.1)) # ModifiedCosine — accounts for precursor mass differences (best for analog search) scores = calculate_scores(references=library, queries=unknowns, similarity_function=ModifiedCosine(tolerance=0.1)) # CosineHungarian — optimal peak matching (slower but more accurate) scores = calculate_scores(references=library, queries=unknowns, similarity_function=CosineHungarian(tolerance=0.1)) # NeutralLossesCosine — similarity based on neutral loss patterns scores = calculate_scores(references=library, queries=unknowns, similarity_function=NeutralLossesCosine(tolerance=0.1)) # Access results for i, query in enumerate(unknowns[:3]): best_matches = scores.scores_by_query(query, sort=True)[:3] print(f"\nQuery {i}: precursor_mz={query.get('precursor_mz')}") for ref, score_tuple in best_matches: print(f" {ref.get('compound_name', 'Unknown')}: " f"score={score_tuple['score']:.3f}, matches={score_tuple['matches']}")
# FingerprintSimilarity — structural similarity (requires fingerprints) from matchms.similarity import FingerprintSimilarity scores = calculate_scores(references=library, queries=unknowns, similarity_function=FingerprintSimilarity( similarity_measure="jaccard")) # PrecursorMzMatch — fast mass-based pre-filtering from matchms.similarity import PrecursorMzMatch scores = calculate_scores(references=library, queries=unknowns, similarity_function=PrecursorMzMatch(tolerance=0.1)) # Multi-metric scoring: combine peak + structural similarity cosine_scores = calculate_scores(references=library, queries=unknowns, similarity_function=CosineGreedy(tolerance=0.1)) fp_scores = calculate_scores(references=library, queries=unknowns, similarity_function=FingerprintSimilarity())
Module 4: Processing Pipelines
Build reusable, reproducible multi-step processing workflows.
from matchms import SpectrumProcessor from matchms.filtering import ( default_filters, normalize_intensities, select_by_relative_intensity, require_minimum_number_of_peaks, remove_peaks_around_precursor_mz, add_losses ) # Define reusable pipeline pipeline = SpectrumProcessor([ default_filters, normalize_intensities, lambda s: select_by_relative_intensity(s, intensity_from=0.01), lambda s: remove_peaks_around_precursor_mz(s, mz_tolerance=17.0), lambda s: require_minimum_number_of_peaks(s, n_required=5), add_losses ]) # Apply to all spectra (filters returning None remove the spectrum) processed = [pipeline(s) for s in raw_spectra] processed = [s for s in processed if s is not None] print(f"Processed: {len(processed)}/{len(raw_spectra)} spectra retained")
Key Concepts
Similarity Function Comparison
| Function | Speed | Accuracy | Best For |
|---|---|---|---|
| Fast | Good | General library matching |
| Slow | Best | Small comparisons, validation |
| Fast | Good | Analog search (different precursors) |
| Medium | Good | Structural class identification |
| Fast | Moderate | Structure-based pre-filtering |
| Fastest | N/A | Mass-based pre-filtering |
Filter Categories
| Category | Examples | Purpose |
|---|---|---|
| Metadata cleanup | , , | Standardize metadata fields |
| Chemical derivation | , | Compute chemical identifiers |
| Mass/charge | , , | Fix and validate mass info |
| Peak normalization | , | Scale and filter peaks |
| Peak reduction | , | Remove noise/artifacts |
| Quality control | , | Enforce minimum quality |
| Neutral losses | | Compute precursor-fragment losses |
Score Tuple Structure
All similarity functions return
(score, matches):
: float 0.0–1.0 (cosine similarity value)score
: int (number of matched peaks between query and reference)matches
Higher scores and more matched peaks indicate better matches. Typical thresholds: score > 0.7 and matches > 6 for confident identifications.
Common Workflows
Workflow 1: Library Matching for Metabolite Identification
from matchms.importing import load_from_mgf from matchms.filtering import default_filters, normalize_intensities from matchms.filtering import select_by_relative_intensity, require_minimum_number_of_peaks from matchms import calculate_scores from matchms.similarity import ModifiedCosine import pandas as pd # Load and process both queries and library identically def process_spectra(spectra): processed = [] for s in spectra: s = default_filters(s) if s is None: continue s = normalize_intensities(s) s = select_by_relative_intensity(s, intensity_from=0.01) s = require_minimum_number_of_peaks(s, n_required=5) if s is not None: processed.append(s) return processed queries = process_spectra(load_from_mgf("unknowns.mgf")) library = process_spectra(load_from_mgf("reference_library.mgf")) print(f"Queries: {len(queries)}, Library: {len(library)}") # Score all query-reference pairs scores = calculate_scores(references=library, queries=queries, similarity_function=ModifiedCosine(tolerance=0.1)) # Extract best matches results = [] for query in queries: best = scores.scores_by_query(query, sort=True)[:1] if best: ref, score_tuple = best[0] results.append({ "query_precursor_mz": query.get("precursor_mz"), "match_name": ref.get("compound_name", "Unknown"), "match_smiles": ref.get("smiles", ""), "score": score_tuple["score"], "matched_peaks": score_tuple["matches"] }) df = pd.DataFrame(results) confident = df[df["score"] > 0.7] print(f"Confident matches (score>0.7): {len(confident)}/{len(df)}") df.to_csv("identification_results.csv", index=False)
Workflow 2: Quality Control and Data Cleaning
from matchms.importing import load_from_msp from matchms.exporting import save_as_mgf from matchms import SpectrumProcessor from matchms.filtering import ( default_filters, normalize_intensities, select_by_relative_intensity, require_minimum_number_of_peaks, require_precursor_mz, add_parent_mass ) # Define QC pipeline qc_pipeline = SpectrumProcessor([ default_filters, require_precursor_mz, add_parent_mass, normalize_intensities, lambda s: select_by_relative_intensity(s, intensity_from=0.001), lambda s: require_minimum_number_of_peaks(s, n_required=3) ]) # Process and filter raw = list(load_from_msp("raw_library.msp")) cleaned = [qc_pipeline(s) for s in raw] cleaned = [s for s in cleaned if s is not None] print(f"Input: {len(raw)}, Output: {len(cleaned)} ({len(cleaned)/len(raw)*100:.1f}% retained)") save_as_mgf(cleaned, "cleaned_library.mgf")
Workflow 3: Format Conversion
- Load spectra from source format (Core API Module 1 —
)load_from_mzml - Apply
for metadata harmonization (Core API Module 2)default_filters - Export to target format (Core API Module 1 —
)save_as_mgf
Key Parameters
| Parameter | Function/Module | Default | Range/Options | Effect |
|---|---|---|---|---|
| CosineGreedy/ModifiedCosine | 0.1 | 0.005–0.5 Da | m/z tolerance for peak matching |
| CosineGreedy | 0.0 | 0.0–2.0 | Weight of m/z in scoring (0=ignore) |
| CosineGreedy | 1.0 | 0.0–2.0 | Weight of intensity in scoring |
| select_by_relative_intensity | 0.0 | 0.0–1.0 | Minimum relative intensity to keep |
| require_minimum_number_of_peaks | 10 | 1–100 | Minimum peaks to retain spectrum |
| reduce_to_number_of_peaks | 100 | 10–500 | Maximum peaks to retain |
| remove_peaks_around_precursor_mz | 17.0 | 0.5–50 Da | Window around precursor to remove |
| add_fingerprint | "daylight" | "daylight"/"morgan"/"maccs" | Molecular fingerprint type |
| add_fingerprint | 2048 | 256–4096 | Fingerprint bit vector length |
Best Practices
- Always process queries and references identically — apply the same filtering pipeline to both sets to avoid systematic bias in similarity scores
- Save intermediate results in pickle — pickle format is fastest for re-loading; use MGF/MSP for sharing with other tools
- Pre-filter by precursor mass — for large libraries, use
first to reduce the comparison space, then score withPrecursorMzMatchCosineGreedy - Combine multiple metrics — use both CosineGreedy (peak matching) and FingerprintSimilarity (structure) for more robust identification
- Check for None after filtering — filters return
when a spectrum fails quality requirements. Always filter:None[s for s in processed if s is not None] - Use ModifiedCosine for analog search — when querying against libraries that may not have the exact compound, ModifiedCosine handles precursor mass differences
Common Recipes
Recipe 1: Precursor-Filtered Library Search (Efficient)
from matchms import calculate_scores from matchms.similarity import PrecursorMzMatch, CosineGreedy # Step 1: Fast mass filter mass_scores = calculate_scores(references=library, queries=unknowns, similarity_function=PrecursorMzMatch(tolerance=0.5)) # Step 2: Detailed scoring only for mass-matched pairs cosine = CosineGreedy(tolerance=0.1) for query in unknowns: candidates = mass_scores.scores_by_query(query, sort=True) mass_matched = [ref for ref, score in candidates if score["score"] > 0] if mass_matched: detailed = calculate_scores(references=mass_matched, queries=[query], similarity_function=cosine) best = detailed.scores_by_query(query, sort=True)[:3] for ref, s in best: print(f"{ref.get('compound_name')}: {s['score']:.3f}")
Recipe 2: Ion Mode-Specific Processing
from matchms.importing import load_from_mgf from matchms.filtering import default_filters, normalize_intensities spectra = list(load_from_mgf("mixed_library.mgf")) spectra = [default_filters(s) for s in spectra] spectra = [s for s in spectra if s is not None] # Separate by ion mode positive = [s for s in spectra if s.get("ionmode") == "positive"] negative = [s for s in spectra if s.get("ionmode") == "negative"] print(f"Positive: {len(positive)}, Negative: {len(negative)}") # Process each mode with mode-specific filtering positive = [normalize_intensities(s) for s in positive] negative = [normalize_intensities(s) for s in negative]
Recipe 3: Metadata Enrichment Report
import pandas as pd from matchms.importing import load_from_mgf from matchms.filtering import default_filters spectra = [default_filters(s) for s in load_from_mgf("library.mgf")] spectra = [s for s in spectra if s is not None] # Extract metadata summary rows = [] for s in spectra: rows.append({ "compound_name": s.get("compound_name", ""), "precursor_mz": s.get("precursor_mz"), "ionmode": s.get("ionmode", ""), "smiles": s.get("smiles", ""), "inchikey": s.get("inchikey", ""), "num_peaks": len(s.peaks.mz) }) df = pd.DataFrame(rows) print(f"Library: {len(df)} spectra") print(f"Named: {(df.compound_name != '').sum()}") print(f"With SMILES: {(df.smiles != '').sum()}") print(f"Ion modes: {df.ionmode.value_counts().to_dict()}")
Troubleshooting
| Problem | Cause | Solution |
|---|---|---|
| All scores are 0.0 | No matching peaks within tolerance | Increase (try 0.2–0.5 Da); verify both spectra have peaks |
| Low scores despite same compound | Different fragmentation conditions | Use instead of ; check ion mode consistency |
| Many spectra filtered to None | Too strict quality filters | Lower in ; relax intensity thresholds |
on metadata field | Field name not harmonized | Apply first to harmonize metadata keys |
| Memory error with large library | All-vs-all comparison | Pre-filter by precursor mass () before detailed scoring |
fails | RDKit not installed | Install chemistry extras: |
| Import returns empty list | Wrong file format or path | Verify format matches loader (MGF for , MSP for ); check file is not empty |
| Inconsistent scores between runs | Different processing pipelines | Use to ensure identical processing for queries and references |
Bundled Resources
-
references/filtering_catalog.md — Complete catalog of 50+ matchms filter functions organized by category (metadata processing, chemical structure, mass/charge, peak processing, quality control).
- Covers: all filter function signatures with parameters and brief descriptions, common filter combinations
- Relocated inline: key filters (default_filters, normalize_intensities, select_by_relative_intensity, reduce_to_number_of_peaks, remove_peaks_around_precursor_mz, add_fingerprint) in Core API Module 2; filter category summary in Key Concepts
- Omitted: individual filter examples duplicating Core API patterns
-
references/workflows_similarity.md — Extended workflows and detailed similarity function documentation consolidated from two original references.
- Covers: all 8 similarity functions with parameter tables, large-scale comparison strategies, multi-metric scoring patterns, 7 extended workflows (library matching, QC, multi-metric, format conversion, metadata enrichment, large-scale comparison, automated identification report), performance considerations
- Relocated inline: CosineGreedy/ModifiedCosine/CosineHungarian/FingerprintSimilarity usage in Core API Module 3; similarity comparison table in Key Concepts; library matching + QC workflows in Common Workflows
- Omitted: network-based spectral clustering workflow — requires external tool (spec2vec); spectra visualization tutorial — covered by
in Core API Module 1spectrum.plot()
Disposition of original reference files:
- importing_exporting.md (417 lines) → fully consolidated into Core API Module 1 (I/O functions, format list, Spectrum creation, pickle usage). Retained: all 7 import functions, 4 export functions, Spectrum class creation, format selection guidance. Omitted: USI detailed examples — niche use case
- filtering.md (289 lines) → migrated as references/filtering_catalog.md with key filters relocated to Core API Module 2
- similarity.md (381 lines) → consolidated into references/workflows_similarity.md with core functions in Core API Module 3
- workflows.md (648 lines) → consolidated into references/workflows_similarity.md with top workflows in Common Workflows
Related Skills
- pyopenms-mass-spectrometry — full LC-MS/MS proteomics and metabolomics pipelines (feature detection, protein ID)
- rdkit-cheminformatics — molecular fingerprint generation and chemical structure similarity
References
- matchms documentation: https://matchms.readthedocs.io
- Huber et al. (2020) matchms — processing and similarity scoring of mass spectrometry data. Journal of Open Source Software, DOI: 10.21105/joss.02411
- GitHub: https://github.com/matchms/matchms