Medical-research-skills matchms

Process, clean, and compare mass spectrometry (MS/MS) spectra with Matchms; use when you need reproducible spectral filtering and similarity scoring for metabolomics workflows.

install
source · Clone the upstream repo
git clone https://github.com/aipoch/medical-research-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/aipoch/medical-research-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/scientific-skills/Data Analysis/matchms" ~/.claude/skills/aipoch-medical-research-skills-matchms && rm -rf "$T"
manifest: scientific-skills/Data Analysis/matchms/SKILL.md
source content

Source: https://github.com/aipoch/medical-research-skills

Matchms Skill

When to Use

  • Use this skill when you need process, clean, and compare mass spectrometry (ms/ms) spectra with matchms; use when you need reproducible spectral filtering and similarity scoring for metabolomics workflows in a reproducible workflow.
  • Use this skill when a data analytics task needs a packaged method instead of ad-hoc freeform output.
  • Use this skill when the user expects a concrete deliverable, validation step, or file-based result.
  • Use this skill when
    scripts/similarity_pipeline.py
    is the most direct path to complete the request.
  • Use this skill when you need the
    matchms
    package behavior rather than a generic answer.

Key Features

  • Scope-focused workflow aligned to: Process, clean, and compare mass spectrometry (MS/MS) spectra with Matchms; use when you need reproducible spectral filtering and similarity scoring for metabolomics workflows.
  • Packaged executable path(s):
    scripts/similarity_pipeline.py
    .
  • Reference material available in
    references/
    for task-specific guidance.
  • Structured execution path designed to keep outputs consistent and reviewable.

Dependencies

  • Python
    :
    3.10+
    . Repository baseline for current packaged skills.
  • Third-party packages
    :
    not explicitly version-pinned in this skill package
    . Add pinned versions if this skill needs stricter environment control.

Example Usage

cd "20260316/scientific-skills/Data Analytics/matchms"
python -m py_compile scripts/similarity_pipeline.py
python scripts/similarity_pipeline.py --help

Example run plan:

  1. Confirm the user input, output path, and any required config values.
  2. Edit the in-file
    CONFIG
    block or documented parameters if the script uses fixed settings.
  3. Run
    python scripts/similarity_pipeline.py
    with the validated inputs.
  4. Review the generated output and return the final artifact with any assumptions called out.

Implementation Details

  • Execution model: validate the request, choose the packaged workflow, and produce a bounded deliverable.
  • Input controls: confirm the source files, scope limits, output format, and acceptance criteria before running any script.
  • Primary implementation surface:
    scripts/similarity_pipeline.py
    .
  • Reference guidance:
    references/
    contains supporting rules, prompts, or checklists.
  • Parameters to clarify first: input path, output path, scope filters, thresholds, and any domain-specific constraints.
  • Output discipline: keep results reproducible, identify assumptions explicitly, and avoid undocumented side effects.

1. When to Use

Use this skill when you need to:

  • Import and harmonize MS/MS spectra from common community formats (e.g., MGF/MSP) before analysis.
  • Clean spectra (peak filtering, intensity normalization) to improve downstream similarity scoring and identification.
  • Compute spectral similarity (Cosine/Modified Cosine/Fingerprint-based) for library matching or clustering.
  • Build reproducible, configurable processing pipelines for metabolomics projects.
  • Compare many spectra efficiently (all-vs-all or query-vs-library) and store/inspect score outputs.

2. Key Features

  • Import/Export support: Read spectra from mzML, mzXML, MGF, MSP, and JSON (depending on installed readers).
  • Filtering & harmonization: Metadata standardization, peak cleaning, intensity normalization, and other reusable filters.
  • Similarity scoring:
    • Cosine similarity (Greedy/Hungarian variants)
    • Modified Cosine (accounts for precursor mass shifts)
    • Fingerprint-based similarities (when molecular fingerprints are available)
  • Pipeline composition: Chain filters and scoring steps into repeatable workflows.

Additional reference material (if present in the repository):

  • Filters:
    references/filtering.md
  • Similarity:
    references/similarity.md
  • Workflows:
    references/workflows.md

3. Dependencies

  • matchms
    (version depends on your environment; pin in your project, e.g.,
    matchms>=0.20,<1.0
    )
  • numpy
    (e.g.,
    numpy>=1.20
    )
  • scipy
    (e.g.,
    scipy>=1.7
    )
  • rdkit
    (optional; required for chemistry/fingerprint-related functionality, version varies by distribution)

4. Example Usage

A minimal, runnable example that loads spectra from an MGF file and computes pairwise cosine scores:

from matchms.importing import load_from_mgf
from matchms import calculate_scores
from matchms.similarity import CosineGreedy

def main():
    # Load spectra from an MGF file
    spectra = list(load_from_mgf("data.mgf"))

    # Compute similarity scores (all-vs-all)
    scores = calculate_scores(
        references=spectra,
        queries=spectra,
        similarity_function=CosineGreedy()
    )

    # Iterate over computed scores
    for (reference_idx, query_idx, score, n_matches) in scores:
        print(
            f"ref={reference_idx:>3} query={query_idx:>3} "
            f"cosine={score:.4f} matches={n_matches}"
        )

if __name__ == "__main__":
    main()

5. Implementation Details

  • Data model: Matchms operates on
    Spectrum
    objects containing peak m/z and intensity arrays plus metadata (e.g., precursor m/z, charge, compound name/identifier).
  • Filtering stage: Typical pipelines apply filters to:
    • standardize/repair metadata fields,
    • remove noise peaks (e.g., by intensity threshold or m/z window rules),
    • normalize intensities (commonly to a maximum of 1.0 or to unit norm). See
      references/filtering.md
      for filter patterns and recommended sequences.
  • Cosine similarity (Greedy/Hungarian):
    • Peaks are matched within an m/z tolerance (implementation-specific defaults; configure via the similarity class parameters).
    • Greedy matching selects best available peak matches iteratively.
    • Hungarian matching solves an assignment problem to maximize total match score under one-to-one constraints.
  • Modified Cosine:
    • Extends cosine matching by allowing peak alignment with a precursor mass shift, improving matching for related compounds/adducts.
    • Typically requires precursor m/z metadata to be present and consistent.
  • Fingerprint similarity (optional):
    • Requires molecular fingerprints (often derived via RDKit) and compares spectra/compounds using fingerprint similarity metrics.
    • Use when you have structure annotations or can compute fingerprints reliably.
  • Workflow reproducibility:
    • Prefer explicit, ordered filter chains and pinned dependency versions.
    • Store configuration (tolerances, normalization choices, filters used) alongside results for traceability. See
      references/workflows.md
      for pipeline organization guidance.