Medical-research-skills matchms
Process, clean, and compare mass spectrometry (MS/MS) spectra with Matchms; use when you need reproducible spectral filtering and similarity scoring for metabolomics workflows.
install
source · Clone the upstream repo
git clone https://github.com/aipoch/medical-research-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/aipoch/medical-research-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/scientific-skills/Data Analysis/matchms" ~/.claude/skills/aipoch-medical-research-skills-matchms && rm -rf "$T"
manifest:
scientific-skills/Data Analysis/matchms/SKILL.mdsource content
Matchms Skill
When to Use
- Use this skill when you need process, clean, and compare mass spectrometry (ms/ms) spectra with matchms; use when you need reproducible spectral filtering and similarity scoring for metabolomics workflows in a reproducible workflow.
- Use this skill when a data analytics task needs a packaged method instead of ad-hoc freeform output.
- Use this skill when the user expects a concrete deliverable, validation step, or file-based result.
- Use this skill when
is the most direct path to complete the request.scripts/similarity_pipeline.py - Use this skill when you need the
package behavior rather than a generic answer.matchms
Key Features
- Scope-focused workflow aligned to: Process, clean, and compare mass spectrometry (MS/MS) spectra with Matchms; use when you need reproducible spectral filtering and similarity scoring for metabolomics workflows.
- Packaged executable path(s):
.scripts/similarity_pipeline.py - Reference material available in
for task-specific guidance.references/ - Structured execution path designed to keep outputs consistent and reviewable.
Dependencies
:Python
. Repository baseline for current packaged skills.3.10+
:Third-party packages
. Add pinned versions if this skill needs stricter environment control.not explicitly version-pinned in this skill package
Example Usage
cd "20260316/scientific-skills/Data Analytics/matchms" python -m py_compile scripts/similarity_pipeline.py python scripts/similarity_pipeline.py --help
Example run plan:
- Confirm the user input, output path, and any required config values.
- Edit the in-file
block or documented parameters if the script uses fixed settings.CONFIG - Run
with the validated inputs.python scripts/similarity_pipeline.py - Review the generated output and return the final artifact with any assumptions called out.
Implementation Details
- Execution model: validate the request, choose the packaged workflow, and produce a bounded deliverable.
- Input controls: confirm the source files, scope limits, output format, and acceptance criteria before running any script.
- Primary implementation surface:
.scripts/similarity_pipeline.py - Reference guidance:
contains supporting rules, prompts, or checklists.references/ - Parameters to clarify first: input path, output path, scope filters, thresholds, and any domain-specific constraints.
- Output discipline: keep results reproducible, identify assumptions explicitly, and avoid undocumented side effects.
1. When to Use
Use this skill when you need to:
- Import and harmonize MS/MS spectra from common community formats (e.g., MGF/MSP) before analysis.
- Clean spectra (peak filtering, intensity normalization) to improve downstream similarity scoring and identification.
- Compute spectral similarity (Cosine/Modified Cosine/Fingerprint-based) for library matching or clustering.
- Build reproducible, configurable processing pipelines for metabolomics projects.
- Compare many spectra efficiently (all-vs-all or query-vs-library) and store/inspect score outputs.
2. Key Features
- Import/Export support: Read spectra from mzML, mzXML, MGF, MSP, and JSON (depending on installed readers).
- Filtering & harmonization: Metadata standardization, peak cleaning, intensity normalization, and other reusable filters.
- Similarity scoring:
- Cosine similarity (Greedy/Hungarian variants)
- Modified Cosine (accounts for precursor mass shifts)
- Fingerprint-based similarities (when molecular fingerprints are available)
- Pipeline composition: Chain filters and scoring steps into repeatable workflows.
Additional reference material (if present in the repository):
- Filters:
references/filtering.md - Similarity:
references/similarity.md - Workflows:
references/workflows.md
3. Dependencies
(version depends on your environment; pin in your project, e.g.,matchms
)matchms>=0.20,<1.0
(e.g.,numpy
)numpy>=1.20
(e.g.,scipy
)scipy>=1.7
(optional; required for chemistry/fingerprint-related functionality, version varies by distribution)rdkit
4. Example Usage
A minimal, runnable example that loads spectra from an MGF file and computes pairwise cosine scores:
from matchms.importing import load_from_mgf from matchms import calculate_scores from matchms.similarity import CosineGreedy def main(): # Load spectra from an MGF file spectra = list(load_from_mgf("data.mgf")) # Compute similarity scores (all-vs-all) scores = calculate_scores( references=spectra, queries=spectra, similarity_function=CosineGreedy() ) # Iterate over computed scores for (reference_idx, query_idx, score, n_matches) in scores: print( f"ref={reference_idx:>3} query={query_idx:>3} " f"cosine={score:.4f} matches={n_matches}" ) if __name__ == "__main__": main()
5. Implementation Details
- Data model: Matchms operates on
objects containing peak m/z and intensity arrays plus metadata (e.g., precursor m/z, charge, compound name/identifier).Spectrum - Filtering stage: Typical pipelines apply filters to:
- standardize/repair metadata fields,
- remove noise peaks (e.g., by intensity threshold or m/z window rules),
- normalize intensities (commonly to a maximum of 1.0 or to unit norm).
See
for filter patterns and recommended sequences.references/filtering.md
- Cosine similarity (Greedy/Hungarian):
- Peaks are matched within an m/z tolerance (implementation-specific defaults; configure via the similarity class parameters).
- Greedy matching selects best available peak matches iteratively.
- Hungarian matching solves an assignment problem to maximize total match score under one-to-one constraints.
- Modified Cosine:
- Extends cosine matching by allowing peak alignment with a precursor mass shift, improving matching for related compounds/adducts.
- Typically requires precursor m/z metadata to be present and consistent.
- Fingerprint similarity (optional):
- Requires molecular fingerprints (often derived via RDKit) and compares spectra/compounds using fingerprint similarity metrics.
- Use when you have structure annotations or can compute fingerprints reliably.
- Workflow reproducibility:
- Prefer explicit, ordered filter chains and pinned dependency versions.
- Store configuration (tolerances, normalization choices, filters used) alongside results for traceability.
See
for pipeline organization guidance.references/workflows.md