SciAgent-Skills metabolomics-workbench-database
Query Metabolomics Workbench REST API (4,200+ studies, NIH-funded) for metabolite identification, study discovery, RefMet name standardization, MS m/z precursor ion searches, MetStat study filtering, and gene/protein annotations. For local metabolite XML parsing use hmdb-database; for compound property lookups use pubchem-compound-search.
git clone https://github.com/jaechang-hits/SciAgent-Skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/proteomics-protein-engineering/metabolomics-workbench-database" ~/.claude/skills/jaechang-hits-sciagent-skills-metabolomics-workbench-database && rm -rf "$T"
skills/proteomics-protein-engineering/metabolomics-workbench-database/SKILL.mdMetabolomics Workbench Database — REST API Access
Overview
Query the Metabolomics Workbench (MW) REST API to access 4,200+ metabolomics studies hosted at UCSD under NIH Common Fund sponsorship. The API provides six query contexts: compound/metabolite lookups, study metadata and experimental data retrieval, RefMet standardized nomenclature, MetStat study filtering by species/disease/analysis type, m/z precursor ion searches for compound identification, and gene/protein annotation from the Metabolomics Gene/Protein (MGP) database.
When to Use
- Searching for metabolite structures, identifiers, or chemical properties by PubChem CID, KEGG ID, InChI key, or formula
- Discovering metabolomics studies by species, disease, analysis type, or polarity
- Standardizing metabolite names to RefMet nomenclature for cross-study comparison
- Identifying unknown compounds from mass spectrometry m/z values with adduct type matching
- Retrieving experimental metabolomics data (concentrations, abundances) from published studies
- Querying gene or protein annotations linked to metabolomics pathways
- Downloading study data in mwTab format for local analysis
- For local metabolite database parsing (220K+ entries, NMR/MS spectra) use
insteadhmdb-database - For live compound property searches (110M+ compounds) use
insteadpubchem-compound-search
Prerequisites
- Python packages:
,requestspandas - No API key required: MW REST API is publicly accessible without authentication
- Rate limits: MW does not enforce strict rate limits for reasonable use. For bulk queries (100+), add 0.5-1s delays between requests
- Base URL:
https://www.metabolomicsworkbench.org/rest
pip install requests pandas
Quick Start
import requests import time BASE = "https://www.metabolomicsworkbench.org/rest" def mw_query(context, input_item, input_value, output_item="all", fmt="json"): """Query Metabolomics Workbench REST API. URL pattern: /context/input_item/input_value/output_item/output_format """ url = f"{BASE}/{context}/{input_item}/{input_value}/{output_item}/{fmt}" resp = requests.get(url, timeout=30) resp.raise_for_status() return resp.json() if fmt == "json" else resp.text # Example: look up glucose by name result = mw_query("compound", "name", "glucose") print(result) # {'regno': '...', 'formula': 'C6H12O6', 'exactmass': '180.063388...', ...}
Core API
Module 1: Compound Queries
Search metabolite records by various identifiers. Returns chemical properties, structure info, and cross-references.
# Search by PubChem CID compound = mw_query("compound", "pubchem_cid", "5793") print(compound.get("name"), compound.get("formula")) # Glucose C6H12O6 # Search by KEGG compound ID compound = mw_query("compound", "kegg_id", "C00031") print(compound.get("name"), compound.get("exactmass")) # D-Glucose 180.06338810 # Search by InChI key compound = mw_query("compound", "inchi_key", "WQZGKKKJIJFFOK-GASJEMHNSA-N") # Search by molecular formula matches = mw_query("compound", "formula", "C6H12O6") # Returns all compounds with this formula # Search by registry number (MW internal ID) compound = mw_query("compound", "regno", "11") # Available input_items: regno, formula, name, pubchem_cid, kegg_id, inchi_key, lm_id, hmdb_id, bmrb_id
Module 2: Study Access
Retrieve study metadata, experimental factors, and data from deposited metabolomics studies.
# Get study summary by study ID study = mw_query("study", "study_id", "ST000001", "summary") print(study.get("study_title"), study.get("institute")) # Get study metadata (species, analysis type, etc.) study_meta = mw_query("study", "study_id", "ST000001") print(study_meta.get("species"), study_meta.get("analysis_type")) # Get study factors (experimental conditions) factors = mw_query("study", "study_id", "ST000001", "factors") # Returns factor names and levels for the study # Get study data (metabolite measurements) data = mw_query("study", "study_id", "ST000001", "data") # Returns concentration/abundance values per sample # Get analysis details analysis = mw_query("study", "study_id", "ST000001", "analysis") print(analysis.get("analysis_type"), analysis.get("instrument_name")) # Download mwTab file (tab-delimited study format) mwtab_text = mw_query("study", "study_id", "ST000001", "mwtab", fmt="txt") # Returns full mwTab formatted text
Module 3: RefMet Nomenclature
Standardize metabolite names using RefMet (Reference Metabolomics) classification. RefMet provides a hierarchical nomenclature: super_class > main_class > sub_class.
# Standardize a metabolite name to RefMet refmet = mw_query("refmet", "name", "Palmitic acid") print(refmet.get("refmet_name"), refmet.get("super_class")) # Palmitic acid Fatty Acyls # Get all metabolites in a main_class fatty_acids = mw_query("refmet", "main_class", "Fatty acids") print(f"Found {len(fatty_acids) if isinstance(fatty_acids, list) else 1} entries") # Get all metabolites in a sub_class lg_fa = mw_query("refmet", "sub_class", "Long-chain fatty acids") # Search by exact mass (with tolerance) # Use match="name" for name matching refmet_match = mw_query("refmet", "match", "Palmitic acid") print(refmet_match.get("formula"), refmet_match.get("exactmass")) # C16H32O2 256.24023
Module 4: MetStat Filtering
Filter studies using semicolon-delimited filter strings. MetStat queries enable discovery of studies by analysis type, polarity, species, and disease.
# MetStat filter format: "analysis_type:value;polarity:value;species:value;disease:value" # Each field is optional; separate multiple filters with semicolons # Find all LC-MS studies in human results = mw_query("metstat", "filter", "analysis_type:LC-MS;species:Human") print(f"Found {len(results) if isinstance(results, list) else 1} studies") # Filter by disease diabetes_studies = mw_query("metstat", "filter", "disease:Diabetes;species:Human") # Filter by polarity pos_studies = mw_query("metstat", "filter", "analysis_type:LC-MS;polarity:positive") # Combined multi-field filter filtered = mw_query("metstat", "filter", "analysis_type:LC-MS;polarity:positive;species:Human;disease:Cancer") # Available filter fields and common values: # analysis_type: LC-MS, GC-MS, CE-MS, NMR # polarity: positive, negative # species: Human, Mouse, Rat, etc. # disease: Cancer, Diabetes, Obesity, etc.
Module 5: m/z Search (Moverz)
Search for metabolites by precursor ion m/z value. Essential for compound identification from mass spectrometry data.
# Search by m/z with adduct type and tolerance # Format: moverz/mz_value/tolerance/ion_type/... mz_results = mw_query("moverz", "mz", "180.063/0.005/M+H") # Returns candidate compounds matching the m/z within tolerance # Negative mode search mz_neg = mw_query("moverz", "mz", "179.056/0.005/M-H") # Sodium adduct search mz_na = mw_query("moverz", "mz", "203.053/0.01/M+Na") # Search with wider tolerance for low-resolution instruments mz_wide = mw_query("moverz", "mz", "180.063/0.5/M+H") print("Candidates:", mz_results) # Returns: compound name, formula, exact mass, delta (mass error)
Module 6: Gene Information
Query gene annotations from the Metabolomics Gene/Protein (MGP) database.
# Search by gene symbol gene = mw_query("gene", "gene_symbol", "HMGCR") print(gene.get("gene_name"), gene.get("taxonomy")) # 3-hydroxy-3-methylglutaryl-CoA reductase Homo sapiens # Search by gene ID gene_by_id = mw_query("gene", "gene_id", "3156") print(gene_by_id.get("gene_symbol")) # Search by taxonomy human_genes = mw_query("gene", "taxonomy", "Homo sapiens")
Module 7: Protein Data
Retrieve protein sequence and annotation data from the MGP database.
# Search by UniProt ID protein = mw_query("protein", "uniprot_id", "P04035") print(protein.get("protein_name"), protein.get("gene_symbol")) # Search by gene symbol for protein info protein_by_gene = mw_query("protein", "gene_symbol", "HMGCR") print(protein_by_gene.get("sequence")[:50] if protein_by_gene.get("sequence") else "No seq") # Search by MGP ID protein_mgp = mw_query("protein", "mgp_id", "MGP000001")
Key Concepts
API URL Structure
All MW REST API endpoints follow the same pattern:
https://www.metabolomicsworkbench.org/rest/{context}/{input_item}/{input_value}/{output_item}/{format}
| Component | Description | Example Values |
|---|---|---|
| Query domain | , , , , , , |
| Search field | , , , , |
| Search term | , , |
| Data to return | , , , , , |
| Response format | , |
RefMet Classification Hierarchy
RefMet standardizes metabolite naming with three classification levels:
| Super Class | Main Class (examples) | Sub Class (examples) |
|---|---|---|
| Fatty Acyls | Fatty acids, Eicosanoids | Short/Medium/Long/Very long-chain FA |
| Glycerolipids | Monoradylglycerols, Diradylglycerols | Monoacylglycerols, Diacylglycerols |
| Glycerophospholipids | Glycerophosphocholines, -ethanolamines | Lysophosphatidylcholines |
| Sphingolipids | Sphingoid bases, Ceramides | Ceramide phosphocholines |
| Steroids | Cholesterol esters, Bile acids | C18/C19/C21 steroids |
| Prenol Lipids | Isoprenoids, Quinones | Ubiquinones, Terpenes |
| Organic acids | Amino acids, Carboxylic acids | Alpha amino acids |
| Nucleosides | Purine nucleosides, Pyrimidine | Adenosine, Cytidine |
| Carbohydrates | Monosaccharides, Disaccharides | Hexoses, Pentoses |
Ion Adduct Types (Moverz)
Common adduct types for m/z searches (mass spectrometry):
| Adduct | Mode | Mass Shift | Use When |
|---|---|---|---|
| M+H | Positive | +1.0073 | Default positive mode |
| M+Na | Positive | +22.9892 | Sodium adducts (common in ESI) |
| M+K | Positive | +38.9632 | Potassium adducts |
| M+NH4 | Positive | +18.0338 | Ammonium adducts (lipids) |
| M-H | Negative | -1.0073 | Default negative mode |
| M-H-H2O | Negative | -19.0178 | Dehydrated anions |
| M+Cl | Negative | +34.9689 | Chloride adducts |
| M+FA-H | Negative | +44.9982 | Formate adducts (LC-MS) |
| M+2H | Positive | +1.0073 (z=2) | Doubly charged ions |
| M-2H | Negative | -1.0073 (z=2) | Doubly charged negative |
MetStat Filter Syntax
MetStat uses semicolon-delimited key:value pairs. All fields are optional:
analysis_type:{value};polarity:{value};species:{value};disease:{value}
- Omit any field to leave it unfiltered
- Values are case-sensitive (use exact values:
notHuman
)human - Combine as many fields as needed
Common Workflows
Workflow 1: Metabolite Identification Pipeline
Standardize a metabolite name, find related studies, and retrieve experimental data.
import pandas as pd # Step 1: Standardize name via RefMet refmet = mw_query("refmet", "name", "Palmitic acid") std_name = refmet.get("refmet_name", "Palmitic acid") formula = refmet.get("formula") print(f"Standardized: {std_name}, Formula: {formula}") # Step 2: Search compound database for cross-references compound = mw_query("compound", "name", std_name) print(f"PubChem CID: {compound.get('pubchem_cid')}, " f"KEGG: {compound.get('kegg_id')}, HMDB: {compound.get('hmdb_id')}") # Step 3: Find studies containing this metabolite via MetStat studies = mw_query("metstat", "filter", "species:Human") # Filter client-side for studies with the metabolite of interest if isinstance(studies, list): print(f"Found {len(studies)} human metabolomics studies") # Step 4: Get data from a specific study study_data = mw_query("study", "study_id", "ST000001", "data") if isinstance(study_data, list): df = pd.DataFrame(study_data) print(f"Data shape: {df.shape}") print(df.head())
Workflow 2: MS Compound Identification
Identify unknown compounds from mass spectrometry m/z values.
# Step 1: Search positive mode m/z target_mz = "256.240" tolerance = "0.01" candidates_pos = mw_query("moverz", "mz", f"{target_mz}/{tolerance}/M+H") # Step 2: Also check sodium adduct candidates_na = mw_query("moverz", "mz", f"{target_mz}/{tolerance}/M+Na") time.sleep(0.5) # Step 3: For each candidate, get full compound info if isinstance(candidates_pos, list): for candidate in candidates_pos[:5]: # Top 5 candidates name = candidate.get("name", "Unknown") delta = candidate.get("delta", "N/A") print(f"Candidate: {name}, Mass error: {delta}") # Get detailed compound info detail = mw_query("compound", "name", name) print(f" Formula: {detail.get('formula')}, " f"KEGG: {detail.get('kegg_id')}") time.sleep(0.5) # Step 4: Standardize top candidate via RefMet if candidates_pos: top_name = candidates_pos[0].get("name", "") refmet = mw_query("refmet", "name", top_name) print(f"RefMet class: {refmet.get('super_class')} > " f"{refmet.get('main_class')} > {refmet.get('sub_class')}")
Workflow 3: Disease Metabolomics Exploration
Discover metabolomics studies for a disease and extract experimental data.
import pandas as pd # Step 1: Filter studies by disease and analysis type diabetes_lc = mw_query("metstat", "filter", "disease:Diabetes;analysis_type:LC-MS;species:Human") print(f"Found {len(diabetes_lc) if isinstance(diabetes_lc, list) else 1} studies") # Step 2: Get study details for top results if isinstance(diabetes_lc, list): for study_entry in diabetes_lc[:3]: sid = study_entry.get("study_id", "") if sid: summary = mw_query("study", "study_id", sid, "summary") print(f"\n{sid}: {summary.get('study_title', 'N/A')}") print(f" Institute: {summary.get('institute', 'N/A')}") time.sleep(0.5) # Step 3: Get experimental factors and data from one study target_study = "ST000001" factors = mw_query("study", "study_id", target_study, "factors") print(f"\nFactors for {target_study}:", factors) data = mw_query("study", "study_id", target_study, "data") if isinstance(data, list): df = pd.DataFrame(data) print(f"Dataset: {df.shape[0]} rows x {df.shape[1]} columns") print(df.describe())
Key Parameters
| Function/Endpoint | Parameter | Description | Example Values |
|---|---|---|---|
| | Search field | , , , , , , , , |
| | Data to retrieve | , , , , , |
| | Classification level | , , , , |
| filter string | Semicolon-delimited | |
| value | m/z / tolerance / adduct | |
| | Gene search field | , , |
| | Protein search field | , , |
| All | | Response format | (default), |
Best Practices
- Use RefMet for standardization: Always standardize metabolite names through RefMet before cross-study comparisons. Different studies may use synonyms for the same compound
- Add delays for bulk queries: Insert
between requests when querying >100 endpoints to avoid overloading the servertime.sleep(0.5) - Check response types: The API may return a dict (single result) or list (multiple results). Always handle both:
results if isinstance(results, list) else [results] - Use specific output_items: Request
,summary
, orfactors
individually rather thandata
to reduce response size and parse timeall - Validate m/z tolerance: Use tight tolerance (0.005 Da) for high-resolution instruments (Orbitrap, TOF) and wider tolerance (0.5 Da) for low-resolution instruments
- MetStat values are case-sensitive: Use exact values (
notHuman
,human
notLC-MS
). Check available values via the MW web interface if unsurelc-ms - Cache compound lookups: Compound data changes infrequently. Cache results locally to avoid redundant API calls during iterative analysis
Common Recipes
Recipe: Batch Metabolite Standardization via RefMet
metabolite_names = ["palmitic acid", "oleic acid", "stearic acid", "linoleic acid", "arachidonic acid"] standardized = [] for name in metabolite_names: result = mw_query("refmet", "name", name) standardized.append({ "original": name, "refmet_name": result.get("refmet_name", name), "super_class": result.get("super_class", ""), "main_class": result.get("main_class", ""), "formula": result.get("formula", "") }) time.sleep(0.5) df_std = pd.DataFrame(standardized) print(df_std.to_string(index=False))
Recipe: Cross-Database ID Mapping
# Map a compound across PubChem, KEGG, HMDB compound = mw_query("compound", "name", "L-Alanine") id_map = { "MW_regno": compound.get("regno"), "PubChem_CID": compound.get("pubchem_cid"), "KEGG_ID": compound.get("kegg_id"), "HMDB_ID": compound.get("hmdb_id"), "LipidMaps_ID": compound.get("lm_id"), "Formula": compound.get("formula"), "Exact_Mass": compound.get("exactmass") } for db, val in id_map.items(): print(f" {db}: {val}")
Recipe: Export Study Data to DataFrame
import pandas as pd study_id = "ST000001" # Get study data and convert to DataFrame raw_data = mw_query("study", "study_id", study_id, "data") if isinstance(raw_data, list): df = pd.DataFrame(raw_data) elif isinstance(raw_data, dict): df = pd.DataFrame([raw_data]) else: df = pd.DataFrame() # Get study metadata for context meta = mw_query("study", "study_id", study_id, "summary") print(f"Study: {meta.get('study_title', study_id)}") print(f"Species: {meta.get('species')}, Analysis: {meta.get('analysis_type')}") print(f"Data shape: {df.shape}") print(df.head()) # Export to CSV df.to_csv(f"{study_id}_data.csv", index=False)
Troubleshooting
| Problem | Cause | Solution |
|---|---|---|
Empty JSON response | Invalid input_item or input_value | Verify the context/input_item combination is valid (see Key Parameters table). Check spelling and case |
or timeout | MW server temporarily unavailable | Retry after 30s. MW occasionally has maintenance windows. Add to requests |
| MetStat returns no results | Case-sensitive filter values | Use exact case: not , not . Check available values on MW website |
| m/z search returns too many hits | Tolerance too wide | Reduce tolerance from 0.5 to 0.01 or 0.005 Da for high-resolution instruments |
| m/z search returns no hits | Wrong adduct type or too-tight tolerance | Try alternative adducts (M+H, M+Na, M-H). Widen tolerance. Verify the m/z value is correct |
on response | Endpoint returns text, not JSON | Some endpoints (e.g., output) return plain text. Use instead of |
| Study data missing columns | Study uses different data format | Check output first to understand the study's data structure. Not all studies have uniform column names |
| RefMet name not found | Metabolite not in RefMet database | Try alternative names or synonyms. RefMet covers ~120K standardized names but some rare metabolites may be absent |
Bundled Resources
This entry is self-contained. The original
references/api_reference.md (494 lines) covering all 7 API contexts (Compound, Study, RefMet, MetStat, Moverz, Gene, Protein) has been fully consolidated inline:
- Compound endpoint: input_items and output fields consolidated into Core API Module 1 + Key Parameters table
- Study endpoint: output_items (summary, factors, data, analysis, mwtab) consolidated into Core API Module 2
- RefMet endpoint: classification hierarchy consolidated into Core API Module 3 + Key Concepts RefMet table
- MetStat endpoint: filter syntax consolidated into Core API Module 4 + Key Concepts MetStat section
- Moverz endpoint: adduct types consolidated into Core API Module 5 + Key Concepts Ion Adduct table
- Gene/Protein endpoints: consolidated into Core API Modules 6 and 7
- Omitted: raw curl examples (replaced with Python helper function), HTML output format examples (rarely used programmatically)
Related Skills
- hmdb-database -- local XML parsing for 220K+ metabolites with NMR/MS spectral data; use when MW does not have the metabolite or you need spectral peak lists
- pubchem-compound-search -- broader compound property lookups (110M+ compounds) via PubChemPy; use for general chemistry queries beyond metabolomics
- matchms-spectral-matching -- spectral similarity scoring for metabolite identification from MS/MS data; complementary to MW m/z searches
- pyopenms-mass-spectrometry -- full LC-MS/MS data processing pipeline; use for raw spectra processing before querying MW for identification
- kegg-database -- pathway and compound queries; use KEGG IDs from MW compound lookups for pathway context
References
- Metabolomics Workbench REST API: https://www.metabolomicsworkbench.org/tools/MWRestAPIv1.0.pdf
- MW REST Interactive URL Creator: https://www.metabolomicsworkbench.org/databases/metabolites/mw-rest.php
- Sud et al. "Metabolomics Workbench: An international repository for metabolomics data" Nucleic Acids Research (2016) https://doi.org/10.1093/nar/gkv1042
- RefMet nomenclature: https://www.metabolomicsworkbench.org/databases/refmet/index.php