Claude-skill-registry bio-proteomics-data-import
Load and parse mass spectrometry data formats including mzML, mzXML, and quantification tool outputs like MaxQuant proteinGroups.txt. Use when starting a proteomics analysis with raw or processed MS data. Handles contaminant filtering and missing value assessment.
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/data-import" ~/.claude/skills/majiayu000-claude-skill-registry-bio-proteomics-data-import && rm -rf "$T"
manifest:
skills/data/data-import/SKILL.mdsource content
Mass Spectrometry Data Import
Loading mzML/mzXML Files with pyOpenMS
from pyopenms import MSExperiment, MzMLFile, MzXMLFile exp = MSExperiment() MzMLFile().load('sample.mzML', exp) for spectrum in exp: if spectrum.getMSLevel() == 1: mz, intensity = spectrum.get_peaks() elif spectrum.getMSLevel() == 2: precursor = spectrum.getPrecursors()[0] precursor_mz = precursor.getMZ()
Loading MaxQuant Output
import pandas as pd protein_groups = pd.read_csv('proteinGroups.txt', sep='\t', low_memory=False) # Filter contaminants and reverse hits contam_col = 'Potential contaminant' if 'Potential contaminant' in protein_groups.columns else 'Contaminant' protein_groups = protein_groups[ (protein_groups.get(contam_col, '') != '+') & (protein_groups.get('Reverse', '') != '+') & (protein_groups.get('Only identified by site', '') != '+') ] # Extract intensity columns (LFQ or iBAQ) intensity_cols = [c for c in protein_groups.columns if c.startswith('LFQ intensity') or c.startswith('iBAQ ')] if not intensity_cols: intensity_cols = [c for c in protein_groups.columns if c.startswith('Intensity ') and 'Intensity L' not in c] intensities = protein_groups[['Protein IDs', 'Gene names'] + intensity_cols]
Loading Spectronaut/DIA-NN Output
diann_report = pd.read_csv('report.tsv', sep='\t') # Pivot to protein-level matrix protein_matrix = diann_report.pivot_table( index='Protein.Group', columns='Run', values='PG.MaxLFQ', aggfunc='first' )
R: Loading with MSnbase
library(MSnbase) raw_data <- readMSData('sample.mzML', mode = 'onDisk') spectra <- spectra(raw_data) header_info <- fData(raw_data)
Missing Value Assessment
def assess_missing_values(df, intensity_cols): missing_per_protein = df[intensity_cols].isna().sum(axis=1) missing_per_sample = df[intensity_cols].isna().sum(axis=0) total_missing = df[intensity_cols].isna().sum().sum() total_values = df[intensity_cols].size missing_pct = 100 * total_missing / total_values return {'per_protein': missing_per_protein, 'per_sample': missing_per_sample, 'total_pct': missing_pct}
Related Skills
- quantification - Process imported data for quantification
- peptide-identification - Identify peptides from raw spectra
- expression-matrix/counts-ingest - Similar data loading patterns