OpenClaw-Medical-Skills tooluniverse-proteomics-analysis
Analyze mass spectrometry proteomics data including protein quantification, differential expression, post-translational modifications (PTMs), and protein-protein interactions. Processes MaxQuant, Spectronaut, DIA-NN, and other MS platform outputs. Performs normalization, statistical analysis, pathway enrichment, and integration with transcriptomics. Use when analyzing proteomics data, comparing protein abundance between conditions, identifying PTM changes, studying protein complexes, integrating protein and RNA data, discovering protein biomarkers, or conducting quantitative proteomics experiments.
git clone https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/tooluniverse-proteomics-analysis" ~/.claude/skills/freedomintelligence-openclaw-medical-skills-tooluniverse-proteomics-analysis && rm -rf "$T"
T=$(mktemp -d) && git clone --depth=1 https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills "$T" && mkdir -p ~/.openclaw/skills && cp -r "$T/skills/tooluniverse-proteomics-analysis" ~/.openclaw/skills/freedomintelligence-openclaw-medical-skills-tooluniverse-proteomics-analysis && rm -rf "$T"
skills/tooluniverse-proteomics-analysis/SKILL.mdProteomics Analysis
Comprehensive analysis of mass spectrometry-based proteomics data from protein identification through quantification, differential expression, post-translational modifications, and systems-level interpretation.
When to Use This Skill
Triggers:
- User has proteomics data (MS output files)
- Questions about protein abundance or expression
- Differential protein expression analysis requests
- PTM analysis (phosphorylation, acetylation, ubiquitination)
- Protein-RNA correlation analysis
- Multi-omics integration involving proteomics
- Protein complex or interaction analysis
- Proteomics biomarker discovery
Example Questions This Skill Solves:
- "Analyze this MaxQuant output for differential protein expression"
- "Which proteins are significantly upregulated in disease vs control?"
- "Correlate protein abundance with mRNA expression"
- "What post-translational modifications change between conditions?"
- "Identify protein complexes in my co-IP MS data"
- "Which pathways are enriched in differentially expressed proteins?"
- "Find protein biomarkers for disease classification"
- "Compare protein and RNA levels to identify translation-regulated genes"
Core Capabilities
| Capability | Description |
|---|---|
| Data Import | MaxQuant, Spectronaut, DIA-NN, Proteome Discoverer, FragPipe outputs |
| Quality Control | Missing value analysis, intensity distributions, sample clustering |
| Normalization | Median, quantile, TMM, VSN normalization methods |
| Imputation | MinProb, KNN, QRILC for missing values |
| Differential Expression | Limma, DEP, MSstats for statistical testing |
| PTM Analysis | Phospho-site localization, PTM enrichment, kinase prediction |
| Protein-RNA Integration | Correlation analysis, translation efficiency |
| Pathway Enrichment | Over-representation and GSEA for protein sets |
| PPI Analysis | Protein complex detection, interaction networks via STRING/IntAct |
| Reporting | Comprehensive reports with volcano plots, heatmaps, pathway diagrams |
Workflow Overview
Input: MS Proteomics Data | v Phase 1: Data Import & QC |-- Load MaxQuant/Spectronaut/DIA-NN output |-- Parse protein groups, intensities, modifications |-- Quality control plots (missing values, intensity distributions) |-- Sample correlation and PCA | v Phase 2: Preprocessing |-- Filter low-confidence proteins |-- Handle missing values (imputation or filtering) |-- Log-transform intensities |-- Normalize across samples | v Phase 3: Differential Expression Analysis |-- Statistical testing (limma, t-test, ANOVA) |-- Multiple testing correction (BH, Bonferroni) |-- Fold change calculation |-- Significance thresholds (p < 0.05, |log2FC| > 1) | v Phase 4: PTM Analysis (if applicable) |-- Identify modified peptides |-- Localization probability filtering |-- PTM site quantification |-- Kinase-substrate prediction |-- PTM enrichment analysis | v Phase 5: Functional Enrichment |-- Gene Ontology enrichment |-- KEGG/Reactome pathway enrichment |-- Protein complex enrichment (CORUM) |-- Tissue-specific enrichment | v Phase 6: Protein-Protein Interactions |-- Query STRING for interaction networks |-- Identify protein complexes |-- Network clustering and modules |-- Hub protein identification | v Phase 7: Multi-Omics Integration (optional) |-- Correlate with RNA-seq data |-- Identify translation-regulated proteins |-- Compare with variant/CNV data |-- Integrate with metabolomics | v Phase 8: Generate Report |-- Summary statistics |-- Volcano plots and heatmaps |-- Pathway diagrams |-- Protein network visualizations |-- Multi-omics integration plots
Phase Details
Phase 1: Data Import & Quality Control
Objective: Load proteomics data and assess data quality.
Supported input formats:
MaxQuant (most common):
- Protein-level quantificationproteinGroups.txt
- Peptide-level dataevidence.txt
- Phosphorylation sitesPhospho (STY)Sites.txt
- Other PTMsmodificationSpecificPeptides.txt
Spectronaut:
- Protein/peptide quantification*_Report.tsv- DIA-based quantification
DIA-NN:
- Protein groupsreport.tsv
- Protein matrixreport.pr_matrix.tsv
Proteome Discoverer:
*_Proteins.txt*_PSMs.txt
Data loading:
def load_maxquant_proteins(protein_groups_file): """ Load MaxQuant proteinGroups.txt file. Returns: - DataFrame with proteins as rows, samples as columns - Metadata (protein names, gene names, sequence coverage) """ import pandas as pd # Read file df = pd.read_csv(protein_groups_file, sep='\t') # Extract intensity columns (LFQ or raw) intensity_cols = [col for col in df.columns if 'LFQ intensity' in col or 'Intensity ' in col] # Create intensity matrix intensity_matrix = df[intensity_cols].copy() intensity_matrix.columns = [col.replace('LFQ intensity ', '').replace('Intensity ', '') for col in intensity_cols] # Add protein metadata metadata = df[['Protein IDs', 'Gene names', 'Fasta headers', 'Peptides', 'Sequence coverage [%]']].copy() return intensity_matrix, metadata
Quality Control:
- Missing value assessment:
def assess_missing_values(intensity_matrix): """ Calculate percentage of missing values per protein and sample. """ # Per protein missing_per_protein = (intensity_matrix == 0).sum(axis=1) / intensity_matrix.shape[1] # Per sample missing_per_sample = (intensity_matrix == 0).sum(axis=0) / intensity_matrix.shape[0] # Visualize plot_missing_value_heatmap(intensity_matrix) return missing_per_protein, missing_per_sample
- Intensity distribution:
def plot_intensity_distributions(intensity_matrix): """ Plot log10 intensity distributions per sample. Check for consistent distributions across samples. """ import matplotlib.pyplot as plt import numpy as np log_intensities = np.log10(intensity_matrix.replace(0, np.nan)) # Boxplot per sample log_intensities.plot(kind='box') plt.ylabel('log10 Intensity') plt.title('Intensity Distribution per Sample') # Should see similar median and spread across samples
- Sample correlation:
def plot_sample_correlation(intensity_matrix): """ Calculate and visualize sample-sample correlation. Expect: High correlation within replicates, lower between conditions. """ # Log-transform and remove zeros log_data = np.log2(intensity_matrix.replace(0, np.nan)) # Correlation matrix corr_matrix = log_data.corr(method='pearson') # Heatmap import seaborn as sns sns.heatmap(corr_matrix, annot=True, cmap='RdYlBu_r', vmin=0.8, vmax=1.0)
- PCA:
def perform_pca(intensity_matrix, sample_groups): """ Principal component analysis for sample clustering. """ from sklearn.decomposition import PCA # Prepare data (log, impute, scale) log_data = np.log2(intensity_matrix.replace(0, np.nan)) # Simple imputation with minimum value imputed = log_data.fillna(log_data.min().min()) # PCA pca = PCA(n_components=2) pca_result = pca.fit_transform(imputed.T) # Plot with group colors plt.scatter(pca_result[:, 0], pca_result[:, 1], c=sample_groups) plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%})') plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%})')
Phase 2: Preprocessing & Normalization
Objective: Clean data and normalize across samples for fair comparison.
Filtering:
def filter_proteins(intensity_matrix, metadata, min_valid=3): """ Filter out low-confidence proteins. Criteria: - At least 2 unique peptides (from metadata) - At least min_valid samples with detected intensity - Remove contaminants and reverse sequences """ # Filter by peptide count valid_proteins = metadata['Peptides'] >= 2 # Filter by detection in samples n_detected = (intensity_matrix > 0).sum(axis=1) valid_detection = n_detected >= min_valid # Remove contaminants (from MaxQuant) is_contaminant = metadata['Protein IDs'].str.contains('CON__', na=False) is_reverse = metadata['Protein IDs'].str.contains('REV__', na=False) # Combined filter keep = valid_proteins & valid_detection & ~is_contaminant & ~is_reverse return intensity_matrix[keep], metadata[keep]
Missing value imputation:
def impute_missing_values(intensity_matrix, method='MinProb'): """ Impute missing protein intensities. Methods: - MinProb: Random from minimum observed + normal noise (for MNAR) - KNN: K-nearest neighbors imputation - QRILC: Quantile regression-based imputation """ if method == 'MinProb': # Assume missing = low abundance (MNAR assumption) min_val = intensity_matrix[intensity_matrix > 0].min().min() width = 0.3 # Standard deviation of noise shift = 1.8 # Downshift from minimum # Replace zeros with random low values imputed = intensity_matrix.copy() missing_mask = imputed == 0 n_missing = missing_mask.sum().sum() random_vals = np.random.normal( loc=min_val - shift, scale=width, size=n_missing ) imputed.values[missing_mask.values] = random_vals return imputed elif method == 'KNN': from sklearn.impute import KNNImputer imputer = KNNImputer(n_neighbors=5) imputed = pd.DataFrame( imputer.fit_transform(intensity_matrix.replace(0, np.nan)), index=intensity_matrix.index, columns=intensity_matrix.columns ) return imputed
Normalization:
def normalize_intensities(intensity_matrix, method='median'): """ Normalize protein intensities across samples. Methods: - median: Divide by median intensity per sample - quantile: Quantile normalization (same distribution) - TMM: Trimmed mean of M-values (from edgeR) - VSN: Variance-stabilizing normalization """ if method == 'median': # Median normalization medians = intensity_matrix.median(axis=0) global_median = medians.median() norm_factors = global_median / medians normalized = intensity_matrix * norm_factors return normalized elif method == 'quantile': # Quantile normalization from sklearn.preprocessing import quantile_transform normalized = pd.DataFrame( quantile_transform(intensity_matrix, axis=1), index=intensity_matrix.index, columns=intensity_matrix.columns ) return normalized
Phase 3: Differential Expression Analysis
Objective: Identify proteins with significant abundance changes between conditions.
Statistical testing with limma:
def differential_expression_limma(log2_intensities, group1_samples, group2_samples): """ Perform differential expression using limma-like approach. Returns: - log2 fold changes - p-values - adjusted p-values (BH) """ from scipy import stats results = [] for protein in log2_intensities.index: # Extract intensities for each group group1 = log2_intensities.loc[protein, group1_samples] group2 = log2_intensities.loc[protein, group2_samples] # Calculate statistics mean1 = group1.mean() mean2 = group2.mean() log2fc = mean2 - mean1 # t-test t_stat, p_value = stats.ttest_ind(group1, group2, equal_var=False) results.append({ 'protein': protein, 'log2FC': log2fc, 'mean_group1': mean1, 'mean_group2': mean2, 'p_value': p_value, 't_statistic': t_stat }) results_df = pd.DataFrame(results) # Multiple testing correction (Benjamini-Hochberg) from statsmodels.stats.multitest import multipletests results_df['adj_p_value'] = multipletests(results_df['p_value'], method='fdr_bh')[1] # Classify significance results_df['significant'] = ( (results_df['adj_p_value'] < 0.05) & (np.abs(results_df['log2FC']) > 1.0) ) return results_df
Volcano plot:
def plot_volcano(de_results, title='Volcano Plot'): """ Visualize differential expression results. """ import matplotlib.pyplot as plt plt.figure(figsize=(8, 6)) # Non-significant non_sig = de_results[~de_results['significant']] plt.scatter(non_sig['log2FC'], -np.log10(non_sig['p_value']), c='gray', alpha=0.5, s=10) # Significant sig = de_results[de_results['significant']] plt.scatter(sig['log2FC'], -np.log10(sig['p_value']), c='red', alpha=0.7, s=20) # Thresholds plt.axhline(-np.log10(0.05), color='blue', linestyle='--', label='p=0.05') plt.axvline(-1, color='blue', linestyle='--') plt.axvline(1, color='blue', linestyle='--', label='|log2FC|=1') plt.xlabel('log2 Fold Change') plt.ylabel('-log10(p-value)') plt.title(title) plt.legend()
Phase 4: PTM Analysis
Objective: Analyze post-translational modifications (phosphorylation, acetylation, etc.)
Phosphoproteomics workflow:
def analyze_phosphosites(phospho_sites_file, intensity_matrix): """ Analyze phosphorylation site changes. Input: MaxQuant Phospho (STY)Sites.txt Output: Differential phosphorylation per site """ # Load phospho data phospho = pd.read_csv(phospho_sites_file, sep='\t') # Filter by localization probability phospho_confident = phospho[phospho['Localization prob'] > 0.75] # Extract site information phospho_confident['site'] = ( phospho_confident['Gene names'] + '_' + phospho_confident['Amino acid'] + phospho_confident['Position'].astype(str) ) # Quantification (similar to protein-level analysis) # ... perform differential analysis ... return phospho_results
Kinase-substrate prediction:
def predict_kinases(phospho_sites): """ Predict upstream kinases for phosphorylation sites. Uses ToolUniverse PhosphoSitePlus or KEA3 tools. """ from tooluniverse import ToolUniverse tu = ToolUniverse() # For each significant phosphosite kinase_predictions = [] for site in phospho_sites: # Query kinase-substrate databases # (would use actual ToolUniverse tool here) result = tu.run_one_function({ "name": "phosphosite_plus_query", # hypothetical "arguments": {"site": site} }) kinase_predictions.append(result) return kinase_predictions
Phase 5: Functional Enrichment
Objective: Interpret biological meaning of protein changes via pathway analysis.
Gene Ontology enrichment:
def pathway_enrichment_proteins(de_proteins, organism='human'): """ Perform pathway enrichment for differentially expressed proteins. Uses ToolUniverse gene-enrichment skill. """ from tooluniverse import ToolUniverse tu = ToolUniverse() # Extract gene names for significant proteins sig_proteins = de_proteins[de_proteins['significant']] gene_list = sig_proteins['gene_name'].tolist() # Run enrichment via ToolUniverse enrichment = tu.run_one_function({ "name": "enrichr_enrich", "arguments": { "gene_list": ",".join(gene_list), "library": "KEGG_2021_Human" } }) return enrichment
Protein complex enrichment:
def protein_complex_enrichment(protein_list): """ Test for enrichment of known protein complexes (CORUM database). """ # Query CORUM or use ToolUniverse # Identify if proteins are part of known complexes pass
Phase 6: Protein-Protein Interactions
Objective: Identify interaction networks and protein complexes.
STRING network analysis:
def build_protein_network(protein_list, confidence=0.7): """ Build PPI network using STRING database. Uses ToolUniverse STRING tools. """ from tooluniverse import ToolUniverse tu = ToolUniverse() # Get interactions interactions = tu.run_one_function({ "name": "string_get_interactions", "arguments": { "proteins": ",".join(protein_list), "species": 9606, # human "score_threshold": int(confidence * 1000) } }) # Build network graph import networkx as nx G = nx.Graph() for interaction in interactions['data']: G.add_edge( interaction['protein1'], interaction['protein2'], score=interaction['score'] ) return G
Module detection:
def detect_protein_modules(network_graph): """ Identify tightly connected protein modules (complexes). """ from networkx.algorithms import community # Detect communities communities = community.greedy_modularity_communities(network_graph) # Annotate modules with enriched functions modules = [] for i, comm in enumerate(communities): module_proteins = list(comm) # Run enrichment for this module enrichment = pathway_enrichment_proteins(module_proteins) modules.append({ 'module_id': i, 'proteins': module_proteins, 'size': len(module_proteins), 'top_function': enrichment['top_terms'][0] }) return modules
Phase 7: Multi-Omics Integration
Objective: Integrate proteomics with transcriptomics and other omics.
Protein-RNA correlation:
def correlate_protein_rna(protein_data, rna_data, common_samples): """ Correlate protein and mRNA levels for each gene. Expected: r ~ 0.4-0.6 (moderate correlation) Discordance indicates post-transcriptional regulation """ from scipy.stats import spearmanr # Find common genes common_genes = set(protein_data.index) & set(rna_data.index) correlations = {} for gene in common_genes: protein = protein_data.loc[gene, common_samples] rna = rna_data.loc[gene, common_samples] r, p = spearmanr(protein, rna) correlations[gene] = { 'r': r, 'p': p, 'regulation': classify_regulation(r, protein.mean(), rna.mean()) } return correlations def classify_regulation(r, protein_level, rna_level): """ Classify regulatory mechanism based on correlation and levels. """ if r > 0.6 and protein_level > 0 and rna_level > 0: return 'transcriptional_upregulation' elif r > 0.6 and protein_level < 0 and rna_level < 0: return 'transcriptional_downregulation' elif r < 0.2 and protein_level > 0 and rna_level < 0: return 'translational_upregulation' elif r < 0.2 and protein_level < 0 and rna_level > 0: return 'protein_degradation' else: return 'mixed_regulation'
Integration with multi-omics skill:
def integrate_with_multiomics(protein_data, rna_data, methylation_data): """ Pass proteomics data to multi-omics integration skill. Enables comprehensive analysis across all molecular layers. """ # Prepare for multi-omics skill omics_data = { 'proteomics': protein_data, 'rnaseq': rna_data, 'methylation': methylation_data } # Invoke multi-omics integration skill from tooluniverse import ToolUniverse # (Would use Skill tool to invoke tooluniverse-multi-omics-integration) return integrated_analysis
Phase 8: Report Generation
Generate comprehensive proteomics report:
# Proteomics Analysis Report ## Dataset Summary - **Samples**: 20 (10 disease, 10 control) - **Proteins Identified**: 5,432 - **Proteins Quantified**: 4,987 (at least 3 samples) - **Platform**: Orbitrap Fusion Lumos, MaxQuant 2.0 ## Quality Control - **Missing Values**: 15% average per protein - **Sample Correlation**: 0.92-0.98 within groups - **PCA**: Clear separation between disease and control (PC1: 35% variance) ## Differential Expression - **Significant Proteins**: 432 (adj. p < 0.05, |log2FC| > 1) - Upregulated: 245 proteins - Downregulated: 187 proteins - **Top upregulated**: MYC (log2FC=3.2), EGFR (log2FC=2.8) - **Top downregulated**: TP53 (log2FC=-2.5), BRCA1 (log2FC=-2.1) ## Phosphoproteomics - **Phosphosites Quantified**: 8,543 - **Differentially Phosphorylated**: 234 sites (p < 0.05) - **Top Predicted Kinases**: CDK1, MAPK1, AKT1 ## Pathway Enrichment ### Top Pathways (Upregulated) 1. **Cell Cycle** (p=1e-15) - 45 proteins, including cyclins, CDKs 2. **DNA Replication** (p=1e-12) - 23 proteins 3. **Glycolysis** (p=1e-10) - 18 proteins ### Top Pathways (Downregulated) 1. **Apoptosis** (p=1e-14) - 32 proteins, including caspases 2. **DNA Repair** (p=1e-11) - 28 proteins 3. **Oxidative Phosphorylation** (p=1e-9) - 25 proteins ## Protein Network Analysis - **Network**: 432 nodes, 1,245 edges (STRING confidence > 0.7) - **Modules Detected**: 8 functional modules - Module 1: Cell cycle (85 proteins) - Module 2: Metabolism (62 proteins) - Module 3: Translation (48 proteins) ## Protein-RNA Correlation - **Overall Correlation**: r = 0.54 (moderate, expected) - **High Correlation**: 2,134 genes (r > 0.6) - transcriptional regulation - **Low Correlation**: 456 genes (r < 0.2) - post-transcriptional regulation - **Translation-Regulated**: 89 proteins (high protein, low RNA) ## Biological Interpretation Disease state shows increased proliferation (MYC, cyclins) with concurrent suppression of apoptosis and DNA repair (TP53, BRCA1). Metabolic shift toward glycolysis evident at protein level. Post-transcriptional upregulation of translation machinery suggests adaptation to proliferative demands. ## Potential Biomarkers Top 10 proteins for disease classification (Random Forest AUC=0.95): 1. MYC (protein) 2. EGFR (protein) 3. CDK1 (phospho-T161) 4. TP53 (protein) 5. BRCA1 (protein)
Integration with ToolUniverse
Skills Coordinated:
| Skill | Used For | Phase |
|---|---|---|
| Pathway enrichment | Phase 5 |
| PPI networks | Phase 6 |
| RNA-seq for integration | Phase 7 |
| Cross-omics analysis | Phase 7 |
| Protein annotation | Phase 8 |
Example Use Cases
Use Case 1: Cancer Proteomics
Question: "Analyze proteomics data from breast cancer vs normal tissue"
Workflow:
- Load MaxQuant proteinGroups.txt
- QC and filter (keep proteins with 2+ peptides, detected in 3+ samples)
- Impute missing, normalize by median
- Differential expression (limma): 432 significant proteins
- Pathway enrichment: Cell cycle, metabolism upregulated
- STRING network: Identify hub proteins (MYC, EGFR)
- Integrate with TCGA RNA-seq: Find translation-regulated genes
- Report: Comprehensive analysis with biomarkers
Use Case 2: Phosphoproteomics Signaling
Question: "What kinase signaling is activated in response to drug treatment?"
Workflow:
- Load Phospho (STY)Sites.txt from MaxQuant
- Filter by localization probability > 0.75
- Differential phosphorylation analysis
- Kinase prediction for significant sites
- Identify MAPK1, CDK1, AKT1 as top kinases
- Pathway enrichment: MAPK, PI3K/AKT pathways
- Report: Drug activates growth signaling
Use Case 3: Protein-RNA Integration
Question: "Which proteins are regulated post-transcriptionally?"
Workflow:
- Load proteomics (MaxQuant) and RNA-seq (DESeq2) data
- Match samples, extract common genes
- Correlate protein and RNA for each gene
- Identify low-correlation genes (r < 0.2)
- Classify: translation upregulation, protein degradation
- Enrichment: Find pathways enriched in post-transcriptional regulation
- Report: 89 translation-regulated proteins, RNA-binding proteins enriched
Quantified Minimums
| Component | Requirement |
|---|---|
| Proteins quantified | At least 500 proteins |
| Replicates | At least 3 per condition |
| Filtering | 2+ unique peptides per protein |
| Statistical test | limma or t-test with multiple testing correction |
| Pathway enrichment | At least one method (GO, KEGG, or Reactome) |
| Report | Summary, QC, DE results, pathways, visualizations |
Limitations
- Platform-specific: Optimized for MS-based proteomics (not Western blot quantification)
- Missing values: High missing rate (>50% per protein) limits statistical power
- PTM analysis: Requires enrichment protocols for comprehensive PTM profiling
- Absolute quantification: Relative abundance only (unless TMT/SILAC used)
- Protein isoforms: Typically collapsed to gene level
- Dynamic range: MS has limited dynamic range vs mRNA sequencing
References
Methods:
- MaxQuant: https://doi.org/10.1038/nbt.1511
- Limma for proteomics: https://doi.org/10.1093/nar/gkv007
- DEP workflow: https://doi.org/10.1038/nprot.2018.107
Databases:
- STRING: https://string-db.org
- PhosphoSitePlus: https://www.phosphosite.org
- CORUM: https://mips.helmholtz-muenchen.de/corum