Skillshub proteomics-de
Differential expression analysis for label-free quantitative (LFQ) intensity data with standard MaxQuant and DIA-NN output. Workflow includes preprocessing, imputation, and statistical testing.
install
source · Clone the upstream repo
git clone https://github.com/ComeOnOliver/skillshub
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ComeOnOliver/skillshub "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/ClawBio/ClawBio/proteomics-de" ~/.claude/skills/comeonoliver-skillshub-proteomics-de && rm -rf "$T"
manifest:
skills/ClawBio/ClawBio/proteomics-de/SKILL.mdsource content
🥚 Proteomics Differential Expression
This skill performs differential expression analysis on label-free quantitative (LFQ) intensity data from MaxQuant and DIA-NN outputs, including preprocessing, imputation, statistical testing, and visualization.
Domain Decisions
1. Multi-format Input Support
- Supports MaxQuant
proteinGroups.txt- Automatic filtering of reverse hits, contaminants, and site-only identifications
- Supports DIA-NN output
- Automatically extracts protein IDs and
intensity columns.raw
- Automatically extracts protein IDs and
2. Preprocessing Strategy
- MaxQuant:
- Filters:
Reverse
/Potential contaminantContaminantOnly identified by site
- Filters:
- DIA-NN:
- Extracts protein identifiers and intensity matrix directly
3. Intensity Transformation
- LFQ intensities are transformed using log2 scaling
- Ensures approximate normality for downstream statistical testing
4. Missing Value Imputation
- Uses down-shifted Gaussian imputation
- Mean shifted by:
median - shift × std - Default:
shift = 1.8scale = 0.3
- Mean shifted by:
- Assumption:
- Missing values represent low-abundance proteins
5. Statistical Testing
- Two-sample t-test between treatment and control groups
- Default degrees of freedom:
(for 3 vs 3 replicates)df = 4
6. s0-based FDR Correction
- Uses s0-based thresholding to stabilize variance
- Combines:
- log2 fold change
- p-value
- Based on:
- Giai Gianetto et al. (2016)
7. Significance Thresholding
- Default:
FDR = 0.05s0 = 0.1
- Produces:
- Adjusted significance boundary (used in volcano plot)
8. Visualization Outputs
- PCA plot
- Volcano plot (with s0 curve)
- Imputation distribution comparison
Safety Rules
-
Local-first
- No data upload without explicit user consent
-
Statistical caution
- Statistical results should be interpreted with caution and not overinterpreted
- Avoid drawing conclusions beyond what the data supports
-
Missing data assumptions
- Imputation assumes missing values correspond to low abundance
- May not hold in all experimental designs
-
Small sample limitations
- t-test reliability depends on sufficient replicates
-
Reproducibility
- All parameters and commands are logged
-
No hallucinated science
- All methods are based on established proteomics workflows
Agent Boundary
This skill DOES:
- Perform differential expression analysis on LFQ proteomics data
- Handle MaxQuant and DIA-NN outputs
- Generate statistical results and visualizations
- Produce reproducible reports
This skill DOES NOT:
- Process raw mass spectrometry data (e.g. RAW files)
- Perform peptide identification or database search
- Conduct pathway or functional enrichment analysis
- Provide biological interpretation of results
Input Contract
Supported Input Formats
- MaxQuant
proteinGroups.txt - DIA-NN output (
/.tsv
).txt
Metadata Requirements
or.csv.tsv- Must include:
sample_idgroup
Supports:
- raw names
- full paths (e.g.
)/path/sample.raw
Output Structure
proteomics_de_report/ ├── report.md ├── figures/ │ ├── imputation_distribution.png │ ├── pca.png │ └── volcano.png ├── tables/ │ ├── imputed_proteinGroups.csv │ └── de_results.csv └── reproducibility/ ├── commands.sh ├── environment.yml └── checksums.sha256
Usage
Demo
python proteomics_de.py \ --demo \ --output report_dir
MaxQuant Input
python proteomics_de.py \ --input proteinGroups.txt \ --input-type maxquant \ --metadata metadata.csv \ --contrast "treated,control" \ --output report_dir
DIA-NN Input
python proteomics_de.py \ --input diann_output.tsv \ --input-type diann \ --metadata metadata.csv \ --contrast "treated,control" \ --output report_dir
Parameters
| Parameter | Description | Default |
|---|---|---|
| Input file path | - |
| or | maxquant |
| Metadata file | - |
| treatment,control | treated,control |
| s0 parameter | 0.1 |
| FDR threshold | 0.05 |
| Degrees of freedom | 4 |
| Imputation shift | 1.8 |
| Imputation scale | 0.3 |
| Output directory | - |
References
- test_proteinGroups.txt is from: Keilhauer EC, Hein MY, Mann M. Accurate protein complex retrieval by affinity enrichment mass spectrometry (AE-MS) rather than affinity purification mass spectrometry (AP-MS). Mol Cell Proteomics. 2015 Jan;14(1):120-35. doi: 10.1074/mcp.M114.041012. Epub 2014 Nov 2. PMID: 25363814; PMCID: PMC4288248.
- s0 correction algorithm is from: Giai Gianetto Q, Couté Y, Bruley C, Burger T. Uses and misuses of the fudge factor in quantitative discovery proteomics. Proteomics. 2016 Jul;16(14):1955-60. doi: 10.1002/pmic.201600132. PMID: 27272648.
- s0 correction algorithm is cited by: Michaelis AC, Brunner AD, Zwiebel M, Meier F, Strauss MT, Bludau I, Mann M. The social and structural architecture of the yeast protein interactome. Nature. 2023 Dec;624(7990):192-200. doi: 10.1038/s41586-023-06739-5. Epub 2023 Nov 15. PMID: 37968396; PMCID: PMC10700138.