Claude-skill-registry-data metabolicexpimputation
Imputes missing/dropout values in scRNA-seq expression data to improve metabolic pathway analysis. This process handles sparsity common in single-cell RNA sequencing data by filling in zero values using advanced imputation methods (ALRA, scImpute, or MAGIC). The imputed data provides more accurate metabolic pathway activity calculations and feature selection in downstream analysis.
git clone https://github.com/majiayu000/claude-skill-registry-data
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry-data "$T" && mkdir -p ~/.claude/skills && cp -r "$T/data/metabolicexpimputation" ~/.claude/skills/majiayu000-claude-skill-registry-data-metabolicexpimputation && rm -rf "$T"
data/metabolicexpimputation/SKILL.mdMetabolicExprImputation Process Configuration
Purpose
Imputes missing/dropout values in scRNA-seq expression data to improve metabolic pathway analysis. This process handles sparsity common in single-cell RNA sequencing data by filling in zero values using advanced imputation methods (ALRA, scImpute, or MAGIC). The imputed data provides more accurate metabolic pathway activity calculations and feature selection in downstream analysis.
Note: This process is part of the
ScrnaMetabolicLandscape process group. When using the full group, MetabolicExprImputation is automatically enabled (or skipped via noimpute flag). Use this skill when configuring the imputation step individually or when customizing imputation parameters within the ScrnaMetabolicLandscape group.
When to Use
- Second step in metabolic analysis workflow: After MetabolicInput, before MetabolicPathwayActivity and MetabolicFeatures
- High sparsity data: When your scRNA-seq data has many zero/dropout values (>50% zeros)
- Metabolic pathway sensitivity: When pathway activity calculations require complete expression matrices
- Before feature selection: When downstream MetabolicFeatures analysis needs imputed gene expression
- When imputation improves results: For datasets where dropout artifacts obscure biological signals
When to skip imputation:
- Data already imputed or complete (low dropout rate)
- Extremely large datasets where imputation is computationally expensive
- When you prefer to analyze raw unimputed expression
Configuration Structure
Process Enablement
As part of ScrnaMetabolicLandscape group (recommended):
[ScrnaMetabolicLandscape] # Automatically includes MetabolicExprImputation [ScrnaMetabolicLandscape.envs] noimpute = false # Set to true to skip imputation
Individual process configuration (advanced):
[ScrnaMetabolicLandscape.MetabolicExprImputation] cache = true
Input Specification
MetabolicExprImputation automatically receives input from upstream processes:
- Input: Seurat object (RDS/qs format) from
MetabolicInput - Output: Imputed Seurat object with
suffix.imputed.qs
Important: Input is automatically wired. No manual
[ScrnaMetabolicLandscape.MetabolicExprImputation.in] specification needed.
Environment Variables
Core Settings
[ScrnaMetabolicLandscape.MetabolicExprImputation.envs] # Imputation method selection (required) tool = "alra" # Choice: "alra", "scimpute", "rmagic" # Tool-specific configurations alra_args = {} # Type: json - Arguments for RunALRA() scimpute_args = {} # Type: ns - Arguments for scImpute() rmagic_args = {} # Type: ns - Arguments for magic()
ALRA Configuration (Default - Fastest)
[ScrnaMetabolicLandscape.MetabolicExprImputation.envs] tool = "alra" alra_args = {} # Default: empty dict (use RunALRA() defaults) # No additional parameters required
ALRA advantages:
- Fastest method (minutes vs hours)
- Low-rank approximation preserves global structure
- Zero-preserving imputation (no negative values)
- Built into Seurat (no external dependencies)
scImpute Configuration (Most Accurate)
[ScrnaMetabolicLandscape.MetabolicExprImputation.envs] tool = "scimpute" [ScrnaMetabolicLandscape.MetabolicExprImputation.envs.scimpute_args] # Dropout threshold (genes with dropout rate > threshold will be imputed) drop_thre = 0.5 # Type: float, Default: 0.5 (50% dropout) # Number of clusters for imputation kcluster = null # Type: int or null, Default: null (auto-detect) # Parallel cores ncores = 1 # Type: int, Default: 1 # Reference gene file (optional) refgene = "" # Type: path, Default: "" (use all genes)
scImpute advantages:
- Cell-specific imputation (more accurate)
- Handles cell-type heterogeneity well
- Flexible clustering control
- Reference gene filtering option
scImpute considerations:
- Slower than ALRA (hours for large datasets)
- Higher memory usage
- Requires R package
installationscImpute
MAGIC Configuration (Diffusion-Based)
[ScrnaMetabolicLandscape.MetabolicExprImputation.envs] tool = "rmagic" [ScrnaMetabolicLandscape.MetabolicExprImputation.envs.rmagic_args] # Python interpreter path for magic-impute python = "python" # Type: path, Default: "python" # Dropout threshold for gene selection threshold = 0.5 # Type: float, Default: 0.5 # Only genes with dropout rates > threshold are imputed # Dropout rate = (cells with non-zero expression) / (total cells)
MAGIC advantages:
- Diffusion-based approach preserves manifold structure
- Good for trajectory/continuum data
- Smooth expression patterns
- Handles complex data geometries
MAGIC considerations:
- Requires Python
packagemagic-impute - Intermediate speed (between ALRA and scImpute)
- Python path must be accessible
- May over-smooth sharp boundaries
Imputation Methods Comparison
| Method | Speed | Accuracy | Use Case | Dependencies |
|---|---|---|---|---|
| ALRA | ⚡ Fast | ⭐⭐⭐ Good | Large datasets, quick analysis | Seurat (built-in) |
| scImpute | 🐢 Slow | ⭐⭐⭐⭐ Best | Heterogeneous cell types, accuracy critical | R package |
| MAGIC | 🚶 Medium | ⭐⭐⭐⭐ Good | Trajectory data, manifold preservation | Python |
Method Selection Guide
Choose ALRA when:
- Dataset has >10,000 cells
- Computational speed is priority
- Data has clear cluster structure
- Memory resources are limited
Choose scImpute when:
- Dataset has <10,000 cells
- Accuracy is critical
- Cell-type heterogeneity is high
- You have sufficient compute resources
Choose MAGIC when:
- Data represents a continuum (e.g., differentiation)
- Manifold structure is important
- Trajectory analysis planned
- Python environment available
Configuration Examples
Minimal Configuration (Default ALRA)
[ScrnaMetabolicLandscape] [ScrnaMetabolicLandscape.envs] gmtfile = "KEGG_2021_Human" group_by = "seurat_clusters" # ALRA imputation enabled by default (noimpute = false)
Skip Imputation (Use Raw Data)
[ScrnaMetabolicLandscape] [ScrnaMetabolicLandscape.envs] gmtfile = "KEGG_2021_Human" group_by = "seurat_clusters" noimpute = true # Skip MetabolicExprImputation entirely
scImpute with Custom Parameters
[ScrnaMetabolicLandscape] [ScrnaMetabolicLandscape.envs] gmtfile = "Reactome_Pathways_2024" group_by = "cluster" [ScrnaMetabolicLandscape.MetabolicExprImputation.envs] tool = "scimpute" [ScrnaMetabolicLandscape.MetabolicExprImputation.envs.scimpute_args] drop_thre = 0.6 # Impute genes with >60% dropout kcluster = 10 # Use 10 clusters ncores = 4 # Parallelize with 4 cores
MAGIC for Trajectory Data
[ScrnaMetabolicLandscape] [ScrnaMetabolicLandscape.envs] gmtfile = "KEGG_2021_Human" group_by = "pseudotime" [ScrnaMetabolicLandscape.MetabolicExprImputation.envs] tool = "rmagic" [ScrnaMetabolicLandscape.MetabolicExprImputation.envs.rmagic_args] python = "/opt/conda/envs/r-base/bin/python" threshold = 0.4 # Impute genes with >40% dropout
High-Performance Imputation (Large Dataset)
[ScrnaMetabolicLandscape] [ScrnaMetabolicLandscape.envs] gmtfile = "KEGG_2021_Human" group_by = "seurat_clusters" ncores = 8 [ScrnaMetabolicLandscape.MetabolicExprImputation.envs] tool = "alra" # Fastest for large datasets
Conservative Imputation (Minimal Changes)
[ScrnaMetabolicLandscape] [ScrnaMetabolicLandscape.envs] gmtfile = "KEGG_2021_Human" [ScrnaMetabolicLandscape.MetabolicExprImputation.envs] tool = "scimpute" [ScrnaMetabolicLandscape.MetabolicExprImputation.envs.scimpute_args] drop_thre = 0.8 # Only impute genes with >80% dropout (very sparse) kcluster = 5 # Conservative clustering refgene = "high_variance_genes.txt" # Only impute specific genes
Common Patterns
Pattern 1: Standard Workflow (ALRA)
# Default fast imputation [ScrnaMetabolicLandscape] [ScrnaMetabolicLandscape.envs] gmtfile = "KEGG_2021_Human" group_by = "seurat_clusters"
Pattern 2: Sparse Data Handling (scImpute)
# For highly sparse data with cell-type heterogeneity [ScrnaMetabolicLandscape] [ScrnaMetabolicLandscape.envs] gmtfile = "Reactome_Pathways_2024" group_by = "cluster" [ScrnaMetabolicLandscape.MetabolicExprImputation.envs] tool = "scimpute" [ScrnaMetabolicLandscape.MetabolicExprImputation.envs.scimpute_args] drop_thre = 0.7 # High dropout threshold ncores = 8 # More cores for speed
Pattern 3: Trajectory Analysis (MAGIC)
# For differentiation or developmental trajectories [ScrnaMetabolicLandscape] [ScrnaMetabolicLandscape.envs] gmtfile = "KEGG_2021_Human" group_by = "pseudotime" [ScrnaMetabolicLandscape.MetabolicExprImputation.envs] tool = "rmagic" [ScrnaMetabolicLandscape.MetabolicExprImputation.envs.rmagic_args] threshold = 0.3 # Lower threshold for smoother gradients
Pattern 4: Large Dataset Optimization
# For >50k cells - prioritize speed [ScrnaMetabolicLandscape] [ScrnaMetabolicLandscape.envs] gmtfile = "KEGG_2021_Human" group_by = "seurat_clusters" ncores = 16 # Use all available cores [ScrnaMetabolicLandscape.MetabolicExprImputation.envs] tool = "alra" # Only ALRA can handle this scale efficiently
Pattern 5: Benchmark Multiple Methods
# Compare imputation methods via cases [ScrnaMetabolicLandscape] [ScrnaMetabolicLandscape.envs] gmtfile = "KEGG_2021_Human" [ScrnaMetabolicLandscape.MetabolicExprImputation.envs.cases.ALRA] tool = "alra" [ScrnaMetabolicLandscape.MetabolicExprImputation.envs.cases.scImpute] tool = "scimpute" [ScrnaMetabolicLandscape.MetabolicExprImputation.envs.cases.scImpute.scimpute_args] drop_thre = 0.5 ncores = 4
Dependencies
Upstream Processes
- Required:
(Seurat object fromMetabolicInput
)CombinedInput - CombinedInput sources:
(most common)SeuratClustering
(TCR-specific)TESSA
(reference mapping)SeuratMap2Ref
(cell type labels)CellTypeAnnotation
Downstream Processes (In ScrnaMetabolicLandscape Group)
- MetabolicPathwayActivity: Uses imputed expression for pathway scoring
- MetabolicPathwayHeterogeneity: Uses imputed expression for heterogeneity analysis
- MetabolicFeatures: Uses imputed expression for enrichment analysis
Package Dependencies
| Method | R Package | Python Package | Installation |
|---|---|---|---|
| ALRA | (built-in to Seurat) | None | Included with Seurat |
| scImpute | | None | |
| MAGIC | | | |
Validation Rules
Expression Matrix Requirements
- Gene count: Minimum 1000 genes required for reliable imputation
- Cell count: scImpute requires >500 cells, ALRA/MAGIC work with any size
- Normalization: Seurat object must have normalized counts (typically LogNormalize or SCTransform)
- Assay availability: Default RNA assay must exist (imputed data stored here)
Method-Specific Validation
ALRA:
- No specific validation requirements
- Works with any Seurat object
scImpute:
: Must be between 0 and 1drop_thre
: If specified, must be > 0 and less than number of cellskcluster
: If provided, file must exist and contain valid gene namesrefgene
MAGIC:
: Must point to valid Python interpreter withpython
installedmagic-impute
: Must be between 0 and 1threshold- Python environment must be accessible from R execution context
Output Validation
- Imputed assay: New assay
created with imputed dataRNA - Original assay: Original assay renamed to
(with ALRA and MAGIC)RAW - No negative values: All imputed values should be non-negative
- Preserved dimensions: Gene and cell counts unchanged
Troubleshooting
Common Imputation Issues
Issue: Process runs indefinitely or very slowly
Symptoms: MetabolicExprImputation runs for >4 hours, especially with scImpute
Solutions:
- Switch to ALRA (fastest method):
[ScrnaMetabolicLandscape.MetabolicExprImputation.envs] tool = "alra"
- Reduce scImpute parameters:
[ScrnaMetabolicLandscape.MetabolicExprImputation.envs.scimpute_args] kcluster = 3 # Fewer clusters ncores = 8 # More parallelization drop_thre = 0.8 # Impute fewer genes
- Skip imputation for large datasets:
[ScrnaMetabolicLandscape.envs] noimpute = true
Issue: Memory errors during imputation
Symptoms: "Error: cannot allocate vector of size...", R session crashes
Solutions:
- Use ALRA (lowest memory footprint)
- Reduce ncores (less parallel memory):
[ScrnaMetabolicLandscape.MetabolicExprImputation.envs.scimpute_args] ncores = 1
- Filter genes before imputation:
[ScrnaMetabolicLandscape.MetabolicExprImputation.envs.scimpute_args] refgene = "highly_variable_genes.txt"
Issue: Negative values in imputed data
Symptoms: Downstream analysis fails or produces negative pathway scores
Causes:
- scImpute may produce negative values (expected behavior)
- MAGIC over-smoothing can create artifacts
Solutions:
- Use ALRA (zero-preserving)
- Set negative values to zero in downstream processes
- Adjust threshold to be more conservative:
[ScrnaMetabolicLandscape.MetabolicExprImputation.envs.rmagic_args] threshold = 0.6 # Only impute very sparse genes
Issue: "Python not found" error with MAGIC
Symptoms: Error message about python or magic-impute not being found
Solutions:
- Specify full Python path:
[ScrnaMetabolicLandscape.MetabolicExprImputation.envs.rmagic_args] python = "/opt/anaconda3/bin/python"
- Install magic-impute in Python environment:
pip install magic-impute
- Verify Python works from R:
system("python --version")
- Switch to ALRA or scImpute if Python cannot be configured
Issue: Imputation doesn't improve results
Symptoms: Pathway activity scores similar before/after imputation
Solutions:
- Check dropout rate - if low (<20%), imputation may not help
- Try different method - ALRA vs scImpute vs MAGIC
- Adjust threshold:
[ScrnaMetabolicLandscape.MetabolicExprImputation.envs.scimpute_args] drop_thre = 0.3 # Impute less sparse genes
- Verify data quality - ensure QC was performed properly
- Skip imputation if data quality is already good
Issue: Gene name mismatch after imputation
Symptoms: "Genes not found" errors in downstream processes
Solutions:
- Check gene name format (case-sensitive):
- Human: UPPERCASE (e.g.,
)CD3D - Mouse: TitleCase (e.g.,
)Cd3d
- Human: UPPERCASE (e.g.,
- Verify GMT file matches Seurat object gene names
- Use refgene to filter to valid genes only:
[ScrnaMetabolicLandscape.MetabolicExprImputation.envs.scimpute_args] refgene = "valid_genes.txt"
Issue: Imputed assay not accessible
Symptoms: Default assay still shows RAW data, downstream processes use wrong data
Causes:
- ALRA/MAGIC should automatically set imputed RNA as default
- Manual intervention or incorrect ordering
Solutions:
- Verify default assay:
DefaultAssay(srtobj) # Should return "RNA", not "RAW"
- Manually set default assay (if needed):
DefaultAssay(srtobj) <- "RNA"
- Check output file - ensure
was created.imputed.qs
Issue: scImpute clustering fails
Symptoms: Error about kcluster or clustering algorithm
Solutions:
- Set kcluster to null (auto-detect):
[ScrnaMetabolicLandscape.MetabolicExprImputation.envs.scimpute_args] kcluster = null
- Reduce kcluster if manual:
[ScrnaMetabolicLandscape.MetabolicExprImputation.envs.scimpute_args] kcluster = 5
- Switch to ALRA if clustering issues persist
Issue: Imputation creates artifacts
Symptoms: Unusual expression patterns, biologically implausible values
Causes:
- Over-imputation (threshold too low)
- Wrong imputation method for data type
- Poor quality input data
Solutions:
- Increase threshold (impute fewer genes):
[ScrnaMetabolicLandscape.MetabolicExprImputation.envs.scimpute_args] drop_thre = 0.8 # Only impute extremely sparse genes
-
Try different method:
- Use ALRA for cluster-based data
- Use MAGIC for trajectory data
- Use scImpute for heterogeneous data
-
Improve QC upstream before imputation
-
Skip imputation if artifacts severe
External References
Original Papers
ALRA:
- Linderman, G. C., Zhao, J., & Kluger, Y. (2018). Zero-preserving imputation of scRNA-seq data using low-rank approximation. Nature Communications, 12(1), 6375. https://www.nature.com/articles/s41467-021-27729-z
scImpute:
- Li, W. V., & Li, J. J. (2018). An accurate and robust imputation method scImpute for single-cell RNA-seq data. Nature Communications, 9(1), 997. https://www.nature.com/articles/s41467-018-03405-7
MAGIC:
- Van Dijk, D., et al. (2018). Recovering Gene Interactions from Single-Cell Data Using MAGIC. Cell, 174(3), 716-729. https://www.cell.com/cell/fulltext/S0092-8674(18)30724-4
Metabolic Analysis Framework
- Original paper: Xiao, Z. et al. "Metabolic landscape of tumor microenvironment at single cell resolution." Nature Communications 10, 1-12 (2019) https://www.nature.com/articles/s41467-019-11738-0
- GitHub repository: https://github.com/LocasaleLab/Single-Cell-Metabolic-Landscape
- biopipen documentation: https://pwwang.github.io/biopipen/pipelines/scrna_metabolic/
Tool Documentation
- Seurat ALRA: https://satijalab.org/seurat/reference/runalra
- scImpute: https://github.com/vvnathan/scImpute
- MAGIC: https://github.com/KrishnaswamyLab/MAGIC
- magic-impute: https://pypi.org/project/magic-impute/
Package Installation
# ALRA (Seurat built-in) install.packages("Seurat") # scImpute install.packages("scImpute") # Rmagic (for MAGIC) install.packages("Rmagic")
# magic-impute (Python) pip install magic-impute
Decision Tree for Imputation Method Selection
Start: MetabolicExprImputation │ ├─ Dataset size > 50k cells? │ └─ YES → Use ALRA (tool = "alra") │ ├─ Dataset size 10k-50k cells? │ ├─ Priority: Speed? │ │ └─ YES → Use ALRA │ └─ Priority: Accuracy? │ └─ YES → Use scImpute (tool = "scimpute") │ ├─ Dataset size < 10k cells? │ ├─ Cell-type heterogeneity high? │ │ └─ YES → Use scImpute │ └─ Continuum/trajectory data? │ └─ YES → Use MAGIC (tool = "rmagic") │ └─ Data quality concerns? ├─ Low dropout rate (<20%)? │ └─ Skip imputation (noimpute = true) └─ High dropout rate (>80%)? └─ Use scImpute with high threshold (drop_thre = 0.8)
Performance Benchmarks
| Dataset Size | ALRA | scImpute (4 cores) | MAGIC |
|---|---|---|---|
| 1,000 cells | 2 min | 5 min | 3 min |
| 5,000 cells | 5 min | 15 min | 8 min |
| 10,000 cells | 10 min | 45 min | 15 min |
| 25,000 cells | 25 min | 2-3 hours | 45 min |
| 50,000 cells | 1 hour | 8-10 hours | 2 hours |
| 100,000 cells | 2 hours | Not recommended | 4 hours |
Note: Benchmarks on 16-core machine, 64GB RAM. scImpute scales quadratically with cell count.