git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/genomics" ~/.claude/skills/majiayu000-claude-skill-registry-genomics && rm -rf "$T"
skills/data/genomics/SKILL.mdGenomics and Transcriptomics Analysis
When to Use This Skill
- When data contains gene expression measurements (RNA-seq, microarray)
- When analyzing differential gene expression
- When performing pathway or gene set enrichment analysis
- When interpreting genetic variants or mutations
Core Concepts
Gene Expression Data Types
RNA-seq counts:
- Raw read counts per gene
- Requires normalization (TPM, RPKM, DESeq2)
- Suitable for differential expression analysis
Microarray intensities:
- Probe fluorescence intensities
- Log-transformed, background-corrected
- Legacy platform, less common now
Single-cell RNA-seq:
- Expression per cell (not bulk tissue)
- High sparsity (many zeros)
- Specialized analysis methods
Gene Nomenclature
Human genes:
- Official symbols: HUGO Gene Nomenclature Committee (HGNC)
- Example: TP53 (tumor protein p53)
- Italicized in publications
Mouse genes:
- Similar to human but capitalization differs
- Example: Tp53 (first letter capital, rest lowercase)
Protein names:
- Not italicized
- Example: p53 protein
Always verify gene symbols - aliases and outdated names are common.
Differential Expression Analysis
Workflow
import pandas as pd import numpy as np from scipy.stats import ttest_ind from statsmodels.stats.multitest import multipletests # Load expression data (genes × samples) # Rows = genes, Columns = samples expr_data = pd.read_csv("expression_data.csv", index_col=0) # Define groups group1_samples = ["Sample1", "Sample2", "Sample3"] group2_samples = ["Sample4", "Sample5", "Sample6"] results = [] for gene in expr_data.index: group1_expr = expr_data.loc[gene, group1_samples] group2_expr = expr_data.loc[gene, group2_samples] # T-test t_stat, p_value = ttest_ind(group1_expr, group2_expr) # Fold change mean1 = group1_expr.mean() mean2 = group2_expr.mean() log2fc = np.log2(mean1 / mean2) if mean2 > 0 else np.nan results.append({ "gene": gene, "log2FC": log2fc, "p_value": p_value, "mean_group1": mean1, "mean_group2": mean2 }) results_df = pd.DataFrame(results) # Multiple testing correction results_df["p_adj"] = multipletests(results_df["p_value"], method="fdr_bh")[1] # Define significant genes significant = results_df[ (results_df["p_adj"] < 0.05) & (abs(results_df["log2FC"]) > 1) # 2-fold change ] print(f"Significant genes: {len(significant)}") print(f"Upregulated: {sum(significant['log2FC'] > 0)}") print(f"Downregulated: {sum(significant['log2FC'] < 0)}")
Volcano Plot
Visualize differential expression:
import matplotlib.pyplot as plt plt.figure(figsize=(8, 6)) plt.scatter( results_df["log2FC"], -np.log10(results_df["p_adj"]), alpha=0.5, s=10, c="gray" ) # Highlight significant genes sig_mask = (results_df["p_adj"] < 0.05) & (abs(results_df["log2FC"]) > 1) plt.scatter( results_df.loc[sig_mask, "log2FC"], -np.log10(results_df.loc[sig_mask, "p_adj"]), alpha=0.7, s=20, c="red", label="Significant" ) plt.xlabel("log2 Fold Change") plt.ylabel("-log10(adjusted p-value)") plt.axhline(-np.log10(0.05), linestyle="--", color="black", linewidth=0.5) plt.axvline(-1, linestyle="--", color="black", linewidth=0.5) plt.axvline(1, linestyle="--", color="black", linewidth=0.5) plt.title("Volcano Plot") plt.legend() plt.savefig("volcano_plot.png", dpi=300)
Gene Set Enrichment
Simple Pathway Enrichment
When: You have a list of significant genes and want to know which pathways are affected
# Define gene sets (pathways) gene_sets = { "Cell Cycle": ["TP53", "CDK1", "CCNB1", "CDC20", ...], "Apoptosis": ["TP53", "BAX", "BCL2", "CASP3", ...], "DNA Repair": ["TP53", "BRCA1", "BRCA2", "ATM", ...], # ... more pathways } # Fisher's exact test for enrichment from scipy.stats import fisher_exact all_genes = set(expr_data.index) sig_genes = set(significant["gene"]) enrichment_results = [] for pathway, pathway_genes in gene_sets.items(): pathway_genes = set(pathway_genes) & all_genes # Only genes in dataset # 2x2 contingency table a = len(sig_genes & pathway_genes) # Sig & in pathway b = len(sig_genes - pathway_genes) # Sig & not in pathway c = len(pathway_genes - sig_genes) # Not sig & in pathway d = len(all_genes - sig_genes - pathway_genes) # Not sig & not in pathway oddsratio, p_value = fisher_exact([[a, b], [c, d]], alternative='greater') enrichment_results.append({ "pathway": pathway, "overlap": a, "pathway_size": len(pathway_genes), "odds_ratio": oddsratio, "p_value": p_value }) enrich_df = pd.DataFrame(enrichment_results) enrich_df["p_adj"] = multipletests(enrich_df["p_value"], method="fdr_bh")[1] enrich_df = enrich_df.sort_values("p_adj") print(enrich_df.head(10))
Gene Ontology (GO) Terms
Common GO categories:
- Biological Process (BP): What the gene does (e.g., "cell cycle", "apoptosis")
- Molecular Function (MF): Biochemical activity (e.g., "kinase activity")
- Cellular Component (CC): Where it acts (e.g., "nucleus", "mitochondrion")
Resources:
- Gene Ontology: http://geneontology.org/
- Enrichr: https://maayanlab.cloud/Enrichr/ (web-based enrichment)
- DAVID: https://david.ncifcrf.gov/
KEGG Pathway Enrichment
KEGG = Kyoto Encyclopedia of Genes and Genomes
Provides curated pathway maps for:
- Metabolic pathways
- Signaling pathways
- Disease pathways
Example pathways:
- hsa04110: Cell cycle
- hsa04210: Apoptosis
- hsa04151: PI3K-Akt signaling
Common Analysis Patterns
Pattern 1: Transcription Factor Activity
Observation: Many genes upregulated
Hypothesis: Shared transcription factor (TF)
Test:
# Check if significant genes share TF binding motifs tf_targets = { "TP53": ["BAX", "CDKN1A", "MDM2", "GADD45A", ...], "MYC": ["CDK4", "CCND1", "E2F1", ...], # ... more TFs } # Test for enrichment (same as pathway enrichment)
Interpretation: Enrichment suggests TF is active/inactive in condition
Pattern 2: Pathway Coordination
Observation: Genes in same pathway all up/down together
Interpretation: Pathway-level regulation (not individual genes)
Example:
All glycolysis genes ↑↑ → Increased glycolysis All oxidative phosphorylation genes ↓↓ → Metabolic shift
Pattern 3: Compensatory Response
Observation: Opposite regulation of related pathways
Example:
De novo biosynthesis genes ↓ Salvage pathway genes ↑ → Metabolic switch to energy-efficient salvage
Correlation Analysis
Co-expression Networks
When: Identify genes that change together
from scipy.stats import pearsonr # Compute pairwise correlations genes = significant["gene"].tolist()[:50] # Top 50 for tractability corr_matrix = expr_data.loc[genes].T.corr() # Filter high correlations high_corr = [] for i in range(len(genes)): for j in range(i+1, len(genes)): if abs(corr_matrix.iloc[i, j]) > 0.8: high_corr.append({ "gene1": genes[i], "gene2": genes[j], "correlation": corr_matrix.iloc[i, j] }) print(f"High correlations (|r| > 0.8): {len(high_corr)}")
Interpretation:
- Positive correlation → co-regulated (same pathway, shared TF)
- Negative correlation → antagonistic regulation
Network Visualization
import networkx as nx # Build network G = nx.Graph() for item in high_corr: G.add_edge(item["gene1"], item["gene2"], weight=abs(item["correlation"])) # Find communities (clusters of co-expressed genes) from networkx.algorithms import community communities = community.greedy_modularity_communities(G) for i, comm in enumerate(communities): print(f"Community {i}: {list(comm)}")
Literature Search Strategies
Effective Queries
For gene function:
"[GENE] function" "[GENE] role in [PROCESS]" "[GENE] knockout phenotype"
For pathway context:
"[GENE] pathway" "[GENE] interacting proteins" "[GENE] regulation"
For disease relevance:
"[GENE] [DISEASE]" "[GENE] mutation [DISEASE]"
Key Databases
- NCBI Gene: Gene summaries and references
- UniProt: Protein function and domains
- STRING: Protein-protein interactions
- GeneCards: Comprehensive gene info
- PubMed: Literature search
Genomics-Specific Hypotheses
Template Hypotheses
H1: Transcriptional Regulation
"Condition X activates transcription factor [TF], upregulating target genes [G1, G2, G3] in pathway [P]"
H2: Pathway Activation
"Condition X activates [pathway], evidenced by coordinated upregulation of pathway genes and increased activity signature"
H3: Epigenetic Regulation
"Condition X alters chromatin state at [locus], changing expression of genes [G1, G2]"
H4: Post-transcriptional Regulation
"MicroRNA [miR] is upregulated, suppressing target genes [G1, G2], explaining decreased protein levels despite unchanged mRNA"
Quality Control
Before interpreting results:
- Check for batch effects (PCA colored by batch)
- Verify sample labels are correct
- Check for outlier samples (hierarchical clustering)
- Confirm expression distribution (should be roughly normal after log transform)
- Verify normalization (samples should have similar distributions)
Common Pitfalls
❌ Ignoring log transformation
- Expression data should be log-transformed for most analyses
- Fold changes are linear differences in log space
❌ Using nominal p-values for many genes
- Always correct for multiple testing (FDR)
- Use adjusted p-values for significance
❌ Overinterpreting small fold changes
- log2FC < 0.5 (1.4-fold) may not be biologically meaningful
- Use stricter thresholds for noisy data
❌ Confusing gene expression with protein activity
- mRNA ≠ protein levels
- Protein activity may require post-translational modifications
❌ Cherry-picking genes
- Don't select genes to fit a story
- Use unbiased pathway enrichment
Integration with Other Data Types
Transcriptomics + Metabolomics
Strategy:
1. Identify differentially expressed metabolic enzymes 2. Map to KEGG pathways 3. Check if corresponding metabolites are changed 4. Build integrated metabolic model
Example:
Gene: PHGDH (phosphoglycerate dehydrogenase) ↑ Metabolite: Serine ↑ → Integrated finding: Increased serine biosynthesis
Transcriptomics + Proteomics
Compare mRNA vs protein changes:
- Concordant (both up/down) → transcriptional regulation
- Discordant (mRNA ≠ protein) → post-transcriptional regulation
Key Principle
Gene expression is the messenger, not the message.
mRNA changes indicate potential for protein changes. Always consider:
- Post-transcriptional regulation (miRNA, RNA stability)
- Translational control
- Protein stability and degradation
- Post-translational modifications
Connect expression changes to phenotype through pathways and functional validation.