Claude-skill-registry genomics

Genomics and transcriptomics analysis strategies

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/genomics" ~/.claude/skills/majiayu000-claude-skill-registry-genomics && rm -rf "$T"

manifest: skills/data/genomics/SKILL.md

Genomics and Transcriptomics Analysis

When to Use This Skill

When data contains gene expression measurements (RNA-seq, microarray)
When analyzing differential gene expression
When performing pathway or gene set enrichment analysis
When interpreting genetic variants or mutations

Core Concepts

Gene Expression Data Types

RNA-seq counts:

Raw read counts per gene
Requires normalization (TPM, RPKM, DESeq2)
Suitable for differential expression analysis

Microarray intensities:

Probe fluorescence intensities
Log-transformed, background-corrected
Legacy platform, less common now

Single-cell RNA-seq:

Expression per cell (not bulk tissue)
High sparsity (many zeros)
Specialized analysis methods

Gene Nomenclature

Human genes:

Official symbols: HUGO Gene Nomenclature Committee (HGNC)
Example: TP53 (tumor protein p53)
Italicized in publications

Mouse genes:

Similar to human but capitalization differs
Example: Tp53 (first letter capital, rest lowercase)

Protein names:

Not italicized
Example: p53 protein

Always verify gene symbols - aliases and outdated names are common.

Differential Expression Analysis

Workflow

import pandas as pd
import numpy as np
from scipy.stats import ttest_ind
from statsmodels.stats.multitest import multipletests

# Load expression data (genes × samples)
# Rows = genes, Columns = samples
expr_data = pd.read_csv("expression_data.csv", index_col=0)

# Define groups
group1_samples = ["Sample1", "Sample2", "Sample3"]
group2_samples = ["Sample4", "Sample5", "Sample6"]

results = []

for gene in expr_data.index:
    group1_expr = expr_data.loc[gene, group1_samples]
    group2_expr = expr_data.loc[gene, group2_samples]

    # T-test
    t_stat, p_value = ttest_ind(group1_expr, group2_expr)

    # Fold change
    mean1 = group1_expr.mean()
    mean2 = group2_expr.mean()
    log2fc = np.log2(mean1 / mean2) if mean2 > 0 else np.nan

    results.append({
        "gene": gene,
        "log2FC": log2fc,
        "p_value": p_value,
        "mean_group1": mean1,
        "mean_group2": mean2
    })

results_df = pd.DataFrame(results)

# Multiple testing correction
results_df["p_adj"] = multipletests(results_df["p_value"], method="fdr_bh")[1]

# Define significant genes
significant = results_df[
    (results_df["p_adj"] < 0.05) &
    (abs(results_df["log2FC"]) > 1)  # 2-fold change
]

print(f"Significant genes: {len(significant)}")
print(f"Upregulated: {sum(significant['log2FC'] > 0)}")
print(f"Downregulated: {sum(significant['log2FC'] < 0)}")

Volcano Plot

Visualize differential expression:

import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))
plt.scatter(
    results_df["log2FC"],
    -np.log10(results_df["p_adj"]),
    alpha=0.5, s=10, c="gray"
)

# Highlight significant genes
sig_mask = (results_df["p_adj"] < 0.05) & (abs(results_df["log2FC"]) > 1)
plt.scatter(
    results_df.loc[sig_mask, "log2FC"],
    -np.log10(results_df.loc[sig_mask, "p_adj"]),
    alpha=0.7, s=20, c="red", label="Significant"
)

plt.xlabel("log2 Fold Change")
plt.ylabel("-log10(adjusted p-value)")
plt.axhline(-np.log10(0.05), linestyle="--", color="black", linewidth=0.5)
plt.axvline(-1, linestyle="--", color="black", linewidth=0.5)
plt.axvline(1, linestyle="--", color="black", linewidth=0.5)
plt.title("Volcano Plot")
plt.legend()
plt.savefig("volcano_plot.png", dpi=300)

Gene Set Enrichment

Simple Pathway Enrichment

When: You have a list of significant genes and want to know which pathways are affected

# Define gene sets (pathways)
gene_sets = {
    "Cell Cycle": ["TP53", "CDK1", "CCNB1", "CDC20", ...],
    "Apoptosis": ["TP53", "BAX", "BCL2", "CASP3", ...],
    "DNA Repair": ["TP53", "BRCA1", "BRCA2", "ATM", ...],
    # ... more pathways
}

# Fisher's exact test for enrichment
from scipy.stats import fisher_exact

all_genes = set(expr_data.index)
sig_genes = set(significant["gene"])

enrichment_results = []

for pathway, pathway_genes in gene_sets.items():
    pathway_genes = set(pathway_genes) & all_genes  # Only genes in dataset

    # 2x2 contingency table
    a = len(sig_genes & pathway_genes)  # Sig & in pathway
    b = len(sig_genes - pathway_genes)  # Sig & not in pathway
    c = len(pathway_genes - sig_genes)  # Not sig & in pathway
    d = len(all_genes - sig_genes - pathway_genes)  # Not sig & not in pathway

    oddsratio, p_value = fisher_exact([[a, b], [c, d]], alternative='greater')

    enrichment_results.append({
        "pathway": pathway,
        "overlap": a,
        "pathway_size": len(pathway_genes),
        "odds_ratio": oddsratio,
        "p_value": p_value
    })

enrich_df = pd.DataFrame(enrichment_results)
enrich_df["p_adj"] = multipletests(enrich_df["p_value"], method="fdr_bh")[1]
enrich_df = enrich_df.sort_values("p_adj")

print(enrich_df.head(10))

Gene Ontology (GO) Terms

Common GO categories:

Biological Process (BP): What the gene does (e.g., "cell cycle", "apoptosis")
Molecular Function (MF): Biochemical activity (e.g., "kinase activity")
Cellular Component (CC): Where it acts (e.g., "nucleus", "mitochondrion")

Resources:

Gene Ontology: http://geneontology.org/
Enrichr: https://maayanlab.cloud/Enrichr/ (web-based enrichment)
DAVID: https://david.ncifcrf.gov/

KEGG Pathway Enrichment

KEGG = Kyoto Encyclopedia of Genes and Genomes

Provides curated pathway maps for:

Metabolic pathways
Signaling pathways
Disease pathways

Example pathways:

hsa04110: Cell cycle
hsa04210: Apoptosis
hsa04151: PI3K-Akt signaling

Common Analysis Patterns

Pattern 1: Transcription Factor Activity

Observation: Many genes upregulated

Hypothesis: Shared transcription factor (TF)

Test:

# Check if significant genes share TF binding motifs
tf_targets = {
    "TP53": ["BAX", "CDKN1A", "MDM2", "GADD45A", ...],
    "MYC": ["CDK4", "CCND1", "E2F1", ...],
    # ... more TFs
}

# Test for enrichment (same as pathway enrichment)

Interpretation: Enrichment suggests TF is active/inactive in condition

Pattern 2: Pathway Coordination

Observation: Genes in same pathway all up/down together

Interpretation: Pathway-level regulation (not individual genes)

Example:

All glycolysis genes ↑↑ → Increased glycolysis
All oxidative phosphorylation genes ↓↓ → Metabolic shift

Pattern 3: Compensatory Response

Observation: Opposite regulation of related pathways

Example:

De novo biosynthesis genes ↓
Salvage pathway genes ↑
→ Metabolic switch to energy-efficient salvage

Correlation Analysis

Co-expression Networks

When: Identify genes that change together

from scipy.stats import pearsonr

# Compute pairwise correlations
genes = significant["gene"].tolist()[:50]  # Top 50 for tractability
corr_matrix = expr_data.loc[genes].T.corr()

# Filter high correlations
high_corr = []
for i in range(len(genes)):
    for j in range(i+1, len(genes)):
        if abs(corr_matrix.iloc[i, j]) > 0.8:
            high_corr.append({
                "gene1": genes[i],
                "gene2": genes[j],
                "correlation": corr_matrix.iloc[i, j]
            })

print(f"High correlations (|r| > 0.8): {len(high_corr)}")

Interpretation:

Positive correlation → co-regulated (same pathway, shared TF)
Negative correlation → antagonistic regulation

Network Visualization

import networkx as nx

# Build network
G = nx.Graph()
for item in high_corr:
    G.add_edge(item["gene1"], item["gene2"], weight=abs(item["correlation"]))

# Find communities (clusters of co-expressed genes)
from networkx.algorithms import community
communities = community.greedy_modularity_communities(G)

for i, comm in enumerate(communities):
    print(f"Community {i}: {list(comm)}")

Literature Search Strategies

Effective Queries

For gene function:

"[GENE] function"
"[GENE] role in [PROCESS]"
"[GENE] knockout phenotype"

For pathway context:

"[GENE] pathway"
"[GENE] interacting proteins"
"[GENE] regulation"

For disease relevance:

"[GENE] [DISEASE]"
"[GENE] mutation [DISEASE]"

Key Databases

NCBI Gene: Gene summaries and references
UniProt: Protein function and domains
STRING: Protein-protein interactions
GeneCards: Comprehensive gene info
PubMed: Literature search

Genomics-Specific Hypotheses

Template Hypotheses

H1: Transcriptional Regulation

"Condition X activates transcription factor [TF], upregulating
target genes [G1, G2, G3] in pathway [P]"

H2: Pathway Activation

"Condition X activates [pathway], evidenced by coordinated
upregulation of pathway genes and increased activity signature"

H3: Epigenetic Regulation

"Condition X alters chromatin state at [locus], changing
expression of genes [G1, G2]"

H4: Post-transcriptional Regulation

"MicroRNA [miR] is upregulated, suppressing target genes [G1, G2],
explaining decreased protein levels despite unchanged mRNA"

Quality Control

Before interpreting results:

Check for batch effects (PCA colored by batch)
Verify sample labels are correct
Check for outlier samples (hierarchical clustering)
Confirm expression distribution (should be roughly normal after log transform)
Verify normalization (samples should have similar distributions)

Common Pitfalls

❌ Ignoring log transformation

Expression data should be log-transformed for most analyses
Fold changes are linear differences in log space

❌ Using nominal p-values for many genes

Always correct for multiple testing (FDR)
Use adjusted p-values for significance

❌ Overinterpreting small fold changes

log2FC < 0.5 (1.4-fold) may not be biologically meaningful
Use stricter thresholds for noisy data

❌ Confusing gene expression with protein activity

mRNA ≠ protein levels
Protein activity may require post-translational modifications

❌ Cherry-picking genes

Don't select genes to fit a story
Use unbiased pathway enrichment

Integration with Other Data Types

Transcriptomics + Metabolomics

Strategy:

1. Identify differentially expressed metabolic enzymes
2. Map to KEGG pathways
3. Check if corresponding metabolites are changed
4. Build integrated metabolic model

Example:

Gene: PHGDH (phosphoglycerate dehydrogenase) ↑
Metabolite: Serine ↑
→ Integrated finding: Increased serine biosynthesis

Transcriptomics + Proteomics

Compare mRNA vs protein changes:

Concordant (both up/down) → transcriptional regulation
Discordant (mRNA ≠ protein) → post-transcriptional regulation

Key Principle

Gene expression is the messenger, not the message.

mRNA changes indicate potential for protein changes. Always consider:

Post-transcriptional regulation (miRNA, RNA stability)
Translational control
Protein stability and degradation
Post-translational modifications

Connect expression changes to phenotype through pathways and functional validation.