LLMs-Universal-Life-Science-and-Clinical-Skills- mageck-analysis

<!--

install
source · Clone the upstream repo
git clone https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills-
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills- "$T" && mkdir -p ~/.claude/skills && cp -r "$T/Skills/Genomics/crispr-screens/mageck-analysis" ~/.claude/skills/mdbabumiamssm-llms-universal-life-science-and-clinical-skills-mageck-analysis && rm -rf "$T"
manifest: Skills/Genomics/crispr-screens/mageck-analysis/SKILL.md
source content
<!-- # COPYRIGHT NOTICE # This file is part of the "Universal Biomedical Skills" project. # Copyright (c) 2026 MD BABU MIA, PhD <md.babu.mia@mssm.edu> # All Rights Reserved. # # This code is proprietary and confidential. # Unauthorized copying of this file, via any medium is strictly prohibited. # # Provenance: Authenticated by MD BABU MIA -->

name: bio-crispr-screens-mageck-analysis description: MAGeCK (Model-based Analysis of Genome-wide CRISPR-Cas9 Knockout) for pooled CRISPR screen analysis. Covers count normalization, gene ranking, and pathway analysis. Use when identifying essential genes, drug targets, or resistance mechanisms from dropout or enrichment screens. tool_type: cli primary_tool: mageck measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes. allowed-tools:

  • read_file
  • run_shell_command

MAGeCK CRISPR Screen Analysis

Count sgRNAs from FASTQ

# Count reads mapping to sgRNA library
mageck count \
    -l library.csv \
    -n experiment \
    --sample-label Day0,Treated1,Treated2,Control1,Control2 \
    --fastq Day0.fastq.gz Treated1.fastq.gz Treated2.fastq.gz Control1.fastq.gz Control2.fastq.gz \
    --norm-method median

# Output files:
# experiment.count.txt - normalized counts
# experiment.count_normalized.txt - normalized counts
# experiment.countsummary.txt - QC summary

Library File Format

# library.csv (tab-separated)
sgRNA_ID	Gene	Sequence
BRCA1_1	BRCA1	ATGGATTTATCTGCTCTTCG
BRCA1_2	BRCA1	CAGCAGATACTTGATGCATC
TP53_1	TP53	CCATTGTTCAATATCGTCCG
...

MAGeCK Test (RRA Algorithm)

# Compare treatment vs control
mageck test \
    -k experiment.count.txt \
    -t Treated1,Treated2 \
    -c Control1,Control2 \
    -n results \
    --norm-method median \
    --gene-test-fdr-threshold 0.25

# Output files:
# results.gene_summary.txt - gene-level results
# results.sgrna_summary.txt - sgRNA-level results

MAGeCK MLE (Maximum Likelihood)

# Create design matrix
# design.txt:
# Samples    baseline    treatment
# Day0       1           0
# Control1   1           0
# Control2   1           0
# Treated1   1           1
# Treated2   1           1

mageck mle \
    -k experiment.count.txt \
    -d design.txt \
    -n mle_results \
    --norm-method median

# Output: mle_results.gene_summary.txt with beta scores

Interpret Results

import pandas as pd

# Load gene summary
genes = pd.read_csv('results.gene_summary.txt', sep='\t')

# Negative selection (dropout/essential)
essential = genes[(genes['neg|fdr'] < 0.05)].sort_values('neg|rank')
print(f'Essential genes (dropout): {len(essential)}')
print(essential[['id', 'neg|score', 'neg|fdr']].head(20))

# Positive selection (enrichment/resistance)
resistant = genes[(genes['pos|fdr'] < 0.05)].sort_values('pos|rank')
print(f'Resistance genes (enriched): {len(resistant)}')
print(resistant[['id', 'pos|score', 'pos|fdr']].head(20))

Visualize Results

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

genes = pd.read_csv('results.gene_summary.txt', sep='\t')

# Volcano plot
fig, ax = plt.subplots(figsize=(10, 8))

x = genes['neg|lfc']
y = -np.log10(genes['neg|fdr'])

colors = ['red' if fdr < 0.05 else 'gray' for fdr in genes['neg|fdr']]
ax.scatter(x, y, c=colors, alpha=0.5, s=10)

# Label top hits
top_hits = genes[genes['neg|fdr'] < 0.01].nsmallest(10, 'neg|rank')
for _, row in top_hits.iterrows():
    ax.annotate(row['id'], (row['neg|lfc'], -np.log10(row['neg|fdr'])))

ax.axhline(-np.log10(0.05), linestyle='--', color='black', alpha=0.5)
ax.set_xlabel('Log2 Fold Change')
ax.set_ylabel('-log10(FDR)')
ax.set_title('MAGeCK Negative Selection')
plt.savefig('mageck_volcano.png', dpi=150)

MAGeCK Pathway Analysis

# Gene set enrichment on screen results
mageck pathway \
    -g results.gene_summary.txt \
    -c go_biological_process.gmt \
    -n pathway_results \
    --pathway-fdr-threshold 0.25

Time-Course Screens

# Compare multiple timepoints
mageck mle \
    -k timecourse.count.txt \
    -d timecourse_design.txt \
    -n timecourse_results

# Design matrix for time course:
# Samples    baseline    day7    day14
# Day0       1           0       0
# Day7_R1    1           1       0
# Day7_R2    1           1       0
# Day14_R1   1           0       1
# Day14_R2   1           0       1

CRISPR Activation (CRISPRa) Screens

# For CRISPRa, focus on positive selection
mageck test \
    -k crispra.count.txt \
    -t Activated1,Activated2 \
    -c Control1,Control2 \
    -n crispra_results

# Hits are genes where activation causes phenotype
# Use pos|fdr and pos|score columns

MAGeCK-VISPR (Visualization)

# Generate interactive report
mageck-vispr run \
    -n vispr_report \
    -c config.yaml

# config.yaml example:
# experiment: screen_name
# assembly: hg38
# species: homo_sapiens
# targets: library.csv
# sgrnas: experiment.count.txt
# samples:
#   - Day0
#   - Treated1

Related Skills

  • screen-qc - Quality control before MAGeCK
  • hit-calling - Alternative hit calling methods
  • pathway-analysis/gsea - Downstream enrichment analysis
<!-- AUTHOR_SIGNATURE: 9a7f3c2e-MD-BABU-MIA-2026-MSSM-SECURE -->