SciAgent-Skills cellxgene-census
Query CELLxGENE Census (61M+ cells) programmatically. Search by cell type, tissue, disease, organism. Get expression matrices as AnnData, stream large queries out-of-core, train PyTorch models on single-cell data. For analyzing your own data use scanpy; for annotated data manipulation use anndata.
git clone https://github.com/jaechang-hits/SciAgent-Skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/genomics-bioinformatics/cellxgene-census" ~/.claude/skills/jaechang-hits-sciagent-skills-cellxgene-census && rm -rf "$T"
skills/genomics-bioinformatics/cellxgene-census/SKILL.mdCZ CELLxGENE Census
Overview
CZ CELLxGENE Census provides programmatic access to 61+ million standardized single-cell RNA-seq observations from human and mouse. It enables population-scale queries by cell type, tissue, disease, and donor metadata, returning expression data as AnnData objects or PyTorch dataloaders for ML workflows.
When to Use
- Querying single-cell expression data across tissues, diseases, or cell types from a curated atlas
- Building reference datasets for cell type classification or marker gene discovery
- Training ML models on large-scale single-cell data (PyTorch integration)
- Comparing gene expression across conditions (e.g., COVID-19 vs healthy) at population scale
- Exploring what single-cell datasets are available for a tissue or disease of interest
- For analyzing your own scRNA-seq data, use scanpy instead
- For manipulating AnnData objects (subsetting, concatenation), use anndata instead
Prerequisites
pip install cellxgene-census # For ML workflows pip install cellxgene-census[experimental]
API Rate Limits: Census uses TileDB-SOMA cloud backend. No explicit rate limit, but large queries (>1M cells) should use out-of-core processing (Module 4) to avoid memory exhaustion. Always use context managers for proper resource cleanup.
Quick Start
import cellxgene_census with cellxgene_census.open_soma() as census: # Get B cells from lung adata = cellxgene_census.get_anndata( census=census, organism="Homo sapiens", obs_value_filter="cell_type == 'B cell' and tissue_general == 'lung' and is_primary_data == True", obs_column_names=["cell_type", "disease", "donor_id"], ) print(f"Retrieved {adata.n_obs} cells × {adata.n_vars} genes") # Retrieved ~15000 cells × 60664 genes
Core API
1. Opening and Exploring the Census
Connect to Census and discover available data.
import cellxgene_census # Open latest stable version (always use context manager) with cellxgene_census.open_soma() as census: # Summary statistics summary = census["census_info"]["summary"].read().concat().to_pandas() print(f"Total cells: {summary['total_cell_count'][0]:,}") # List all datasets datasets = census["census_info"]["datasets"].read().concat().to_pandas() print(f"Total datasets: {len(datasets)}") print(datasets[["dataset_title", "cell_count"]].head())
# Open specific version for reproducibility with cellxgene_census.open_soma(census_version="2023-07-25") as census: # Reproducible analysis code here pass
2. Cell Metadata Queries
Query cell-level metadata without downloading expression data.
import cellxgene_census with cellxgene_census.open_soma() as census: # Get unique cell types in brain cell_metadata = cellxgene_census.get_obs( census, "homo_sapiens", value_filter="tissue_general == 'brain' and is_primary_data == True", column_names=["cell_type", "disease", "assay"] ) print(f"Total brain cells: {len(cell_metadata):,}") print(cell_metadata["cell_type"].value_counts().head(10))
# Gene metadata query gene_metadata = cellxgene_census.get_var( census, "homo_sapiens", value_filter="feature_name in ['CD4', 'CD8A', 'FOXP3']", column_names=["feature_id", "feature_name", "feature_length"] ) print(gene_metadata) # Returns DataFrame with Ensembl IDs, gene symbols, and lengths
3. Expression Data Queries (Small-Medium Scale)
Retrieve expression matrices as AnnData objects for queries returning <100k cells.
import cellxgene_census with cellxgene_census.open_soma() as census: # Query by cell type + tissue + disease adata = cellxgene_census.get_anndata( census=census, organism="Homo sapiens", obs_value_filter="cell_type == 'T cell' and disease == 'COVID-19' and is_primary_data == True", var_value_filter="feature_name in ['CD4', 'CD8A', 'CD19', 'FOXP3']", obs_column_names=["cell_type", "tissue_general", "donor_id"], ) print(f"Shape: {adata.shape}") # (n_cells, 4) print(f"Metadata columns: {list(adata.obs.columns)}")
Filter syntax reference:
- Combine conditions:
,andor - Multiple values:
feature_name in ['CD4', 'CD8A'] - Comparison:
cell_count > 1000 - Always include
to avoid duplicate cellsis_primary_data == True
4. Large-Scale Out-of-Core Queries
Stream expression data in chunks for queries exceeding available RAM.
import cellxgene_census import tiledbsoma as soma with cellxgene_census.open_soma() as census: # Estimate query size first metadata = cellxgene_census.get_obs( census, "homo_sapiens", value_filter="tissue_general == 'brain' and is_primary_data == True", column_names=["soma_joinid"] ) n_cells = len(metadata) print(f"Query will return {n_cells:,} cells") # If >100k cells, use streaming query = census["census_data"]["homo_sapiens"].axis_query( measurement_name="RNA", obs_query=soma.AxisQuery( value_filter="tissue_general == 'brain' and is_primary_data == True" ), var_query=soma.AxisQuery( value_filter="feature_name in ['FOXP2', 'TBR1', 'SATB2']" ) ) # Incremental statistics n_obs, total = 0, 0.0 for batch in query.X("raw").tables(): values = batch["soma_data"].to_numpy() n_obs += len(values) total += values.sum() print(f"Processed {n_obs:,} non-zero entries, mean={total/n_obs:.4f}")
5. Dataset Presence Matrix
Check which datasets measured specific genes (not all genes are in all datasets).
import cellxgene_census with cellxgene_census.open_soma() as census: presence = cellxgene_census.get_presence_matrix( census, "homo_sapiens", var_value_filter="feature_name in ['CD4', 'CD8A', 'PTPRC']" ) print(f"Presence matrix shape: {presence.shape}") # (n_datasets, n_genes) — True if gene measured in dataset
6. PyTorch ML Integration
Train models directly on Census data using the experimental dataloader.
from cellxgene_census.experimental.ml import experiment_dataloader import cellxgene_census with cellxgene_census.open_soma() as census: dataloader = experiment_dataloader( census["census_data"]["homo_sapiens"], measurement_name="RNA", X_name="raw", obs_value_filter="tissue_general == 'liver' and is_primary_data == True", obs_column_names=["cell_type"], batch_size=128, shuffle=True, ) for batch in dataloader: X = batch["X"] # Gene expression tensor labels = batch["obs"] # Cell metadata print(f"Batch X shape: {X.shape}, labels: {list(labels.columns)}") break # Show first batch only
Key Concepts
Census Data Model
The Census is organized as a SOMA (Stack of Matrices, Annotated) collection:
census/ ├── census_info/ │ ├── summary # Total cell counts │ └── datasets # Dataset metadata └── census_data/ ├── homo_sapiens/ │ └── ms_RNA/ │ ├── obs # Cell metadata (61M+ rows) │ ├── var # Gene metadata (~60k rows) │ └── X/raw # Expression matrix (sparse) └── mus_musculus/ └── ...
Key Metadata Fields
| Field | Type | Description | Example Values |
|---|---|---|---|
| str | Cell Ontology label | "B cell", "neuron", "macrophage" |
| str | Coarse tissue grouping | "brain", "lung", "blood" |
| str | Specific tissue | "prefrontal cortex", "alveolar tissue" |
| str | Disease state | "normal", "COVID-19", "lung adenocarcinoma" |
| str | Sequencing assay | "10x 3' v3", "Smart-seq2" |
| bool | True = unique cell | Always filter |
| str | Donor identifier | Used for batch effects |
tissue_general
vs tissue
tissue_generaltissueUse
tissue_general for broad cross-tissue analyses and tissue for specific tissue queries:
# Broad: all immune system cells obs_value_filter = "tissue_general == 'immune system'" # Specific: only PBMCs obs_value_filter = "tissue == 'peripheral blood mononuclear cell'"
Common Workflows
Workflow 1: Cross-Tissue Cell Type Comparison
Goal: Compare macrophage gene expression across tissues.
import cellxgene_census import scanpy as sc with cellxgene_census.open_soma() as census: adata = cellxgene_census.get_anndata( census=census, organism="Homo sapiens", obs_value_filter=( "cell_type == 'macrophage' and " "tissue_general in ['lung', 'liver', 'brain'] and " "is_primary_data == True" ), obs_column_names=["cell_type", "tissue_general", "donor_id", "disease"], ) print(f"Macrophages: {adata.n_obs} cells from {adata.obs['tissue_general'].nunique()} tissues") # Standard scanpy analysis sc.pp.normalize_total(adata, target_sum=1e4) sc.pp.log1p(adata) sc.pp.highly_variable_genes(adata, n_top_genes=2000) sc.pp.pca(adata, n_comps=50) sc.pp.neighbors(adata) sc.tl.umap(adata) # Differential expression across tissues sc.tl.rank_genes_groups(adata, groupby="tissue_general") sc.pl.umap(adata, color=["tissue_general", "disease"])
Workflow 2: Disease-Associated Gene Expression
Goal: Compare marker gene expression between COVID-19 and healthy controls.
- Query metadata to identify available cell types in COVID-19 data (Core API module 2)
- Retrieve expression data for selected cell types and marker genes (Core API module 3)
- Compute mean expression per cell type per condition
- Visualize with scanpy
ordotplotmatrixplot
Key Parameters
| Parameter | Function/Endpoint | Default | Range / Options | Effect |
|---|---|---|---|---|
| , | — | , | Species selection |
| | latest stable | Date string | Pin to specific data release |
| , | None | SOMA filter expression | Cell-level filtering |
| , | None | SOMA filter expression | Gene-level filtering |
| , | all columns | list of field names | Reduces data transfer |
| | 128 | 32–512 | PyTorch batch size |
| | False | True/False | Randomize training order |
Best Practices
-
Always filter
: Without this filter, duplicate cells across datasets inflate counts and bias analyses.is_primary_data == True -
Estimate query size before loading: Call
withget_obs()
to count cells before downloading expression data. Use out-of-core processing for >100k cells.column_names=["soma_joinid"] -
Pin
for reproducibility: The default "latest stable" changes periodically. Always specify the version for published analyses.census_version -
Select only needed metadata columns: Passing
reduces data transfer and memory usage significantly for large queries.obs_column_names -
Use
for cross-tissue analyses: Thetissue_general
field has hundreds of specific values;tissue
provides ~30 coarse groupings suitable for comparative analyses.tissue_general -
Anti-pattern — querying all genes when you need a few: Specify
to retrieve only genes of interest. Downloading the full ~60k gene matrix for 3 marker genes wastes bandwidth and memory.var_value_filter
Common Recipes
Recipe: Multi-Tissue Dataset Summary
import cellxgene_census import pandas as pd with cellxgene_census.open_soma() as census: metadata = cellxgene_census.get_obs( census, "homo_sapiens", value_filter="is_primary_data == True", column_names=["tissue_general", "cell_type", "disease"] ) summary = metadata.groupby("tissue_general").agg( n_cells=("cell_type", "size"), n_cell_types=("cell_type", "nunique"), n_diseases=("disease", "nunique"), ).sort_values("n_cells", ascending=False) print(summary.head(10))
Recipe: Export Census Subset to h5ad
import cellxgene_census with cellxgene_census.open_soma() as census: adata = cellxgene_census.get_anndata( census=census, organism="Homo sapiens", obs_value_filter="tissue_general == 'heart' and is_primary_data == True", obs_column_names=["cell_type", "disease", "donor_id", "assay"], ) adata.write_h5ad("heart_cells.h5ad") print(f"Saved {adata.n_obs} cells to heart_cells.h5ad")
Troubleshooting
| Problem | Cause | Solution |
|---|---|---|
on | Query returns too many cells | Check count with first; use out-of-core for >100k cells |
| Duplicate cells in results | Missing filter | Add to all queries |
| Gene not found | Wrong gene name or gene not in Census | Check spelling (case-sensitive); try Ensembl ID via ; verify with |
/ timeout | Census backend temporarily unavailable | Retry after 1-2 minutes; pin a specific for reliability |
| Version inconsistencies | Using default "latest" across sessions | Always specify in production code |
| Slow query performance | Downloading all metadata columns | Specify only needed columns via |
| Package not installed | (note the hyphen) |
Related Skills
- scanpy-scrna-seq — downstream analysis of Census data (clustering, DEG, visualization)
- anndata-annotated-data — manipulating AnnData objects returned by Census queries
- esm-protein-language-model — protein embeddings from sequences; complementary to Census gene expression data
References
- CELLxGENE Census documentation — official API reference
- CELLxGENE Discover — web browser for Census data
- TileDB-SOMA — underlying data access layer
- CZI (2023) "CZ CELLxGENE Discover: A single-cell data platform for scalable exploration, analysis and modeling of aggregated data" — bioRxiv