SciAgent-Skills cellxgene-census

Query CELLxGENE Census (61M+ cells) programmatically. Search by cell type, tissue, disease, organism. Get expression matrices as AnnData, stream large queries out-of-core, train PyTorch models on single-cell data. For analyzing your own data use scanpy; for annotated data manipulation use anndata.

install

source · Clone the upstream repo

git clone https://github.com/jaechang-hits/SciAgent-Skills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/genomics-bioinformatics/cellxgene-census" ~/.claude/skills/jaechang-hits-sciagent-skills-cellxgene-census && rm -rf "$T"

manifest: skills/genomics-bioinformatics/cellxgene-census/SKILL.md

source content

CZ CELLxGENE Census

Overview

CZ CELLxGENE Census provides programmatic access to 61+ million standardized single-cell RNA-seq observations from human and mouse. It enables population-scale queries by cell type, tissue, disease, and donor metadata, returning expression data as AnnData objects or PyTorch dataloaders for ML workflows.

When to Use

Querying single-cell expression data across tissues, diseases, or cell types from a curated atlas
Building reference datasets for cell type classification or marker gene discovery
Training ML models on large-scale single-cell data (PyTorch integration)
Comparing gene expression across conditions (e.g., COVID-19 vs healthy) at population scale
Exploring what single-cell datasets are available for a tissue or disease of interest
For analyzing your own scRNA-seq data, use scanpy instead
For manipulating AnnData objects (subsetting, concatenation), use anndata instead

Prerequisites

pip install cellxgene-census
# For ML workflows
pip install cellxgene-census[experimental]

API Rate Limits: Census uses TileDB-SOMA cloud backend. No explicit rate limit, but large queries (>1M cells) should use out-of-core processing (Module 4) to avoid memory exhaustion. Always use context managers for proper resource cleanup.

Quick Start

import cellxgene_census

with cellxgene_census.open_soma() as census:
    # Get B cells from lung
    adata = cellxgene_census.get_anndata(
        census=census,
        organism="Homo sapiens",
        obs_value_filter="cell_type == 'B cell' and tissue_general == 'lung' and is_primary_data == True",
        obs_column_names=["cell_type", "disease", "donor_id"],
    )
    print(f"Retrieved {adata.n_obs} cells × {adata.n_vars} genes")
    # Retrieved ~15000 cells × 60664 genes

Core API

1. Opening and Exploring the Census

Connect to Census and discover available data.

import cellxgene_census

# Open latest stable version (always use context manager)
with cellxgene_census.open_soma() as census:
    # Summary statistics
    summary = census["census_info"]["summary"].read().concat().to_pandas()
    print(f"Total cells: {summary['total_cell_count'][0]:,}")

    # List all datasets
    datasets = census["census_info"]["datasets"].read().concat().to_pandas()
    print(f"Total datasets: {len(datasets)}")
    print(datasets[["dataset_title", "cell_count"]].head())

# Open specific version for reproducibility
with cellxgene_census.open_soma(census_version="2023-07-25") as census:
    # Reproducible analysis code here
    pass

2. Cell Metadata Queries

Query cell-level metadata without downloading expression data.

import cellxgene_census

with cellxgene_census.open_soma() as census:
    # Get unique cell types in brain
    cell_metadata = cellxgene_census.get_obs(
        census,
        "homo_sapiens",
        value_filter="tissue_general == 'brain' and is_primary_data == True",
        column_names=["cell_type", "disease", "assay"]
    )
    print(f"Total brain cells: {len(cell_metadata):,}")
    print(cell_metadata["cell_type"].value_counts().head(10))

    # Gene metadata query
    gene_metadata = cellxgene_census.get_var(
        census,
        "homo_sapiens",
        value_filter="feature_name in ['CD4', 'CD8A', 'FOXP3']",
        column_names=["feature_id", "feature_name", "feature_length"]
    )
    print(gene_metadata)
    # Returns DataFrame with Ensembl IDs, gene symbols, and lengths

3. Expression Data Queries (Small-Medium Scale)

Retrieve expression matrices as AnnData objects for queries returning <100k cells.

import cellxgene_census

with cellxgene_census.open_soma() as census:
    # Query by cell type + tissue + disease
    adata = cellxgene_census.get_anndata(
        census=census,
        organism="Homo sapiens",
        obs_value_filter="cell_type == 'T cell' and disease == 'COVID-19' and is_primary_data == True",
        var_value_filter="feature_name in ['CD4', 'CD8A', 'CD19', 'FOXP3']",
        obs_column_names=["cell_type", "tissue_general", "donor_id"],
    )
    print(f"Shape: {adata.shape}")  # (n_cells, 4)
    print(f"Metadata columns: {list(adata.obs.columns)}")

Filter syntax reference:

Combine conditions:
```
and
```
,
```
or
```
Multiple values:
```
feature_name in ['CD4', 'CD8A']
```
Comparison:
```
cell_count > 1000
```
Always include
```
is_primary_data == True
```
to avoid duplicate cells

4. Large-Scale Out-of-Core Queries

Stream expression data in chunks for queries exceeding available RAM.

import cellxgene_census
import tiledbsoma as soma

with cellxgene_census.open_soma() as census:
    # Estimate query size first
    metadata = cellxgene_census.get_obs(
        census, "homo_sapiens",
        value_filter="tissue_general == 'brain' and is_primary_data == True",
        column_names=["soma_joinid"]
    )
    n_cells = len(metadata)
    print(f"Query will return {n_cells:,} cells")

    # If >100k cells, use streaming
    query = census["census_data"]["homo_sapiens"].axis_query(
        measurement_name="RNA",
        obs_query=soma.AxisQuery(
            value_filter="tissue_general == 'brain' and is_primary_data == True"
        ),
        var_query=soma.AxisQuery(
            value_filter="feature_name in ['FOXP2', 'TBR1', 'SATB2']"
        )
    )

    # Incremental statistics
    n_obs, total = 0, 0.0
    for batch in query.X("raw").tables():
        values = batch["soma_data"].to_numpy()
        n_obs += len(values)
        total += values.sum()

    print(f"Processed {n_obs:,} non-zero entries, mean={total/n_obs:.4f}")

5. Dataset Presence Matrix

Check which datasets measured specific genes (not all genes are in all datasets).

import cellxgene_census

with cellxgene_census.open_soma() as census:
    presence = cellxgene_census.get_presence_matrix(
        census,
        "homo_sapiens",
        var_value_filter="feature_name in ['CD4', 'CD8A', 'PTPRC']"
    )
    print(f"Presence matrix shape: {presence.shape}")
    # (n_datasets, n_genes) — True if gene measured in dataset

6. PyTorch ML Integration

Train models directly on Census data using the experimental dataloader.

from cellxgene_census.experimental.ml import experiment_dataloader
import cellxgene_census

with cellxgene_census.open_soma() as census:
    dataloader = experiment_dataloader(
        census["census_data"]["homo_sapiens"],
        measurement_name="RNA",
        X_name="raw",
        obs_value_filter="tissue_general == 'liver' and is_primary_data == True",
        obs_column_names=["cell_type"],
        batch_size=128,
        shuffle=True,
    )

    for batch in dataloader:
        X = batch["X"]           # Gene expression tensor
        labels = batch["obs"]    # Cell metadata
        print(f"Batch X shape: {X.shape}, labels: {list(labels.columns)}")
        break  # Show first batch only

Key Concepts

Census Data Model

The Census is organized as a SOMA (Stack of Matrices, Annotated) collection:

census/
├── census_info/
│   ├── summary          # Total cell counts
│   └── datasets         # Dataset metadata
└── census_data/
    ├── homo_sapiens/
    │   └── ms_RNA/
    │       ├── obs      # Cell metadata (61M+ rows)
    │       ├── var      # Gene metadata (~60k rows)
    │       └── X/raw    # Expression matrix (sparse)
    └── mus_musculus/
        └── ...

Key Metadata Fields

Field	Type	Description	Example Values
`cell_type`	str	Cell Ontology label	"B cell", "neuron", "macrophage"
`tissue_general`	str	Coarse tissue grouping	"brain", "lung", "blood"
`tissue`	str	Specific tissue	"prefrontal cortex", "alveolar tissue"
`disease`	str	Disease state	"normal", "COVID-19", "lung adenocarcinoma"
`assay`	str	Sequencing assay	"10x 3' v3", "Smart-seq2"
`is_primary_data`	bool	True = unique cell	Always filter `True`
`donor_id`	str	Donor identifier	Used for batch effects

tissue_general

tissue

Use

tissue_general

for broad cross-tissue analyses and

tissue

for specific tissue queries:

# Broad: all immune system cells
obs_value_filter = "tissue_general == 'immune system'"
# Specific: only PBMCs
obs_value_filter = "tissue == 'peripheral blood mononuclear cell'"

Common Workflows

Workflow 1: Cross-Tissue Cell Type Comparison

Goal: Compare macrophage gene expression across tissues.

import cellxgene_census
import scanpy as sc

with cellxgene_census.open_soma() as census:
    adata = cellxgene_census.get_anndata(
        census=census,
        organism="Homo sapiens",
        obs_value_filter=(
            "cell_type == 'macrophage' and "
            "tissue_general in ['lung', 'liver', 'brain'] and "
            "is_primary_data == True"
        ),
        obs_column_names=["cell_type", "tissue_general", "donor_id", "disease"],
    )
    print(f"Macrophages: {adata.n_obs} cells from {adata.obs['tissue_general'].nunique()} tissues")

    # Standard scanpy analysis
    sc.pp.normalize_total(adata, target_sum=1e4)
    sc.pp.log1p(adata)
    sc.pp.highly_variable_genes(adata, n_top_genes=2000)
    sc.pp.pca(adata, n_comps=50)
    sc.pp.neighbors(adata)
    sc.tl.umap(adata)

    # Differential expression across tissues
    sc.tl.rank_genes_groups(adata, groupby="tissue_general")
    sc.pl.umap(adata, color=["tissue_general", "disease"])

Workflow 2: Disease-Associated Gene Expression

Goal: Compare marker gene expression between COVID-19 and healthy controls.

Query metadata to identify available cell types in COVID-19 data (Core API module 2)
Retrieve expression data for selected cell types and marker genes (Core API module 3)
Compute mean expression per cell type per condition
Visualize with scanpy
```
dotplot
```
or
```
matrixplot
```

Key Parameters

Parameter	Function/Endpoint	Default	Range / Options	Effect
`organism`	`get_anndata` , `get_obs`	—	`"Homo sapiens"` , `"Mus musculus"`	Species selection
`census_version`	`open_soma`	latest stable	Date string `"YYYY-MM-DD"`	Pin to specific data release
`obs_value_filter`	`get_anndata` , `get_obs`	None	SOMA filter expression	Cell-level filtering
`var_value_filter`	`get_anndata` , `get_var`	None	SOMA filter expression	Gene-level filtering
`obs_column_names`	`get_anndata` , `get_obs`	all columns	list of field names	Reduces data transfer
`batch_size`	`experiment_dataloader`	128	32–512	PyTorch batch size
`shuffle`	`experiment_dataloader`	False	True/False	Randomize training order

Best Practices

Always filter
```
is_primary_data == True
```
: Without this filter, duplicate cells across datasets inflate counts and bias analyses.
Estimate query size before loading: Call
```
get_obs()
```
with
```
column_names=["soma_joinid"]
```
to count cells before downloading expression data. Use out-of-core processing for >100k cells.
Pin
```
census_version
```
for reproducibility: The default "latest stable" changes periodically. Always specify the version for published analyses.
Select only needed metadata columns: Passing
```
obs_column_names
```
reduces data transfer and memory usage significantly for large queries.
Use
```
tissue_general
```
for cross-tissue analyses: The
```
tissue
```
field has hundreds of specific values;
```
tissue_general
```
provides ~30 coarse groupings suitable for comparative analyses.
Anti-pattern — querying all genes when you need a few: Specify
```
var_value_filter
```
to retrieve only genes of interest. Downloading the full ~60k gene matrix for 3 marker genes wastes bandwidth and memory.

Common Recipes

Recipe: Multi-Tissue Dataset Summary

import cellxgene_census
import pandas as pd

with cellxgene_census.open_soma() as census:
    metadata = cellxgene_census.get_obs(
        census, "homo_sapiens",
        value_filter="is_primary_data == True",
        column_names=["tissue_general", "cell_type", "disease"]
    )
    summary = metadata.groupby("tissue_general").agg(
        n_cells=("cell_type", "size"),
        n_cell_types=("cell_type", "nunique"),
        n_diseases=("disease", "nunique"),
    ).sort_values("n_cells", ascending=False)
    print(summary.head(10))

Recipe: Export Census Subset to h5ad

import cellxgene_census

with cellxgene_census.open_soma() as census:
    adata = cellxgene_census.get_anndata(
        census=census,
        organism="Homo sapiens",
        obs_value_filter="tissue_general == 'heart' and is_primary_data == True",
        obs_column_names=["cell_type", "disease", "donor_id", "assay"],
    )
    adata.write_h5ad("heart_cells.h5ad")
    print(f"Saved {adata.n_obs} cells to heart_cells.h5ad")

Troubleshooting

Problem	Cause	Solution
`MemoryError` on `get_anndata()`	Query returns too many cells	Check count with `get_obs()` first; use out-of-core `axis_query()` for >100k cells
Duplicate cells in results	Missing `is_primary_data == True` filter	Add `is_primary_data == True` to all `obs_value_filter` queries
Gene not found	Wrong gene name or gene not in Census	Check spelling (case-sensitive); try Ensembl ID via `feature_id` ; verify with `get_presence_matrix()`
`ConnectionError` / timeout	Census backend temporarily unavailable	Retry after 1-2 minutes; pin a specific `census_version` for reliability
Version inconsistencies	Using default "latest" across sessions	Always specify `census_version` in production code
Slow query performance	Downloading all metadata columns	Specify only needed columns via `obs_column_names`
`ImportError: cellxgene_census`	Package not installed	`pip install cellxgene-census` (note the hyphen)

Related Skills

scanpy-scrna-seq — downstream analysis of Census data (clustering, DEG, visualization)
anndata-annotated-data — manipulating AnnData objects returned by Census queries
esm-protein-language-model — protein embeddings from sequences; complementary to Census gene expression data

References

CELLxGENE Census documentation — official API reference
CELLxGENE Discover — web browser for Census data
TileDB-SOMA — underlying data access layer
CZI (2023) "CZ CELLxGENE Discover: A single-cell data platform for scalable exploration, analysis and modeling of aggregated data" — bioRxiv