SciAgent-Skills anndata-data-structure
Annotated data matrices for single-cell genomics. AnnData stores expression data (X) with observation metadata (obs), variable metadata (var), layers, embeddings (obsm/varm), graphs (obsp/varp), and unstructured data (uns). Use for .h5ad/.zarr file handling, dataset concatenation, and scverse ecosystem integration. For analysis workflows use scanpy; for probabilistic models use scvi-tools.
git clone https://github.com/jaechang-hits/SciAgent-Skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/genomics-bioinformatics/anndata-data-structure" ~/.claude/skills/jaechang-hits-sciagent-skills-anndata-data-structure && rm -rf "$T"
skills/genomics-bioinformatics/anndata-data-structure/SKILL.mdAnnData — Annotated Data Matrices for Single-Cell Genomics
Overview
AnnData provides the standard data structure for single-cell genomics in the scverse ecosystem. It stores an observations-by-variables matrix (X) alongside cell metadata (obs), gene metadata (var), layers, embeddings (obsm/varm), graphs (obsp/varp), and unstructured metadata (uns). Supports sparse matrices, H5AD/Zarr storage, backed mode for large files, and integration with Scanpy, scvi-tools, and Muon.
When to Use
- Constructing annotated matrices from raw count data with cell/gene metadata
- Reading/writing
or.h5ad
files for single-cell experiments.zarr - Subsetting cells by quality metrics, gene sets, or metadata conditions
- Concatenating multiple experimental batches with consistent metadata
- Storing multiple data layers (raw counts, normalized, scaled) in one object
- Working with large datasets exceeding RAM (backed mode, lazy concatenation)
- Preparing data for Scanpy or scvi-tools pipelines
- For single-cell analysis (clustering, DE, visualization), use
insteadscanpy - For probabilistic models, use
insteadscvi-tools
Prerequisites
- Python packages:
,anndata
,scipy
,pandasnumpy - Optional:
(analysis),scanpy
(cloud storage),zarr
(HDF5 backend)h5py - Data requirements: count matrices (dense or sparse), cell/gene metadata tables
pip install "anndata>=0.10" # Full ecosystem pip install anndata scanpy zarr
Quick Start
import anndata as ad import numpy as np import pandas as pd from scipy.sparse import csr_matrix counts = csr_matrix(np.random.poisson(0.5, (500, 2000)).astype(np.float32)) obs = pd.DataFrame({"cell_type": np.random.choice(["T", "B", "NK"], 500)}, index=[f"cell_{i}" for i in range(500)]) var = pd.DataFrame(index=[f"ENSG{i:05d}" for i in range(2000)]) adata = ad.AnnData(X=counts, obs=obs, var=var) adata.layers["raw_counts"] = counts.copy() adata.write_h5ad("example.h5ad", compression="gzip") print(f"Created: {adata.n_obs} cells x {adata.n_vars} genes") # Created: 500 cells x 2000 genes
Core API
1. Object Creation
Build AnnData objects from arrays, DataFrames, and sparse matrices.
import anndata as ad import numpy as np import pandas as pd from scipy.sparse import csr_matrix # Minimal: just a matrix adata_min = ad.AnnData(X=np.random.rand(100, 50).astype(np.float32)) print(f"Minimal: {adata_min.shape}") # (100, 50) # Full: sparse matrix + obs/var metadata n_obs, n_vars = 300, 1000 X = csr_matrix(np.random.poisson(1, (n_obs, n_vars)).astype(np.float32)) obs = pd.DataFrame({"cell_type": np.random.choice(["T", "B", "Mono"], n_obs), "batch": np.repeat(["ctrl", "stim"], n_obs // 2)}, index=[f"cell_{i}" for i in range(n_obs)]) var = pd.DataFrame({"gene_symbol": [f"Gene_{i}" for i in range(n_vars)], "mt": [i < 13 for i in range(n_vars)]}, index=[f"ENSG{i:05d}" for i in range(n_vars)]) adata = ad.AnnData(X=X, obs=obs, var=var) print(f"Full: {adata.shape}, obs cols: {list(adata.obs.columns)}") # Full: (300, 1000), obs cols: ['cell_type', 'batch'] # From a pandas DataFrame (rows=obs, columns=vars) df = pd.DataFrame(np.random.rand(50, 20), index=[f"sample_{i}" for i in range(50)], columns=[f"feature_{i}" for i in range(20)]) adata_df = ad.AnnData(df) print(f"From DataFrame: {adata_df.shape}") # (50, 20)
2. I/O Operations
Read and write in multiple formats including backed mode for large files.
import anndata as ad # H5AD (native format, recommended for most use cases) adata = ad.read_h5ad("data.h5ad") adata.write_h5ad("output.h5ad", compression="gzip") # gzip: smaller files # 10X Genomics formats adata_10x = ad.read_10x_h5("filtered_feature_bc_matrix.h5") # adata_mtx = ad.read_10x_mtx("filtered_feature_bc_matrix/") # Zarr format (cloud-friendly, parallel I/O) adata.write_zarr("output.zarr") adata_zarr = ad.read_zarr("output.zarr") # Other formats # adata = ad.read_csv("expression.csv") # adata = ad.read_loom("data.loom") print(f"Loaded: {adata.n_obs} obs x {adata.n_vars} vars")
import anndata as ad # Backed mode: lazy loading for files larger than RAM adata_backed = ad.read_h5ad("large_data.h5ad", backed="r") # read-only print(f"Backed: {adata_backed.n_obs} obs, isbacked={adata_backed.isbacked}") # Filter on metadata (no data loaded), then load subset into memory subset = adata_backed[adata_backed.obs["tissue"] == "brain"].to_memory() print(f"Loaded subset: {subset.n_obs} cells") # Read-write backed mode: adata_rw = ad.read_h5ad("data.h5ad", backed="r+") # Format conversion: ad.read_loom("data.loom").write_h5ad("out.h5ad", compression="gzip")
3. Subsetting and Views
Select cells and genes by indices, names, boolean masks, or metadata conditions.
import anndata as ad adata = ad.read_h5ad("data.h5ad") # Boolean mask (most common) t_cells = adata[adata.obs["cell_type"] == "T_cell"] print(f"T cells: {t_cells.n_obs}, is_view: {t_cells.is_view}") # is_view: True # Integer index / name-based / combined axis first_100 = adata[:100, :500] selected = adata[["cell_0", "cell_1"], ["ENSG00000", "ENSG00001"]] # Combined metadata conditions high_quality = adata[ (adata.obs["n_genes"] > 200) & (adata.obs["pct_mito"] < 0.2) ] print(f"QC filter: {high_quality.n_obs} / {adata.n_obs} cells") # Views vs copies: subsetting returns a view (lightweight, shares data) # .copy() creates an independent object (REQUIRED before modification) independent = adata[adata.obs["batch"] == "ctrl"].copy() print(f"Is view: {independent.is_view}") # False
4. Layers, Embeddings, and Graphs
Store multiple data representations, dimensionality reductions, and cell-cell graphs.
import anndata as ad import numpy as np from scipy.sparse import csr_matrix adata = ad.read_h5ad("data.h5ad") # Layers: alternative representations of X (same shape as X) adata.layers["raw_counts"] = adata.X.copy() adata.layers["normalized"] = adata.X.copy() print(f"Layers: {list(adata.layers.keys())}") # Layers: ['raw_counts', 'normalized'] # Embeddings in obsm (n_obs x n_components) adata.obsm["X_pca"] = np.random.randn(adata.n_obs, 50).astype(np.float32) adata.obsm["X_umap"] = np.random.randn(adata.n_obs, 2).astype(np.float32) print(f"obsm keys: {list(adata.obsm.keys())}") # Variable loadings in varm (n_vars x n_components) adata.varm["PCs"] = np.random.randn(adata.n_vars, 50).astype(np.float32) # Pairwise graphs in obsp (n_obs x n_obs, sparse) adata.obsp["connectivities"] = csr_matrix( np.random.rand(adata.n_obs, adata.n_obs) > 0.99) adata.obsp["distances"] = adata.obsp["connectivities"].copy() # Unstructured metadata in uns (arbitrary dict) adata.uns["experiment"] = {"date": "2024-06-01", "protocol": "10x_v3"} adata.uns["neighbors"] = {"params": {"n_neighbors": 15, "method": "umap"}} adata.uns["cell_type_colors"] = ["#1f77b4", "#ff7f0e", "#2ca02c"] print(f"uns keys: {list(adata.uns.keys())}")
5. Concatenation
Merge datasets along observations or variables with flexible join and merge strategies.
import anndata as ad import numpy as np import pandas as pd from scipy.sparse import csr_matrix # Create sample datasets def make_adata(n, genes, batch_name): X = csr_matrix(np.random.poisson(1, (n, len(genes))).astype(np.float32)) obs = pd.DataFrame({"sample": batch_name}, index=[f"{batch_name}_{i}" for i in range(n)]) return ad.AnnData(X=X, obs=obs, var=pd.DataFrame(index=genes)) shared = [f"Gene_{i}" for i in range(100)] adata1 = make_adata(200, shared + ["GeneA"], "batch1") adata2 = make_adata(300, shared + ["GeneB"], "batch2") # Along observations (axis=0): stack cells combined = ad.concat( [adata1, adata2], axis=0, join="inner", label="batch", keys=["B1", "B2"], merge="same", ) print(f"Inner join: {combined.n_obs} cells, {combined.n_vars} genes") # Inner join: 500 cells, 100 genes # Outer join: keeps all genes, fills missing with NaN/0 combined_outer = ad.concat([adata1, adata2], join="outer") print(f"Outer join: {combined_outer.n_vars} genes") # 102 genes # Along variables (axis=1): multi-modal n = 100 obs = pd.DataFrame(index=[f"cell_{i}" for i in range(n)]) rna = ad.AnnData(X=csr_matrix(np.random.poisson(1, (n, 500)).astype(np.float32)), obs=obs, var=pd.DataFrame(index=[f"RNA_{i}" for i in range(500)])) protein = ad.AnnData(X=csr_matrix(np.random.rand(n, 50).astype(np.float32)), obs=obs, var=pd.DataFrame(index=[f"ADT_{i}" for i in range(50)])) multimodal = ad.concat([rna, protein], axis=1) print(f"Multimodal: {multimodal.shape}") # (100, 550)
# Lazy concatenation for very large datasets (no data copying) from anndata.experimental import AnnCollection collection = AnnCollection( {"batch1": adata1, "batch2": adata2}, join_obs="inner", ) print(f"Lazy collection: {collection.n_obs} total obs") # On-disk concat (writes directly to disk without loading all into memory) # ad.experimental.concat_on_disk({"b1": "batch1.h5ad", "b2": "batch2.h5ad"}, "combined.h5ad")
6. Data Manipulation
Type conversions, metadata management, renaming, and quality control filtering.
import anndata as ad import numpy as np from scipy.sparse import csr_matrix, issparse adata = ad.read_h5ad("data.h5ad") # Type conversions adata.strings_to_categoricals() # string cols -> categorical (saves memory) if not issparse(adata.X): adata.X = csr_matrix(adata.X) # dense -> sparse dense_X = adata.X.toarray() if issparse(adata.X) else adata.X # sparse -> dense # Adding/removing metadata columns adata.obs["log_counts"] = np.log1p(np.array(adata.X.sum(axis=1)).flatten()) adata.var["mean_expr"] = np.array(adata.X.mean(axis=0)).flatten() del adata.obs["unwanted_column"] # remove # Renaming observations/variables/categories adata.obs_names_make_unique() # add suffixes to duplicate names adata.var_names_make_unique() adata.obs["cell_type"] = adata.obs["cell_type"].cat.rename_categories( {"T": "T_cell", "B": "B_cell"}) # Quality control filtering (always .copy() after subsetting) adata.obs["n_genes"] = np.array((adata.X > 0).sum(axis=1)).flatten() mito_mask = adata.var_names.str.startswith("MT-") adata.obs["pct_mito"] = (np.array(adata[:, mito_mask].X.sum(axis=1)).flatten() / np.array(adata.X.sum(axis=1)).flatten()) adata_qc = adata[(adata.obs["n_genes"] > 200) & (adata.obs["pct_mito"] < 0.2)].copy() print(f"After QC: {adata_qc.n_obs} / {adata.n_obs} cells")
Key Concepts
AnnData Object Architecture
The AnnData object is an annotated matrix with the following slots:
| Slot | Type | Shape | Description | Common Keys |
|---|---|---|---|---|
| matrix (sparse/dense) | (n_obs, n_vars) | Primary data (expression counts) | -- |
| DataFrame | (n_obs, _) | Cell/observation metadata | cell_type, sample, n_genes, batch |
| DataFrame | (n_vars, _) | Gene/variable metadata | gene_name, highly_variable, mt |
| dict of matrices | same as X | Alternative representations | raw_counts, normalized, scaled |
| dict of arrays | (n_obs, _) | Embeddings per observation | X_pca, X_umap, X_tsne |
| dict of arrays | (n_vars, _) | Loadings per variable | PCs |
| dict of sparse | (n_obs, n_obs) | Pairwise observation graphs | connectivities, distances |
| dict of sparse | (n_vars, n_vars) | Pairwise variable relationships | -- |
| dict | unstructured | Analysis parameters and metadata | neighbors, colors, experiment |
| AnnData | original shape | Snapshot before gene filtering | -- |
Views vs Copies
Subsetting returns a view (lightweight reference sharing data with parent). Always
.copy() before modification to avoid ImplicitModificationWarning.
view = adata[adata.obs["cell_type"] == "T_cell"] print(f"is_view: {view.is_view}") # True -- shares memory independent = view.copy() print(f"is_view: {independent.is_view}") # False -- independent
Storage Formats
| Format | Extension | Best For | Backed Mode | Notes |
|---|---|---|---|---|
| H5AD | | Default storage, random access | Yes (, ) | Based on HDF5; supports compression |
| Zarr | | Cloud storage, parallel I/O | No | Directory-based; good for S3/GCS |
| 10X H5 | | 10X Genomics CellRanger output | No | Read-only via |
| Loom | | Legacy format (HDF5-based) | No | Deprecated in favor of H5AD |
| CSV | | Interoperability, small datasets | No | No sparse/metadata support |
Common Workflows
Workflow 1: Single-cell RNA-seq Data Preparation
Goal: Load raw data, QC filter, normalize, and save for downstream Scanpy/scvi-tools analysis.
import anndata as ad import numpy as np from scipy.sparse import issparse # 1. Load and QC filter (see Core API 6 for metric computation details) adata = ad.read_h5ad("raw_counts.h5ad") adata.obs["n_genes"] = np.array((adata.X > 0).sum(axis=1)).flatten() adata.obs["total_counts"] = np.array(adata.X.sum(axis=1)).flatten() mito = adata.var_names.str.startswith("MT-") adata.obs["pct_mito"] = (np.array(adata[:, mito].X.sum(axis=1)).flatten() / np.array(adata.X.sum(axis=1)).flatten()) adata = adata[(adata.obs["n_genes"].between(200, 5000)) & (adata.obs["pct_mito"] < 0.2)].copy() adata = adata[:, np.array((adata.X > 0).sum(axis=0)).flatten() >= 3].copy() # 2. Store raw counts, then normalize (total-count + log1p) adata.layers["counts"] = adata.X.copy() totals = np.array(adata.X.sum(axis=1)).flatten() if issparse(adata.X): adata.X = np.log1p(adata.X.multiply(1.0 / totals[:, None]).toarray() * 1e4) else: adata.X = np.log1p(adata.X / totals[:, None] * 1e4) # 3. Save adata.strings_to_categoricals() adata.write_h5ad("processed.h5ad", compression="gzip") print(f"Saved: {adata.n_obs} cells x {adata.n_vars} genes, layers: {list(adata.layers.keys())}")
Workflow 2: Multi-batch Integration
Goal: Load multiple batches, harmonize genes, concatenate with labels, and save.
import anndata as ad from pathlib import Path # 1. Load all batches batches = {} for h5 in sorted(Path("batches/").glob("*.h5ad")): batches[h5.stem] = ad.read_h5ad(str(h5)) print(f" {h5.stem}: {batches[h5.stem].n_obs} cells") # 2. Harmonize genes and concatenate shared = set.intersection(*[set(a.var_names) for a in batches.values()]) batches = {k: v[:, list(shared)].copy() for k, v in batches.items()} combined = ad.concat(batches, label="batch", join="inner", merge="same") # 3. Clean up and save combined.obs_names_make_unique() combined.strings_to_categoricals() combined.write_h5ad("combined_batches.h5ad", compression="gzip") print(f"Combined: {combined.n_obs} cells x {combined.n_vars} genes, " f"{combined.obs['batch'].nunique()} batches")
Workflow 3: Large Dataset Processing (Backed Mode)
Goal: Process datasets too large for memory using lazy loading.
- Open file in backed mode:
adata = ad.read_h5ad("huge.h5ad", backed="r") - Inspect metadata without loading data: check
,adata.obsadata.var - Filter on metadata conditions:
mask = adata.obs["tissue"] == "brain" - Load filtered subset into memory:
subset = adata[mask].to_memory() - Process the in-memory subset normally (normalize, filter genes)
- For chunked processing: iterate
(uses Core API modules 2 and 3)adata[i:i+chunk_size].to_memory()
Key Parameters
| Parameter | Module | Default | Range / Options | Effect |
|---|---|---|---|---|
| | | , , | Lazy loading; read-only, read-write |
| | | , , | File compression; gzip=smaller, lzf=faster |
| | | , | 0=stack observations, 1=stack variables |
| | | , | inner=shared features, outer=union with fill |
| | | , , , | Strategy for non-concatenated annotations |
| | | Any string | Column name added to obs tracking source |
| | | list of strings | Labels for each dataset in the label column |
| | | Tuple of ints | Chunk dimensions for Zarr arrays |
| | | Dict mapping slot to format | Convert dense arrays to sparse on read |
Best Practices
-
Use sparse matrices for count data: Single-cell count matrices are typically 90%+ zeros. Use
to reduce memory by ~10x.scipy.sparse.csr_matrixfrom scipy.sparse import csr_matrix adata.X = csr_matrix(adata.X) -
Convert strings to categoricals before saving: Repeated string columns (cell_type, batch, sample) waste memory. Call
beforeadata.strings_to_categoricals()
..write_h5ad() -
Use backed mode for files larger than RAM: Open with
, filter on obs/var metadata, thenbacked="r"
only the subset you need. Never try to load a 50GB file directly..to_memory() -
Always copy views before modifying: Subsetting returns a view. Modifying triggers
. UseImplicitModificationWarning
before any modification.adata[mask].copy() -
Store raw counts in layers before normalization:
before any transformation -- raw counts cannot be recovered from normalized data.adata.layers["counts"] = adata.X.copy() -
Use gzip compression for long-term storage:
reduces size 2-5x. Useadata.write_h5ad("f.h5ad", compression="gzip")
for speed-critical workflows.lzf -
Align external data on index: Pandas index alignment silently inserts NaN. Always use
when assigning external data to obs/var.external_series.reindex(adata.obs_names).values
Common Recipes
Recipe: PyTorch DataLoader Integration
When to use: Training deep learning models on single-cell data.
import anndata as ad from anndata.experimental.pytorch import AnnLoader adata = ad.read_h5ad("data.h5ad") # Create PyTorch DataLoader directly from AnnData dataloader = AnnLoader(adata, batch_size=128, shuffle=True) for batch in dataloader: X_batch = batch.X # torch.Tensor, shape (128, n_vars) obs_batch = batch.obs # DataFrame with batch metadata print(f"Batch shape: {X_batch.shape}") break # demo: process first batch only
Recipe: Pandas DataFrame Conversion
When to use: Interoperating with non-scverse tools that expect DataFrames.
import anndata as ad import pandas as pd import numpy as np adata = ad.read_h5ad("data.h5ad") # AnnData to DataFrame (dense, uses var_names as columns) df = adata.to_df() print(f"DataFrame: {df.shape}") # (n_obs, n_vars) # Include a specific layer instead of X df_raw = adata.to_df(layer="raw_counts") # DataFrame back to AnnData new_adata = ad.AnnData(df) print(f"Back to AnnData: {new_adata.shape}")
Recipe: Optimized File Saving
When to use: Minimizing file size and save time for large datasets.
import anndata as ad from scipy.sparse import issparse, csr_matrix adata = ad.read_h5ad("data.h5ad") if not issparse(adata.X): adata.X = csr_matrix(adata.X) # ensure sparse adata.strings_to_categoricals() # compress string columns for key in ["temp_results"]: adata.uns.pop(key, None) # remove bulky items adata.write_h5ad("optimized.h5ad", compression="gzip") print(f"Saved: {adata.n_obs} x {adata.n_vars}")
Troubleshooting
| Problem | Cause | Solution |
|---|---|---|
when reading H5AD | File too large for RAM | Use for lazy loading |
Slow | Large dense matrix | Convert to sparse: ; use |
on | Mismatched var indices | Use for shared genes, or harmonize var_names before concat |
| NaN values after adding obs column | Pandas index misalignment | Use when assigning external data |
| Modifying a view in-place | Call on the subset before modification |
on save | Unsupported dtype in uns/obsm | Convert complex objects to strings/arrays; remove non-serializable items from |
| Duplicated obs_names after concat | Same barcodes across batches | Use after concatenation |
accessing layer/obsm | Key doesn't exist | Check available keys: , |
Ecosystem Integration
# Scanpy: preprocessing, clustering, visualization (operates on AnnData in-place) import scanpy as sc adata = ad.read_h5ad("data.h5ad") sc.pp.normalize_total(adata); sc.tl.pca(adata); sc.pl.umap(adata, color="cell_type") # Muon: multimodal data -- mu.MuData({"rna": adata_rna, "atac": adata_atac}) # scvi-tools: scvi.model.SCVI.setup_anndata(adata, layer="counts", batch_key="batch")
Bundled Resources
Two reference files consolidate the original 5 reference files:
-
-- Consolidates data_structure.md + io_operations.md. Covers: detailed slot-by-slot API, all I/O format parameters, backed mode advanced patterns (chunked iteration, write-back). Relocated inline: core slot table (Key Concepts), basic I/O (Core API 2), format comparison (Key Concepts). Omitted: introductory prose redundant with Core API.references/data_structure_io.md -
-- Consolidates manipulation.md + concatenation.md + best_practices.md. Covers: advanced merge behaviors (same/unique/first/only edge cases), on-disk concat, AnnCollection API, bulk renaming, memory optimization. Relocated inline: QC filtering (Core API 6), basic concat (Core API 5), best practices (Best Practices). Omitted: generic Python advice not AnnData-specific.references/manipulation_concatenation.md
Related Skills
- scanpy-scrna-seq -- downstream analysis: preprocessing, clustering, DE testing, visualization using AnnData objects
- scvi-tools-single-cell -- probabilistic latent variable models (scVI, scANVI, TOTALVI) consuming AnnData
- cellxgene-census -- querying the CZ CELLxGENE Census database, returns AnnData objects
References
- AnnData documentation -- official API reference and tutorials
- scverse ecosystem -- coordinated single-cell analysis tools
- AnnData GitHub -- source code and issue tracker
- Virshup et al. (2024) "anndata: Access and store annotated data matrices" -- JOSS