SciAgent-Skills anndata-data-structure

Annotated data matrices for single-cell genomics. AnnData stores expression data (X) with observation metadata (obs), variable metadata (var), layers, embeddings (obsm/varm), graphs (obsp/varp), and unstructured data (uns). Use for .h5ad/.zarr file handling, dataset concatenation, and scverse ecosystem integration. For analysis workflows use scanpy; for probabilistic models use scvi-tools.

install
source · Clone the upstream repo
git clone https://github.com/jaechang-hits/SciAgent-Skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/genomics-bioinformatics/anndata-data-structure" ~/.claude/skills/jaechang-hits-sciagent-skills-anndata-data-structure && rm -rf "$T"
manifest: skills/genomics-bioinformatics/anndata-data-structure/SKILL.md
source content

AnnData — Annotated Data Matrices for Single-Cell Genomics

Overview

AnnData provides the standard data structure for single-cell genomics in the scverse ecosystem. It stores an observations-by-variables matrix (X) alongside cell metadata (obs), gene metadata (var), layers, embeddings (obsm/varm), graphs (obsp/varp), and unstructured metadata (uns). Supports sparse matrices, H5AD/Zarr storage, backed mode for large files, and integration with Scanpy, scvi-tools, and Muon.

When to Use

  • Constructing annotated matrices from raw count data with cell/gene metadata
  • Reading/writing
    .h5ad
    or
    .zarr
    files for single-cell experiments
  • Subsetting cells by quality metrics, gene sets, or metadata conditions
  • Concatenating multiple experimental batches with consistent metadata
  • Storing multiple data layers (raw counts, normalized, scaled) in one object
  • Working with large datasets exceeding RAM (backed mode, lazy concatenation)
  • Preparing data for Scanpy or scvi-tools pipelines
  • For single-cell analysis (clustering, DE, visualization), use
    scanpy
    instead
  • For probabilistic models, use
    scvi-tools
    instead

Prerequisites

  • Python packages:
    anndata
    ,
    scipy
    ,
    pandas
    ,
    numpy
  • Optional:
    scanpy
    (analysis),
    zarr
    (cloud storage),
    h5py
    (HDF5 backend)
  • Data requirements: count matrices (dense or sparse), cell/gene metadata tables
pip install "anndata>=0.10"
# Full ecosystem
pip install anndata scanpy zarr

Quick Start

import anndata as ad
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix

counts = csr_matrix(np.random.poisson(0.5, (500, 2000)).astype(np.float32))
obs = pd.DataFrame({"cell_type": np.random.choice(["T", "B", "NK"], 500)},
                    index=[f"cell_{i}" for i in range(500)])
var = pd.DataFrame(index=[f"ENSG{i:05d}" for i in range(2000)])
adata = ad.AnnData(X=counts, obs=obs, var=var)
adata.layers["raw_counts"] = counts.copy()
adata.write_h5ad("example.h5ad", compression="gzip")
print(f"Created: {adata.n_obs} cells x {adata.n_vars} genes")
# Created: 500 cells x 2000 genes

Core API

1. Object Creation

Build AnnData objects from arrays, DataFrames, and sparse matrices.

import anndata as ad
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix

# Minimal: just a matrix
adata_min = ad.AnnData(X=np.random.rand(100, 50).astype(np.float32))
print(f"Minimal: {adata_min.shape}")  # (100, 50)

# Full: sparse matrix + obs/var metadata
n_obs, n_vars = 300, 1000
X = csr_matrix(np.random.poisson(1, (n_obs, n_vars)).astype(np.float32))
obs = pd.DataFrame({"cell_type": np.random.choice(["T", "B", "Mono"], n_obs),
                     "batch": np.repeat(["ctrl", "stim"], n_obs // 2)},
                    index=[f"cell_{i}" for i in range(n_obs)])
var = pd.DataFrame({"gene_symbol": [f"Gene_{i}" for i in range(n_vars)],
                     "mt": [i < 13 for i in range(n_vars)]},
                    index=[f"ENSG{i:05d}" for i in range(n_vars)])
adata = ad.AnnData(X=X, obs=obs, var=var)
print(f"Full: {adata.shape}, obs cols: {list(adata.obs.columns)}")
# Full: (300, 1000), obs cols: ['cell_type', 'batch']

# From a pandas DataFrame (rows=obs, columns=vars)
df = pd.DataFrame(np.random.rand(50, 20),
                  index=[f"sample_{i}" for i in range(50)],
                  columns=[f"feature_{i}" for i in range(20)])
adata_df = ad.AnnData(df)
print(f"From DataFrame: {adata_df.shape}")  # (50, 20)

2. I/O Operations

Read and write in multiple formats including backed mode for large files.

import anndata as ad

# H5AD (native format, recommended for most use cases)
adata = ad.read_h5ad("data.h5ad")
adata.write_h5ad("output.h5ad", compression="gzip")  # gzip: smaller files

# 10X Genomics formats
adata_10x = ad.read_10x_h5("filtered_feature_bc_matrix.h5")
# adata_mtx = ad.read_10x_mtx("filtered_feature_bc_matrix/")

# Zarr format (cloud-friendly, parallel I/O)
adata.write_zarr("output.zarr")
adata_zarr = ad.read_zarr("output.zarr")

# Other formats
# adata = ad.read_csv("expression.csv")
# adata = ad.read_loom("data.loom")

print(f"Loaded: {adata.n_obs} obs x {adata.n_vars} vars")
import anndata as ad

# Backed mode: lazy loading for files larger than RAM
adata_backed = ad.read_h5ad("large_data.h5ad", backed="r")  # read-only
print(f"Backed: {adata_backed.n_obs} obs, isbacked={adata_backed.isbacked}")

# Filter on metadata (no data loaded), then load subset into memory
subset = adata_backed[adata_backed.obs["tissue"] == "brain"].to_memory()
print(f"Loaded subset: {subset.n_obs} cells")

# Read-write backed mode: adata_rw = ad.read_h5ad("data.h5ad", backed="r+")
# Format conversion: ad.read_loom("data.loom").write_h5ad("out.h5ad", compression="gzip")

3. Subsetting and Views

Select cells and genes by indices, names, boolean masks, or metadata conditions.

import anndata as ad

adata = ad.read_h5ad("data.h5ad")

# Boolean mask (most common)
t_cells = adata[adata.obs["cell_type"] == "T_cell"]
print(f"T cells: {t_cells.n_obs}, is_view: {t_cells.is_view}")  # is_view: True

# Integer index / name-based / combined axis
first_100 = adata[:100, :500]
selected = adata[["cell_0", "cell_1"], ["ENSG00000", "ENSG00001"]]

# Combined metadata conditions
high_quality = adata[
    (adata.obs["n_genes"] > 200) & (adata.obs["pct_mito"] < 0.2)
]
print(f"QC filter: {high_quality.n_obs} / {adata.n_obs} cells")

# Views vs copies: subsetting returns a view (lightweight, shares data)
# .copy() creates an independent object (REQUIRED before modification)
independent = adata[adata.obs["batch"] == "ctrl"].copy()
print(f"Is view: {independent.is_view}")  # False

4. Layers, Embeddings, and Graphs

Store multiple data representations, dimensionality reductions, and cell-cell graphs.

import anndata as ad
import numpy as np
from scipy.sparse import csr_matrix

adata = ad.read_h5ad("data.h5ad")

# Layers: alternative representations of X (same shape as X)
adata.layers["raw_counts"] = adata.X.copy()
adata.layers["normalized"] = adata.X.copy()
print(f"Layers: {list(adata.layers.keys())}")
# Layers: ['raw_counts', 'normalized']

# Embeddings in obsm (n_obs x n_components)
adata.obsm["X_pca"] = np.random.randn(adata.n_obs, 50).astype(np.float32)
adata.obsm["X_umap"] = np.random.randn(adata.n_obs, 2).astype(np.float32)
print(f"obsm keys: {list(adata.obsm.keys())}")

# Variable loadings in varm (n_vars x n_components)
adata.varm["PCs"] = np.random.randn(adata.n_vars, 50).astype(np.float32)

# Pairwise graphs in obsp (n_obs x n_obs, sparse)
adata.obsp["connectivities"] = csr_matrix(
    np.random.rand(adata.n_obs, adata.n_obs) > 0.99)
adata.obsp["distances"] = adata.obsp["connectivities"].copy()

# Unstructured metadata in uns (arbitrary dict)
adata.uns["experiment"] = {"date": "2024-06-01", "protocol": "10x_v3"}
adata.uns["neighbors"] = {"params": {"n_neighbors": 15, "method": "umap"}}
adata.uns["cell_type_colors"] = ["#1f77b4", "#ff7f0e", "#2ca02c"]
print(f"uns keys: {list(adata.uns.keys())}")

5. Concatenation

Merge datasets along observations or variables with flexible join and merge strategies.

import anndata as ad
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix

# Create sample datasets
def make_adata(n, genes, batch_name):
    X = csr_matrix(np.random.poisson(1, (n, len(genes))).astype(np.float32))
    obs = pd.DataFrame({"sample": batch_name}, index=[f"{batch_name}_{i}" for i in range(n)])
    return ad.AnnData(X=X, obs=obs, var=pd.DataFrame(index=genes))

shared = [f"Gene_{i}" for i in range(100)]
adata1 = make_adata(200, shared + ["GeneA"], "batch1")
adata2 = make_adata(300, shared + ["GeneB"], "batch2")

# Along observations (axis=0): stack cells
combined = ad.concat(
    [adata1, adata2], axis=0, join="inner",
    label="batch", keys=["B1", "B2"], merge="same",
)
print(f"Inner join: {combined.n_obs} cells, {combined.n_vars} genes")
# Inner join: 500 cells, 100 genes

# Outer join: keeps all genes, fills missing with NaN/0
combined_outer = ad.concat([adata1, adata2], join="outer")
print(f"Outer join: {combined_outer.n_vars} genes")  # 102 genes

# Along variables (axis=1): multi-modal
n = 100
obs = pd.DataFrame(index=[f"cell_{i}" for i in range(n)])
rna = ad.AnnData(X=csr_matrix(np.random.poisson(1, (n, 500)).astype(np.float32)),
                 obs=obs, var=pd.DataFrame(index=[f"RNA_{i}" for i in range(500)]))
protein = ad.AnnData(X=csr_matrix(np.random.rand(n, 50).astype(np.float32)),
                     obs=obs, var=pd.DataFrame(index=[f"ADT_{i}" for i in range(50)]))
multimodal = ad.concat([rna, protein], axis=1)
print(f"Multimodal: {multimodal.shape}")  # (100, 550)
# Lazy concatenation for very large datasets (no data copying)
from anndata.experimental import AnnCollection

collection = AnnCollection(
    {"batch1": adata1, "batch2": adata2},
    join_obs="inner",
)
print(f"Lazy collection: {collection.n_obs} total obs")
# On-disk concat (writes directly to disk without loading all into memory)
# ad.experimental.concat_on_disk({"b1": "batch1.h5ad", "b2": "batch2.h5ad"}, "combined.h5ad")

6. Data Manipulation

Type conversions, metadata management, renaming, and quality control filtering.

import anndata as ad
import numpy as np
from scipy.sparse import csr_matrix, issparse

adata = ad.read_h5ad("data.h5ad")

# Type conversions
adata.strings_to_categoricals()  # string cols -> categorical (saves memory)
if not issparse(adata.X):
    adata.X = csr_matrix(adata.X)  # dense -> sparse
dense_X = adata.X.toarray() if issparse(adata.X) else adata.X  # sparse -> dense

# Adding/removing metadata columns
adata.obs["log_counts"] = np.log1p(np.array(adata.X.sum(axis=1)).flatten())
adata.var["mean_expr"] = np.array(adata.X.mean(axis=0)).flatten()
del adata.obs["unwanted_column"]  # remove

# Renaming observations/variables/categories
adata.obs_names_make_unique()  # add suffixes to duplicate names
adata.var_names_make_unique()
adata.obs["cell_type"] = adata.obs["cell_type"].cat.rename_categories(
    {"T": "T_cell", "B": "B_cell"})

# Quality control filtering (always .copy() after subsetting)
adata.obs["n_genes"] = np.array((adata.X > 0).sum(axis=1)).flatten()
mito_mask = adata.var_names.str.startswith("MT-")
adata.obs["pct_mito"] = (np.array(adata[:, mito_mask].X.sum(axis=1)).flatten()
                          / np.array(adata.X.sum(axis=1)).flatten())
adata_qc = adata[(adata.obs["n_genes"] > 200) & (adata.obs["pct_mito"] < 0.2)].copy()
print(f"After QC: {adata_qc.n_obs} / {adata.n_obs} cells")

Key Concepts

AnnData Object Architecture

The AnnData object is an annotated matrix with the following slots:

SlotTypeShapeDescriptionCommon Keys
X
matrix (sparse/dense)(n_obs, n_vars)Primary data (expression counts)--
obs
DataFrame(n_obs, _)Cell/observation metadatacell_type, sample, n_genes, batch
var
DataFrame(n_vars, _)Gene/variable metadatagene_name, highly_variable, mt
layers
dict of matricessame as XAlternative representationsraw_counts, normalized, scaled
obsm
dict of arrays(n_obs, _)Embeddings per observationX_pca, X_umap, X_tsne
varm
dict of arrays(n_vars, _)Loadings per variablePCs
obsp
dict of sparse(n_obs, n_obs)Pairwise observation graphsconnectivities, distances
varp
dict of sparse(n_vars, n_vars)Pairwise variable relationships--
uns
dictunstructuredAnalysis parameters and metadataneighbors, colors, experiment
raw
AnnDataoriginal shapeSnapshot before gene filtering--

Views vs Copies

Subsetting returns a view (lightweight reference sharing data with parent). Always

.copy()
before modification to avoid
ImplicitModificationWarning
.

view = adata[adata.obs["cell_type"] == "T_cell"]
print(f"is_view: {view.is_view}")        # True -- shares memory
independent = view.copy()
print(f"is_view: {independent.is_view}")  # False -- independent

Storage Formats

FormatExtensionBest ForBacked ModeNotes
H5AD
.h5ad
Default storage, random accessYes (
"r"
,
"r+"
)
Based on HDF5; supports compression
Zarr
.zarr
Cloud storage, parallel I/ONoDirectory-based; good for S3/GCS
10X H5
.h5
10X Genomics CellRanger outputNoRead-only via
read_10x_h5
Loom
.loom
Legacy format (HDF5-based)NoDeprecated in favor of H5AD
CSV
.csv
Interoperability, small datasetsNoNo sparse/metadata support

Common Workflows

Workflow 1: Single-cell RNA-seq Data Preparation

Goal: Load raw data, QC filter, normalize, and save for downstream Scanpy/scvi-tools analysis.

import anndata as ad
import numpy as np
from scipy.sparse import issparse

# 1. Load and QC filter (see Core API 6 for metric computation details)
adata = ad.read_h5ad("raw_counts.h5ad")
adata.obs["n_genes"] = np.array((adata.X > 0).sum(axis=1)).flatten()
adata.obs["total_counts"] = np.array(adata.X.sum(axis=1)).flatten()
mito = adata.var_names.str.startswith("MT-")
adata.obs["pct_mito"] = (np.array(adata[:, mito].X.sum(axis=1)).flatten()
                          / np.array(adata.X.sum(axis=1)).flatten())
adata = adata[(adata.obs["n_genes"].between(200, 5000)) &
              (adata.obs["pct_mito"] < 0.2)].copy()
adata = adata[:, np.array((adata.X > 0).sum(axis=0)).flatten() >= 3].copy()

# 2. Store raw counts, then normalize (total-count + log1p)
adata.layers["counts"] = adata.X.copy()
totals = np.array(adata.X.sum(axis=1)).flatten()
if issparse(adata.X):
    adata.X = np.log1p(adata.X.multiply(1.0 / totals[:, None]).toarray() * 1e4)
else:
    adata.X = np.log1p(adata.X / totals[:, None] * 1e4)

# 3. Save
adata.strings_to_categoricals()
adata.write_h5ad("processed.h5ad", compression="gzip")
print(f"Saved: {adata.n_obs} cells x {adata.n_vars} genes, layers: {list(adata.layers.keys())}")

Workflow 2: Multi-batch Integration

Goal: Load multiple batches, harmonize genes, concatenate with labels, and save.

import anndata as ad
from pathlib import Path

# 1. Load all batches
batches = {}
for h5 in sorted(Path("batches/").glob("*.h5ad")):
    batches[h5.stem] = ad.read_h5ad(str(h5))
    print(f"  {h5.stem}: {batches[h5.stem].n_obs} cells")

# 2. Harmonize genes and concatenate
shared = set.intersection(*[set(a.var_names) for a in batches.values()])
batches = {k: v[:, list(shared)].copy() for k, v in batches.items()}
combined = ad.concat(batches, label="batch", join="inner", merge="same")

# 3. Clean up and save
combined.obs_names_make_unique()
combined.strings_to_categoricals()
combined.write_h5ad("combined_batches.h5ad", compression="gzip")
print(f"Combined: {combined.n_obs} cells x {combined.n_vars} genes, "
      f"{combined.obs['batch'].nunique()} batches")

Workflow 3: Large Dataset Processing (Backed Mode)

Goal: Process datasets too large for memory using lazy loading.

  1. Open file in backed mode:
    adata = ad.read_h5ad("huge.h5ad", backed="r")
  2. Inspect metadata without loading data: check
    adata.obs
    ,
    adata.var
  3. Filter on metadata conditions:
    mask = adata.obs["tissue"] == "brain"
  4. Load filtered subset into memory:
    subset = adata[mask].to_memory()
  5. Process the in-memory subset normally (normalize, filter genes)
  6. For chunked processing: iterate
    adata[i:i+chunk_size].to_memory()
    (uses Core API modules 2 and 3)

Key Parameters

ParameterModuleDefaultRange / OptionsEffect
backed
read_h5ad
None
None
,
"r"
,
"r+"
Lazy loading;
"r"
read-only,
"r+"
read-write
compression
write_h5ad
None
None
,
"gzip"
,
"lzf"
File compression; gzip=smaller, lzf=faster
axis
concat
0
0
,
1
0=stack observations, 1=stack variables
join
concat
"inner"
"inner"
,
"outer"
inner=shared features, outer=union with fill
merge
concat
None
"same"
,
"unique"
,
"first"
,
"only"
Strategy for non-concatenated annotations
label
concat
None
Any stringColumn name added to obs tracking source
keys
concat
None
list of stringsLabels for each dataset in the label column
chunks
write_zarr
None
Tuple of intsChunk dimensions for Zarr arrays
as_sparse
read_h5ad
{}
Dict mapping slot to formatConvert dense arrays to sparse on read

Best Practices

  1. Use sparse matrices for count data: Single-cell count matrices are typically 90%+ zeros. Use

    scipy.sparse.csr_matrix
    to reduce memory by ~10x.

    from scipy.sparse import csr_matrix
    adata.X = csr_matrix(adata.X)
    
  2. Convert strings to categoricals before saving: Repeated string columns (cell_type, batch, sample) waste memory. Call

    adata.strings_to_categoricals()
    before
    .write_h5ad()
    .

  3. Use backed mode for files larger than RAM: Open with

    backed="r"
    , filter on obs/var metadata, then
    .to_memory()
    only the subset you need. Never try to load a 50GB file directly.

  4. Always copy views before modifying: Subsetting returns a view. Modifying triggers

    ImplicitModificationWarning
    . Use
    adata[mask].copy()
    before any modification.

  5. Store raw counts in layers before normalization:

    adata.layers["counts"] = adata.X.copy()
    before any transformation -- raw counts cannot be recovered from normalized data.

  6. Use gzip compression for long-term storage:

    adata.write_h5ad("f.h5ad", compression="gzip")
    reduces size 2-5x. Use
    lzf
    for speed-critical workflows.

  7. Align external data on index: Pandas index alignment silently inserts NaN. Always use

    external_series.reindex(adata.obs_names).values
    when assigning external data to obs/var.

Common Recipes

Recipe: PyTorch DataLoader Integration

When to use: Training deep learning models on single-cell data.

import anndata as ad
from anndata.experimental.pytorch import AnnLoader

adata = ad.read_h5ad("data.h5ad")

# Create PyTorch DataLoader directly from AnnData
dataloader = AnnLoader(adata, batch_size=128, shuffle=True)

for batch in dataloader:
    X_batch = batch.X  # torch.Tensor, shape (128, n_vars)
    obs_batch = batch.obs  # DataFrame with batch metadata
    print(f"Batch shape: {X_batch.shape}")
    break  # demo: process first batch only

Recipe: Pandas DataFrame Conversion

When to use: Interoperating with non-scverse tools that expect DataFrames.

import anndata as ad
import pandas as pd
import numpy as np

adata = ad.read_h5ad("data.h5ad")

# AnnData to DataFrame (dense, uses var_names as columns)
df = adata.to_df()
print(f"DataFrame: {df.shape}")  # (n_obs, n_vars)

# Include a specific layer instead of X
df_raw = adata.to_df(layer="raw_counts")

# DataFrame back to AnnData
new_adata = ad.AnnData(df)
print(f"Back to AnnData: {new_adata.shape}")

Recipe: Optimized File Saving

When to use: Minimizing file size and save time for large datasets.

import anndata as ad
from scipy.sparse import issparse, csr_matrix

adata = ad.read_h5ad("data.h5ad")
if not issparse(adata.X):
    adata.X = csr_matrix(adata.X)  # ensure sparse
adata.strings_to_categoricals()     # compress string columns
for key in ["temp_results"]:
    adata.uns.pop(key, None)        # remove bulky items
adata.write_h5ad("optimized.h5ad", compression="gzip")
print(f"Saved: {adata.n_obs} x {adata.n_vars}")

Troubleshooting

ProblemCauseSolution
MemoryError
when reading H5AD
File too large for RAMUse
ad.read_h5ad(path, backed="r")
for lazy loading
Slow
.write_h5ad()
Large dense matrixConvert to sparse:
adata.X = csr_matrix(adata.X)
; use
compression="gzip"
ValueError
on
ad.concat()
Mismatched var indicesUse
join="inner"
for shared genes, or harmonize var_names before concat
NaN values after adding obs columnPandas index misalignmentUse
.reindex(adata.obs_names).values
when assigning external data
ImplicitModificationWarning
Modifying a view in-placeCall
.copy()
on the subset before modification
IORegistryError
on save
Unsupported dtype in uns/obsmConvert complex objects to strings/arrays; remove non-serializable items from
uns
Duplicated obs_names after concatSame barcodes across batchesUse
adata.obs_names_make_unique()
after concatenation
KeyError
accessing layer/obsm
Key doesn't existCheck available keys:
list(adata.layers.keys())
,
list(adata.obsm.keys())

Ecosystem Integration

# Scanpy: preprocessing, clustering, visualization (operates on AnnData in-place)
import scanpy as sc
adata = ad.read_h5ad("data.h5ad")
sc.pp.normalize_total(adata); sc.tl.pca(adata); sc.pl.umap(adata, color="cell_type")

# Muon: multimodal data -- mu.MuData({"rna": adata_rna, "atac": adata_atac})
# scvi-tools: scvi.model.SCVI.setup_anndata(adata, layer="counts", batch_key="batch")

Bundled Resources

Two reference files consolidate the original 5 reference files:

  1. references/data_structure_io.md
    -- Consolidates data_structure.md + io_operations.md. Covers: detailed slot-by-slot API, all I/O format parameters, backed mode advanced patterns (chunked iteration, write-back). Relocated inline: core slot table (Key Concepts), basic I/O (Core API 2), format comparison (Key Concepts). Omitted: introductory prose redundant with Core API.

  2. references/manipulation_concatenation.md
    -- Consolidates manipulation.md + concatenation.md + best_practices.md. Covers: advanced merge behaviors (same/unique/first/only edge cases), on-disk concat, AnnCollection API, bulk renaming, memory optimization. Relocated inline: QC filtering (Core API 6), basic concat (Core API 5), best practices (Best Practices). Omitted: generic Python advice not AnnData-specific.

Related Skills

  • scanpy-scrna-seq -- downstream analysis: preprocessing, clustering, DE testing, visualization using AnnData objects
  • scvi-tools-single-cell -- probabilistic latent variable models (scVI, scANVI, TOTALVI) consuming AnnData
  • cellxgene-census -- querying the CZ CELLxGENE Census database, returns AnnData objects

References