Medical-research-skills anndata
Data structure for annotated matrices in single-cell analysis; use when reading/writing .h5ad (or zarr) and exchanging data with the scverse ecosystem.
install
source · Clone the upstream repo
git clone https://github.com/aipoch/medical-research-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/aipoch/medical-research-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/scientific-skills/Data Analysis/anndata" ~/.claude/skills/aipoch-medical-research-skills-anndata && rm -rf "$T"
manifest:
scientific-skills/Data Analysis/anndata/SKILL.mdsource content
When to Use
Use AnnData when you need to:
- Load, inspect, or export annotated single-cell datasets stored as
(or.h5ad
) for downstream tools.zarr - Keep a matrix (cells × features) tightly coupled with observation/feature metadata (e.g., cell types, batches, gene annotations).
- Work efficiently with large, sparse count matrices (e.g., scRNA-seq) and avoid loading everything into memory (backed mode).
- Combine multiple experiments/batches/modalities into a unified object while tracking provenance.
- Subset/filter/transform data while preserving alignment between the matrix and metadata.
Key Features
- Unified container:
(data matrix) plus aligned annotations:X
,obs
,var
, and multi-dimensional slots (uns
,obsm
,varm
,obsp
), plusvarp
and optionallayers
.raw - Interoperable I/O: Native
and.h5ad
, plus common genomics formats (e.g., 10x, loom, mtx, csv).zarr - Scalable workflows: Sparse matrices and backed mode (
) for large datasets.backed="r" - Safe subsetting: Slicing preserves alignment across matrix and annotations; supports views vs copies.
- Concatenation utilities:
with join/merge strategies and batch labeling; experimental lazy collections.ad.concat(...)
Reference notes: the original material mentions additional guides under
(e.g.,references/,references/data_structure.md,references/io_operations.md,references/concatenation.md,references/manipulation.md) for deeper explanations of each topic.references/best_practices.md
Dependencies
(latest compatible with your environment; install via pip/uv)anndatanumpypandas
(recommended for sparse matrices)scipy- Optional ecosystem tools (only if needed):
scanpymuon
(for deep learning) andtorch
experimental loader utilitiesanndata
Example Usage
A complete runnable example that creates an AnnData object, writes/reads
.h5ad, subsets, concatenates batches, and demonstrates backed mode.
import numpy as np import pandas as pd import anndata as ad from scipy.sparse import csr_matrix # ---------------------------- # 1) Create an AnnData object # ---------------------------- rng = np.random.default_rng(0) n_cells, n_genes = 100, 500 X = rng.poisson(1.0, size=(n_cells, n_genes)).astype(np.float32) obs = pd.DataFrame( { "cell_type": (["T cell", "B cell"] * (n_cells // 2)), "sample": (["A", "B"] * (n_cells // 2)), "quality_score": rng.random(n_cells), }, index=[f"cell_{i}" for i in range(n_cells)], ) var = pd.DataFrame( {"gene_name": [f"Gene_{j}" for j in range(n_genes)]}, index=[f"ENSG{j:05d}" for j in range(n_genes)], ) adata = ad.AnnData(X=X, obs=obs, var=var) # Use sparse storage for typical count-like matrices adata.X = csr_matrix(adata.X) # Convert string columns to categoricals to reduce memory and speed up ops adata.strings_to_categoricals() print(f"Created: {adata.n_obs} obs × {adata.n_vars} vars") # ---------------------------- # 2) Write and read .h5ad # ---------------------------- adata.write_h5ad("example.h5ad", compression="gzip") adata2 = ad.read_h5ad("example.h5ad") print(f"Reloaded: {adata2.n_obs} obs × {adata2.n_vars} vars") # ---------------------------- # 3) Subset (keeps alignment) # ---------------------------- t_cells = adata2[adata2.obs["cell_type"] == "T cell", :] high_quality = adata2[adata2.obs["quality_score"] > 0.8, :] print(f"T cells: {t_cells.n_obs}") print(f"High quality: {high_quality.n_obs}") # ---------------------------- # 4) Concatenate batches # ---------------------------- adata_a = adata2[adata2.obs["sample"] == "A", :].copy() adata_b = adata2[adata2.obs["sample"] == "B", :].copy() combined = ad.concat( [adata_a, adata_b], axis=0, # concatenate observations (cells) join="inner", # keep shared variables label="batch", # add a column in .obs keys=["A", "B"], # batch labels ) print(combined.obs["batch"].value_counts().to_dict()) # ---------------------------- # 5) Backed mode for large files # ---------------------------- adata_backed = ad.read_h5ad("example.h5ad", backed="r") # Slicing in backed mode is metadata-friendly; load to memory when needed: subset_mem = adata_backed[:10, :50].to_memory() print(f"Backed subset loaded: {subset_mem.shape}")
Implementation Details
Data model (core slots)
: primary data matrix (denseX
or sparsenumpy.ndarray
), shapescipy.sparse
.(n_obs, n_vars)
: per-observation metadata (obs
), indexed bypandas.DataFrame
(e.g., cell IDs).obs_names
: per-variable metadata (var
), indexed bypandas.DataFrame
(e.g., gene IDs).var_names
: named alternative matrices aligned tolayers
(e.g.,X
,"counts"
)."log1p"
/obsm
: multi-dimensional embeddings aligned to obs/var (e.g., PCA, UMAP coordinates).varm
/obsp
: pairwise graphs/matrices (e.g., kNN graph invarp
).obsp["connectivities"]
: unstructured metadata (dict-like), often used for parameters and plotting configs.uns
(optional): snapshot of unfiltered/untransformed data for reproducibility.raw
Views vs copies
- Slicing like
typically returns a view (lightweight reference).adata_subset = adata[mask, :] - Use
when you need an independent object (e.g., before in-place modifications)..copy()
Backed mode (large datasets)
keeps the matrix on disk and loads data lazily.ad.read_h5ad(path, backed="r")- Convert a slice to memory with
when you need in-memory computation..to_memory() - Backed mode is best for filtering by metadata, chunked processing, and avoiding OOM.
Concatenation behavior
stacks observations;ad.concat([...], axis=0)
stacks variables.axis=1
keeps intersection of variables;join="inner"
unions variables (may introduce missing values).join="outer"
+label
records dataset/batch provenance inkeys
..obs[label]- Merge strategies control how conflicting
and annotation columns are handled (choose based on your data governance needs)..uns
Practical performance parameters
- Prefer sparse matrices (
) for count-like data.csr_matrix - Convert repeated strings to categoricals (
).adata.strings_to_categoricals() - Use compression when writing
(e.g.,.h5ad
) to reduce storage; considercompression="gzip"
for chunked/cloud-friendly access.zarr