SciAgent-Skills lamindb-data-management
Open-source data framework for biology: queryable, traceable, FAIR data management. Version artifacts (AnnData, DataFrame, Zarr), track computational lineage, validate with biological ontologies (Bionty), query across datasets. Integrates with Nextflow, Snakemake, W&B, scVI-tools. For single-cell analysis use scanpy; for ontology-only lookups use bionty directly.
git clone https://github.com/jaechang-hits/SciAgent-Skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/systems-biology-multiomics/lamindb-data-management" ~/.claude/skills/jaechang-hits-sciagent-skills-lamindb-data-management && rm -rf "$T"
skills/systems-biology-multiomics/lamindb-data-management/SKILL.mdLaminDB — Biological Data Management
Overview
LaminDB is an open-source data framework for biology that makes data queryable, traceable, and FAIR (Findable, Accessible, Interoperable, Reusable). It combines data lakehouse architecture, lineage tracking, biological ontology validation, and a unified Python API for managing biological datasets from raw files to annotated, curated artifacts.
When to Use
- Managing and versioning biological datasets (scRNA-seq, spatial, flow cytometry, multi-modal)
- Tracking computational lineage (which code produced which data)
- Validating and curating data against biological ontologies (cell types, genes, tissues, diseases)
- Building queryable data lakehouses across multiple experiments
- Ensuring reproducibility with automatic environment and provenance capture
- Integrating with workflow managers (Nextflow, Snakemake) or MLOps (W&B, MLflow)
- Standardizing metadata with ontology-based annotation (Bionty)
- For single-cell analysis pipelines (clustering, DE), use scanpy instead
- For ontology lookups only without data management, use bionty directly
Prerequisites
pip install lamindb # With extras for specific data types pip install 'lamindb[bionty,zarr,fcs]'
Setup: Requires instance initialization before use:
lamin login lamin init --storage ./my-data --name my-project # Or with cloud storage: # lamin init --storage s3://my-bucket --name my-project --db postgresql://...
Instance types: Local SQLite (development), Cloud + SQLite (small teams), Cloud + PostgreSQL (production).
Quick Start
import lamindb as ln ln.track() # Start lineage tracking # Save an artifact import pandas as pd df = pd.DataFrame({"gene": ["TP53", "BRCA1"], "score": [0.95, 0.87]}) artifact = ln.Artifact.from_df(df, key="results/gene_scores.parquet", description="Gene importance scores") artifact.save() print(f"Saved: {artifact.uid}, size: {artifact.size}") # Query artifacts results = ln.Artifact.filter(key__startswith="results/").df() print(f"Found {len(results)} artifacts") ln.finish()
Core API
1. Artifacts — Data Objects
Artifacts are versioned data objects (files, DataFrames, AnnData, arrays).
import lamindb as ln import pandas as pd import anndata as ad ln.track() # From DataFrame df = pd.DataFrame({"sample": ["A", "B"], "value": [1.5, 2.3]}) artifact = ln.Artifact.from_df(df, key="experiments/batch1.parquet").save() print(f"ID: {artifact.uid}, Version: {artifact.version}") # From AnnData adata = ad.read_h5ad("counts.h5ad") artifact = ln.Artifact.from_anndata(adata, key="scrna/batch1.h5ad", description="scRNA-seq batch 1").save() # From file path artifact = ln.Artifact("results/figure.png", key="figures/fig1.png").save() # Load back df_loaded = artifact.load() # Returns DataFrame/AnnData/etc. path = artifact.cache() # Returns local file path
# Versioning artifact_v2 = ln.Artifact.from_df(df_updated, key="experiments/batch1.parquet", revises=artifact).save() print(f"v1: {artifact.uid}, v2: {artifact_v2.uid}") print(f"Latest version: {artifact_v2.is_latest}") # Delete (archive first, then permanent) artifact.delete(permanent=False) # Archive # artifact.delete(permanent=True) # Permanent deletion
2. Lineage Tracking
Automatic provenance capture for reproducibility.
import lamindb as ln # Start tracking — captures notebook/script, environment, user ln.track(params={"method": "PCA", "n_components": 50}) # All artifacts created within this block are linked to this run input_data = ln.Artifact.get(key="raw/counts.h5ad") adata = input_data.load() # ... analysis code ... output = ln.Artifact.from_anndata(adata, key="processed/pca.h5ad").save() # View lineage graph output.view_lineage() ln.finish() # Finalize tracking
3. Querying and Filtering
Search and filter artifacts by metadata, features, and annotations.
import lamindb as ln # Basic filtering artifacts = ln.Artifact.filter(key__startswith="scrna/").df() print(f"Found {len(artifacts)} scRNA-seq artifacts") # Filter by metadata recent = ln.Artifact.filter( created_at__gte="2026-01-01", size__gt=1000000 ).df() # Filter by annotated features immune = ln.Artifact.filter( cell_types__name="T cell", tissues__name="PBMC" ).df() # Single record retrieval artifact = ln.Artifact.get(key="results/final.parquet") # Exact match, raises if not found artifact = ln.Artifact.filter(key="results/final.parquet").one_or_none() # Returns None if missing # Full-text search results = ln.Artifact.search("gene expression PBMC") # Streaming large files (without full load into memory) artifact = ln.Artifact.get(key="large_dataset.h5ad") backed = artifact.open() # AnnData-backed mode subset = backed[backed.obs["cell_type"] == "B cell"]
4. Annotation and Validation
Curate datasets against schemas and ontology terms.
import lamindb as ln import bionty as bt # Annotate artifacts with features artifact = ln.Artifact.get(key="scrna/batch1.h5ad") artifact.features.add_values({ "tissue": "PBMC", "condition": "treated", "organism": "human", "batch": 1 }) # Validate with schema curator = ln.curators.AnnDataCurator(adata, schema) try: curator.validate() artifact = curator.save_artifact(key="validated/batch1.h5ad") print("Validation passed") except ln.errors.ValidationError as e: print(f"Validation failed: {e}") # Standardize cell type names using ontology adata.obs["cell_type"] = bt.CellType.standardize(adata.obs["cell_type"])
5. Biological Ontologies (Bionty)
Access standardized biological vocabularies for annotation.
import bionty as bt # Available ontologies # bt.Gene (Ensembl), bt.Protein (UniProt), bt.CellType (CL), # bt.Tissue (Uberon), bt.Disease (Mondo), bt.Pathway (GO), # bt.CellLine (CLO), bt.Phenotype (HPO), bt.Organism (NCBItaxon) # Import and search ontology bt.CellType.import_source() results = bt.CellType.search("T helper") print(results.head()) # Get specific term t_cell = bt.CellType.get(name="T cell") print(f"Ontology ID: {t_cell.ontology_id}") # Explore hierarchy children = t_cell.children.all() parents = t_cell.parents.all() print(f"Children: {[c.name for c in children]}") # Validate a list of terms validated = bt.CellType.validate(["T cell", "B cell", "Unknown_type"]) # Returns boolean array: [True, True, False]
6. Collections and Organization
Group related artifacts for batch operations.
import lamindb as ln # Create a collection artifacts = ln.Artifact.filter(key__startswith="scrna/batch_").all() collection = ln.Collection(artifacts, name="scRNA-seq batches Q1 2026").save() print(f"Collection: {collection.name}, {collection.n_objects} artifacts") # Query collection for artifact in collection.artifacts.all(): print(f" {artifact.key}: {artifact.size} bytes") # Organize with hierarchical keys # Convention: project/experiment/datatype/file # e.g., "immunology/exp42/scrna/counts.h5ad"
Key Concepts
Core Entity Model
| Entity | Purpose | Example |
|---|---|---|
| Artifact | Versioned data object | , |
| Run | Single code execution | Notebook run, script execution |
| Transform | Code definition (notebook, script, pipeline) | |
| Feature | Typed metadata field | , , |
| Collection | Group of related artifacts | "Experiment batches" |
| ULabel | Universal label for custom categorization | "high_quality", "pilot" |
Data Types Supported
| Format | Method | Use Case |
|---|---|---|
| DataFrame | | Tabular data, metadata tables |
| AnnData | | Single-cell data |
| MuData | | Multi-modal data |
| Any file | | Images, FASTQ, custom formats |
| Zarr | Via zarr extra | Large array data |
| TileDB-SOMA | Via tiledbsoma extra | Scalable cell-level queries |
track() / finish() Pattern
Every analysis session should be wrapped:
ln.track(params={"key": "value"}) # Start: captures code, environment, user # ... analysis ... ln.finish() # End: finalizes lineage links
Common Workflows
Workflow: Multi-Experiment Data Lakehouse
import lamindb as ln import anndata as ad ln.track() # Register multiple experiments data_files = ["batch1.h5ad", "batch2.h5ad", "batch3.h5ad"] tissues = ["PBMC", "bone_marrow", "PBMC"] conditions = ["control", "treated", "treated"] for i, (file, tissue, condition) in enumerate(zip(data_files, tissues, conditions)): adata = ad.read_h5ad(file) artifact = ln.Artifact.from_anndata( adata, key=f"scrna/batch_{i}.h5ad", description=f"scRNA-seq batch {i}" ).save() artifact.features.add_values({ "tissue": tissue, "condition": condition, "batch": i }) print(f"Registered batch {i}: {artifact.uid}") # Query across all experiments treated_pbmc = ln.Artifact.filter( key__startswith="scrna/", features__tissue="PBMC", features__condition="treated" ).all() print(f"Found {len(treated_pbmc)} matching datasets") # Load and concatenate import anndata as ad adatas = [a.load() for a in treated_pbmc] combined = ad.concat(adatas) print(f"Combined: {combined.shape}") ln.finish()
Workflow: Validated Data Curation
import lamindb as ln import bionty as bt import anndata as ad ln.track() # 1. Import ontologies bt.CellType.import_source() bt.Gene.import_source(organism="human") # 2. Load raw data adata = ad.read_h5ad("raw_counts.h5ad") print(f"Raw: {adata.shape}") # 3. Validate and standardize cell types validated = bt.CellType.validate(adata.obs["cell_type"].unique()) if not all(validated): adata.obs["cell_type"] = bt.CellType.standardize(adata.obs["cell_type"]) # 4. Validate gene names gene_validated = bt.Gene.validate(adata.var_names) print(f"Valid genes: {sum(gene_validated)}/{len(gene_validated)}") # 5. Curate and save curator = ln.curators.AnnDataCurator(adata, schema) curator.validate() artifact = curator.save_artifact(key="curated/validated_counts.h5ad") print(f"Saved curated artifact: {artifact.uid}") ln.finish()
Workflow: Nextflow Pipeline Integration
- In each Nextflow process, import lamindb and call
ln.track() - Load input artifacts with
; cache to local pathln.Artifact.get(key=...) - Run analysis; save output as new artifact with
ln.Artifact(...).save() - Call
— lineage automatically links inputs to outputsln.finish()
Key Parameters
| Parameter | Function | Default | Options | Effect |
|---|---|---|---|---|
| | None | String path | Hierarchical storage key (e.g., "project/data.h5ad") |
| | None | String | Human-readable description |
| | None | Artifact | Previous version to revise |
| | None | Dict | Parameters for the current run |
| | None | "human", "mouse" | Organism for ontology |
| | False | True/False | Permanent vs archive deletion |
| | — | String | Key prefix filter |
, | | — | Value | Greater/less than or equal |
| | — | String | Substring match |
Best Practices
-
Always wrap analysis with
/ln.track()
: This captures lineage automatically. Without it, artifacts have no provenance.ln.finish() -
Use hierarchical keys: Structure as
(e.g.,project/experiment/datatype/file.ext
). This enables prefix-based queries.immunology/exp42/scrna/counts.h5ad -
Anti-pattern — duplicating data instead of versioning: Use the
parameter to create new versions, not new keys for the same dataset.revises= -
Validate early: Run schema validation before analysis. Catching bad metadata early saves debugging time downstream.
-
Use ontologies for standardization: Map free-text labels to ontology terms (e.g., "T helper cell" → CL:0000912). This enables cross-dataset queries.
-
Anti-pattern — loading large files without checking size: Use
to inspect metadata first, then.filter().df()
or.load()
(backed mode) for large files..open() -
Query metadata first, load data second: Filter with
to find relevant artifacts, then load only what you need..filter()
Common Recipes
Recipe: Bulk Dataset Registration
import lamindb as ln from pathlib import Path ln.track() data_dir = Path("raw_data/") for fcs_file in data_dir.glob("*.fcs"): artifact = ln.Artifact(str(fcs_file), key=f"flow_cytometry/{fcs_file.name}").save() artifact.features.add_values({"assay": "flow_cytometry", "source": "batch_import"}) print(f"Registered: {fcs_file.name} -> {artifact.uid}") ln.finish()
Recipe: View and Export Lineage
import lamindb as ln artifact = ln.Artifact.get(key="results/final_analysis.h5ad") # View lineage graph (opens in browser or notebook) artifact.view_lineage() # Programmatic lineage access run = artifact.run print(f"Created by: {run.transform.name}") print(f"User: {run.created_by.name}") print(f"Date: {run.created_at}") print(f"Input artifacts: {[a.key for a in run.input_artifacts.all()]}")
Recipe: Ontology Hierarchy Exploration
import bionty as bt bt.CellType.import_source() t_cell = bt.CellType.get(name="T cell") # Explore hierarchy print(f"Parents: {[p.name for p in t_cell.parents.all()]}") print(f"Children: {[c.name for c in t_cell.children.all()]}") # Find all descendants descendants = t_cell.children.all() for child in descendants: grandchildren = child.children.all() print(f" {child.name}: {[gc.name for gc in grandchildren]}")
Troubleshooting
| Problem | Cause | Solution |
|---|---|---|
| Instance not initialized | Run |
fails | No transform context | Run inside a notebook/script, not REPL; or pass explicitly |
Artifact conflict | Key already exists (not a version) | Use for versioning, or choose a different key |
| Data doesn't match schema | Run to see specific failures; standardize terms |
| Slow queries on large instances | No index on filtered field | Use for overview first; add database indexes for frequently filtered fields |
| Ontology import fails | Network issue or wrong organism | Check internet connection; specify explicitly |
on | Cloud artifact not synced | Check storage connectivity; use instead for in-memory access |
Related Skills
- anndata-annotated-data — AnnData format used as primary data container in LaminDB for single-cell data
- scanpy-scrna-seq — single-cell analysis pipeline; LaminDB manages data that scanpy analyzes
- scvi-tools-single-cell — deep learning models for single-cell; integrates with LaminDB for data/model tracking
References
- LaminDB documentation — official user guide and API reference
- LaminDB tutorial — step-by-step introduction
- Bionty documentation — biological ontology management
- LaminDB GitHub — source code and issues