SciAgent-Skills lamindb-data-management

Open-source data framework for biology: queryable, traceable, FAIR data management. Version artifacts (AnnData, DataFrame, Zarr), track computational lineage, validate with biological ontologies (Bionty), query across datasets. Integrates with Nextflow, Snakemake, W&B, scVI-tools. For single-cell analysis use scanpy; for ontology-only lookups use bionty directly.

install
source · Clone the upstream repo
git clone https://github.com/jaechang-hits/SciAgent-Skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/systems-biology-multiomics/lamindb-data-management" ~/.claude/skills/jaechang-hits-sciagent-skills-lamindb-data-management && rm -rf "$T"
manifest: skills/systems-biology-multiomics/lamindb-data-management/SKILL.md
source content

LaminDB — Biological Data Management

Overview

LaminDB is an open-source data framework for biology that makes data queryable, traceable, and FAIR (Findable, Accessible, Interoperable, Reusable). It combines data lakehouse architecture, lineage tracking, biological ontology validation, and a unified Python API for managing biological datasets from raw files to annotated, curated artifacts.

When to Use

  • Managing and versioning biological datasets (scRNA-seq, spatial, flow cytometry, multi-modal)
  • Tracking computational lineage (which code produced which data)
  • Validating and curating data against biological ontologies (cell types, genes, tissues, diseases)
  • Building queryable data lakehouses across multiple experiments
  • Ensuring reproducibility with automatic environment and provenance capture
  • Integrating with workflow managers (Nextflow, Snakemake) or MLOps (W&B, MLflow)
  • Standardizing metadata with ontology-based annotation (Bionty)
  • For single-cell analysis pipelines (clustering, DE), use scanpy instead
  • For ontology lookups only without data management, use bionty directly

Prerequisites

pip install lamindb
# With extras for specific data types
pip install 'lamindb[bionty,zarr,fcs]'

Setup: Requires instance initialization before use:

lamin login
lamin init --storage ./my-data --name my-project
# Or with cloud storage:
# lamin init --storage s3://my-bucket --name my-project --db postgresql://...

Instance types: Local SQLite (development), Cloud + SQLite (small teams), Cloud + PostgreSQL (production).

Quick Start

import lamindb as ln

ln.track()  # Start lineage tracking

# Save an artifact
import pandas as pd
df = pd.DataFrame({"gene": ["TP53", "BRCA1"], "score": [0.95, 0.87]})
artifact = ln.Artifact.from_df(df, key="results/gene_scores.parquet", description="Gene importance scores")
artifact.save()
print(f"Saved: {artifact.uid}, size: {artifact.size}")

# Query artifacts
results = ln.Artifact.filter(key__startswith="results/").df()
print(f"Found {len(results)} artifacts")

ln.finish()

Core API

1. Artifacts — Data Objects

Artifacts are versioned data objects (files, DataFrames, AnnData, arrays).

import lamindb as ln
import pandas as pd
import anndata as ad

ln.track()

# From DataFrame
df = pd.DataFrame({"sample": ["A", "B"], "value": [1.5, 2.3]})
artifact = ln.Artifact.from_df(df, key="experiments/batch1.parquet").save()
print(f"ID: {artifact.uid}, Version: {artifact.version}")

# From AnnData
adata = ad.read_h5ad("counts.h5ad")
artifact = ln.Artifact.from_anndata(adata, key="scrna/batch1.h5ad", description="scRNA-seq batch 1").save()

# From file path
artifact = ln.Artifact("results/figure.png", key="figures/fig1.png").save()

# Load back
df_loaded = artifact.load()  # Returns DataFrame/AnnData/etc.
path = artifact.cache()       # Returns local file path
# Versioning
artifact_v2 = ln.Artifact.from_df(df_updated, key="experiments/batch1.parquet", revises=artifact).save()
print(f"v1: {artifact.uid}, v2: {artifact_v2.uid}")
print(f"Latest version: {artifact_v2.is_latest}")

# Delete (archive first, then permanent)
artifact.delete(permanent=False)  # Archive
# artifact.delete(permanent=True)  # Permanent deletion

2. Lineage Tracking

Automatic provenance capture for reproducibility.

import lamindb as ln

# Start tracking — captures notebook/script, environment, user
ln.track(params={"method": "PCA", "n_components": 50})

# All artifacts created within this block are linked to this run
input_data = ln.Artifact.get(key="raw/counts.h5ad")
adata = input_data.load()

# ... analysis code ...

output = ln.Artifact.from_anndata(adata, key="processed/pca.h5ad").save()

# View lineage graph
output.view_lineage()

ln.finish()  # Finalize tracking

3. Querying and Filtering

Search and filter artifacts by metadata, features, and annotations.

import lamindb as ln

# Basic filtering
artifacts = ln.Artifact.filter(key__startswith="scrna/").df()
print(f"Found {len(artifacts)} scRNA-seq artifacts")

# Filter by metadata
recent = ln.Artifact.filter(
    created_at__gte="2026-01-01",
    size__gt=1000000
).df()

# Filter by annotated features
immune = ln.Artifact.filter(
    cell_types__name="T cell",
    tissues__name="PBMC"
).df()

# Single record retrieval
artifact = ln.Artifact.get(key="results/final.parquet")  # Exact match, raises if not found
artifact = ln.Artifact.filter(key="results/final.parquet").one_or_none()  # Returns None if missing

# Full-text search
results = ln.Artifact.search("gene expression PBMC")

# Streaming large files (without full load into memory)
artifact = ln.Artifact.get(key="large_dataset.h5ad")
backed = artifact.open()  # AnnData-backed mode
subset = backed[backed.obs["cell_type"] == "B cell"]

4. Annotation and Validation

Curate datasets against schemas and ontology terms.

import lamindb as ln
import bionty as bt

# Annotate artifacts with features
artifact = ln.Artifact.get(key="scrna/batch1.h5ad")
artifact.features.add_values({
    "tissue": "PBMC",
    "condition": "treated",
    "organism": "human",
    "batch": 1
})

# Validate with schema
curator = ln.curators.AnnDataCurator(adata, schema)
try:
    curator.validate()
    artifact = curator.save_artifact(key="validated/batch1.h5ad")
    print("Validation passed")
except ln.errors.ValidationError as e:
    print(f"Validation failed: {e}")

# Standardize cell type names using ontology
adata.obs["cell_type"] = bt.CellType.standardize(adata.obs["cell_type"])

5. Biological Ontologies (Bionty)

Access standardized biological vocabularies for annotation.

import bionty as bt

# Available ontologies
# bt.Gene (Ensembl), bt.Protein (UniProt), bt.CellType (CL),
# bt.Tissue (Uberon), bt.Disease (Mondo), bt.Pathway (GO),
# bt.CellLine (CLO), bt.Phenotype (HPO), bt.Organism (NCBItaxon)

# Import and search ontology
bt.CellType.import_source()
results = bt.CellType.search("T helper")
print(results.head())

# Get specific term
t_cell = bt.CellType.get(name="T cell")
print(f"Ontology ID: {t_cell.ontology_id}")

# Explore hierarchy
children = t_cell.children.all()
parents = t_cell.parents.all()
print(f"Children: {[c.name for c in children]}")

# Validate a list of terms
validated = bt.CellType.validate(["T cell", "B cell", "Unknown_type"])
# Returns boolean array: [True, True, False]

6. Collections and Organization

Group related artifacts for batch operations.

import lamindb as ln

# Create a collection
artifacts = ln.Artifact.filter(key__startswith="scrna/batch_").all()
collection = ln.Collection(artifacts, name="scRNA-seq batches Q1 2026").save()
print(f"Collection: {collection.name}, {collection.n_objects} artifacts")

# Query collection
for artifact in collection.artifacts.all():
    print(f"  {artifact.key}: {artifact.size} bytes")

# Organize with hierarchical keys
# Convention: project/experiment/datatype/file
# e.g., "immunology/exp42/scrna/counts.h5ad"

Key Concepts

Core Entity Model

EntityPurposeExample
ArtifactVersioned data object
counts.h5ad
,
results.parquet
RunSingle code executionNotebook run, script execution
TransformCode definition (notebook, script, pipeline)
analysis.ipynb
FeatureTyped metadata field
tissue
,
condition
,
batch
CollectionGroup of related artifacts"Experiment batches"
ULabelUniversal label for custom categorization"high_quality", "pilot"

Data Types Supported

FormatMethodUse Case
DataFrame
Artifact.from_df()
Tabular data, metadata tables
AnnData
Artifact.from_anndata()
Single-cell data
MuData
Artifact.from_mudata()
Multi-modal data
Any file
Artifact("path")
Images, FASTQ, custom formats
ZarrVia zarr extraLarge array data
TileDB-SOMAVia tiledbsoma extraScalable cell-level queries

track() / finish() Pattern

Every analysis session should be wrapped:

ln.track(params={"key": "value"})   # Start: captures code, environment, user
# ... analysis ...
ln.finish()                          # End: finalizes lineage links

Common Workflows

Workflow: Multi-Experiment Data Lakehouse

import lamindb as ln
import anndata as ad

ln.track()

# Register multiple experiments
data_files = ["batch1.h5ad", "batch2.h5ad", "batch3.h5ad"]
tissues = ["PBMC", "bone_marrow", "PBMC"]
conditions = ["control", "treated", "treated"]

for i, (file, tissue, condition) in enumerate(zip(data_files, tissues, conditions)):
    adata = ad.read_h5ad(file)
    artifact = ln.Artifact.from_anndata(
        adata, key=f"scrna/batch_{i}.h5ad", description=f"scRNA-seq batch {i}"
    ).save()
    artifact.features.add_values({
        "tissue": tissue, "condition": condition, "batch": i
    })
    print(f"Registered batch {i}: {artifact.uid}")

# Query across all experiments
treated_pbmc = ln.Artifact.filter(
    key__startswith="scrna/",
    features__tissue="PBMC",
    features__condition="treated"
).all()
print(f"Found {len(treated_pbmc)} matching datasets")

# Load and concatenate
import anndata as ad
adatas = [a.load() for a in treated_pbmc]
combined = ad.concat(adatas)
print(f"Combined: {combined.shape}")

ln.finish()

Workflow: Validated Data Curation

import lamindb as ln
import bionty as bt
import anndata as ad

ln.track()

# 1. Import ontologies
bt.CellType.import_source()
bt.Gene.import_source(organism="human")

# 2. Load raw data
adata = ad.read_h5ad("raw_counts.h5ad")
print(f"Raw: {adata.shape}")

# 3. Validate and standardize cell types
validated = bt.CellType.validate(adata.obs["cell_type"].unique())
if not all(validated):
    adata.obs["cell_type"] = bt.CellType.standardize(adata.obs["cell_type"])

# 4. Validate gene names
gene_validated = bt.Gene.validate(adata.var_names)
print(f"Valid genes: {sum(gene_validated)}/{len(gene_validated)}")

# 5. Curate and save
curator = ln.curators.AnnDataCurator(adata, schema)
curator.validate()
artifact = curator.save_artifact(key="curated/validated_counts.h5ad")
print(f"Saved curated artifact: {artifact.uid}")

ln.finish()

Workflow: Nextflow Pipeline Integration

  1. In each Nextflow process, import lamindb and call
    ln.track()
  2. Load input artifacts with
    ln.Artifact.get(key=...)
    ; cache to local path
  3. Run analysis; save output as new artifact with
    ln.Artifact(...).save()
  4. Call
    ln.finish()
    — lineage automatically links inputs to outputs

Key Parameters

ParameterFunctionDefaultOptionsEffect
key
Artifact()
NoneString pathHierarchical storage key (e.g., "project/data.h5ad")
description
Artifact()
NoneStringHuman-readable description
revises
Artifact()
NoneArtifactPrevious version to revise
params
ln.track()
NoneDictParameters for the current run
organism
bt.Gene.import_source()
None"human", "mouse"Organism for ontology
permanent
.delete()
FalseTrue/FalsePermanent vs archive deletion
__startswith
.filter()
StringKey prefix filter
__gte
,
__lte
.filter()
ValueGreater/less than or equal
__contains
.filter()
StringSubstring match

Best Practices

  1. Always wrap analysis with

    ln.track()
    /
    ln.finish()
    : This captures lineage automatically. Without it, artifacts have no provenance.

  2. Use hierarchical keys: Structure as

    project/experiment/datatype/file.ext
    (e.g.,
    immunology/exp42/scrna/counts.h5ad
    ). This enables prefix-based queries.

  3. Anti-pattern — duplicating data instead of versioning: Use the

    revises=
    parameter to create new versions, not new keys for the same dataset.

  4. Validate early: Run schema validation before analysis. Catching bad metadata early saves debugging time downstream.

  5. Use ontologies for standardization: Map free-text labels to ontology terms (e.g., "T helper cell" → CL:0000912). This enables cross-dataset queries.

  6. Anti-pattern — loading large files without checking size: Use

    .filter().df()
    to inspect metadata first, then
    .load()
    or
    .open()
    (backed mode) for large files.

  7. Query metadata first, load data second: Filter with

    .filter()
    to find relevant artifacts, then load only what you need.

Common Recipes

Recipe: Bulk Dataset Registration

import lamindb as ln
from pathlib import Path

ln.track()

data_dir = Path("raw_data/")
for fcs_file in data_dir.glob("*.fcs"):
    artifact = ln.Artifact(str(fcs_file), key=f"flow_cytometry/{fcs_file.name}").save()
    artifact.features.add_values({"assay": "flow_cytometry", "source": "batch_import"})
    print(f"Registered: {fcs_file.name} -> {artifact.uid}")

ln.finish()

Recipe: View and Export Lineage

import lamindb as ln

artifact = ln.Artifact.get(key="results/final_analysis.h5ad")

# View lineage graph (opens in browser or notebook)
artifact.view_lineage()

# Programmatic lineage access
run = artifact.run
print(f"Created by: {run.transform.name}")
print(f"User: {run.created_by.name}")
print(f"Date: {run.created_at}")
print(f"Input artifacts: {[a.key for a in run.input_artifacts.all()]}")

Recipe: Ontology Hierarchy Exploration

import bionty as bt

bt.CellType.import_source()
t_cell = bt.CellType.get(name="T cell")

# Explore hierarchy
print(f"Parents: {[p.name for p in t_cell.parents.all()]}")
print(f"Children: {[c.name for c in t_cell.children.all()]}")

# Find all descendants
descendants = t_cell.children.all()
for child in descendants:
    grandchildren = child.children.all()
    print(f"  {child.name}: {[gc.name for gc in grandchildren]}")

Troubleshooting

ProblemCauseSolution
InstanceNotSetupError
Instance not initializedRun
lamin init --storage ./data --name my-project
ln.track()
fails
No transform contextRun inside a notebook/script, not REPL; or pass
transform
explicitly
Artifact
key
conflict
Key already exists (not a version)Use
revises=
for versioning, or choose a different key
ValidationError
Data doesn't match schemaRun
curator.validate()
to see specific failures; standardize terms
Slow queries on large instancesNo index on filtered fieldUse
.df()
for overview first; add database indexes for frequently filtered fields
Ontology import failsNetwork issue or wrong organismCheck internet connection; specify
organism="human"
explicitly
FileNotFoundError
on
.cache()
Cloud artifact not syncedCheck storage connectivity; use
artifact.load()
instead for in-memory access

Related Skills

  • anndata-annotated-data — AnnData format used as primary data container in LaminDB for single-cell data
  • scanpy-scrna-seq — single-cell analysis pipeline; LaminDB manages data that scanpy analyzes
  • scvi-tools-single-cell — deep learning models for single-cell; integrates with LaminDB for data/model tracking

References