SciAgent-Skills lamindb-data-management

Open-source data framework for biology: queryable, traceable, FAIR data management. Version artifacts (AnnData, DataFrame, Zarr), track computational lineage, validate with biological ontologies (Bionty), query across datasets. Integrates with Nextflow, Snakemake, W&B, scVI-tools. For single-cell analysis use scanpy; for ontology-only lookups use bionty directly.

install

source · Clone the upstream repo

git clone https://github.com/jaechang-hits/SciAgent-Skills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/systems-biology-multiomics/lamindb-data-management" ~/.claude/skills/jaechang-hits-sciagent-skills-lamindb-data-management && rm -rf "$T"

manifest: skills/systems-biology-multiomics/lamindb-data-management/SKILL.md

source content

LaminDB — Biological Data Management

Overview

LaminDB is an open-source data framework for biology that makes data queryable, traceable, and FAIR (Findable, Accessible, Interoperable, Reusable). It combines data lakehouse architecture, lineage tracking, biological ontology validation, and a unified Python API for managing biological datasets from raw files to annotated, curated artifacts.

When to Use

Managing and versioning biological datasets (scRNA-seq, spatial, flow cytometry, multi-modal)
Tracking computational lineage (which code produced which data)
Validating and curating data against biological ontologies (cell types, genes, tissues, diseases)
Building queryable data lakehouses across multiple experiments
Ensuring reproducibility with automatic environment and provenance capture
Integrating with workflow managers (Nextflow, Snakemake) or MLOps (W&B, MLflow)
Standardizing metadata with ontology-based annotation (Bionty)
For single-cell analysis pipelines (clustering, DE), use scanpy instead
For ontology lookups only without data management, use bionty directly

Prerequisites

pip install lamindb
# With extras for specific data types
pip install 'lamindb[bionty,zarr,fcs]'

Setup: Requires instance initialization before use:

lamin login
lamin init --storage ./my-data --name my-project
# Or with cloud storage:
# lamin init --storage s3://my-bucket --name my-project --db postgresql://...

Instance types: Local SQLite (development), Cloud + SQLite (small teams), Cloud + PostgreSQL (production).

Quick Start

import lamindb as ln

ln.track()  # Start lineage tracking

# Save an artifact
import pandas as pd
df = pd.DataFrame({"gene": ["TP53", "BRCA1"], "score": [0.95, 0.87]})
artifact = ln.Artifact.from_df(df, key="results/gene_scores.parquet", description="Gene importance scores")
artifact.save()
print(f"Saved: {artifact.uid}, size: {artifact.size}")

# Query artifacts
results = ln.Artifact.filter(key__startswith="results/").df()
print(f"Found {len(results)} artifacts")

ln.finish()

Core API

1. Artifacts — Data Objects

Artifacts are versioned data objects (files, DataFrames, AnnData, arrays).

import lamindb as ln
import pandas as pd
import anndata as ad

ln.track()

# From DataFrame
df = pd.DataFrame({"sample": ["A", "B"], "value": [1.5, 2.3]})
artifact = ln.Artifact.from_df(df, key="experiments/batch1.parquet").save()
print(f"ID: {artifact.uid}, Version: {artifact.version}")

# From AnnData
adata = ad.read_h5ad("counts.h5ad")
artifact = ln.Artifact.from_anndata(adata, key="scrna/batch1.h5ad", description="scRNA-seq batch 1").save()

# From file path
artifact = ln.Artifact("results/figure.png", key="figures/fig1.png").save()

# Load back
df_loaded = artifact.load()  # Returns DataFrame/AnnData/etc.
path = artifact.cache()       # Returns local file path

# Versioning
artifact_v2 = ln.Artifact.from_df(df_updated, key="experiments/batch1.parquet", revises=artifact).save()
print(f"v1: {artifact.uid}, v2: {artifact_v2.uid}")
print(f"Latest version: {artifact_v2.is_latest}")

# Delete (archive first, then permanent)
artifact.delete(permanent=False)  # Archive
# artifact.delete(permanent=True)  # Permanent deletion

2. Lineage Tracking

Automatic provenance capture for reproducibility.

import lamindb as ln

# Start tracking — captures notebook/script, environment, user
ln.track(params={"method": "PCA", "n_components": 50})

# All artifacts created within this block are linked to this run
input_data = ln.Artifact.get(key="raw/counts.h5ad")
adata = input_data.load()

# ... analysis code ...

output = ln.Artifact.from_anndata(adata, key="processed/pca.h5ad").save()

# View lineage graph
output.view_lineage()

ln.finish()  # Finalize tracking

3. Querying and Filtering

Search and filter artifacts by metadata, features, and annotations.

import lamindb as ln

# Basic filtering
artifacts = ln.Artifact.filter(key__startswith="scrna/").df()
print(f"Found {len(artifacts)} scRNA-seq artifacts")

# Filter by metadata
recent = ln.Artifact.filter(
    created_at__gte="2026-01-01",
    size__gt=1000000
).df()

# Filter by annotated features
immune = ln.Artifact.filter(
    cell_types__name="T cell",
    tissues__name="PBMC"
).df()

# Single record retrieval
artifact = ln.Artifact.get(key="results/final.parquet")  # Exact match, raises if not found
artifact = ln.Artifact.filter(key="results/final.parquet").one_or_none()  # Returns None if missing

# Full-text search
results = ln.Artifact.search("gene expression PBMC")

# Streaming large files (without full load into memory)
artifact = ln.Artifact.get(key="large_dataset.h5ad")
backed = artifact.open()  # AnnData-backed mode
subset = backed[backed.obs["cell_type"] == "B cell"]

4. Annotation and Validation

Curate datasets against schemas and ontology terms.

import lamindb as ln
import bionty as bt

# Annotate artifacts with features
artifact = ln.Artifact.get(key="scrna/batch1.h5ad")
artifact.features.add_values({
    "tissue": "PBMC",
    "condition": "treated",
    "organism": "human",
    "batch": 1
})

# Validate with schema
curator = ln.curators.AnnDataCurator(adata, schema)
try:
    curator.validate()
    artifact = curator.save_artifact(key="validated/batch1.h5ad")
    print("Validation passed")
except ln.errors.ValidationError as e:
    print(f"Validation failed: {e}")

# Standardize cell type names using ontology
adata.obs["cell_type"] = bt.CellType.standardize(adata.obs["cell_type"])

5. Biological Ontologies (Bionty)

Access standardized biological vocabularies for annotation.

import bionty as bt

# Available ontologies
# bt.Gene (Ensembl), bt.Protein (UniProt), bt.CellType (CL),
# bt.Tissue (Uberon), bt.Disease (Mondo), bt.Pathway (GO),
# bt.CellLine (CLO), bt.Phenotype (HPO), bt.Organism (NCBItaxon)

# Import and search ontology
bt.CellType.import_source()
results = bt.CellType.search("T helper")
print(results.head())

# Get specific term
t_cell = bt.CellType.get(name="T cell")
print(f"Ontology ID: {t_cell.ontology_id}")

# Explore hierarchy
children = t_cell.children.all()
parents = t_cell.parents.all()
print(f"Children: {[c.name for c in children]}")

# Validate a list of terms
validated = bt.CellType.validate(["T cell", "B cell", "Unknown_type"])
# Returns boolean array: [True, True, False]

6. Collections and Organization

Group related artifacts for batch operations.

import lamindb as ln

# Create a collection
artifacts = ln.Artifact.filter(key__startswith="scrna/batch_").all()
collection = ln.Collection(artifacts, name="scRNA-seq batches Q1 2026").save()
print(f"Collection: {collection.name}, {collection.n_objects} artifacts")

# Query collection
for artifact in collection.artifacts.all():
    print(f"  {artifact.key}: {artifact.size} bytes")

# Organize with hierarchical keys
# Convention: project/experiment/datatype/file
# e.g., "immunology/exp42/scrna/counts.h5ad"

Key Concepts

Core Entity Model

Entity	Purpose	Example
Artifact	Versioned data object	`counts.h5ad` , `results.parquet`
Run	Single code execution	Notebook run, script execution
Transform	Code definition (notebook, script, pipeline)	`analysis.ipynb`
Feature	Typed metadata field	`tissue` , `condition` , `batch`
Collection	Group of related artifacts	"Experiment batches"
ULabel	Universal label for custom categorization	"high_quality", "pilot"

Data Types Supported

Format	Method	Use Case
DataFrame	`Artifact.from_df()`	Tabular data, metadata tables
AnnData	`Artifact.from_anndata()`	Single-cell data
MuData	`Artifact.from_mudata()`	Multi-modal data
Any file	`Artifact("path")`	Images, FASTQ, custom formats
Zarr	Via zarr extra	Large array data
TileDB-SOMA	Via tiledbsoma extra	Scalable cell-level queries

track() / finish() Pattern

Every analysis session should be wrapped:

ln.track(params={"key": "value"})   # Start: captures code, environment, user
# ... analysis ...
ln.finish()                          # End: finalizes lineage links

Common Workflows

Workflow: Multi-Experiment Data Lakehouse

import lamindb as ln
import anndata as ad

ln.track()

# Register multiple experiments
data_files = ["batch1.h5ad", "batch2.h5ad", "batch3.h5ad"]
tissues = ["PBMC", "bone_marrow", "PBMC"]
conditions = ["control", "treated", "treated"]

for i, (file, tissue, condition) in enumerate(zip(data_files, tissues, conditions)):
    adata = ad.read_h5ad(file)
    artifact = ln.Artifact.from_anndata(
        adata, key=f"scrna/batch_{i}.h5ad", description=f"scRNA-seq batch {i}"
    ).save()
    artifact.features.add_values({
        "tissue": tissue, "condition": condition, "batch": i
    })
    print(f"Registered batch {i}: {artifact.uid}")

# Query across all experiments
treated_pbmc = ln.Artifact.filter(
    key__startswith="scrna/",
    features__tissue="PBMC",
    features__condition="treated"
).all()
print(f"Found {len(treated_pbmc)} matching datasets")

# Load and concatenate
import anndata as ad
adatas = [a.load() for a in treated_pbmc]
combined = ad.concat(adatas)
print(f"Combined: {combined.shape}")

ln.finish()

Workflow: Validated Data Curation

import lamindb as ln
import bionty as bt
import anndata as ad

ln.track()

# 1. Import ontologies
bt.CellType.import_source()
bt.Gene.import_source(organism="human")

# 2. Load raw data
adata = ad.read_h5ad("raw_counts.h5ad")
print(f"Raw: {adata.shape}")

# 3. Validate and standardize cell types
validated = bt.CellType.validate(adata.obs["cell_type"].unique())
if not all(validated):
    adata.obs["cell_type"] = bt.CellType.standardize(adata.obs["cell_type"])

# 4. Validate gene names
gene_validated = bt.Gene.validate(adata.var_names)
print(f"Valid genes: {sum(gene_validated)}/{len(gene_validated)}")

# 5. Curate and save
curator = ln.curators.AnnDataCurator(adata, schema)
curator.validate()
artifact = curator.save_artifact(key="curated/validated_counts.h5ad")
print(f"Saved curated artifact: {artifact.uid}")

ln.finish()

Workflow: Nextflow Pipeline Integration

In each Nextflow process, import lamindb and call
```
ln.track()
```
Load input artifacts with
```
ln.Artifact.get(key=...)
```
; cache to local path
Run analysis; save output as new artifact with
```
ln.Artifact(...).save()
```
Call
```
ln.finish()
```
— lineage automatically links inputs to outputs

Key Parameters

Parameter	Function	Default	Options	Effect
`key`	`Artifact()`	None	String path	Hierarchical storage key (e.g., "project/data.h5ad")
`description`	`Artifact()`	None	String	Human-readable description
`revises`	`Artifact()`	None	Artifact	Previous version to revise
`params`	`ln.track()`	None	Dict	Parameters for the current run
`organism`	`bt.Gene.import_source()`	None	"human", "mouse"	Organism for ontology
`permanent`	`.delete()`	False	True/False	Permanent vs archive deletion
`__startswith`	`.filter()`	—	String	Key prefix filter
`__gte` , `__lte`	`.filter()`	—	Value	Greater/less than or equal
`__contains`	`.filter()`	—	String	Substring match

Best Practices

Always wrap analysis with
```
ln.track()
```
/
ln.finish()
: This captures lineage automatically. Without it, artifacts have no provenance.
Use hierarchical keys: Structure as
```
project/experiment/datatype/file.ext
```
(e.g.,
```
immunology/exp42/scrna/counts.h5ad
```
). This enables prefix-based queries.
Anti-pattern — duplicating data instead of versioning: Use the
```
revises=
```
parameter to create new versions, not new keys for the same dataset.
Validate early: Run schema validation before analysis. Catching bad metadata early saves debugging time downstream.
Use ontologies for standardization: Map free-text labels to ontology terms (e.g., "T helper cell" → CL:0000912). This enables cross-dataset queries.
Anti-pattern — loading large files without checking size: Use
```
.filter().df()
```
to inspect metadata first, then
```
.load()
```
or
```
.open()
```
(backed mode) for large files.
Query metadata first, load data second: Filter with
```
.filter()
```
to find relevant artifacts, then load only what you need.

Common Recipes

Recipe: Bulk Dataset Registration

import lamindb as ln
from pathlib import Path

ln.track()

data_dir = Path("raw_data/")
for fcs_file in data_dir.glob("*.fcs"):
    artifact = ln.Artifact(str(fcs_file), key=f"flow_cytometry/{fcs_file.name}").save()
    artifact.features.add_values({"assay": "flow_cytometry", "source": "batch_import"})
    print(f"Registered: {fcs_file.name} -> {artifact.uid}")

ln.finish()

Recipe: View and Export Lineage

import lamindb as ln

artifact = ln.Artifact.get(key="results/final_analysis.h5ad")

# View lineage graph (opens in browser or notebook)
artifact.view_lineage()

# Programmatic lineage access
run = artifact.run
print(f"Created by: {run.transform.name}")
print(f"User: {run.created_by.name}")
print(f"Date: {run.created_at}")
print(f"Input artifacts: {[a.key for a in run.input_artifacts.all()]}")

Recipe: Ontology Hierarchy Exploration

import bionty as bt

bt.CellType.import_source()
t_cell = bt.CellType.get(name="T cell")

# Explore hierarchy
print(f"Parents: {[p.name for p in t_cell.parents.all()]}")
print(f"Children: {[c.name for c in t_cell.children.all()]}")

# Find all descendants
descendants = t_cell.children.all()
for child in descendants:
    grandchildren = child.children.all()
    print(f"  {child.name}: {[gc.name for gc in grandchildren]}")

Troubleshooting

Problem	Cause	Solution
`InstanceNotSetupError`	Instance not initialized	Run `lamin init --storage ./data --name my-project`
`ln.track()` fails	No transform context	Run inside a notebook/script, not REPL; or pass `transform` explicitly
Artifact `key` conflict	Key already exists (not a version)	Use `revises=` for versioning, or choose a different key
`ValidationError`	Data doesn't match schema	Run `curator.validate()` to see specific failures; standardize terms
Slow queries on large instances	No index on filtered field	Use `.df()` for overview first; add database indexes for frequently filtered fields
Ontology import fails	Network issue or wrong organism	Check internet connection; specify `organism="human"` explicitly
`FileNotFoundError` on `.cache()`	Cloud artifact not synced	Check storage connectivity; use `artifact.load()` instead for in-memory access

Related Skills

anndata-annotated-data — AnnData format used as primary data container in LaminDB for single-cell data
scanpy-scrna-seq — single-cell analysis pipeline; LaminDB manages data that scanpy analyzes
scvi-tools-single-cell — deep learning models for single-cell; integrates with LaminDB for data/model tracking

References

LaminDB documentation — official user guide and API reference
LaminDB tutorial — step-by-step introduction
Bionty documentation — biological ontology management
LaminDB GitHub — source code and issues