SciAgent-Skills umap-learn

install
source · Clone the upstream repo
git clone https://github.com/jaechang-hits/SciAgent-Skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/scientific-computing/umap-learn" ~/.claude/skills/jaechang-hits-sciagent-skills-umap-learn && rm -rf "$T"
manifest: skills/scientific-computing/umap-learn/SKILL.md
source content

UMAP-Learn

Overview

UMAP (Uniform Manifold Approximation and Projection) is a dimensionality reduction algorithm for visualization and general non-linear dimensionality reduction. It is faster than t-SNE, scales to larger datasets, preserves both local and global structure, and supports supervised learning and embedding of new data points.

When to Use

  • Reducing high-dimensional data to 2D/3D for visualization
  • Preprocessing for density-based clustering (HDBSCAN, DBSCAN)
  • Feature engineering in ML pipelines (transform new data into learned embedding)
  • Supervised/semi-supervised embedding with partial labels
  • Tracking embeddings across time points or batches (AlignedUMAP)
  • Density-preserving embeddings (DensMAP)
  • Neural network-based embedding with custom architectures (Parametric UMAP)
  • For linear dimensionality reduction use PCA (scikit-learn)
  • For neighborhood-graph construction without embedding use scikit-learn NearestNeighbors

Prerequisites

pip install umap-learn

# For Parametric UMAP (neural network variant)
pip install umap-learn[parametric_umap]  # requires TensorFlow 2.x

Critical: Always standardize features before applying UMAP to ensure equal weighting across dimensions.

Quick Start

import umap
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_digits

# Load and scale data
X, y = load_digits(return_X_y=True)
X_scaled = StandardScaler().fit_transform(X)

# Fit and transform
embedding = umap.UMAP(random_state=42).fit_transform(X_scaled)
print(f"Input: {X_scaled.shape}, Output: {embedding.shape}")
# Input: (1797, 64), Output: (1797, 2)

Core API

1. Standard UMAP

Basic dimensionality reduction following scikit-learn conventions.

import umap
from sklearn.preprocessing import StandardScaler

X_scaled = StandardScaler().fit_transform(data)

# Method 1: fit_transform (single step)
embedding = umap.UMAP(
    n_neighbors=15,     # local neighborhood size (2-200)
    min_dist=0.1,       # min distance between embedded points (0.0-0.99)
    n_components=2,     # output dimensions
    metric='euclidean', # distance metric
    random_state=42,    # reproducibility
).fit_transform(X_scaled)
print(f"Embedding shape: {embedding.shape}")

# Method 2: fit + access (for reuse)
reducer = umap.UMAP(random_state=42)
reducer.fit(X_scaled)
embedding = reducer.embedding_  # trained embedding
graph = reducer.graph_          # fuzzy simplicial set (sparse matrix)
# Visualization
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))
plt.scatter(embedding[:, 0], embedding[:, 1], c=labels, cmap='Spectral', s=5)
plt.colorbar()
plt.title('UMAP Embedding')
plt.tight_layout()
plt.savefig('umap_embedding.png', dpi=150)

2. Supervised & Semi-Supervised UMAP

Incorporate label information to guide embedding via the

y
parameter.

import umap

# Supervised — all labels known
embedding = umap.UMAP(random_state=42).fit_transform(X_scaled, y=labels)

# Semi-supervised — partial labels (mark unlabeled as -1)
semi_labels = labels.copy()
semi_labels[unlabeled_indices] = -1
embedding = umap.UMAP(random_state=42).fit_transform(X_scaled, y=semi_labels)

# Control label influence with target_weight (0.0=unsupervised, 1.0=fully supervised)
reducer = umap.UMAP(
    target_weight=0.7,               # emphasize labels
    target_metric='categorical',     # for classification; use distance metric for regression
    random_state=42
)
embedding = reducer.fit_transform(X_scaled, y=labels)
print(f"Supervised embedding: {embedding.shape}")

3. Transform New Data

Project unseen data into the trained embedding space.

import umap
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Fit on training data
reducer = umap.UMAP(n_components=10, random_state=42)
X_train_emb = reducer.fit_transform(X_train_scaled)

# Transform test data
X_test_emb = reducer.transform(X_test_scaled)
print(f"Train: {X_train_emb.shape}, Test: {X_test_emb.shape}")

# Works in sklearn Pipelines
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('umap', umap.UMAP(n_components=10, random_state=42)),
    ('classifier', SVC())
])
pipeline.fit(X_train, y_train)
accuracy = pipeline.score(X_test, y_test)
print(f"Pipeline accuracy: {accuracy:.3f}")

4. Parametric UMAP

Neural network-based embedding via TensorFlow/Keras. Enables efficient transform, reconstruction, and custom architectures.

from umap.parametric_umap import ParametricUMAP

# Default architecture (3-layer, 100-neuron FC network)
embedder = ParametricUMAP(n_components=2, random_state=42)
embedding = embedder.fit_transform(X_scaled)
new_emb = embedder.transform(new_data)  # fast neural network inference
print(f"Parametric embedding: {embedding.shape}")
import tensorflow as tf
from umap.parametric_umap import ParametricUMAP

# Custom encoder/decoder for autoencoder mode
input_dim = X_scaled.shape[1]
encoder = tf.keras.Sequential([
    tf.keras.layers.InputLayer(input_shape=(input_dim,)),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(2),
])
decoder = tf.keras.Sequential([
    tf.keras.layers.InputLayer(input_shape=(2,)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(input_dim),
])

embedder = ParametricUMAP(
    encoder=encoder, decoder=decoder, dims=(input_dim,),
    parametric_reconstruction=True, autoencoder_loss=True,
    n_training_epochs=10, batch_size=128,
    n_neighbors=15, min_dist=0.1, random_state=42
)
embedding = embedder.fit_transform(X_scaled)
reconstructed = embedder.inverse_transform(embedding)
print(f"Reconstruction error: {np.mean((X_scaled - reconstructed)**2):.4f}")

5. DensMAP

Variant preserving local density information in the embedding.

import umap

reducer = umap.UMAP(
    densmap=True,          # enable DensMAP
    dens_lambda=2.0,       # density preservation weight
    dens_frac=0.3,         # fraction for density estimation
    output_dens=True,      # output density estimates
    n_neighbors=15,
    min_dist=0.1,
    random_state=42
)
embedding = reducer.fit_transform(X_scaled)

# Access density estimates
original_density = reducer.rad_orig_  # density in original space
embedded_density = reducer.rad_emb_   # density in embedded space
print(f"DensMAP embedding: {embedding.shape}")
print(f"Density correlation: {np.corrcoef(original_density, embedded_density)[0,1]:.3f}")

6. AlignedUMAP

Align embeddings across multiple related datasets (time points, batches).

from umap import AlignedUMAP

# Multiple related datasets
datasets = [day1_data, day2_data, day3_data]

mapper = AlignedUMAP(
    n_neighbors=15,
    alignment_regularisation=1e-2,  # alignment strength
    alignment_window_size=2,        # align with N adjacent datasets
    n_components=2,
    random_state=42
)
mapper.fit(datasets)

aligned_embeddings = mapper.embeddings_  # list of aligned embedding arrays
print(f"Aligned {len(aligned_embeddings)} datasets")
for i, emb in enumerate(aligned_embeddings):
    print(f"  Dataset {i}: {emb.shape}")

Key Concepts

Parameter Tuning Guide

ParameterLowMedium (default)HighEffect
n_neighbors
2-51550-200Local detail vs global structure
min_dist
0.00.10.5-0.99Tight clusters vs spread out
n_components
225-50Visualization vs ML/clustering
spread
0.51.02.0Embedding scale (with min_dist)

Configuration by Use-Case

Use-Casen_neighborsmin_distn_componentsmetric
Visualization150.12euclidean
Clustering (HDBSCAN)300.05-10euclidean
Text/document embedding150.12cosine
Global structure1000.52euclidean
ML feature engineering15-300.110-50euclidean
Binary/set data150.12hamming/jaccard

Supported Metrics

Minkowski family:

euclidean
,
manhattan
,
chebyshev
,
minkowski
. Spatial:
canberra
,
braycurtis
,
haversine
. Correlation:
cosine
,
correlation
. Binary:
hamming
,
jaccard
,
dice
,
russellrao
,
rogerstanimoto
,
sokalmichener
,
sokalsneath
,
yule
. Special:
precomputed
(distance matrix), custom Numba-compiled callables.

Standard UMAP vs Parametric UMAP

FeatureStandardParametric
BackendDirect optimizationTensorFlow neural network
Transform speedModerateFast (neural net inference)
Inverse transformApproximate, expensiveDecoder network, fast
Custom architectureNoYes (CNNs, RNNs, etc.)
Requirementsumap-learnumap-learn + TensorFlow 2.x
Best forQuick explorationProduction pipelines, reconstruction

Common Workflows

Workflow 1: UMAP + HDBSCAN Clustering Pipeline

import umap
import hdbscan
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import adjusted_rand_score

# Step 1: Preprocess
X_scaled = StandardScaler().fit_transform(data)
print(f"Input shape: {X_scaled.shape}")

# Step 2: UMAP for clustering (NOT visualization parameters)
reducer = umap.UMAP(
    n_neighbors=30,     # more global structure for clustering
    min_dist=0.0,       # allow tight packing
    n_components=10,    # higher dims preserve density better than 2D
    metric='euclidean',
    random_state=42
)
embedding = reducer.fit_transform(X_scaled)

# Step 3: HDBSCAN clustering
clusterer = hdbscan.HDBSCAN(min_cluster_size=15, min_samples=5)
cluster_labels = clusterer.fit_predict(embedding)

n_clusters = len(set(cluster_labels)) - (1 if -1 in cluster_labels else 0)
noise = sum(cluster_labels == -1)
print(f"Clusters: {n_clusters}, Noise: {noise}")

# Step 4: Separate 2D embedding for visualization
vis_emb = umap.UMAP(n_neighbors=15, min_dist=0.1, random_state=42).fit_transform(X_scaled)
plt.scatter(vis_emb[:, 0], vis_emb[:, 1], c=cluster_labels, cmap='Spectral', s=5)
plt.colorbar()
plt.title(f'HDBSCAN Clusters (n={n_clusters})')
plt.tight_layout()
plt.savefig('umap_clusters.png', dpi=150)

Workflow 2: Supervised Embedding for Classification

import umap
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import classification_report

# Split and scale
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

# Supervised UMAP for feature engineering
reducer = umap.UMAP(n_components=10, random_state=42)
X_train_emb = reducer.fit_transform(X_train_s, y=y_train)
X_test_emb = reducer.transform(X_test_s)

# Downstream classifier
clf = SVC(kernel='rbf')
clf.fit(X_train_emb, y_train)
y_pred = clf.predict(X_test_emb)

print(classification_report(y_test, y_pred))

Workflow 3: Exploring Embedding Space with Inverse Transform

Text-only — combines Core API modules 1 and 3 (inverse_transform on standard UMAP):

  1. Fit standard UMAP on data (Core API: Standard UMAP)
  2. Create a grid of points spanning the embedding space
  3. Apply
    reducer.inverse_transform(grid_points)
    to reconstruct high-dimensional data
  4. Visualize reconstructed samples to understand embedding regions

Note: inverse transform is approximate; works poorly outside the convex hull of the training embedding.

Key Parameters

ParameterModuleDefaultRangeEffect
n_neighbors
UMAP152-200Local vs global structure balance
min_dist
UMAP0.10.0-0.99Cluster tightness
n_components
UMAP22-100Output dimensionality
metric
UMAP
'euclidean'
See metrics listDistance calculation method
spread
UMAP1.0>0Embedding scale (with min_dist)
n_epochs
UMAP
None
(auto)
50-500+Training iterations
learning_rate
UMAP1.0>0SGD step size
init
UMAP
'spectral'
spectral/random/pcaEmbedding initialization
random_state
UMAP
None
intReproducibility seed
target_weight
UMAP0.50.0-1.0Label influence (supervised)
densmap
UMAP
False
boolEnable DensMAP
dens_lambda
UMAP2.0>0DensMAP density weight
low_memory
UMAP
True
boolMemory-efficient mode
encoder
ParametricUMAP
None
Keras modelCustom encoder network
decoder
ParametricUMAP
None
Keras modelCustom decoder network
n_training_epochs
ParametricUMAP11-100Neural network training epochs
alignment_regularisation
AlignedUMAP0.01>0Alignment strength
alignment_window_size
AlignedUMAP31-NAdjacent datasets to align

Best Practices

  1. Always standardize features: Use

    StandardScaler
    before UMAP — unscaled features with different ranges will dominate the embedding.

  2. Set

    random_state
    for reproducibility: UMAP uses stochastic optimization; results vary between runs without a fixed seed.

  3. Use different parameters for clustering vs visualization: Clustering needs

    n_neighbors=30, min_dist=0.0, n_components=5-10
    . Visualization needs
    n_neighbors=15, min_dist=0.1, n_components=2
    .

  4. Anti-pattern — interpreting distances literally: UMAP preserves topology, not precise distances. Cluster separations and point distances in the embedding are not proportional to original distances.

  5. Anti-pattern — using 2D embeddings for clustering: 2D projections lose density information. Use 5-10 components for HDBSCAN input.

  6. Consider PCA preprocessing for very high dimensions: For data with >1000 features, reducing to 50-100 PCA components first can speed up UMAP without losing quality.

  7. Use Parametric UMAP for production: When you need fast transform on new data or reconstruction capabilities, Parametric UMAP's neural network provides consistent, fast inference.

Common Recipes

Recipe: Custom Numba Distance Metric

from numba import njit
import umap

@njit()
def weighted_euclidean(x, y):
    """Custom distance with feature weights."""
    result = 0.0
    for i in range(x.shape[0]):
        result += (x[i] - y[i]) ** 2 * (1.0 + i * 0.01)  # increasing weight
    return np.sqrt(result)

embedding = umap.UMAP(metric=weighted_euclidean, random_state=42).fit_transform(data)

Recipe: Precomputed Distance Matrix

import umap
from scipy.spatial.distance import pdist, squareform

# Compute custom distance matrix
dist_matrix = squareform(pdist(data, metric='correlation'))

# Use precomputed distances
embedding = umap.UMAP(
    metric='precomputed', random_state=42
).fit_transform(dist_matrix)
print(f"Embedding from precomputed: {embedding.shape}")

Recipe: Metric Learning Pipeline

import umap
from sklearn.svm import SVC

# Train supervised embedding on labeled data
mapper = umap.UMAP(n_components=10, random_state=42)
train_emb = mapper.fit_transform(X_train, y=y_train)

# Transform unlabeled test data using learned metric
test_emb = mapper.transform(X_test)

# Downstream classifier
clf = SVC().fit(train_emb, y_train)
predictions = clf.predict(test_emb)
print(f"Accuracy: {(predictions == y_test).mean():.3f}")

Troubleshooting

ProblemCauseSolution
Disconnected/fragmented clusters
n_neighbors
too low
Increase
n_neighbors
(try 30-50)
Clusters too spread out
min_dist
too high
Decrease
min_dist
(try 0.0-0.05)
All points collapsedBad preprocessing or
min_dist
too low
Check
StandardScaler
; increase
min_dist
Poor clustering resultsUsing visualization parameters for clusteringSet
n_neighbors=30, min_dist=0.0, n_components=5-10
Transform results differ from trainingDistribution shiftEnsure test data matches training distribution; use Parametric UMAP
Slow on large datasets (>100k)Default settingsSet
low_memory=True
; preprocess with PCA to 50-100 dims
First run very slowNumba JIT compilationExpected — subsequent runs are fast (compiled cache)
ImportError: umap
Name conflict with
umap
package
pip install umap-learn
(not
pip install umap
)
Parametric UMAP import errorMissing TensorFlow
pip install umap-learn[parametric_umap]
Non-reproducible resultsMissing
random_state
Always set
random_state=42
(or any int)

Bundled Resources

references/api_reference.md

Complete UMAP constructor parameter reference (60+ parameters organized by category: core, training, advanced structural, supervised, transform, performance, DensMAP), all methods and attributes, ParametricUMAP class with autoencoder parameters, AlignedUMAP class, utility functions (nearest_neighbors, fuzzy_simplicial_set). Core parameter tuning guidance was relocated to SKILL.md Key Concepts and Core API modules. Usage examples duplicating SKILL.md workflows omitted.

Related Skills

  • scikit-learn-machine-learning — ML classifiers, preprocessing, pipelines for downstream tasks
  • matplotlib-scientific-plotting — Visualization of UMAP embeddings
  • scikit-bio — Biological distance matrices that can feed into UMAP via
    metric='precomputed'

References

  • McInnes L, Healy J, Melville J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv:1802.03426
  • Sainburg T, McInnes L, Gentner TQ. Parametric UMAP Embeddings for Representation and Semisupervised Learning. Neural Computation (2021)
  • Narayan A, Berger B, Cho H. Assessing single-cell transcriptomic variability through density-preserving data visualization. Nature Biotechnology (2021) — DensMAP
  • Official docs: https://umap-learn.readthedocs.io/
  • GitHub: https://github.com/lmcinnes/umap