SciAgent-Skills umap-learn
git clone https://github.com/jaechang-hits/SciAgent-Skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/scientific-computing/umap-learn" ~/.claude/skills/jaechang-hits-sciagent-skills-umap-learn && rm -rf "$T"
skills/scientific-computing/umap-learn/SKILL.mdUMAP-Learn
Overview
UMAP (Uniform Manifold Approximation and Projection) is a dimensionality reduction algorithm for visualization and general non-linear dimensionality reduction. It is faster than t-SNE, scales to larger datasets, preserves both local and global structure, and supports supervised learning and embedding of new data points.
When to Use
- Reducing high-dimensional data to 2D/3D for visualization
- Preprocessing for density-based clustering (HDBSCAN, DBSCAN)
- Feature engineering in ML pipelines (transform new data into learned embedding)
- Supervised/semi-supervised embedding with partial labels
- Tracking embeddings across time points or batches (AlignedUMAP)
- Density-preserving embeddings (DensMAP)
- Neural network-based embedding with custom architectures (Parametric UMAP)
- For linear dimensionality reduction use PCA (scikit-learn)
- For neighborhood-graph construction without embedding use scikit-learn NearestNeighbors
Prerequisites
pip install umap-learn # For Parametric UMAP (neural network variant) pip install umap-learn[parametric_umap] # requires TensorFlow 2.x
Critical: Always standardize features before applying UMAP to ensure equal weighting across dimensions.
Quick Start
import umap import numpy as np from sklearn.preprocessing import StandardScaler from sklearn.datasets import load_digits # Load and scale data X, y = load_digits(return_X_y=True) X_scaled = StandardScaler().fit_transform(X) # Fit and transform embedding = umap.UMAP(random_state=42).fit_transform(X_scaled) print(f"Input: {X_scaled.shape}, Output: {embedding.shape}") # Input: (1797, 64), Output: (1797, 2)
Core API
1. Standard UMAP
Basic dimensionality reduction following scikit-learn conventions.
import umap from sklearn.preprocessing import StandardScaler X_scaled = StandardScaler().fit_transform(data) # Method 1: fit_transform (single step) embedding = umap.UMAP( n_neighbors=15, # local neighborhood size (2-200) min_dist=0.1, # min distance between embedded points (0.0-0.99) n_components=2, # output dimensions metric='euclidean', # distance metric random_state=42, # reproducibility ).fit_transform(X_scaled) print(f"Embedding shape: {embedding.shape}") # Method 2: fit + access (for reuse) reducer = umap.UMAP(random_state=42) reducer.fit(X_scaled) embedding = reducer.embedding_ # trained embedding graph = reducer.graph_ # fuzzy simplicial set (sparse matrix)
# Visualization import matplotlib.pyplot as plt plt.figure(figsize=(8, 6)) plt.scatter(embedding[:, 0], embedding[:, 1], c=labels, cmap='Spectral', s=5) plt.colorbar() plt.title('UMAP Embedding') plt.tight_layout() plt.savefig('umap_embedding.png', dpi=150)
2. Supervised & Semi-Supervised UMAP
Incorporate label information to guide embedding via the
y parameter.
import umap # Supervised — all labels known embedding = umap.UMAP(random_state=42).fit_transform(X_scaled, y=labels) # Semi-supervised — partial labels (mark unlabeled as -1) semi_labels = labels.copy() semi_labels[unlabeled_indices] = -1 embedding = umap.UMAP(random_state=42).fit_transform(X_scaled, y=semi_labels) # Control label influence with target_weight (0.0=unsupervised, 1.0=fully supervised) reducer = umap.UMAP( target_weight=0.7, # emphasize labels target_metric='categorical', # for classification; use distance metric for regression random_state=42 ) embedding = reducer.fit_transform(X_scaled, y=labels) print(f"Supervised embedding: {embedding.shape}")
3. Transform New Data
Project unseen data into the trained embedding space.
import umap from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # Fit on training data reducer = umap.UMAP(n_components=10, random_state=42) X_train_emb = reducer.fit_transform(X_train_scaled) # Transform test data X_test_emb = reducer.transform(X_test_scaled) print(f"Train: {X_train_emb.shape}, Test: {X_test_emb.shape}") # Works in sklearn Pipelines from sklearn.pipeline import Pipeline from sklearn.svm import SVC pipeline = Pipeline([ ('scaler', StandardScaler()), ('umap', umap.UMAP(n_components=10, random_state=42)), ('classifier', SVC()) ]) pipeline.fit(X_train, y_train) accuracy = pipeline.score(X_test, y_test) print(f"Pipeline accuracy: {accuracy:.3f}")
4. Parametric UMAP
Neural network-based embedding via TensorFlow/Keras. Enables efficient transform, reconstruction, and custom architectures.
from umap.parametric_umap import ParametricUMAP # Default architecture (3-layer, 100-neuron FC network) embedder = ParametricUMAP(n_components=2, random_state=42) embedding = embedder.fit_transform(X_scaled) new_emb = embedder.transform(new_data) # fast neural network inference print(f"Parametric embedding: {embedding.shape}")
import tensorflow as tf from umap.parametric_umap import ParametricUMAP # Custom encoder/decoder for autoencoder mode input_dim = X_scaled.shape[1] encoder = tf.keras.Sequential([ tf.keras.layers.InputLayer(input_shape=(input_dim,)), tf.keras.layers.Dense(128, activation='relu'), tf.keras.layers.Dense(64, activation='relu'), tf.keras.layers.Dense(2), ]) decoder = tf.keras.Sequential([ tf.keras.layers.InputLayer(input_shape=(2,)), tf.keras.layers.Dense(64, activation='relu'), tf.keras.layers.Dense(128, activation='relu'), tf.keras.layers.Dense(input_dim), ]) embedder = ParametricUMAP( encoder=encoder, decoder=decoder, dims=(input_dim,), parametric_reconstruction=True, autoencoder_loss=True, n_training_epochs=10, batch_size=128, n_neighbors=15, min_dist=0.1, random_state=42 ) embedding = embedder.fit_transform(X_scaled) reconstructed = embedder.inverse_transform(embedding) print(f"Reconstruction error: {np.mean((X_scaled - reconstructed)**2):.4f}")
5. DensMAP
Variant preserving local density information in the embedding.
import umap reducer = umap.UMAP( densmap=True, # enable DensMAP dens_lambda=2.0, # density preservation weight dens_frac=0.3, # fraction for density estimation output_dens=True, # output density estimates n_neighbors=15, min_dist=0.1, random_state=42 ) embedding = reducer.fit_transform(X_scaled) # Access density estimates original_density = reducer.rad_orig_ # density in original space embedded_density = reducer.rad_emb_ # density in embedded space print(f"DensMAP embedding: {embedding.shape}") print(f"Density correlation: {np.corrcoef(original_density, embedded_density)[0,1]:.3f}")
6. AlignedUMAP
Align embeddings across multiple related datasets (time points, batches).
from umap import AlignedUMAP # Multiple related datasets datasets = [day1_data, day2_data, day3_data] mapper = AlignedUMAP( n_neighbors=15, alignment_regularisation=1e-2, # alignment strength alignment_window_size=2, # align with N adjacent datasets n_components=2, random_state=42 ) mapper.fit(datasets) aligned_embeddings = mapper.embeddings_ # list of aligned embedding arrays print(f"Aligned {len(aligned_embeddings)} datasets") for i, emb in enumerate(aligned_embeddings): print(f" Dataset {i}: {emb.shape}")
Key Concepts
Parameter Tuning Guide
| Parameter | Low | Medium (default) | High | Effect |
|---|---|---|---|---|
| 2-5 | 15 | 50-200 | Local detail vs global structure |
| 0.0 | 0.1 | 0.5-0.99 | Tight clusters vs spread out |
| 2 | 2 | 5-50 | Visualization vs ML/clustering |
| 0.5 | 1.0 | 2.0 | Embedding scale (with min_dist) |
Configuration by Use-Case
| Use-Case | n_neighbors | min_dist | n_components | metric |
|---|---|---|---|---|
| Visualization | 15 | 0.1 | 2 | euclidean |
| Clustering (HDBSCAN) | 30 | 0.0 | 5-10 | euclidean |
| Text/document embedding | 15 | 0.1 | 2 | cosine |
| Global structure | 100 | 0.5 | 2 | euclidean |
| ML feature engineering | 15-30 | 0.1 | 10-50 | euclidean |
| Binary/set data | 15 | 0.1 | 2 | hamming/jaccard |
Supported Metrics
Minkowski family:
euclidean, manhattan, chebyshev, minkowski. Spatial: canberra, braycurtis, haversine. Correlation: cosine, correlation. Binary: hamming, jaccard, dice, russellrao, rogerstanimoto, sokalmichener, sokalsneath, yule. Special: precomputed (distance matrix), custom Numba-compiled callables.
Standard UMAP vs Parametric UMAP
| Feature | Standard | Parametric |
|---|---|---|
| Backend | Direct optimization | TensorFlow neural network |
| Transform speed | Moderate | Fast (neural net inference) |
| Inverse transform | Approximate, expensive | Decoder network, fast |
| Custom architecture | No | Yes (CNNs, RNNs, etc.) |
| Requirements | umap-learn | umap-learn + TensorFlow 2.x |
| Best for | Quick exploration | Production pipelines, reconstruction |
Common Workflows
Workflow 1: UMAP + HDBSCAN Clustering Pipeline
import umap import hdbscan import numpy as np import matplotlib.pyplot as plt from sklearn.preprocessing import StandardScaler from sklearn.metrics import adjusted_rand_score # Step 1: Preprocess X_scaled = StandardScaler().fit_transform(data) print(f"Input shape: {X_scaled.shape}") # Step 2: UMAP for clustering (NOT visualization parameters) reducer = umap.UMAP( n_neighbors=30, # more global structure for clustering min_dist=0.0, # allow tight packing n_components=10, # higher dims preserve density better than 2D metric='euclidean', random_state=42 ) embedding = reducer.fit_transform(X_scaled) # Step 3: HDBSCAN clustering clusterer = hdbscan.HDBSCAN(min_cluster_size=15, min_samples=5) cluster_labels = clusterer.fit_predict(embedding) n_clusters = len(set(cluster_labels)) - (1 if -1 in cluster_labels else 0) noise = sum(cluster_labels == -1) print(f"Clusters: {n_clusters}, Noise: {noise}") # Step 4: Separate 2D embedding for visualization vis_emb = umap.UMAP(n_neighbors=15, min_dist=0.1, random_state=42).fit_transform(X_scaled) plt.scatter(vis_emb[:, 0], vis_emb[:, 1], c=cluster_labels, cmap='Spectral', s=5) plt.colorbar() plt.title(f'HDBSCAN Clusters (n={n_clusters})') plt.tight_layout() plt.savefig('umap_clusters.png', dpi=150)
Workflow 2: Supervised Embedding for Classification
import umap import numpy as np from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.svm import SVC from sklearn.metrics import classification_report # Split and scale X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) scaler = StandardScaler() X_train_s = scaler.fit_transform(X_train) X_test_s = scaler.transform(X_test) # Supervised UMAP for feature engineering reducer = umap.UMAP(n_components=10, random_state=42) X_train_emb = reducer.fit_transform(X_train_s, y=y_train) X_test_emb = reducer.transform(X_test_s) # Downstream classifier clf = SVC(kernel='rbf') clf.fit(X_train_emb, y_train) y_pred = clf.predict(X_test_emb) print(classification_report(y_test, y_pred))
Workflow 3: Exploring Embedding Space with Inverse Transform
Text-only — combines Core API modules 1 and 3 (inverse_transform on standard UMAP):
- Fit standard UMAP on data (Core API: Standard UMAP)
- Create a grid of points spanning the embedding space
- Apply
to reconstruct high-dimensional datareducer.inverse_transform(grid_points) - Visualize reconstructed samples to understand embedding regions
Note: inverse transform is approximate; works poorly outside the convex hull of the training embedding.
Key Parameters
| Parameter | Module | Default | Range | Effect |
|---|---|---|---|---|
| UMAP | 15 | 2-200 | Local vs global structure balance |
| UMAP | 0.1 | 0.0-0.99 | Cluster tightness |
| UMAP | 2 | 2-100 | Output dimensionality |
| UMAP | | See metrics list | Distance calculation method |
| UMAP | 1.0 | >0 | Embedding scale (with min_dist) |
| UMAP | (auto) | 50-500+ | Training iterations |
| UMAP | 1.0 | >0 | SGD step size |
| UMAP | | spectral/random/pca | Embedding initialization |
| UMAP | | int | Reproducibility seed |
| UMAP | 0.5 | 0.0-1.0 | Label influence (supervised) |
| UMAP | | bool | Enable DensMAP |
| UMAP | 2.0 | >0 | DensMAP density weight |
| UMAP | | bool | Memory-efficient mode |
| ParametricUMAP | | Keras model | Custom encoder network |
| ParametricUMAP | | Keras model | Custom decoder network |
| ParametricUMAP | 1 | 1-100 | Neural network training epochs |
| AlignedUMAP | 0.01 | >0 | Alignment strength |
| AlignedUMAP | 3 | 1-N | Adjacent datasets to align |
Best Practices
-
Always standardize features: Use
before UMAP — unscaled features with different ranges will dominate the embedding.StandardScaler -
Set
for reproducibility: UMAP uses stochastic optimization; results vary between runs without a fixed seed.random_state -
Use different parameters for clustering vs visualization: Clustering needs
. Visualization needsn_neighbors=30, min_dist=0.0, n_components=5-10
.n_neighbors=15, min_dist=0.1, n_components=2 -
Anti-pattern — interpreting distances literally: UMAP preserves topology, not precise distances. Cluster separations and point distances in the embedding are not proportional to original distances.
-
Anti-pattern — using 2D embeddings for clustering: 2D projections lose density information. Use 5-10 components for HDBSCAN input.
-
Consider PCA preprocessing for very high dimensions: For data with >1000 features, reducing to 50-100 PCA components first can speed up UMAP without losing quality.
-
Use Parametric UMAP for production: When you need fast transform on new data or reconstruction capabilities, Parametric UMAP's neural network provides consistent, fast inference.
Common Recipes
Recipe: Custom Numba Distance Metric
from numba import njit import umap @njit() def weighted_euclidean(x, y): """Custom distance with feature weights.""" result = 0.0 for i in range(x.shape[0]): result += (x[i] - y[i]) ** 2 * (1.0 + i * 0.01) # increasing weight return np.sqrt(result) embedding = umap.UMAP(metric=weighted_euclidean, random_state=42).fit_transform(data)
Recipe: Precomputed Distance Matrix
import umap from scipy.spatial.distance import pdist, squareform # Compute custom distance matrix dist_matrix = squareform(pdist(data, metric='correlation')) # Use precomputed distances embedding = umap.UMAP( metric='precomputed', random_state=42 ).fit_transform(dist_matrix) print(f"Embedding from precomputed: {embedding.shape}")
Recipe: Metric Learning Pipeline
import umap from sklearn.svm import SVC # Train supervised embedding on labeled data mapper = umap.UMAP(n_components=10, random_state=42) train_emb = mapper.fit_transform(X_train, y=y_train) # Transform unlabeled test data using learned metric test_emb = mapper.transform(X_test) # Downstream classifier clf = SVC().fit(train_emb, y_train) predictions = clf.predict(test_emb) print(f"Accuracy: {(predictions == y_test).mean():.3f}")
Troubleshooting
| Problem | Cause | Solution |
|---|---|---|
| Disconnected/fragmented clusters | too low | Increase (try 30-50) |
| Clusters too spread out | too high | Decrease (try 0.0-0.05) |
| All points collapsed | Bad preprocessing or too low | Check ; increase |
| Poor clustering results | Using visualization parameters for clustering | Set |
| Transform results differ from training | Distribution shift | Ensure test data matches training distribution; use Parametric UMAP |
| Slow on large datasets (>100k) | Default settings | Set ; preprocess with PCA to 50-100 dims |
| First run very slow | Numba JIT compilation | Expected — subsequent runs are fast (compiled cache) |
| Name conflict with package | (not ) |
| Parametric UMAP import error | Missing TensorFlow | |
| Non-reproducible results | Missing | Always set (or any int) |
Bundled Resources
references/api_reference.md
Complete UMAP constructor parameter reference (60+ parameters organized by category: core, training, advanced structural, supervised, transform, performance, DensMAP), all methods and attributes, ParametricUMAP class with autoencoder parameters, AlignedUMAP class, utility functions (nearest_neighbors, fuzzy_simplicial_set). Core parameter tuning guidance was relocated to SKILL.md Key Concepts and Core API modules. Usage examples duplicating SKILL.md workflows omitted.
Related Skills
- scikit-learn-machine-learning — ML classifiers, preprocessing, pipelines for downstream tasks
- matplotlib-scientific-plotting — Visualization of UMAP embeddings
- scikit-bio — Biological distance matrices that can feed into UMAP via
metric='precomputed'
References
- McInnes L, Healy J, Melville J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv:1802.03426
- Sainburg T, McInnes L, Gentner TQ. Parametric UMAP Embeddings for Representation and Semisupervised Learning. Neural Computation (2021)
- Narayan A, Berger B, Cho H. Assessing single-cell transcriptomic variability through density-preserving data visualization. Nature Biotechnology (2021) — DensMAP
- Official docs: https://umap-learn.readthedocs.io/
- GitHub: https://github.com/lmcinnes/umap