SciAgent-Skills vaex-dataframes

install

source · Clone the upstream repo

git clone https://github.com/jaechang-hits/SciAgent-Skills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/scientific-computing/vaex-dataframes" ~/.claude/skills/jaechang-hits-sciagent-skills-vaex-dataframes && rm -rf "$T"

manifest: skills/scientific-computing/vaex-dataframes/SKILL.md

source content

Vaex DataFrames

Overview

Vaex is a high-performance Python library for lazy, out-of-core DataFrame operations on datasets too large to fit in RAM. It processes over a billion rows per second using memory-mapped files and lazy evaluation, enabling interactive exploration and analysis without loading data into memory.

When to Use

Processing tabular datasets larger than available RAM (10 GB to terabytes)
Fast statistical aggregations on massive datasets (mean, std, quantiles at billion-row scale)
Creating visualizations (heatmaps, histograms) of large datasets without sampling
Building ML preprocessing pipelines (scaling, encoding, PCA) on big data
Converting between data formats (CSV to HDF5/Arrow for fast repeated access)
Feature engineering with virtual columns that consume zero additional memory
Working with astronomical catalogs, financial time series, or large scientific datasets
For in-memory speed on data that fits in RAM, use polars instead
For distributed multi-node computing, use dask instead

Prerequisites

pip install vaex
# Optional extras:
pip install vaex-hdf5          # HDF5 support (recommended)
pip install vaex-arrow          # Apache Arrow support
pip install vaex-ml             # Machine learning transformers
pip install vaex-viz            # Visualization support
pip install vaex-jupyter        # Jupyter widget support
pip install s3fs gcsfs adlfs    # Cloud storage (S3, GCS, Azure)

Requires Python 3.7+. HDF5 and Arrow formats provide instant memory-mapped loading; CSV requires conversion for optimal performance.

Quick Start

import vaex
import numpy as np

df = vaex.from_arrays(
    x=np.random.normal(0, 1, 1_000_000),
    y=np.random.normal(0, 1, 1_000_000),
    category=np.random.choice(['A', 'B', 'C'], 1_000_000),
)

df['radius'] = (df.x**2 + df.y**2).sqrt()  # Virtual column, zero memory
df_inner = df[df.radius < 1.0]              # Filtered view
print(df_inner.radius.mean())               # ~0.48

result = df.groupby('category').agg({'radius': 'mean'})
print(result)  # shape: (3, 2)

df.export_hdf5('/tmp/sample.hdf5')          # Export to efficient format
df2 = vaex.open('/tmp/sample.hdf5')         # Future loads are instant
print(f"Loaded {len(df2):,} rows instantly")

Core API

1. DataFrame Creation and I/O

Create DataFrames from files, arrays, pandas, or Arrow tables. HDF5 and Arrow files are memory-mapped for instant loading.

import vaex
import numpy as np

# From files (HDF5/Arrow are instant via memory mapping)
df = vaex.open('data.hdf5')       # Recommended: instant, memory-mapped
df = vaex.open('data.arrow')      # Also instant, memory-mapped
df = vaex.open('data.parquet')    # Fast, columnar, compressed
df = vaex.open('data_*.hdf5')     # Wildcards: multiple files as one DataFrame

# From CSV (slow for large files — convert to HDF5)
df = vaex.from_csv('data.csv', convert='data.hdf5')  # Auto-converts

# From Python objects
df = vaex.from_arrays(x=np.arange(100), y=np.random.rand(100))
df = vaex.from_dict({'name': ['Alice', 'Bob'], 'age': [30, 25]})
df = vaex.from_pandas(pd.DataFrame({'a': [1, 2, 3]}), copy_index=False)

# From Arrow table
import pyarrow as pa
df = vaex.from_arrow_table(pa.table({'x': [1, 2, 3]}))

# Inspect
print(df.shape)          # (rows, cols)
print(df.column_names)   # Column names
df.describe()            # Statistical summary

# Export
df.export_hdf5('out.hdf5')                             # Recommended
df.export_arrow('out.arrow')                            # Interoperability
df.export_parquet('out.parquet', compression='snappy')   # Compressed
df.export_parquet('s3://bucket/data.parquet')            # Cloud storage

2. Filtering and Selection

Filter rows with boolean expressions. Named selections allow computing statistics on multiple subsets without creating new DataFrames.

import vaex
import numpy as np

df = vaex.from_arrays(
    age=np.array([22, 35, 45, 19, 60]),
    salary=np.array([30000, 70000, 90000, 25000, 120000]),
    dept=np.array(['Eng', 'Sales', 'Eng', 'Sales', 'Eng']),
)

# Boolean filtering (creates a view, no copy)
df_eng_high = df[(df.dept == 'Eng') & (df.salary > 50000)]
print(len(df_eng_high))  # 2

# isin, between, string/null checks
df_mid = df[df.age.between(25, 50)]
# df[df.name.str.contains('Ali')], df[df.salary.notna()]

# Named selections (more efficient for multiple aggregations)
df.select(df.age >= 30, name='senior')
df.select(df.dept == 'Eng', name='engineers')
mean_senior = df.salary.mean(selection='senior')
mean_eng = df.salary.mean(selection='engineers')
print(f"Senior avg: {mean_senior}, Eng avg: {mean_eng}")
# Senior avg: 93333.33, Eng avg: 80000.0

3. Virtual Columns and Expressions

Virtual columns are computed on-the-fly with zero memory overhead. They are the core of Vaex's efficiency.

import vaex
import numpy as np

df = vaex.from_arrays(
    price=np.array([10.0, 20.0, 30.0, 40.0]),
    quantity=np.array([5, 3, 8, 2]),
    discount=np.array([0.0, 0.1, 0.0, 0.2]),
)

# Arithmetic (virtual columns — no memory used)
df['revenue'] = df.price * df.quantity * (1 - df.discount)
df['log_price'] = df.price.log()

# Conditional logic
df['tier'] = (df.price >= 30).where('premium', 'standard')

# Math: .abs(), .sqrt(), .log(), .log10(), .exp(), .sin(), .cos(),
#        .round(n), .floor(), .ceil(), .astype('float64')

# Check virtual vs materialized
print(df.get_column_names(virtual=False))  # Materialized only

# Materialize when needed (complex expr used repeatedly)
df['revenue_mat'] = df.revenue.values  # Now stored in memory

4. Aggregation and GroupBy

Efficient aggregations across billions of rows. Use

delay=True

to batch multiple operations into a single data pass.

import vaex
import numpy as np

np.random.seed(42)
n = 100_000
df = vaex.from_arrays(
    sales=np.random.uniform(10, 500, n),
    quantity=np.random.randint(1, 20, n),
    region=np.random.choice(['East', 'West', 'North'], n),
)

# Single-column aggregations
print(f"Mean: {df.sales.mean():.2f}, Std: {df.sales.std():.2f}")
# Also: .min(), .max(), .minmax(), .sum(), .count(), .nunique(),
#   .quantile(0.5), .median_approx(), .kurtosis(), .skew()
#   .correlation(df.x, df.y), .covar(df.x, df.y)

# Batch aggregations with delay=True (single pass through data)
mean_s = df.sales.mean(delay=True)
std_s = df.sales.std(delay=True)
sum_q = df.quantity.sum(delay=True)
results = vaex.execute([mean_s, std_s, sum_q])
print(f"Mean: {results[0]:.2f}, Std: {results[1]:.2f}, Total qty: {results[2]}")

# GroupBy
grouped = df.groupby('region').agg({'sales': ['sum', 'mean'], 'quantity': 'sum'})
print(grouped)  # shape: (3, 4)

# Multi-dimensional binned aggregation (for heatmap data)
counts = df.count(binby=[df.sales, df.quantity],
                  limits=[[0, 500], [1, 20]], shape=(50, 19))
print(f"2D histogram shape: {counts.shape}")  # (50, 19)

5. String and DateTime Operations

String methods via

.str

accessor; datetime methods via

.dt

accessor.

import vaex
import numpy as np

# String operations via .str accessor
df = vaex.from_dict({
    'name': ['Alice Smith', 'Bob Jones', 'Charlie Brown'],
    'email': ['ALICE@test.com', 'bob@TEST.com', 'charlie@test.com'],
})
df['email_clean'] = df.email.str.lower().str.strip()
df['first_name'] = df.name.str.split(' ')[0]
df['has_test'] = df.email_clean.str.contains('test')
# Also: .upper(), .title(), .startswith(), .endswith(), .len(),
#   .replace(), .pad(), .slice(start, end)

# DateTime operations via .dt accessor
dates = np.array(['2024-01-15', '2024-06-20', '2024-12-01'], dtype='datetime64')
df2 = vaex.from_arrays(timestamp=dates, value=np.array([100, 200, 300]))
df2['year'] = df2.timestamp.dt.year
df2['month'] = df2.timestamp.dt.month
df2['weekday'] = df2.timestamp.dt.dayofweek  # 0=Monday
# Also: .day, .hour, .minute, .second
print(df2[['timestamp', 'year', 'month']].head(3))

6. Visualization

Vaex visualizes billion-row datasets through efficient binning, using all data without sampling.

import vaex
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(0)
df = vaex.from_arrays(
    x=np.random.normal(0, 1, 500_000),
    y=np.random.normal(0, 1, 500_000),
    z=np.random.uniform(0, 10, 500_000),
)

# 1D histogram
df.plot1d(df.x, limits=[-4, 4], shape=80, figsize=(8, 4))
plt.title('X Distribution')
plt.savefig('hist1d.png', dpi=150, bbox_inches='tight')
plt.close()

# 2D density heatmap (core vaex visualization)
df.plot(df.x, df.y, limits='99.7%', shape=(256, 256),
        f='log', colormap='viridis', figsize=(8, 8))
plt.savefig('heatmap2d.png', dpi=150, bbox_inches='tight')
plt.close()

# Aggregation on grid (mean of z over x-y plane)
df.plot(df.x, df.y, what=df.z.mean(),
        limits=[[-3, 3], [-3, 3]], shape=(100, 100))
plt.savefig('mean_grid.png', dpi=150, bbox_inches='tight')
plt.close()
print("Saved: hist1d.png, heatmap2d.png, mean_grid.png")

7. ML Integration

vaex.ml

provides transformers for preprocessing, dimensionality reduction, clustering, and scikit-learn model wrapping. All transformers create virtual columns (zero memory overhead).

import vaex, vaex.ml
import numpy as np

np.random.seed(42)
n = 10_000
df = vaex.from_arrays(
    age=np.random.randint(18, 70, n).astype(float),
    income=np.random.uniform(20000, 150000, n),
    category=np.random.choice(['A', 'B', 'C'], n),
    target=np.random.randint(0, 2, n),
)

# Feature scaling (creates virtual columns: standard_scaled_age, ...)
scaler = vaex.ml.StandardScaler(features=['age', 'income'])
df = scaler.fit_transform(df)
# Also: MinMaxScaler, MaxAbsScaler, RobustScaler

# Categorical encoding
encoder = vaex.ml.LabelEncoder(features=['category'])
df = encoder.fit_transform(df)
# Also: OneHotEncoder, FrequencyEncoder, TargetEncoder, WeightOfEvidenceEncoder

# PCA
pca = vaex.ml.PCA(features=['standard_scaled_age', 'standard_scaled_income'],
                   n_components=2)
df = pca.fit_transform(df)
print(f"Explained variance: {pca.explained_variance_ratio_}")

# Scikit-learn bridge: wrap any sklearn model
from sklearn.ensemble import RandomForestClassifier
features = ['standard_scaled_age', 'standard_scaled_income',
            'label_encoded_category']
model = vaex.ml.sklearn.Predictor(
    features=features, target='target',
    model=RandomForestClassifier(n_estimators=50, random_state=42),
    prediction_name='rf_prediction',
)
train_df, test_df = df[:8000], df[8000:]
model.fit(train_df)
test_df = model.transform(test_df)  # Predictions as virtual columns
accuracy = (test_df.rf_prediction == test_df.target).mean()
print(f"Accuracy: {accuracy:.3f}")  # ~0.50 (random data)

# Save pipeline state (encoding + scaling + model)
train_df.state_write('pipeline_state.json')
# Deploy: prod_df.state_load('pipeline_state.json')

Key Concepts

Lazy Evaluation Model

Vaex operations build an expression graph without executing computation. Evaluation is triggered only when a result is accessed (printing a value, calling

.values

, exporting).

import vaex, numpy as np
df = vaex.from_arrays(x=np.random.rand(1_000_000))

# NOT computed: virtual column, expression, filtered view
df['x_sq'] = df.x ** 2;  expr = df.x_sq.mean();  df_f = df[df.x > 0.5]

# COMPUTED: accessing value, .values, .to_pandas_df(), .export_hdf5()
print(f"Mean: {df.x.mean():.4f}")

Virtual vs Materialized Columns

Aspect	Virtual	Materialized
Memory	Zero overhead	Stores full array
Speed	Recomputed each use	Instant access
Creation	`df['col'] = expr`	`df['col'] = expr.values`
Best for	Simple expressions, infrequent use	Complex expressions used repeatedly
Check	`df.is_local('col')` returns `False`	Returns `True`

Rule of thumb: Keep columns virtual unless the same complex expression is used in 3+ aggregations.

Memory-Mapped File Architecture

HDF5 and Apache Arrow files are memory-mapped: the OS maps file pages to virtual memory on demand, so opening a 100 GB file is instant and uses minimal RAM. Data pages are read from disk only when accessed.

# Opens instantly regardless of file size
df = vaex.open('100gb_dataset.hdf5')  # ~0.001s, minimal RAM
mean = df.column.mean()  # Streams through data, ~RSS stays low

Format Comparison

Feature	HDF5	Arrow/Feather	Parquet	CSV
Load speed	Instant	Instant	Fast	Slow
Memory-mapped	Yes	Yes	No	No
Compression	Optional (gzip, lzf, blosc)	No	Default (snappy, gzip, brotli)	No
Columnar	Yes	Yes	Yes	No
Portability	Good	Excellent	Excellent	Excellent
Best for	Local Vaex workflows	Cross-language interop	Distributed systems	Data exchange

Recommendation: Convert CSV to HDF5 once (

vaex.from_csv('data.csv', convert='data.hdf5')

), then use HDF5 for all future loads.

Common Workflows

Workflow 1: Large CSV Exploration and Conversion

import vaex
import matplotlib.pyplot as plt

# Convert CSV to HDF5 (one-time); future loads instant
df = vaex.from_csv('large_data.csv', convert='large_data.hdf5')
print(f"Shape: {df.shape}, Columns: {df.column_names}")

# Feature engineering with virtual columns
df['log_value'] = df.value.log()
df['category_clean'] = df.category.str.lower().str.strip()
df['is_high'] = df.value > df.value.mean()

# Batch statistics (single pass)
delayed = [df.value.mean(delay=True), df.value.std(delay=True),
           df.value.quantile(0.5, delay=True), df.value.quantile(0.99, delay=True)]
results = vaex.execute(delayed)
print(f"Mean: {results[0]:.2f}, Std: {results[1]:.2f}, P99: {results[3]:.2f}")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
df.plot1d(df.value, ax=axes[0], show=False)
axes[0].set_title('Value Distribution')
df.plot1d(df.log_value, ax=axes[1], show=False)
axes[1].set_title('Log Value Distribution')
plt.tight_layout()
plt.savefig('exploration.png', dpi=150, bbox_inches='tight')
plt.close()

Workflow 2: ML Pipeline with State Deployment

import vaex, vaex.ml
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier

np.random.seed(42)
n = 50_000
df = vaex.from_arrays(
    age=np.random.randint(18, 70, n).astype(float),
    income=np.random.uniform(20000, 150000, n),
    region=np.random.choice(['East', 'West', 'Central'], n),
    target=np.random.randint(0, 2, n),
)
train, test = df[:40_000], df[40_000:]

# Preprocessing pipeline
train = vaex.ml.LabelEncoder(features=['region']).fit_transform(train)
train = vaex.ml.StandardScaler(features=['age', 'income']).fit_transform(train)

# Train model
features = ['standard_scaled_age', 'standard_scaled_income', 'label_encoded_region']
model = vaex.ml.sklearn.Predictor(
    features=features, target='target',
    model=GradientBoostingClassifier(n_estimators=50, random_state=42),
    prediction_name='prediction',
)
model.fit(train)

# Save pipeline state (encoding + scaling + model in one file)
train.state_write('ml_pipeline.json')

# Deploy: apply saved state to new data
test.state_load('ml_pipeline.json')
accuracy = (test.prediction == test.target).mean()
print(f"Test accuracy: {accuracy:.3f}")  # ~0.50 (random data)
# Production: prod_df = vaex.open('new_batch.hdf5'); prod_df.state_load('ml_pipeline.json')

Key Parameters

Parameter	Module	Default	Range/Options	Effect
`shape`	`plot` , `plot1d`	64 (1D), (256,256) (2D)	32-2048	Histogram bin count / heatmap resolution
`limits`	`plot` , `plot1d`	`'minmax'`	`'99%'` , `'99.7%'` , `[min,max]`	Axis ranges; percentile-based for outlier handling
`f`	`plot`	`'identity'`	`'log'` , `'log10'` , `'sqrt'`	Color scale transform for density plots
`delay`	aggregations	`False`	`True/False`	Batch multiple aggregations into single pass
`convert`	`from_csv`	`None`	file path string	Auto-convert CSV to HDF5 during load
`chunk_size`	`from_csv`	5,000,000	100K-50M	Rows per chunk for CSV processing
`n_components`	`PCA`	2	1-n_features	Number of principal components
`n_clusters`	`KMeans`	8	2-100+	Number of clusters
`features`	all ML transformers	required	list of column names	Columns to transform
`compression`	`export_hdf5`	None	`'gzip'` , `'lzf'` , `'blosc'`	Trade file size for I/O speed

Best Practices

Always convert CSV to HDF5 or Arrow for repeated access. One-time conversion pays for itself on the first reload:
```
vaex.from_csv('data.csv', convert='data.hdf5')
```
.
Keep columns virtual until you must materialize. Virtual columns have zero memory cost. Materialize only when a complex expression is reused in 3+ aggregations.
Batch aggregations with
```
delay=True
```
. Each separate aggregation call scans the entire dataset. Batching with
```
vaex.execute([df.x.mean(delay=True), df.x.std(delay=True)])
```
reduces N passes to 1.
Use selections instead of creating filtered DataFrames when computing statistics on multiple subsets.
```
df.select(df.age > 30, name='senior')
```
then
```
df.salary.mean(selection='senior')
```
is more efficient than creating
```
df_senior = df[df.age > 30]
```
.
Avoid
```
.values
```
and
.to_pandas_df()
on large data. These load data into RAM, defeating Vaex's purpose. Use only on small subsets or samples.
Save pipeline state for reproducibility.
```
df.state_write('state.json')
```
captures virtual columns, selections, and ML transformers for deployment.
Anti-pattern -- row iteration. Never iterate rows in Vaex. Use vectorized expressions and aggregations instead.

Common Recipes

Recipe: Multi-Source Data Consolidation

import vaex

# Load from multiple sources and formats
df_csv = vaex.from_csv('data_2022.csv')
df_hdf = vaex.open('data_2023.hdf5')
df_pq = vaex.open('data_2024.parquet')

# Concatenate vertically
df_all = vaex.concat([df_csv, df_hdf, df_pq])
print(f"Combined: {len(df_all):,} rows")

# Export as unified HDF5
df_all.export_hdf5('unified_data.hdf5')
# Future: vaex.open('unified_data.hdf5')

Recipe: Handling Missing Data

import vaex
import numpy as np

df = vaex.from_arrays(
    x=np.array([1.0, np.nan, 3.0, np.nan, 5.0]),
    y=np.array([10, 20, 30, 40, 50]),
)

# Detect missing
missing_pct = df.x.isna().mean() * 100
print(f"Missing: {missing_pct:.0f}%")
# Missing: 40%

# Fill with value or column mean
df['x_filled'] = df.x.fillna(df.x.mean())

# Filter out missing
df_clean = df[df.x.notna()]
print(f"Clean rows: {len(df_clean)}")
# Clean rows: 3

Recipe: Comparison Visualization with Selections

import vaex
import numpy as np
import matplotlib.pyplot as plt

df = vaex.from_arrays(
    value=np.concatenate([
        np.random.normal(50, 10, 200_000),
        np.random.normal(70, 15, 200_000),
    ]),
    group=np.array(['Control'] * 200_000 + ['Treatment'] * 200_000),
)

# Named selections for efficient comparison
df.select(df.group == 'Control', name='ctrl')
df.select(df.group == 'Treatment', name='treat')

plt.figure(figsize=(10, 5))
df.plot1d(df.value, selection='ctrl', label='Control', show=False)
df.plot1d(df.value, selection='treat', label='Treatment', show=False)
plt.legend()
plt.title('Distribution Comparison')
plt.savefig('comparison.png', dpi=150, bbox_inches='tight')
plt.close()
print("Saved comparison.png")

Troubleshooting

Problem	Cause	Solution
CSV loading extremely slow	CSV is not memory-mappable; parsed row-by-row	Convert once: `vaex.from_csv('data.csv', convert='data.hdf5')` . Use HDF5 for all future loads
`MemoryError` on simple operations	Calling `.values` or `.to_pandas_df()` on full dataset	Keep operations lazy. Use `.sample(n=1000).to_pandas_df()` for inspection
Empty or all-white 2D plot	Axis limits don't match data range	Use `limits='99.7%'` or `limits='minmax'` instead of manual limits
Heatmap shows only one bright spot	Linear color scale overwhelmed by high-density region	Use `f='log'` for logarithmic color scaling
Virtual column recomputes slowly in loop	Complex expression recomputed on every access	Materialize: `df['col'] = df.complex_expr.values`
`FileNotFoundError` with cloud paths	Missing filesystem library	Install `s3fs` (S3), `gcsfs` (GCS), or `adlfs` (Azure)
Slow export with many virtual columns	Each virtual column recomputed during export	Materialize first: `df.materialize().export_hdf5('out.hdf5')`
Column shows as string when numeric expected	CSV auto-detection chose wrong type	Cast: `df['col_num'] = df.col.astype('float64')`
`state_load` fails on new data	Column names in state don't match new DataFrame	Ensure new data has identical column names as the training data

Bundled Resources

This entry includes two reference files in

references/

```
references/io_performance.md
```
-- Consolidated from original
```
io_operations.md
```
(704 lines) and
```
performance.md
```
(572 lines). Covers: format-specific I/O details (HDF5 compression options, Parquet compression, Arrow integration, FITS), chunked I/O processing, cloud storage (S3/GCS/Azure with credentials), Vaex server (remote data), database integration (SQL read/write via pandas bridge), state files (save/load pipeline state), memory management, parallel computation (multithreading, Dask integration), JIT compilation with Numba, async operations, and detailed profiling/benchmarking patterns. Relocated inline: format comparison table, CSV-to-HDF5 conversion pattern,
```
delay=True
```
batching, memory-mapped architecture explanation, basic export methods. Omitted: redundant "Best Practices" lists that duplicated content already in the main file; "Related Resources" cross-links (superseded by Bundled Resources section).
```
references/ml_visualization.md
```
-- Consolidated from original
```
machine_learning.md
```
(729 lines) and
```
visualization.md
```
(614 lines). Covers: full ML transformer catalog (MinMaxScaler, MaxAbsScaler, RobustScaler, FrequencyEncoder, TargetEncoder, WeightOfEvidenceEncoder, CycleTransformer, Discretizer, RandomProjection), KMeans clustering, external library integration (XGBoost, LightGBM, CatBoost, Keras), cross-validation, feature selection, imbalanced data handling, advanced visualization (contour plots, vector field overlays, interactive Jupyter widgets, faceted plots, batch plotting, Plotly/seaborn integration). Relocated inline: StandardScaler, LabelEncoder, OneHotEncoder, PCA, scikit-learn Predictor wrapper, basic plot1d/plot/what patterns, selection-based visualization. Omitted: model evaluation metrics section (standard sklearn metrics, not vaex-specific -- use scikit-learn directly); "Related Resources" cross-links.

Original reference file disposition:

```
core_dataframes.md
```
(368 lines) -- fully consolidated into Core API Module 1 (DataFrame Creation & I/O) and Key Concepts. Combined coverage: ~65 lines inline covering all creation methods (open, from_csv, from_arrays, from_dict, from_pandas, from_arrow_table), inspection, export, and expression basics. Omitted: detailed row/column manipulation patterns (covered in Module 2-4), copy/concat patterns (covered in Recipes).
```
data_processing.md
```
(556 lines) -- fully consolidated into Core API Modules 2-5. Filtering, virtual columns, expressions, aggregation, groupby, string ops, datetime ops, missing data, joining, column management all represented inline. Omitted: advanced binning (searchsorted patterns, statistical binning details) -- niche usage, consult official docs.

Related Skills

polars-dataframes -- In-memory DataFrame library; use when data fits in RAM for 10-100x faster processing
dask-parallel-computing -- Distributed computing for multi-node clusters and parallel pandas/NumPy
pandas (planned) -- Standard Python DataFrame library; Vaex interoperates via
```
from_pandas
```
/
```
to_pandas_df
```

References

Vaex documentation: https://vaex.io/docs/latest/
Vaex GitHub repository: https://github.com/vaexio/vaex
Vaex PyPI: https://pypi.org/project/vaex/
Breddels & Veljanoski (2018), "Vaex: Big Data exploration in the era of Gaia", Astronomy & Astrophysics, https://doi.org/10.1051/0004-6361/201732493