Claude-scientific-skills vaex
Use this skill for processing and analyzing large tabular datasets (billions of rows) that exceed available RAM. Vaex excels at out-of-core DataFrame operations, lazy evaluation, fast aggregations, efficient visualization of big data, and machine learning on large datasets. Apply when users need to work with large CSV/HDF5/Arrow/Parquet files, perform fast statistics on massive datasets, create visualizations of big data, or build ML pipelines that do not fit in memory.
git clone https://github.com/K-Dense-AI/scientific-agent-skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/K-Dense-AI/scientific-agent-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/scientific-skills/vaex" ~/.claude/skills/k-dense-ai-claude-scientific-skills-vaex && rm -rf "$T"
scientific-skills/vaex/SKILL.mdVaex
Overview
Vaex is a high-performance Python library designed for lazy, out-of-core DataFrames to process and visualize tabular datasets that are too large to fit into RAM. Vaex can process over a billion rows per second, enabling interactive data exploration and analysis on datasets with billions of rows.
When to Use This Skill
Use Vaex when:
- Processing tabular datasets larger than available RAM (gigabytes to terabytes)
- Performing fast statistical aggregations on massive datasets
- Creating visualizations and heatmaps of large datasets
- Building machine learning pipelines on big data
- Converting between data formats (CSV, HDF5, Arrow, Parquet)
- Needing lazy evaluation and virtual columns to avoid memory overhead
- Working with astronomical data, financial time series, or other large-scale scientific datasets
Core Capabilities
Vaex provides six primary capability areas, each documented in detail in the references directory:
1. DataFrames and Data Loading
Load and create Vaex DataFrames from various sources including files (HDF5, CSV, Arrow, Parquet), pandas DataFrames, NumPy arrays, and dictionaries. Reference
references/core_dataframes.md for:
- Opening large files efficiently
- Converting from pandas/NumPy/Arrow
- Working with example datasets
- Understanding DataFrame structure
2. Data Processing and Manipulation
Perform filtering, create virtual columns, use expressions, and aggregate data without loading everything into memory. Reference
references/data_processing.md for:
- Filtering and selections
- Virtual columns and expressions
- Groupby operations and aggregations
- String operations and datetime handling
- Working with missing data
3. Performance and Optimization
Leverage Vaex's lazy evaluation, caching strategies, and memory-efficient operations. Reference
references/performance.md for:
- Understanding lazy evaluation
- Using
for batching operationsdelay=True - Materializing columns when needed
- Caching strategies
- Asynchronous operations
4. Data Visualization
Create interactive visualizations of large datasets including heatmaps, histograms, and scatter plots. Reference
references/visualization.md for:
- Creating 1D and 2D plots
- Heatmap visualizations
- Working with selections
- Customizing plots and subplots
5. Machine Learning Integration
Build ML pipelines with transformers, encoders, and integration with scikit-learn, XGBoost, and other frameworks. Reference
references/machine_learning.md for:
- Feature scaling and encoding
- PCA and dimensionality reduction
- K-means clustering
- Integration with scikit-learn/XGBoost/CatBoost
- Model serialization and deployment
6. I/O Operations
Efficiently read and write data in various formats with optimal performance. Reference
references/io_operations.md for:
- File format recommendations
- Export strategies
- Working with Apache Arrow
- CSV handling for large files
- Server and remote data access
Quick Start Pattern
For most Vaex tasks, follow this pattern:
import vaex # 1. Open or create DataFrame df = vaex.open('large_file.hdf5') # or .csv, .arrow, .parquet # OR df = vaex.from_pandas(pandas_df) # 2. Explore the data print(df) # Shows first/last rows and column info df.describe() # Statistical summary # 3. Create virtual columns (no memory overhead) df['new_column'] = df.x ** 2 + df.y # 4. Filter with selections df_filtered = df[df.age > 25] # 5. Compute statistics (fast, lazy evaluation) mean_val = df.x.mean() stats = df.groupby('category').agg({'value': 'sum'}) # 6. Visualize df.plot1d(df.x, limits=[0, 100]) df.plot(df.x, df.y, limits='99.7%') # 7. Export if needed df.export_hdf5('output.hdf5')
Working with References
The reference files contain detailed information about each capability area. Load references into context based on the specific task:
- Basic operations: Start with
andreferences/core_dataframes.mdreferences/data_processing.md - Performance issues: Check
references/performance.md - Visualization tasks: Use
references/visualization.md - ML pipelines: Reference
references/machine_learning.md - File I/O: Consult
references/io_operations.md
Best Practices
- Use HDF5 or Apache Arrow formats for optimal performance with large datasets
- Leverage virtual columns instead of materializing data to save memory
- Batch operations using
when performing multiple calculationsdelay=True - Export to efficient formats rather than keeping data in CSV
- Use expressions for complex calculations without intermediate storage
- Profile with
to understand memory usage and optimize operationsdf.stat()
Common Patterns
Pattern: Converting Large CSV to HDF5
import vaex # Open large CSV (processes in chunks automatically) df = vaex.from_csv('large_file.csv') # Export to HDF5 for faster future access df.export_hdf5('large_file.hdf5') # Future loads are instant df = vaex.open('large_file.hdf5')
Pattern: Efficient Aggregations
# Use delay=True to batch multiple operations mean_x = df.x.mean(delay=True) std_y = df.y.std(delay=True) sum_z = df.z.sum(delay=True) # Execute all at once results = vaex.execute([mean_x, std_y, sum_z])
Pattern: Virtual Columns for Feature Engineering
# No memory overhead - computed on the fly df['age_squared'] = df.age ** 2 df['full_name'] = df.first_name + ' ' + df.last_name df['is_adult'] = df.age >= 18
Resources
This skill includes reference documentation in the
references/ directory:
- DataFrame creation, loading, and basic structurecore_dataframes.md
- Filtering, expressions, aggregations, and transformationsdata_processing.md
- Optimization strategies and lazy evaluationperformance.md
- Plotting and interactive visualizationsvisualization.md
- ML pipelines and model integrationmachine_learning.md
- File formats and data import/exportio_operations.md