AlterLab-Academic-Skills alterlab-polars
Part of the AlterLab Academic Skills suite. Fast in-memory DataFrame library for datasets that fit in RAM. Use when pandas is too slow but data still fits in memory. Lazy evaluation, parallel execution, Apache Arrow backend. Best for 1-100GB datasets, ETL pipelines, faster pandas replacement. For larger-than-RAM data use dask or vaex.
git clone https://github.com/AlterLab-IEU/AlterLab-Academic-Skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/AlterLab-IEU/AlterLab-Academic-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data-science/alterlab-polars" ~/.claude/skills/alterlab-ieu-alterlab-academic-skills-alterlab-polars && rm -rf "$T"
skills/data-science/alterlab-polars/SKILL.mdPolars
Overview
Polars is a lightning-fast DataFrame library for Python and Rust built on Apache Arrow. Work with Polars' expression-based API, lazy evaluation framework, and high-performance data manipulation capabilities for efficient data processing, pandas migration, and data pipeline optimization.
Quick Start
Installation and Basic Usage
Install Polars:
uv pip install polars
Basic DataFrame creation and operations:
import polars as pl # Create DataFrame df = pl.DataFrame({ "name": ["Alice", "Bob", "Charlie"], "age": [25, 30, 35], "city": ["NY", "LA", "SF"] }) # Select columns df.select("name", "age") # Filter rows df.filter(pl.col("age") > 25) # Add computed columns df.with_columns( age_plus_10=pl.col("age") + 10 )
Core Concepts
Expressions
Expressions are the fundamental building blocks of Polars operations. They describe transformations on data and can be composed, reused, and optimized.
Key principles:
- Use
to reference columnspl.col("column_name") - Chain methods to build complex transformations
- Expressions are lazy and only execute within contexts (select, with_columns, filter, group_by)
Example:
# Expression-based computation df.select( pl.col("name"), (pl.col("age") * 12).alias("age_in_months") )
Lazy vs Eager Evaluation
Eager (DataFrame): Operations execute immediately
df = pl.read_csv("file.csv") # Reads immediately result = df.filter(pl.col("age") > 25) # Executes immediately
Lazy (LazyFrame): Operations build a query plan, optimized before execution
lf = pl.scan_csv("file.csv") # Doesn't read yet result = lf.filter(pl.col("age") > 25).select("name", "age") df = result.collect() # Now executes optimized query
When to use lazy:
- Working with large datasets
- Complex query pipelines
- When only some columns/rows are needed
- Performance is critical
Benefits of lazy evaluation:
- Automatic query optimization
- Predicate pushdown
- Projection pushdown
- Parallel execution
For detailed concepts, load
references/core_concepts.md.
Common Operations
Select
Select and manipulate columns:
# Select specific columns df.select("name", "age") # Select with expressions df.select( pl.col("name"), (pl.col("age") * 2).alias("double_age") ) # Select all columns matching a pattern df.select(pl.col("^.*_id$"))
Filter
Filter rows by conditions:
# Single condition df.filter(pl.col("age") > 25) # Multiple conditions (cleaner than using &) df.filter( pl.col("age") > 25, pl.col("city") == "NY" ) # Complex conditions df.filter( (pl.col("age") > 25) | (pl.col("city") == "LA") )
With Columns
Add or modify columns while preserving existing ones:
# Add new columns df.with_columns( age_plus_10=pl.col("age") + 10, name_upper=pl.col("name").str.to_uppercase() ) # Parallel computation (all columns computed in parallel) df.with_columns( pl.col("value") * 10, pl.col("value") * 100, )
Group By and Aggregations
Group data and compute aggregations:
# Basic grouping df.group_by("city").agg( pl.col("age").mean().alias("avg_age"), pl.len().alias("count") ) # Multiple group keys df.group_by("city", "department").agg( pl.col("salary").sum() ) # Conditional aggregations df.group_by("city").agg( (pl.col("age") > 30).sum().alias("over_30") )
For detailed operation patterns, load
references/operations.md.
Aggregations and Window Functions
Aggregation Functions
Common aggregations within
group_by context:
- count rowspl.len()
- sum valuespl.col("x").sum()
- averagepl.col("x").mean()
/pl.col("x").min()
- extremespl.col("x").max()
/pl.first()
- first/last valuespl.last()
Window Functions with over()
over()Apply aggregations while preserving row count:
# Add group statistics to each row df.with_columns( avg_age_by_city=pl.col("age").mean().over("city"), rank_in_city=pl.col("salary").rank().over("city") ) # Multiple grouping columns df.with_columns( group_avg=pl.col("value").mean().over("category", "region") )
Mapping strategies:
(default): Preserves original row ordergroup_to_rows
: Faster but groups rows togetherexplode
: Creates list columnsjoin
Data I/O
Supported Formats
Polars supports reading and writing:
- CSV, Parquet, JSON, Excel
- Databases (via connectors)
- Cloud storage (S3, Azure, GCS)
- Google BigQuery
- Multiple/partitioned files
Common I/O Operations
CSV:
# Eager df = pl.read_csv("file.csv") df.write_csv("output.csv") # Lazy (preferred for large files) lf = pl.scan_csv("file.csv") result = lf.filter(...).select(...).collect()
Parquet (recommended for performance):
df = pl.read_parquet("file.parquet") df.write_parquet("output.parquet")
JSON:
df = pl.read_json("file.json") df.write_json("output.json")
For comprehensive I/O documentation, load
references/io_guide.md.
Transformations
Joins
Combine DataFrames:
# Inner join df1.join(df2, on="id", how="inner") # Left join df1.join(df2, on="id", how="left") # Join on different column names df1.join(df2, left_on="user_id", right_on="id")
Concatenation
Stack DataFrames:
# Vertical (stack rows) pl.concat([df1, df2], how="vertical") # Horizontal (add columns) pl.concat([df1, df2], how="horizontal") # Diagonal (union with different schemas) pl.concat([df1, df2], how="diagonal")
Pivot and Unpivot
Reshape data:
# Pivot (wide format) df.pivot(values="sales", index="date", columns="product") # Unpivot (long format) df.unpivot(index="id", on=["col1", "col2"])
For detailed transformation examples, load
references/transformations.md.
Pandas Migration
Polars offers significant performance improvements over pandas with a cleaner API. Key differences:
Conceptual Differences
- No index: Polars uses integer positions only
- Strict typing: No silent type conversions
- Lazy evaluation: Available via LazyFrame
- Parallel by default: Operations parallelized automatically
Common Operation Mappings
| Operation | Pandas | Polars |
|---|---|---|
| Select column | | |
| Filter | | |
| Add column | | |
| Group by | | |
| Window | | |
Key Syntax Patterns
Pandas sequential (slow):
df.assign( col_a=lambda df_: df_.value * 10, col_b=lambda df_: df_.value * 100 )
Polars parallel (fast):
df.with_columns( col_a=pl.col("value") * 10, col_b=pl.col("value") * 100, )
For comprehensive migration guide, load
references/pandas_migration.md.
Best Practices
Performance Optimization
-
Use lazy evaluation for large datasets:
lf = pl.scan_csv("large.csv") # Don't use read_csv result = lf.filter(...).select(...).collect() -
Avoid Python functions in hot paths:
- Stay within expression API for parallelization
- Use
only when necessary.map_elements() - Prefer native Polars operations
-
Use streaming for very large data:
lf.collect(streaming=True) -
Select only needed columns early:
# Good: Select columns early lf.select("col1", "col2").filter(...) # Bad: Filter on all columns first lf.filter(...).select("col1", "col2") -
Use appropriate data types:
- Categorical for low-cardinality strings
- Appropriate integer sizes (i32 vs i64)
- Date types for temporal data
Expression Patterns
Conditional operations:
pl.when(condition).then(value).otherwise(other_value)
Column operations across multiple columns:
df.select(pl.col("^.*_value$") * 2) # Regex pattern
Null handling:
pl.col("x").fill_null(0) pl.col("x").is_null() pl.col("x").drop_nulls()
For additional best practices and patterns, load
references/best_practices.md.
Resources
This skill includes comprehensive reference documentation:
references/
- Detailed explanations of expressions, lazy evaluation, and type systemcore_concepts.md
- Comprehensive guide to all common operations with examplesoperations.md
- Complete migration guide from pandas to Polarspandas_migration.md
- Data I/O operations for all supported formatsio_guide.md
- Joins, concatenation, pivots, and reshaping operationstransformations.md
- Performance optimization tips and common patternsbest_practices.md
Load these references as needed when users require detailed information about specific topics.