Awesome-Agent-Skills-for-Empirical-Research polars
git clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research
T=$(mktemp -d) && git clone --depth=1 https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/17-DAAF-Contribution-Community-daaf/dot-claude/skills/polars" ~/.claude/skills/brycewang-stanford-awesome-agent-skills-for-empirical-research-polars && rm -rf "$T"
skills/17-DAAF-Contribution-Community-daaf/dot-claude/skills/polars/SKILL.mdPolars Skill
Polars DataFrame library for high-performance data manipulation in Python. Covers lazy/eager execution, expressions, I/O (CSV, Parquet, JSON, database), aggregations, joins, string/datetime operations, pandas/NumPy interop, and performance optimization. Use when working with Polars DataFrames, migrating from pandas, reading Parquet files, or optimizing data pipeline performance.
Comprehensive skill for high-performance data manipulation with Polars. Use decision trees below to find the right guidance, then load detailed references.
What is Polars?
Polars is a fast DataFrame library for Python (and Rust):
- Fast: Written in Rust, optimized for modern CPUs with SIMD and parallelism
- Lazy Evaluation: Build query plans that get optimized before execution
- Expressive: Powerful expression API for complex transformations
- Memory Efficient: Columnar format, streaming for larger-than-memory data
- No Dependencies: Pure Rust core, no NumPy/Pandas required
Version Notes
This skill targets Polars 1.x (tested with 1.37.1). Key changes from 0.x:
renamed toapply
(0.19+)map_elements
renamed togroupby
(0.19+)group_by
renamed tomelt
(1.0+)unpivot- Streaming engine improvements in 1.x
is nowpl.Utf8
(1.0+, Utf8 still works as alias)pl.String
How to Use This Skill
Reference File Structure
Each topic in
./references/ contains focused documentation:
| File | Purpose | When to Read |
|---|---|---|
| Installation, concepts, first DataFrame | Starting with Polars |
| Creation, selection, filtering, modification | Basic data manipulation |
| CSV, Parquet, JSON, database I/O | Loading/saving data |
| Expression system, contexts, chaining | Understanding Polars idioms |
| GroupBy, window functions, statistics | Summarizing data |
| Joins, concatenation, pivot/unpivot | Combining DataFrames |
| String ops, datetime, categoricals | Type-specific operations |
| Lazy execution, optimization, anti-patterns | Making code faster |
| Pandas, NumPy, PyArrow, DuckDB | Working with other tools |
| Common errors, anti-patterns, migration | Debugging issues |
Reading Order
- New to Polars? Start with
thenquickstart.mdexpressions.md - Coming from Pandas? Read
,quickstart.md
, thenexpressions.mdinterop.md - Performance issues? Check
firstperformance.md
Quick Decision Trees
"I need to get started"
Getting started? ├─ Install Polars → ./references/quickstart.md ├─ Create first DataFrame → ./references/quickstart.md ├─ Understand lazy vs eager → ./references/quickstart.md ├─ Learn expression syntax → ./references/expressions.md └─ Coming from Pandas → ./references/interop.md
"I need to load or save data"
Loading/saving data? ├─ Read CSV file → ./references/io-data.md ├─ Read Parquet (recommended) → ./references/io-data.md ├─ Read JSON/NDJSON → ./references/io-data.md ├─ Read from database → ./references/io-data.md ├─ Read multiple files (glob) → ./references/io-data.md ├─ Write to file → ./references/io-data.md └─ Larger-than-memory data → ./references/performance.md
"I need to filter or select data"
Filtering/selecting? ├─ Select columns by name → ./references/dataframes-series.md ├─ Select by pattern/regex → ./references/dataframes-series.md ├─ Select by data type → ./references/dataframes-series.md ├─ Filter rows by condition → ./references/dataframes-series.md ├─ Filter with multiple conditions → ./references/dataframes-series.md ├─ Handle null values → ./references/dataframes-series.md └─ Add/modify columns → ./references/dataframes-series.md
"I need to aggregate or group data"
Aggregating data? ├─ Basic statistics (sum, mean, etc.) → ./references/aggregations-grouping.md ├─ Group by columns → ./references/aggregations-grouping.md ├─ Multiple aggregations → ./references/aggregations-grouping.md ├─ Window functions (over) → ./references/aggregations-grouping.md ├─ Rolling/moving averages → ./references/aggregations-grouping.md ├─ Cumulative operations → ./references/aggregations-grouping.md └─ Ranking within groups → ./references/aggregations-grouping.md
"I need to combine DataFrames"
Combining data? ├─ Join two DataFrames → ./references/joins-concat.md ├─ Left/right/outer join → ./references/joins-concat.md ├─ Anti-join (not in) → ./references/joins-concat.md ├─ Concatenate vertically → ./references/joins-concat.md ├─ Pivot (long to wide) → ./references/joins-concat.md └─ Unpivot/melt (wide to long) → ./references/joins-concat.md
"I need better performance"
Performance issues? ├─ Use lazy evaluation → ./references/performance.md ├─ Avoid row iteration → ./references/performance.md ├─ Reduce memory usage → ./references/performance.md ├─ Process large files → ./references/performance.md ├─ Optimize query plan → ./references/performance.md └─ Common anti-patterns → ./references/performance.md
"Something isn't working"
Having issues? ├─ Type errors → ./references/gotchas.md ├─ Null handling → ./references/gotchas.md ├─ Expression context errors → ./references/gotchas.md ├─ String operations → ./references/strings-datetime-categorical.md ├─ Date parsing issues → ./references/strings-datetime-categorical.md ├─ Performance problems → ./references/gotchas.md ├─ Pandas migration issues → ./references/gotchas.md ├─ Memory errors → ./references/gotchas.md └─ General troubleshooting → ./references/gotchas.md
File-First Execution in Research Workflows
Important: In data research pipelines (see
CLAUDE.md), Polars transformations are executed through script files, not interactively. This ensures auditability and reproducibility.
The pattern:
- Write transformation code to
scripts/stage{N}_{type}/{step}_{task-name}.py - Execute via Bash with automatic output capture wrapper script
- Validation results get automatically embedded in scripts as comments
- If failed, create versioned copy for fixes
Closely read
agent_reference/SCRIPT_EXECUTION_REFERENCE.md for the mandatory file-first execution protocol covering complete code file writing, output capture, and file versioning rules.
See:
— Script execution protocol and format with validationagent_reference/SCRIPT_EXECUTION_REFERENCE.md
The examples below show Polars syntax. In research workflows, wrap them in scripts following the file-first pattern.
Quick Reference
Essential Import
import polars as pl import polars.selectors as cs # For column selection by type
Lazy vs Eager (One-Liner)
# Eager: immediate execution df = pl.read_csv("data.csv") # Lazy: deferred, optimized execution (preferred for large data) lf = pl.scan_csv("data.csv") df = lf.collect() # Execute when ready
Core Expression Patterns
# Select columns df.select("a", "b") df.select(pl.col("a"), pl.col("b")) df.select(pl.all().exclude("id")) # Filter rows df.filter(pl.col("a") > 10) df.filter((pl.col("a") > 10) & (pl.col("b") == "x")) # Add/modify columns df.with_columns( (pl.col("a") * 2).alias("a_doubled"), pl.col("b").str.to_uppercase().alias("b_upper") ) # Conditional column df.with_columns( pl.when(pl.col("a") > 10) .then(pl.lit("high")) .otherwise(pl.lit("low")) .alias("category") ) # Group and aggregate df.group_by("category").agg( pl.col("value").sum().alias("total"), pl.col("value").mean().alias("average"), pl.len().alias("count") )
Essential Functions
| Function | Purpose |
|---|---|
| Reference a column |
| Literal value |
| All columns |
| All except specified |
| Row count |
| Conditional logic |
| Rename result |
| Convert type |
Common Data Types
| Type | Description |
|---|---|
, | Integers |
, | Floats |
(or ) | Strings |
| True/False |
, | Dates and timestamps |
| Time differences |
| Categorical strings |
| List of values |
| Named fields |
Quick Cheatsheet
# I/O df = pl.read_csv/parquet/json("file") lf = pl.scan_csv/parquet/ndjson("file") # Lazy df.write_csv/parquet/json("file") # Selection df.select("a", "b") df.select(cs.numeric()) # By type # Filtering df.filter(pl.col("a") > 1) # Aggregation df.group_by("key").agg(pl.col("val").sum()) # Joining df1.join(df2, on="key", how="left") # Sorting df.sort("col", descending=True) # Lazy execution lf.collect() # Run query lf.explain() # Show plan
Topic Index
| Topic | Reference File |
|---|---|
| Installation | |
| DataFrame Creation | |
| Lazy vs Eager | |
| Column Selection | |
| Row Filtering | |
| Adding Columns | |
| CSV Files | |
| Parquet Files | |
| Database Connections | |
| Expressions | |
| Method Chaining | |
| Contexts | |
| GroupBy | |
| Window Functions | |
| Rolling Windows | |
| Joins | |
| Concatenation | |
| Pivot/Unpivot | |
| String Operations | |
| Datetime Handling | |
| Categorical Data | |
| Query Optimization | |
| Memory Management | |
| Anti-Patterns | |
| Pandas Conversion | |
| NumPy Integration | |
| DuckDB Integration | |
| Type Errors | |
| qcut Label Gotcha | |
| Null Handling Issues | |
| Expression Context Errors | |
| Performance Anti-Patterns | |
| Migration from Pandas | |
| Memory Issues | |
Citation
When this library is used as a primary analytical tool, include in the report's Software & Tools references:
Vink, R. et al. Polars: Blazingly fast DataFrames [Computer software]. https://pola.rs/
Cite when: Polars is the core data processing engine for the analysis (typically always true in DAAF pipelines). Do not cite when: Only used for trivial file I/O in a script primarily using another tool.