install
source · Clone the upstream repo
git clone https://github.com/ComeOnOliver/skillshub
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ComeOnOliver/skillshub "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/TerminalSkills/skills/pandas" ~/.claude/skills/comeonoliver-skillshub-pandas && rm -rf "$T"
manifest:
skills/TerminalSkills/skills/pandas/SKILL.mdsource content
Pandas
Overview
Pandas is a Python library for loading, cleaning, transforming, and analyzing tabular data. It provides DataFrames for structured data manipulation, supports CSV, Excel, SQL, JSON, and Parquet formats, and offers powerful groupby aggregation, merge/join operations, time series resampling, and method chaining for building analysis pipelines.
Instructions
- When loading data, use
for large datasets (faster, smaller, type-preserving),pd.read_parquet()
with explicitpd.read_csv()
for CSVs, anddtype
for database queries.pd.read_sql() - When cleaning data, handle missing values with
orfillna()
, deduplicate withdropna()
, use string methods (drop_duplicates()
,.str.strip()
) for text cleaning, and convert types explicitly with.str.lower()
andastype()
.pd.to_datetime() - When transforming data, use
for computed columns,assign()
for method chaining,pipe()
andmelt()
for reshaping, andpivot_table()
/pd.cut()
for binning.pd.qcut() - When aggregating, use
with named aggregation for readable column names,groupby().agg()
to broadcast results back to original shape, andtransform()
for time-based grouping.resample() - When merging, use
with explicitpd.merge()
andhow
parameters to catch data quality issues at merge time, andvalidate
for stacking DataFrames.pd.concat() - When optimizing performance, use
dtype for low-cardinality strings, vectorized operations overcategory
, and Parquet for storage; for datasets over 10GB, consider Polars or DuckDB..apply()
Examples
Example 1: Clean and analyze a sales dataset
User request: "Load a messy CSV of sales data, clean it, and generate monthly revenue summaries"
Actions:
- Load with
specifyingpd.read_csv()
anddtype
for key columnsparse_dates - Clean missing values, deduplicate by order ID, and standardize text fields
- Add computed columns for revenue and profit margin using
assign() - Group by month with
for revenue, order count, and average order valueresample("M").agg()
Output: A clean DataFrame with monthly revenue summaries ready for visualization or reporting.
Example 2: Merge and enrich customer data from multiple sources
User request: "Join customer data from CRM, transactions, and support tickets into a single view"
Actions:
- Load each dataset and standardize key columns (email, customer ID)
- Merge CRM and transactions with
pd.merge(on="customer_id", how="left", validate="one_to_many") - Aggregate support tickets per customer and merge counts
- Export the enriched dataset to Parquet for downstream analysis
Output: A unified customer DataFrame with CRM info, transaction history, and support metrics.
Guidelines
- Use
for intermediate and output files since it is faster, smaller, and preserves types.pd.read_parquet() - Chain transformations with
for readable and testable code..pipe() - Use named aggregation in
for self-documenting column names..agg() - Set
explicitly ondtype
for large files since type inference reads the full file twice.read_csv() - Use
dtype for columns with fewer than 1000 unique values for significant memory savings.category - Validate merges with
to catch data quality issues at merge time.validate="one_to_many" - Use
for complex filters instead of chained boolean indexing for better readability.query()