install
source · Clone the upstream repo
git clone https://github.com/dathere/qsv
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/dathere/qsv "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/skills/csv-query" ~/.claude/skills/dathere-qsv-csv-query && rm -rf "$T"
manifest:
.claude/skills/skills/csv-query/SKILL.mdsource content
CSV Query
Query tabular data files using SQL via the Polars-powered
sqlp command.
Cowork note: If relative paths don't resolve, call
andmcp__qsv__qsv_get_working_dirto sync the working directory.mcp__qsv__qsv_set_working_dir
Decision Tree
Is the query simple (single column filter, basic select)?
- Yes -> Consider
+select
for simpler operationssearch - No -> Use
for full SQL supportsqlp
Does the query involve joins, GROUP BY, window functions, or complex expressions?
- Yes -> Use
(Polars SQL engine)sqlp
Is the CSV file very large (> 10MB)?
- Yes -> Consider converting to Parquet with
for faster repeated queries. Note:mcp__qsv__qsv_to_parquet
can also query CSV files of any size directly.sqlp
Steps
-
Prepare the file: Run
andmcp__qsv__qsv_index
withmcp__qsv__qsv_stats
to create index and stats cache.cardinality: true, stats_jsonl: true -
Read the stats cache: Read
(e.g.,<FILESTEM>.stats.csv
fordata.stats.csv
) to understand column metadata before writing SQL. This is the most important step for writing efficient queries.data.csv -
Run frequency on key columns: For columns you plan to GROUP BY, filter on, or join on, run
to see actual value distributions. This reveals the best filter values and whether a GROUP BY will produce a manageable result set.mcp__qsv__qsv_frequency -
Write and run SQL: Use
with the SQL query informed by stats and frequency data. The table name in SQL is the filename stem (e.g.,mcp__qsv__qsv_sqlp
->data.csv
). For Parquet files, useSELECT * FROM data
as the table source instead.read_parquet('data.parquet') -
Refine if needed: Check results and adjust the query.
Using Stats to Write Better SQL
After reading the
.stats.csv cache, use these columns to inform your SQL:
| Stats Column | How to Use in SQL |
|---|---|
| Use correct casts and comparisons — don't quote integers, use date functions for Date/DateTime columns |
/ | Write precise WHERE clauses using actual data range (e.g., instead of arbitrary bounds) |
| Estimate GROUP BY result size — low cardinality (< 100) is fast; high cardinality (> 10K) may need LIMIT or a different approach |
| Only add COALESCE or IS NOT NULL where > 0 — skip null handling for columns with zero nulls |
| Skip ORDER BY if data is already sorted on that column (sort_order = "Ascending"/"Descending") |
/ | Write outlier filters: |
/ / | For skewed data (when mean and median diverge), use quartile-based ranges: instead of mean ± stddev |
| If skewness > 1 or < -1, prefer median/quartile-based filters over mean-based ones |
| High CV (> 100%) signals high relative variability — add LIMIT to GROUP BY queries and consider binning continuous values |
| If > 5%, consider excluding outliers before aggregation: |
| Columns with sparsity > 0.5 are mostly null — avoid using them as join keys or GROUP BY columns |
Using Frequency for Filter Values
Run
mcp__qsv__qsv_frequency with select: "col", limit: 20 before writing WHERE clauses on categorical columns:
- Pick selective filters: If
shows "active" has 90% of rows, filtering onfrequency
is wasteful — filter on the rare values insteadWHERE status = 'active' - Validate expected values: If you plan
, check frequency first to confirm those values exist and see if you're missing anyWHERE category IN ('A','B','C') - Avoid GROUP BY on high-cardinality columns: If frequency shows thousands of unique values, GROUP BY will produce a huge result — add LIMIT or aggregate differently
SQL Syntax Guide
The
sqlp command uses Polars SQL dialect:
-- Basic select SELECT col1, col2 FROM data WHERE col1 > 100 -- Aggregation SELECT category, COUNT(*) as cnt, AVG(price) as avg_price FROM data GROUP BY category ORDER BY cnt DESC -- Window functions SELECT *, ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC) as rank FROM employees -- String operations SELECT * FROM data WHERE col1 LIKE '%pattern%' -- Date operations SELECT *, EXTRACT(YEAR FROM date_col) as year FROM data -- Multiple files (join) SELECT a.*, b.name FROM file1 a JOIN file2 b ON a.id = b.id -- CASE expressions SELECT *, CASE WHEN amount > 1000 THEN 'high' ELSE 'low' END as tier FROM data
Table Naming Convention
- File:
-> Table:sales_2024.csvsales_2024 - File:
-> Table:my-data.csv
(quote if contains special chars)"my-data" - Multiple files: each file is a separate table
Notes
uses the Polars engine - some PostgreSQL-specific syntax may not be supportedsqlp- For very complex queries that fail, suggest DuckDB as an alternative
- The stats cache helps Polars choose optimal data types for columns
- Results go to stdout by default; use
for large result sets--output file.csv - Column names are case-sensitive in SQL queries
- Use
to preview large result sets before running full queriesLIMIT
can query multiple CSV files in a single SQL statement (useful for joins)sqlp