Qsv csv-query

Run SQL queries against CSV/TSV/Excel files using Polars SQL engine

install

source · Clone the upstream repo

git clone https://github.com/dathere/qsv

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/dathere/qsv "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/skills/csv-query" ~/.claude/skills/dathere-qsv-csv-query && rm -rf "$T"

manifest: .claude/skills/skills/csv-query/SKILL.md

source content

CSV Query

Query tabular data files using SQL via the Polars-powered

sqlp

command.

Cowork note: If relative paths don't resolve, call
mcp__qsv__qsv_get_working_dir
and
mcp__qsv__qsv_set_working_dir
to sync the working directory.

Decision Tree

Is the query simple (single column filter, basic select)?

Yes -> Consider
```
select
```
+
```
search
```
for simpler operations
No -> Use
```
sqlp
```
for full SQL support

Does the query involve joins, GROUP BY, window functions, or complex expressions?

Yes -> Use
```
sqlp
```
(Polars SQL engine)

Is the CSV file very large (> 10MB)?

Yes -> Consider converting to Parquet with
```
mcp__qsv__qsv_to_parquet
```
for faster repeated queries. Note:
```
sqlp
```
can also query CSV files of any size directly.

Steps

Prepare the file: Run

mcp__qsv__qsv_index

and

mcp__qsv__qsv_stats

with

cardinality: true, stats_jsonl: true

to create index and stats cache.

Read the stats cache: Read
```
<FILESTEM>.stats.csv
```
(e.g.,
```
data.stats.csv
```
for
```
data.csv
```
) to understand column metadata before writing SQL. This is the most important step for writing efficient queries.
Run frequency on key columns: For columns you plan to GROUP BY, filter on, or join on, run
```
mcp__qsv__qsv_frequency
```
to see actual value distributions. This reveals the best filter values and whether a GROUP BY will produce a manageable result set.
Write and run SQL: Use
```
mcp__qsv__qsv_sqlp
```
with the SQL query informed by stats and frequency data. The table name in SQL is the filename stem (e.g.,
```
data.csv
```
->
```
SELECT * FROM data
```
). For Parquet files, use
```
read_parquet('data.parquet')
```
as the table source instead.
Refine if needed: Check results and adjust the query.

Using Stats to Write Better SQL

After reading the

.stats.csv

cache, use these columns to inform your SQL:

Stats Column	How to Use in SQL
`type`	Use correct casts and comparisons — don't quote integers, use date functions for Date/DateTime columns
`min` / `max`	Write precise WHERE clauses using actual data range (e.g., `WHERE price BETWEEN 10.5 AND 999.99` instead of arbitrary bounds)
`cardinality`	Estimate GROUP BY result size — low cardinality (< 100) is fast; high cardinality (> 10K) may need LIMIT or a different approach
`nullcount`	Only add COALESCE or IS NOT NULL where `nullcount` > 0 — skip null handling for columns with zero nulls
`sort_order`	Skip ORDER BY if data is already sorted on that column (sort_order = "Ascending"/"Descending")
`mean` / `stddev`	Write outlier filters: `WHERE col BETWEEN mean - 3stddev AND mean + 3stddev`
`median` / `q1` / `q3`	For skewed data (when mean and median diverge), use quartile-based ranges: `WHERE col BETWEEN q1 AND q3` instead of mean ± stddev
`skewness`	If skewness > 1 or < -1, prefer median/quartile-based filters over mean-based ones
`cv`	High CV (> 100%) signals high relative variability — add LIMIT to GROUP BY queries and consider binning continuous values
`outliers_percentage`	If > 5%, consider excluding outliers before aggregation: `WHERE col BETWEEN lower_inner_fence AND upper_inner_fence`
`sparsity`	Columns with sparsity > 0.5 are mostly null — avoid using them as join keys or GROUP BY columns

Using Frequency for Filter Values

Run

mcp__qsv__qsv_frequency

with

select: "col", limit: 20

before writing WHERE clauses on categorical columns:

Pick selective filters: If
```
frequency
```
shows "active" has 90% of rows, filtering on
```
WHERE status = 'active'
```
is wasteful — filter on the rare values instead
Validate expected values: If you plan
```
WHERE category IN ('A','B','C')
```
, check frequency first to confirm those values exist and see if you're missing any
Avoid GROUP BY on high-cardinality columns: If frequency shows thousands of unique values, GROUP BY will produce a huge result — add LIMIT or aggregate differently

SQL Syntax Guide

The

sqlp

command uses Polars SQL dialect:

-- Basic select
SELECT col1, col2 FROM data WHERE col1 > 100

-- Aggregation
SELECT category, COUNT(*) as cnt, AVG(price) as avg_price
FROM data GROUP BY category ORDER BY cnt DESC

-- Window functions
SELECT *, ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC) as rank
FROM employees

-- String operations
SELECT * FROM data WHERE col1 LIKE '%pattern%'

-- Date operations
SELECT *, EXTRACT(YEAR FROM date_col) as year FROM data

-- Multiple files (join)
SELECT a.*, b.name FROM file1 a JOIN file2 b ON a.id = b.id

-- CASE expressions
SELECT *, CASE WHEN amount > 1000 THEN 'high' ELSE 'low' END as tier FROM data

Table Naming Convention

File:
```
sales_2024.csv
```
-> Table:
```
sales_2024
```
File:
```
my-data.csv
```
-> Table:
```
"my-data"
```
(quote if contains special chars)
Multiple files: each file is a separate table

Notes

```
sqlp
```
uses the Polars engine - some PostgreSQL-specific syntax may not be supported
For very complex queries that fail, suggest DuckDB as an alternative
The stats cache helps Polars choose optimal data types for columns
Results go to stdout by default; use
```
--output file.csv
```
for large result sets
Column names are case-sensitive in SQL queries
Use
```
LIMIT
```
to preview large result sets before running full queries
```
sqlp
```
can query multiple CSV files in a single SQL statement (useful for joins)