Qsv csv-wrangling
Standard workflow order, tool selection matrix, and composition patterns for qsv CSV data wrangling
install
source · Clone the upstream repo
git clone https://github.com/dathere/qsv
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/dathere/qsv "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/skills/csv-wrangling" ~/.claude/skills/dathere-qsv-csv-wrangling && rm -rf "$T"
manifest:
.claude/skills/skills/csv-wrangling/SKILL.mdsource content
CSV Wrangling with qsv
Standard Workflow Order
Always follow this sequence when processing CSV data:
- Setup (Cowork) - If relative paths don't resolve, call
andmcp__qsv__qsv_get_working_dir
to syncmcp__qsv__qsv_set_working_dir - Index -
(enables fast random access for subsequent commands)index - Discover -
(detect format, encoding, delimiter) ->sniff
->headerscount - Profile -
(creates cache used by smart commands)stats --cardinality --stats-jsonl - Inspect -
(preview rows),slice --len 5
(value distributions with cache for reuse)frequency --frequency-jsonl - Transform - select, sort, dedup, rename, replace, search, sqlp, etc.
- Validate -
(against JSON Schema),validate
(verify results)stats - Export -
,tojsonl
,table
,mcp__qsv__qsv_to_parquet
(xlsx/sqlite/postgres/ods/datapackage)to - Document -
(AI-generated Data Dictionary, Description & Tags)describegpt --all
Tool Selection Matrix
| Task | Best Tool | Alternative | When to Use Alternative |
|---|---|---|---|
| Select columns | | | Need computed columns |
| Filter rows | | | Complex WHERE conditions |
| Sort data | | | Need ORDER BY with LIMIT |
| Remove duplicates | | | Need GROUP BY dedup |
| Join two files | | | for memory-constrained |
| Aggregate/GROUP BY | | | for simple counts; creates cache |
| Column stats | | | for extended stats |
| Find/replace | | | for conditional replace |
| Reshape wide->long | | - | DuckDB UNPIVOT (external) for complex reshaping |
| Reshape long->wide | | | Complex pivots |
| Concatenate files | | | Different column orders |
| Sample rows | | | for positional ranges |
| Document dataset | | — | AI-generated Data Dictionary, Description & Tags |
qsv Selection Syntax
Used by
select, search, sort, dedup, frequency, and other commands:
| Syntax | Meaning | Example |
|---|---|---|
| Column by name | |
| Column by 1-based index | |
| Multiple columns | |
| Range (inclusive) | |
| Exclude column | |
| Exclude range | |
| Match column names | |
Common Pipeline Patterns
Clean and Deduplicate
sniff -> index -> safenames -> fixlengths -> sqlp (TRIM) -> dedup -> validate
Profile and Analyze
sniff -> index -> stats --cardinality --stats-jsonl -> read .stats.csv -> frequency (on key columns) -> sqlp (GROUP BY queries)
Before writing SQL: read
.stats.csv to learn column types, cardinality, nullcount, min/max, sort order. Run frequency on columns you'll GROUP BY or filter on. Use this to write precise WHERE clauses, correct type casts, and avoid unnecessary COALESCE.
For repeated SQL queries on large CSV (> 10MB), consider converting to Parquet:
sniff -> index -> stats -> to_parquet -> sqlp (using read_parquet()). Note: sqlp can query CSV of any size directly.
Join and Enrich
index (both files) -> stats (both) -> joinp -> select (keep needed columns) -> sort
Profile and Document
sniff -> index -> stats --cardinality --stats-jsonl -> describegpt --all
Convert and Export
excel (to CSV) -> index -> stats -> select -> tojsonl / qsv_to_parquet
Batch Convert to Multiple Formats
excel (to CSV) -> index -> stats -> to xlsx report.xlsx excel (to CSV) -> index -> stats -> to sqlite report.db excel (to CSV) -> index -> stats -> to parquet parquet_output_dir
File Integrity Verification
blake3 file.csv > checksums.b3 (before transfer) -> blake3 --check checksums.b3 (after transfer)
Delimiter Handling
- CSV (
): default, no flag needed, - TSV (
): use\t
or file extension--delimiter '\t'.tsv - SSV (
): use;
or file extension--delimiter ';'.ssv - Auto-detect: set
environment variableQSV_SNIFF_DELIMITER=1
Important Notes
- Column indices are 1-based, not 0-based
flag changes behavior significantly - most commands assume headers exist--no-headers- Output goes to stdout by default; use
to write to file--output file.csv - Many commands auto-detect
(Snappy compressed) files transparently.sz
requires same column order; usecat rows
for different schemascat rowskey
loads all data into memory and sorts internally; usededup
flag if input is already sorted to enable streaming mode with constant memory--sorted
loads entire file into memory; for huge files usesort
with ORDER BYsqlp- For repeated SQL queries on large CSV (> 10MB), consider converting to Parquet with
for faster performance. Parquet works ONLY withmcp__qsv__qsv_to_parquet
and DuckDB — all other qsv commands need CSV/TSV/SSV inputsqlp
Tool Discovery
Use
to discover commands beyond the initially loaded core tools. There are 53 qsv skill-based commands covering selection, filtering, transformation, aggregation, joining, validation, formatting, conversion, and more.mcp__qsv__qsv_search_tools
Operational Notes
- Timeout: Default operation timeout is 10 minutes (configurable via
, max 30 min). Allow operations to run to completion.QSV_MCP_OPERATION_TIMEOUT_MS - Memory:
,dedup
,sort
,reverse
,table
,transpose
, andpragmastat
(with extended stats) load entire files into memory. For files >1GB, preferstats
/extdedup
viaextsort
.mcp__qsv__qsv_command - Cowork path architecture: qsv runs on the HOST machine. File paths must be valid on the host. Always verify with
.mcp__qsv__qsv_get_working_dir - Sequential operations: Prefer sequential over parallel qsv calls to avoid queuing delays: index → stats → analysis.
- Large files (>5GB): Let
run to completion. Only fall back tomcp__qsv__qsv_frequency
with GROUP BY if the server timeout is exceeded.mcp__qsv__qsv_sqlp - Context window: Save outputs to files rather than returning to chat. Use
ormcp__qsv__qsv_slice
with LIMIT to inspect subsets.mcp__qsv__qsv_sqlp