Qsv qsv-performance

Performance guide covering index files, stats cache, and frequency cache accelerators for qsv

install

source · Clone the upstream repo

git clone https://github.com/dathere/qsv

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/dathere/qsv "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/skills/qsv-performance" ~/.claude/skills/dathere-qsv-qsv-performance && rm -rf "$T"

manifest: .claude/skills/skills/qsv-performance/SKILL.md

source content

qsv Performance Guide

Three Accelerators

1. Index Files (

.csv.idx

)

Created by:

qsv index

Used by: count, slice, sample, split, stats, frequency, schema, and others marked with 📇

Benefit	Without Index	With Index
Row count	Scan entire file	Instant (stored in index)
Random access	Sequential scan	O(1) lookup
Multithreaded	Not possible	Enabled for many commands
Slicing	Read from start	Jump to position

Rule: Always run

index

first if you'll run 2+ commands on the same file.

Auto-indexing: The MCP server auto-indexes files > 10MB.

2. Stats Cache (

.stats.csv

.stats.csv.data.jsonl

)

Created by:

qsv stats --cardinality --stats-jsonl

Used by: frequency, schema, tojsonl, sqlp, joinp, pivotp, diff, sample (smart commands)

Smart Command	What It Uses from Cache
`frequency`	Cardinality to skip all-unique columns
`schema`	Data types for JSON Schema generation
`sqlp`	Column types for Polars optimization
`joinp`	Cardinality for optimal join order
`pivotp`	Cardinality to estimate output width
`diff`	Column types for comparison

Rule: Run

stats --cardinality --stats-jsonl

before using any smart command.

Auto-caching: The MCP server auto-adds

--stats-jsonl

to stats commands.

3. Polars Engine

Commands: sqlp, joinp, pivotp, count (with

--polars-len

), schema (with

--polars

)

Benefit	Standard (csv crate)	Polars Engine
Processing model	Row-by-row streaming	Vectorized columnar
Memory	Streaming (constant)	Columnar (efficient)
Parallelism	Single-threaded	Multi-threaded
Large files	Limited by memory	Larger-than-memory
SQL support	N/A	Full SQL dialect

Rule: Use Polars commands (sqlp, joinp, pivotp) for files > 100MB or complex queries.

Parquet Acceleration

For repeated SQL queries on large CSV (> 10MB), consider converting to Parquet with

mcp__qsv__qsv_to_parquet

. Parquet is a columnar format that speeds up repeated SQL queries in

mcp__qsv__qsv_sqlp

. Use

read_parquet('file.parquet')

as the table source. DuckDB is the preferred engine for Parquet queries;

mcp__qsv__qsv_sqlp

with

SKIP_INPUT

as the

input_file

value also works. Note:

mcp__qsv__qsv_sqlp

can query CSV of any size directly — Parquet is an optimization for repeated queries, not a requirement. Parquet works ONLY with

mcp__qsv__qsv_sqlp

and DuckDB — all other qsv commands require CSV/TSV/SSV input.

Memory-Aware Command Selection

Commands That Load Entire File into Memory (🤯)

dedup

reverse

sort

stats

(with extended stats),

table

transpose

Commands with Memory Proportional to Cardinality (😣)

frequency

join

schema

tojsonl

Streaming Commands (constant memory)

Everything else -

select

search

slice

replace

count

, etc.

Large File Decision Tree

File size?
├── < 10MB: Any command works fine
├── 10MB - 100MB:
│   ├── Always: index first
│   ├── Repeated SQL: consider Parquet with qsv_to_parquet
│   ├── Prefer: streaming commands
│   └── OK: memory-intensive if < available RAM
├── 100MB - 1GB:
│   ├── Always: index + stats cache first
│   ├── Repeated SQL: consider Parquet with qsv_to_parquet
│   ├── Prefer: Polars commands (sqlp, joinp, pivotp)
│   ├── Avoid: sort, reverse, table (load entire file)
│   └── Alternative: sqlp with ORDER BY LIMIT instead of sort
└── > 1GB:
    ├── Must: index + stats cache
    ├── Repeated SQL: convert to Parquet with qsv_to_parquet
    ├── Must: Polars commands only for joins/queries
    ├── Avoid: all 🤯 commands
    └── Consider: split into chunks, process, cat rows

Performance Tips

Tip	Why
Use `--output file.csv`	Avoids stdout buffering overhead
Use `count` before `stats`	Fast row count for progress bars
Use `select` early in pipeline	Reduce columns = faster processing
Use `--no-headers` only when needed	Header detection is cheap
Use `slice --len N` for previews	Don't read entire file to inspect
Prefer `joinp` over `join`	Polars engine is significantly faster
Use `frequency --limit N`	Don't compute all unique values
Use `stats --cardinality`	Enables smart optimizations downstream

Concurrent Operations

The MCP server limits concurrent qsv operations (default: 1). For multiple independent files, the agent can issue separate tool calls.

Timeout Handling

Default timeout: 10 minutes (
```
QSV_MCP_OPERATION_TIMEOUT_MS
```
)
Long operations (sort on huge files) may timeout
If timeout occurs: try Polars alternative or split the file
Exit code 124 indicates timeout

Qsv qsv-performance

qsv Performance Guide

Three Accelerators

1. Index Files (
`.csv.idx`
)

2. Stats Cache (
`.stats.csv`
+
`.stats.csv.data.jsonl`
)

3. Polars Engine

Parquet Acceleration

Memory-Aware Command Selection

Commands That Load Entire File into Memory (🤯)

Commands with Memory Proportional to Cardinality (😣)

Streaming Commands (constant memory)

Large File Decision Tree

Performance Tips

Concurrent Operations

Timeout Handling

Qsv qsv-performance

qsv Performance Guide

Three Accelerators

1. Index Files (.csv.idx)

2. Stats Cache (.stats.csv + .stats.csv.data.jsonl)

3. Polars Engine

Parquet Acceleration

Memory-Aware Command Selection

Commands That Load Entire File into Memory (🤯)

Commands with Memory Proportional to Cardinality (😣)

Streaming Commands (constant memory)

Large File Decision Tree

Performance Tips

Concurrent Operations

Timeout Handling

1. Index Files (
`.csv.idx`
)

2. Stats Cache (
`.stats.csv`
+
`.stats.csv.data.jsonl`
)