Qsv qsv-performance

Performance guide covering index files, stats cache, and frequency cache accelerators for qsv

install
source · Clone the upstream repo
git clone https://github.com/dathere/qsv
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/dathere/qsv "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/skills/qsv-performance" ~/.claude/skills/dathere-qsv-qsv-performance && rm -rf "$T"
manifest: .claude/skills/skills/qsv-performance/SKILL.md
source content

qsv Performance Guide

Three Accelerators

1. Index Files (
.csv.idx
)

Created by:

qsv index
Used by: count, slice, sample, split, stats, frequency, schema, and others marked with 📇

BenefitWithout IndexWith Index
Row countScan entire fileInstant (stored in index)
Random accessSequential scanO(1) lookup
MultithreadedNot possibleEnabled for many commands
SlicingRead from startJump to position

Rule: Always run

index
first if you'll run 2+ commands on the same file.

Auto-indexing: The MCP server auto-indexes files > 10MB.

2. Stats Cache (
.stats.csv
+
.stats.csv.data.jsonl
)

Created by:

qsv stats --cardinality --stats-jsonl
Used by: frequency, schema, tojsonl, sqlp, joinp, pivotp, diff, sample (smart commands)

Smart CommandWhat It Uses from Cache
frequency
Cardinality to skip all-unique columns
schema
Data types for JSON Schema generation
sqlp
Column types for Polars optimization
joinp
Cardinality for optimal join order
pivotp
Cardinality to estimate output width
diff
Column types for comparison

Rule: Run

stats --cardinality --stats-jsonl
before using any smart command.

Auto-caching: The MCP server auto-adds

--stats-jsonl
to stats commands.

3. Polars Engine

Commands: sqlp, joinp, pivotp, count (with

--polars-len
), schema (with
--polars
)

BenefitStandard (csv crate)Polars Engine
Processing modelRow-by-row streamingVectorized columnar
MemoryStreaming (constant)Columnar (efficient)
ParallelismSingle-threadedMulti-threaded
Large filesLimited by memoryLarger-than-memory
SQL supportN/AFull SQL dialect

Rule: Use Polars commands (sqlp, joinp, pivotp) for files > 100MB or complex queries.

Parquet Acceleration

For repeated SQL queries on large CSV (> 10MB), consider converting to Parquet with

mcp__qsv__qsv_to_parquet
. Parquet is a columnar format that speeds up repeated SQL queries in
mcp__qsv__qsv_sqlp
. Use
read_parquet('file.parquet')
as the table source. DuckDB is the preferred engine for Parquet queries;
mcp__qsv__qsv_sqlp
with
SKIP_INPUT
as the
input_file
value also works. Note:
mcp__qsv__qsv_sqlp
can query CSV of any size directly — Parquet is an optimization for repeated queries, not a requirement. Parquet works ONLY with
mcp__qsv__qsv_sqlp
and DuckDB — all other qsv commands require CSV/TSV/SSV input.

Memory-Aware Command Selection

Commands That Load Entire File into Memory (🤯)

dedup
,
reverse
,
sort
,
stats
(with extended stats),
table
,
transpose

Commands with Memory Proportional to Cardinality (😣)

frequency
,
join
,
schema
,
tojsonl

Streaming Commands (constant memory)

Everything else -

select
,
search
,
slice
,
replace
,
count
, etc.

Large File Decision Tree

File size?
├── < 10MB: Any command works fine
├── 10MB - 100MB:
│   ├── Always: index first
│   ├── Repeated SQL: consider Parquet with qsv_to_parquet
│   ├── Prefer: streaming commands
│   └── OK: memory-intensive if < available RAM
├── 100MB - 1GB:
│   ├── Always: index + stats cache first
│   ├── Repeated SQL: consider Parquet with qsv_to_parquet
│   ├── Prefer: Polars commands (sqlp, joinp, pivotp)
│   ├── Avoid: sort, reverse, table (load entire file)
│   └── Alternative: sqlp with ORDER BY LIMIT instead of sort
└── > 1GB:
    ├── Must: index + stats cache
    ├── Repeated SQL: convert to Parquet with qsv_to_parquet
    ├── Must: Polars commands only for joins/queries
    ├── Avoid: all 🤯 commands
    └── Consider: split into chunks, process, cat rows

Performance Tips

TipWhy
Use
--output file.csv
Avoids stdout buffering overhead
Use
count
before
stats
Fast row count for progress bars
Use
select
early in pipeline
Reduce columns = faster processing
Use
--no-headers
only when needed
Header detection is cheap
Use
slice --len N
for previews
Don't read entire file to inspect
Prefer
joinp
over
join
Polars engine is significantly faster
Use
frequency --limit N
Don't compute all unique values
Use
stats --cardinality
Enables smart optimizations downstream

Concurrent Operations

The MCP server limits concurrent qsv operations (default: 1). For multiple independent files, the agent can issue separate tool calls.

Timeout Handling

  • Default timeout: 10 minutes (
    QSV_MCP_OPERATION_TIMEOUT_MS
    )
  • Long operations (sort on huge files) may timeout
  • If timeout occurs: try Polars alternative or split the file
  • Exit code 124 indicates timeout