install
source · Clone the upstream repo
git clone https://github.com/dathere/qsv
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/dathere/qsv "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/skills/data-join" ~/.claude/skills/dathere-qsv-data-join && rm -rf "$T"
manifest:
.claude/skills/skills/data-join/SKILL.mdsource content
Data Join
Join two tabular data files on common columns.
Cowork note: If relative paths don't resolve, call
andmcp__qsv__qsv_get_working_dirto sync the working directory.mcp__qsv__qsv_set_working_dir
Strategy Selection
| Scenario | Best Tool | Why |
|---|---|---|
| Standard equi-join | | Polars engine, fastest |
| Non-equi join (>, <, BETWEEN) | | SQL supports complex conditions |
| Cross join / cartesian | | syntax |
| Memory-constrained | with | Streaming, lower memory |
| Fuzzy/approximate match | with | Nearest-match join |
Steps
-
Index both files: Run
on both files for fast random access.mcp__qsv__qsv_index -
Inspect both files: Run
on both files to identify column names. Determine which columns to join on.mcp__qsv__qsv_headers -
Profile join columns: Run
withmcp__qsv__qsv_stats
on both files. Check the cardinality of join columns to determine optimal table order.cardinality: true, stats_jsonl: true -
Choose strategy:
- If cardinality of join column in file1 > file2, put file1 on the left
- For
: smaller cardinality table should be on the right for best performancejoinp - If join condition is complex (non-equi), use
mcp__qsv__qsv_sqlp - If join involves date/time matching where exact dates won't align (e.g., quarterly to monthly, event dates to nearest reporting period), use
withmcp__qsv__qsv_joinpasof: true
-
Execute join: Use
for standard joins:mcp__qsv__qsv_joinpjoinp columns1: "id" input1: "file1.csv" columns2: "id" input2: "file2.csv" # Join type: omit for inner (default), or set one of: # left: true, full: true, cross: trueOr use
for complex joins:mcp__qsv__qsv_sqlpSELECT a.*, b.col1, b.col2 FROM file1 a JOIN file2 b ON a.id = b.id AND a.date BETWEEN b.start_date AND b.end_dateFor ASOF (nearest-match) joins, use
withmcp__qsv__qsv_joinp
:asof: truejoinp columns1: "date" input1: "events.csv" columns2: "date" input2: "reference.csv" asof: true strategy: "backward" allow_exact_matches: true
(default) — match to the last right row with key < left keystrategy: "backward"
— match to the first right row with key > left keystrategy: "forward"
— match to the numerically closest row (supportsstrategy: "nearest"
parameter)tolerance- Add
/left_by
parameters to restrict matching within subgroups (e.g., per jurisdiction)right_by - Add
to include equal keys (<=, >=); default is strict inequality (<, >)allow_exact_matches: true
-
Clean up result: Use
to remove duplicate join columns or unnecessary columns from the result.mcp__qsv__qsv_select -
Verify: Run
on the result. Compare with input counts to validate join behavior:mcp__qsv__qsv_count- Inner join: result <= min(left, right)
- Left join: result >= left count
- Full outer: result >= max(left, right)
- ASOF: result = left count (every left row gets a match or null, like a left join)
Join Column Validation Checklist
Before executing a join, read
.stats.csv for both files and validate:
| Check | Stats Column | Red Flag | Action |
|---|---|---|---|
| Type match | | Join columns have different types (e.g., Integer vs String) | Cast one column before joining: with |
| Null density | , | sparsity > 0.3 on join column | Nulls don't match — expect unmatched rows; consider filtering nulls first |
| Value overlap | , | Non-overlapping ranges across files | No rows will match — verify correct join column |
| Skew detection | , | One value dominates (mode_count > 50% of rows) | Join will be heavily skewed many-to-one; verify this is expected |
| Uniqueness | | Both files have uniqueness_ratio < 1.0 on join column | Many-to-many join risk — expect row explosion; verify with after |
| Outlier keys | | outliers_percentage > 5% on numeric join column | Outlier keys may not match across files; consider trimming first |
Join Types
| Type | Flag | SQL | Behavior |
|---|---|---|---|
| Inner | (default) | | Only matching rows |
| Left | | | All left + matching right |
| Full outer | | | All rows from both |
| Cross | | | Cartesian product |
| Left Anti | | / | Left rows without match |
| Left Semi | | | Left rows with match (no right cols) |
| ASOF | | (use joinp) | Nearest-key match (temporal/numeric) |
Notes
uses the Polars engine and is significantly faster thanjoinp
for large filesjoin- The stats cache helps
optimize join executionjoinp - For joining on multiple columns, separate column names with commas:
columns1: "col1,col2" - Column names must match exactly (case-sensitive)
- If join columns have different names, specify separately:
,columns1: "id"columns2: "customer_id" - For one-to-many joins, the result will have more rows than either input
handles null values in join columns (nulls don't match by default)joinp- ASOF joins implicitly enable
— no need to pass it explicitly--try-parsedates - For ASOF joins with subgroups, use
and--left_by
(e.g., match nearest date per jurisdiction)--right_by - The
option (nearest strategy only) limits how far the nearest match can be: use duration strings for dates (--tolerance
,1d
,30d
) or positive integers for numeric keys365d - ASOF joins require sorted join columns; both datasets are auto-sorted unless
is set--no-sort