Qsv data-clean
Clean a CSV/TSV/Excel file - fix headers, trim whitespace, remove duplicates, validate
git clone https://github.com/dathere/qsv
T=$(mktemp -d) && git clone --depth=1 https://github.com/dathere/qsv "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/skills/data-clean" ~/.claude/skills/dathere-qsv-data-clean && rm -rf "$T"
.claude/skills/skills/data-clean/SKILL.mdData Clean
Clean the given tabular data file by fixing common data quality issues.
Cowork note: If relative paths don't resolve, call
andmcp__qsv__qsv_get_working_dirto sync the working directory.mcp__qsv__qsv_set_working_dir
Steps
-
Index: Run
on the file for fast random access in subsequent steps.mcp__qsv__qsv_index -
Assess current state: Run
andmcp__qsv__qsv_sniff
to understand the file format and size.mcp__qsv__qsv_count -
Profile for cleaning decisions: Run
withmcp__qsv__qsv_stats
. Readcardinality: true, stats_jsonl: true
to decide which cleaning steps are needed:.stats.csvStats Column What It Reveals Cleaning Action
,nullcountsparsityMissing values per column If sparsity > 0.5, decide: impute, drop column, or flag
vs row countcardinalityDuplicate rows exist if any key column has cardinality < row count Run dedup
,min_lengthmax_lengthString length variation Large gap suggests ragged data or embedded whitespace sort_orderWhether data is pre-sorted Use
for streaming mode if sorteddedup --sorted
,modemode_countDominant values If mode_count > 80% of rows, investigate data entry defaults typeInferred types String columns that should be numeric indicate format issues -
Check headers: Run
to inspect column names. If names contain spaces, special characters, or are duplicated, plan to usemcp__qsv__qsv_headers
.safenames -
Build cleaning steps: Apply these operations in order (skip any that aren't needed based on assessment):
a.
- Normalize column names to safe, ASCII-only identifiers (removes spaces, special chars, ensures uniqueness)safenamesb.
- Ensure all rows have the same number of fields (pads short rows, truncates long rows)fixlengthsc.
- Remove leading/trailing whitespace from columns usingsqlp
. Example:TRIM()
.SELECT TRIM(col1) AS col1, TRIM(col2) AS col2 FROM _t_1d.
- Remove exact duplicate rows. Loads all data into memory and sorts internally. Usededup
if input is already sorted to enable streaming mode with constant memory.--sortede.
- If a JSON Schema is available, validate against it and report violations.validate -
Verify results: Run
on the output to confirm row count. Runmcp__qsv__qsv_count
withmcp__qsv__qsv_stats
to verify improvements.cardinality: true -
Report changes: Summarize what was cleaned:
- Headers renamed (before -> after)
- Rows with wrong field count (fixed by fixlengths)
- Duplicate rows removed
- Whitespace trimmed
Cleaning Steps
Call each tool sequentially, passing the output of one step as input to the next:
withmcp__qsv__qsv_command
,command: "safenames"
,input_file: "<file>"output_file: "step1.csv"
withmcp__qsv__qsv_command
,command: "fixlengths"
,input_file: "step1.csv"output_file: "step2.csv"
withmcp__qsv__qsv_sqlp
,input_file: "step2.csv"
,sql: "SELECT TRIM(col1) AS col1, TRIM(col2) AS col2, ... FROM _t_1"
(list all columns with TRIM)output_file: "step3.csv"
withmcp__qsv__qsv_command
,command: "dedup"
,input_file: "step3.csv"output_file: "<output>"
Notes
- Always preserve the original file - write output to a new file
- For large files (> 100MB),
loads entire file into memory to sort and deduplicate; consider usingdedup
withsqlp
insteadSELECT DISTINCT
usessafenames
by default (only renames if needed)--mode conditional- If the user specifies particular columns to clean, use column selection syntax instead of cleaning all columns
loads all data into memory and sorts internally; if input is already sorted, usededup
for streaming mode--sorted- Use
to find additional cleaning tools if needed (e.g.,mcp__qsv__qsv_search_tools
for regex substitution)replace