Qsv data-clean

Clean a CSV/TSV/Excel file - fix headers, trim whitespace, remove duplicates, validate

install

source · Clone the upstream repo

git clone https://github.com/dathere/qsv

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/dathere/qsv "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/skills/data-clean" ~/.claude/skills/dathere-qsv-data-clean && rm -rf "$T"

manifest: .claude/skills/skills/data-clean/SKILL.md

source content

Data Clean

Clean the given tabular data file by fixing common data quality issues.

Cowork note: If relative paths don't resolve, call
mcp__qsv__qsv_get_working_dir
and
mcp__qsv__qsv_set_working_dir
to sync the working directory.

Steps

Index: Run
```
mcp__qsv__qsv_index
```
on the file for fast random access in subsequent steps.
Assess current state: Run
```
mcp__qsv__qsv_sniff
```
and
```
mcp__qsv__qsv_count
```
to understand the file format and size.

Profile for cleaning decisions: Run

mcp__qsv__qsv_stats

with

cardinality: true, stats_jsonl: true

. Read

.stats.csv

to decide which cleaning steps are needed:

Stats Column	What It Reveals	Cleaning Action
`nullcount` , `sparsity`	Missing values per column	If sparsity > 0.5, decide: impute, drop column, or flag
`cardinality` vs row count	Duplicate rows exist if any key column has cardinality < row count	Run `dedup`
`min_length` , `max_length`	String length variation	Large gap suggests ragged data or embedded whitespace
`sort_order`	Whether data is pre-sorted	Use `dedup --sorted` for streaming mode if sorted
`mode` , `mode_count`	Dominant values	If mode_count > 80% of rows, investigate data entry defaults
`type`	Inferred types	String columns that should be numeric indicate format issues

Check headers: Run
```
mcp__qsv__qsv_headers
```
to inspect column names. If names contain spaces, special characters, or are duplicated, plan to use
```
safenames
```
.
Build cleaning steps: Apply these operations in order (skip any that aren't needed based on assessment):

a.
```
safenames
```
- Normalize column names to safe, ASCII-only identifiers (removes spaces, special chars, ensures uniqueness)

b.
```
fixlengths
```
- Ensure all rows have the same number of fields (pads short rows, truncates long rows)

c.
```
sqlp
```
- Remove leading/trailing whitespace from columns using
```
TRIM()
```
. Example:
```
SELECT TRIM(col1) AS col1, TRIM(col2) AS col2 FROM _t_1
```
.

d.
```
dedup
```
- Remove exact duplicate rows. Loads all data into memory and sorts internally. Use
```
--sorted
```
if input is already sorted to enable streaming mode with constant memory.

e.
```
validate
```
- If a JSON Schema is available, validate against it and report violations.
Verify results: Run
```
mcp__qsv__qsv_count
```
on the output to confirm row count. Run
```
mcp__qsv__qsv_stats
```
with
```
cardinality: true
```
to verify improvements.
Report changes: Summarize what was cleaned:
- Headers renamed (before -> after)
- Rows with wrong field count (fixed by fixlengths)
- Duplicate rows removed
- Whitespace trimmed

Cleaning Steps

Call each tool sequentially, passing the output of one step as input to the next:

mcp__qsv__qsv_command

with

command: "safenames"

input_file: "<file>"

output_file: "step1.csv"

mcp__qsv__qsv_command

with

command: "fixlengths"

input_file: "step1.csv"

output_file: "step2.csv"

mcp__qsv__qsv_sqlp

with

input_file: "step2.csv"

sql: "SELECT TRIM(col1) AS col1, TRIM(col2) AS col2, ... FROM _t_1"

output_file: "step3.csv"

(list all columns with TRIM)

mcp__qsv__qsv_command

with

command: "dedup"

input_file: "step3.csv"

output_file: "<output>"

Notes

Always preserve the original file - write output to a new file
For large files (> 100MB),
```
dedup
```
loads entire file into memory to sort and deduplicate; consider using
```
sqlp
```
with
```
SELECT DISTINCT
```
instead
```
safenames
```
uses
```
--mode conditional
```
by default (only renames if needed)
If the user specifies particular columns to clean, use column selection syntax instead of cleaning all columns
```
dedup
```
loads all data into memory and sorts internally; if input is already sorted, use
```
--sorted
```
for streaming mode
Use
```
mcp__qsv__qsv_search_tools
```
to find additional cleaning tools if needed (e.g.,
```
replace
```
for regex substitution)