Qsv data-quality

Quality dimensions quick reference and remediation decision tree for tabular data assessment

install
source · Clone the upstream repo
git clone https://github.com/dathere/qsv
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/dathere/qsv "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/skills/data-quality" ~/.claude/skills/dathere-qsv-data-quality && rm -rf "$T"
manifest: .claude/skills/skills/data-quality/SKILL.md
source content

Data Quality Assessment with qsv

For the full step-by-step profiling workflow, use the

/data-profile
command. This skill provides quick-reference guidance for quality assessment and remediation decisions.

Quality Dimensions (Quick Reference)

DimensionKey QuestionPrimary CheckRed Flag
CompletenessMissing values?
stats
nullcount
,
sparsity
Sparsity > 0.5
UniquenessUnwanted duplicates?
stats --cardinality
vs row count
Key column cardinality < row count
ValidityCorrect formats/types?
stats
type
;
validate schema.json
String type on numeric column
ConsistencyUniform formats?
frequency
— case variants;
sniff
— encoding
Same value in different cases
AccuracyPlausible values?
stats
— min/max/stddev
Values > 3 stddev from mean
Column Name QualityHeaders safe & descriptive?
safenames --verify
Spaces, special chars, or duplicates in headers
ConformityValues follow standards?
searchset
with domain regex
Non-standard codes (country, state, zip, phone)
Referential IntegrityForeign keys valid?
joinp --left-anti
Orphaned references across related files
Injection SafetyMalicious payloads?
searchset
with injection regex
Formula/SQL injection patterns in cells
DocumentationDataset described?
describegpt --all
No Data Dictionary or Description

Remediation Decision Tree

When a quality issue is found, choose the right fix:

ProblemSeverityFix CommandWhen to Skip
Ragged rowsHigh
fixlengths
Never — breaks downstream tools
Wrong encodingHigh
input
File is already UTF-8 (check with
sniff
)
Unsafe column namesMedium
safenames
Headers already safe (no spaces/special chars)
Leading/trailing whitespaceMedium
sqlp
with
TRIM(col)
Stats show no difference between
min
/
max
lengths and trimmed values
Duplicate rowsMedium
dedup
(or
extdedup
for >1GB)
stats --cardinality
on key columns shows all unique
Inconsistent caseLow
sqlp
with
UPPER(col)
or
LOWER(col)
frequency
shows no case variants
Empty valuesLow
sqlp
with
COALESCE(NULLIF(col, ''), 'N/A')
Nulls are semantically meaningful
Non-conforming valuesMedium
searchset
+
search --flag
No domain standard applies
Orphaned foreign keysMedium
joinp --left-anti
Single-file dataset with no references
Injection payloadsHigh
searchset
with injection regex + sanitize
Data is internal-only and never opened in spreadsheets or loaded into databases
Invalid rowsLow
validate schema.json
+ filter
No schema available

Fix Ordering

Always apply fixes in this order to avoid cascading issues:

1. input          (encoding — must be UTF-8 before anything else)
2. safenames      (headers — fixes names before column references)
3. fixlengths     (structure — ensures consistent field counts)
4. sqlp with TRIM()    (whitespace — clean values before dedup)
5. dedup          (duplicates — remove after trimming so "foo " and "foo" match)
6. validate       (validation — check against schema last)

Stats Cache as Quality Dashboard

After running

stats --cardinality --stats-jsonl
(basic moarstats auto-runs), read the
.stats.csv
cache to assess quality in one pass:

Cache ColumnQuality Signal
nullcount
Completeness — 0 is ideal
sparsity
Completeness — ratio of nulls (0.0–1.0)
cardinality
Uniqueness — compare to row count
type
Validity — check expected types
min
/
max
Accuracy — plausible range?
mean
/
stddev
Accuracy — outlier detection (>3σ)
outliers_total_cnt
Accuracy — from moarstats; outlier count per column
mode
Consistency — dominant value expected?

Advanced Stats (via
moarstats --advanced
)

Run

moarstats --advanced
to enrich the cache with distribution shape metrics:

Cache ColumnQuality Signal
kurtosis
>3 heavy tails (outlier-prone), <3 light tails; >10 = extreme outliers
bimodality_coefficient
>=0.555 suggests bimodal distribution (possible mixed populations)
jarque_bera_pvalue
<0.05 = NOT normally distributed; flag analyses assuming normality
gini_coefficient
Near 1 = extreme concentration; near 0 = uniform
shannon_entropy
Low = concentrated values; high = diverse
winsorized_mean
Compare to
mean
— large difference signals outlier influence
median_mean_ratio
<0.8 or >1.2 = significantly skewed; mean may be misleading
range_stddev_ratio
Very high = extreme outliers relative to variability
cv
>100% = high relative variability; data is highly spread relative to mean
mad_stddev_ratio
>0.8 = stddev is reliable; <<0.8 = outliers inflating stddev
mode_zscore
Far from 0 = mode is atypical; possible mixed populations
trimean
Robust central tendency: (Q1 + 2*median + Q3)/4; compare to mean for skew detection
midhinge
Midpoint of middle 50%: (Q1+Q3)/2; robust center measure
robust_cv
MAD/abs(median); outlier-resistant coefficient of variation
theil_index
Inequality measure (0=equal); decomposable into within/between group; only for positive values
mean_ad
Average absolute distance from mean; less sensitive to outliers than stddev
simpsons_diversity_index
Probability two random values differ (0-1); more intuitive than entropy