Qsv data-quality
Quality dimensions quick reference and remediation decision tree for tabular data assessment
install
source · Clone the upstream repo
git clone https://github.com/dathere/qsv
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/dathere/qsv "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/skills/data-quality" ~/.claude/skills/dathere-qsv-data-quality && rm -rf "$T"
manifest:
.claude/skills/skills/data-quality/SKILL.mdsource content
Data Quality Assessment with qsv
For the full step-by-step profiling workflow, use the
/data-profile command. This skill provides quick-reference guidance for quality assessment and remediation decisions.
Quality Dimensions (Quick Reference)
| Dimension | Key Question | Primary Check | Red Flag |
|---|---|---|---|
| Completeness | Missing values? | — , | Sparsity > 0.5 |
| Uniqueness | Unwanted duplicates? | vs row count | Key column cardinality < row count |
| Validity | Correct formats/types? | — ; | String type on numeric column |
| Consistency | Uniform formats? | — case variants; — encoding | Same value in different cases |
| Accuracy | Plausible values? | — min/max/stddev | Values > 3 stddev from mean |
| Column Name Quality | Headers safe & descriptive? | | Spaces, special chars, or duplicates in headers |
| Conformity | Values follow standards? | with domain regex | Non-standard codes (country, state, zip, phone) |
| Referential Integrity | Foreign keys valid? | | Orphaned references across related files |
| Injection Safety | Malicious payloads? | with injection regex | Formula/SQL injection patterns in cells |
| Documentation | Dataset described? | | No Data Dictionary or Description |
Remediation Decision Tree
When a quality issue is found, choose the right fix:
| Problem | Severity | Fix Command | When to Skip |
|---|---|---|---|
| Ragged rows | High | | Never — breaks downstream tools |
| Wrong encoding | High | | File is already UTF-8 (check with ) |
| Unsafe column names | Medium | | Headers already safe (no spaces/special chars) |
| Leading/trailing whitespace | Medium | with | Stats show no difference between / lengths and trimmed values |
| Duplicate rows | Medium | (or for >1GB) | on key columns shows all unique |
| Inconsistent case | Low | with or | shows no case variants |
| Empty values | Low | with | Nulls are semantically meaningful |
| Non-conforming values | Medium | + | No domain standard applies |
| Orphaned foreign keys | Medium | | Single-file dataset with no references |
| Injection payloads | High | with injection regex + sanitize | Data is internal-only and never opened in spreadsheets or loaded into databases |
| Invalid rows | Low | + filter | No schema available |
Fix Ordering
Always apply fixes in this order to avoid cascading issues:
1. input (encoding — must be UTF-8 before anything else) 2. safenames (headers — fixes names before column references) 3. fixlengths (structure — ensures consistent field counts) 4. sqlp with TRIM() (whitespace — clean values before dedup) 5. dedup (duplicates — remove after trimming so "foo " and "foo" match) 6. validate (validation — check against schema last)
Stats Cache as Quality Dashboard
After running
stats --cardinality --stats-jsonl (basic moarstats auto-runs), read the .stats.csv cache to assess quality in one pass:
| Cache Column | Quality Signal |
|---|---|
| Completeness — 0 is ideal |
| Completeness — ratio of nulls (0.0–1.0) |
| Uniqueness — compare to row count |
| Validity — check expected types |
/ | Accuracy — plausible range? |
/ | Accuracy — outlier detection (>3σ) |
| Accuracy — from moarstats; outlier count per column |
| Consistency — dominant value expected? |
Advanced Stats (via moarstats --advanced
)
moarstats --advancedRun
moarstats --advanced to enrich the cache with distribution shape metrics:
| Cache Column | Quality Signal |
|---|---|
| >3 heavy tails (outlier-prone), <3 light tails; >10 = extreme outliers |
| >=0.555 suggests bimodal distribution (possible mixed populations) |
| <0.05 = NOT normally distributed; flag analyses assuming normality |
| Near 1 = extreme concentration; near 0 = uniform |
| Low = concentrated values; high = diverse |
| Compare to — large difference signals outlier influence |
| <0.8 or >1.2 = significantly skewed; mean may be misleading |
| Very high = extreme outliers relative to variability |
| >100% = high relative variability; data is highly spread relative to mean |
| >0.8 = stddev is reliable; <<0.8 = outliers inflating stddev |
| Far from 0 = mode is atypical; possible mixed populations |
| Robust central tendency: (Q1 + 2*median + Q3)/4; compare to mean for skew detection |
| Midpoint of middle 50%: (Q1+Q3)/2; robust center measure |
| MAD/abs(median); outlier-resistant coefficient of variation |
| Inequality measure (0=equal); decomposable into within/between group; only for positive values |
| Average absolute distance from mean; less sensitive to outliers than stddev |
| Probability two random values differ (0-1); more intuitive than entropy |