Qsv data-validate

Validate data and analysis before sharing - methodology, accuracy, bias, and data quality checks

install
source · Clone the upstream repo
git clone https://github.com/dathere/qsv
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/dathere/qsv "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/skills/data-validate" ~/.claude/skills/dathere-qsv-data-validate && rm -rf "$T"
manifest: .claude/skills/skills/data-validate/SKILL.md
source content

Data Validate

Validate data files and analyses for accuracy, methodology, and potential biases before sharing with stakeholders. Generates a confidence assessment and improvement suggestions.

Cowork note: If relative paths don't resolve, call

mcp__qsv__qsv_get_working_dir
and
mcp__qsv__qsv_set_working_dir
to sync the working directory.

Usage

The input can be:

  • A CSV/TSV/Excel file to validate for data quality
  • An analysis, report, or document to review for methodology
  • SQL queries and their results to verify
  • A description of methodology and findings

Steps

1. Data Quality Validation

If a data file is provided, run these checks using qsv:

a. Index and profile: Run

mcp__qsv__qsv_index
, then
mcp__qsv__qsv_stats
with
cardinality: true, stats_jsonl: true
and
mcp__qsv__qsv_sniff
to understand the data.

b. Completeness: Read

.stats.csv
— check
nullcount
and
sparsity
for each column. Flag columns with sparsity > 0.5.

c. Uniqueness: Compare

cardinality
to row count from
mcp__qsv__qsv_count
. Flag key columns (ID, email) where cardinality < row count. Run
mcp__qsv__qsv_command
with
command: "dedup"
and
options: {"dupes-output": "dupes.csv"}
to find exact duplicates.

d. Validity: Check

type
column in stats — flag String columns that should be numeric. Run
mcp__qsv__qsv_command
with
command: "validate"
against a JSON Schema if available.

e. Consistency: Run

mcp__qsv__qsv_frequency
with
limit: 20
on categorical columns — look for case variants ("NYC" vs "nyc"), inconsistent formats, unexpected values.

f. Accuracy: Read

.stats.csv
for
min
,
max
,
mean
,
stddev
— flag implausible ranges (negative ages, latitude > 90, future dates). Run
mcp__qsv__qsv_moarstats
with
advanced: true
— check
outliers_percentage
> 5%,
kurtosis
> 10 (extreme outliers).

g. Distribution sanity: Read moarstats columns for deeper validation:

  • median_mean_ratio
    — if < 0.8 or > 1.2, distribution is significantly skewed; verify the mean isn't misleading
  • winsorized_mean_25pct
    vs
    mean
    — large divergence (> 10%) confirms outliers are distorting the average
  • mad
    (median absolute deviation) — more robust than stddev for outlier detection; if
    mad_stddev_ratio
    > 0.8, stddev is reasonably reliable
  • jarque_bera_pvalue
    — if < 0.05, data is NOT normally distributed; flag any analysis that assumes normality
  • mode_count
    — if mode accounts for > 50% of values, investigate whether this reflects a data entry default or missing value masking

h. Join integrity (if multiple files): Run

mcp__qsv__qsv_joinp
with
left_anti: true
to find orphaned foreign keys.

i. Injection screening: Run

mcp__qsv__qsv_command
with
command: "searchset"
,
regexset-file: "${CLAUDE_PLUGIN_ROOT}/resources/injection-regexes.txt"
, and
flag: "injection_match"
to scan for malicious payloads.

2. Review Methodology and Assumptions

Examine the analysis for:

  • Question framing: Is the analysis answering the right question? Could it be interpreted differently?
  • Data selection: Are the right datasets being used? Is the time range appropriate?
  • Population definition: Is the analysis population correctly defined? Are there unintended exclusions?
  • Metric definitions: Are metrics defined clearly and consistently? Do they match how stakeholders understand them?
  • Baseline and comparison: Is the comparison fair? Are time periods, cohort sizes, and contexts comparable?

3. Check for Common Analytical Pitfalls

Systematically review against these pitfalls:

PitfallHow to Detect with qsvRed Flag
Join explosion
mcp__qsv__qsv_count
before and after join
Row count increased after join
Survivorship bias
mcp__qsv__qsv_frequency
on status/lifecycle columns
Missing churned/deleted/failed entities
Incomplete period
mcp__qsv__qsv_sqlp
to check date ranges
Partial periods compared to full periods
Denominator shifting
mcp__qsv__qsv_sqlp
to verify denominator consistency
Definition changed between periods
Average of averages
mcp__qsv__qsv_sqlp
to recalculate from raw data
Pre-aggregated averages with unequal group sizes
Selection bias
mcp__qsv__qsv_frequency
on segment definitions
Segments defined by the outcome being measured

4. Verify Calculations and Aggregations

Spot-check using

mcp__qsv__qsv_sqlp
:

  • Recalculate key numbers independently
  • Verify subtotals sum to totals:
    SELECT SUM(subtotal) as check_total FROM data
  • Check percentages sum to ~100%:
    SELECT SUM(pct) FROM data
  • Validate YoY/MoM comparisons use correct base periods
  • Confirm filters are applied consistently across all metrics

Magnitude Checks

Metric TypeSanity Check via qsv
Counts
mcp__qsv__qsv_count
— does it match known figures?
Sums/averages
mcp__qsv__qsv_stats
— are min/max/mean in plausible range?
Rates
mcp__qsv__qsv_sqlp
— are values between 0% and 100%?
Distributions
mcp__qsv__qsv_frequency
— do segment percentages sum to ~100%?
Growth rates
mcp__qsv__qsv_sqlp
— is 50%+ MoM growth realistic?
Outliers
mcp__qsv__qsv_moarstats
outliers_percentage
,
kurtosis

Red Flags That Warrant Investigation

  • Any metric that changed by more than 50% period-over-period without an obvious cause
  • Counts or sums that are exact round numbers (suggests a filter or default value issue)
  • Rates exactly at 0% or 100% (may indicate incomplete data)
  • Results that perfectly confirm the hypothesis (reality is usually messier)
  • Identical values across time periods or segments (suggests the query is ignoring a dimension)

5. Assess Visualizations (if present)

If the analysis includes charts:

  • Do axes start at appropriate values (zero for bar charts)?
  • Are scales consistent across comparison charts?
  • Do chart titles accurately describe what's shown?
  • Could the visualization mislead a quick reader?
  • Are there truncated axes, inconsistent intervals, or 3D effects that distort perception?

6. Evaluate Narrative and Conclusions

Review whether:

  • Conclusions are supported by the data shown
  • Alternative explanations are acknowledged
  • Uncertainty is communicated appropriately
  • Recommendations follow logically from findings
  • The level of confidence matches the strength of evidence

7. Suggest Improvements

Provide specific, actionable suggestions:

  • Additional analyses that would strengthen the conclusions
  • Caveats or limitations that should be noted
  • Better visualizations or framings for key points
  • Missing context that stakeholders would want

8. Generate Confidence Assessment

Rate the analysis on a 3-level scale:

Ready to share — Analysis is methodologically sound, calculations verified, caveats noted. Minor suggestions for improvement but nothing blocking.

Share with noted caveats — Analysis is largely correct but has specific limitations or assumptions that must be communicated to stakeholders. List the required caveats.

Needs revision — Found specific errors, methodological issues, or missing analyses that should be addressed before sharing. List the required changes with priority order.

Pre-Delivery QA Checklist

Data Quality Checks

  • Source verification: Confirmed data sources. Run
    mcp__qsv__qsv_sniff
    to verify format and encoding.
  • Freshness: Data is current enough. Noted the "as of" date.
  • Completeness: No gaps in time series. Check
    nullcount
    /
    sparsity
    in
    .stats.csv
    .
  • Null handling: Nulls handled appropriately (excluded, imputed, or flagged).
  • Deduplication: No double-counting. Verify with
    mcp__qsv__qsv_count
    before/after joins,
    dedup --dupes-output
    .
  • Filter verification: All filters correct. No unintended exclusions.

Calculation Checks

  • Aggregation logic: GROUP BY includes all non-aggregated columns.
  • Denominator correctness: Rate calculations use the right denominator (non-zero).
  • Date alignment: Comparisons use same time period length. Partial periods excluded or noted.
  • Join correctness: JOIN types appropriate. Verify row counts with
    mcp__qsv__qsv_count
    after joins.
  • Metric definitions: Metrics match stakeholder definitions. Deviations noted.
  • Subtotals sum: Parts add up to the whole. Verify with
    mcp__qsv__qsv_sqlp
    .

Reasonableness Checks

  • Magnitude: Numbers in plausible range. Check
    min
    /
    max
    in
    .stats.csv
    .
  • Trend continuity: No unexplained jumps. Use
    mcp__qsv__qsv_sqlp
    to check period-over-period.
  • Cross-reference: Key numbers match other known sources.
  • Edge cases: Checked boundaries — empty segments, zero-activity periods, new entities.

Presentation Checks

  • Chart accuracy: Bar charts start at zero. Axes labeled. Scales consistent.
  • Number formatting: Appropriate precision and consistent formatting.
  • Title clarity: Titles state the insight, not just the metric. Date ranges specified.
  • Caveat transparency: Known limitations and assumptions stated explicitly.
  • Reproducibility: Someone else could recreate this analysis.

Common Analytical Pitfalls (Reference)

Join Explosion

A many-to-many join silently multiplies rows, inflating counts and sums. Detect:

mcp__qsv__qsv_count
before and after join — if count increased, investigate the join relationship. Prevent: Use
COUNT(DISTINCT id)
instead of
COUNT(*)
when counting entities through joins.

Survivorship Bias

Analyzing only entities that exist today, ignoring churned/deleted/failed ones. Detect:

mcp__qsv__qsv_frequency
on status columns — are all lifecycle states represented? Prevent: Ask "who is NOT in this dataset?" before drawing conclusions.

Incomplete Period Comparison

Comparing a partial period to a full period. Detect:

mcp__qsv__qsv_sqlp
to check min/max dates per period. Prevent: Filter to complete periods or compare same number of days.

Denominator Shifting

The denominator changes between periods, making rates incomparable. Detect:

mcp__qsv__qsv_sqlp
to verify denominator definition consistency. Prevent: Use consistent definitions across all compared periods.

Average of Averages

Averaging pre-computed averages gives wrong results when group sizes differ. Detect: Compare

mcp__qsv__qsv_stats
mean against
mcp__qsv__qsv_sqlp
weighted average. Prevent: Always aggregate from raw data.

Simpson's Paradox

Trend reverses when data is aggregated vs. segmented. Detect:

mcp__qsv__qsv_sqlp
GROUP BY at different granularity levels — does the conclusion change? Prevent: Always check results at segment level before aggregating.

Report Format

## Validation Report

### Overall Assessment: [Ready to share | Share with caveats | Needs revision]

### Data Quality Summary
- File: [format, rows, columns, encoding]
- Completeness: [null rates, gaps found]
- Uniqueness: [duplicates found, cardinality issues]
- Validity: [type mismatches, schema violations]
- Accuracy: [outliers, implausible ranges]

### Methodology Review
[Findings about approach, data selection, definitions]

### Issues Found
1. [Severity: High/Medium/Low] [Issue description and impact]
2. ...

### Calculation Spot-Checks
- [Metric]: [Verified / Discrepancy found]
- ...

### Visualization Review
[Any issues with charts or visual presentation]

### Suggested Improvements
1. [Improvement and why it matters]
2. ...

### Required Caveats for Stakeholders
- [Caveat that must be communicated]
- ...

Notes

  • Run this command before any high-stakes presentation or decision
  • For pure data quality profiling without methodology review, use
    /data-profile
    instead
  • If validation finds data quality issues, use
    /data-clean
    to fix them and re-validate
  • Share the validation report alongside your analysis to build stakeholder confidence