Ai-analyst data-quality-check

Skill: Data Quality Check

install
source · Clone the upstream repo
git clone https://github.com/ai-analyst-lab/ai-analyst
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ai-analyst-lab/ai-analyst "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/data-quality-check" ~/.claude/skills/ai-analyst-lab-ai-analyst-data-quality-check && rm -rf "$T"
manifest: .claude/skills/data-quality-check/skill.md
source content

Skill: Data Quality Check

Purpose

Validate data completeness, consistency, and coverage before any analysis begins, flagging issues with severity ratings so the analyst knows what blocks analysis vs. what to note as a caveat.

When to Use

Apply this skill at the start of every new analysis, when connecting to a new data source, or when results look suspicious. Run quality checks BEFORE drawing conclusions from data.

Instructions

Check Sequence

Run these checks in order. Stop and report blockers immediately.

1. Completeness Checks

-- Null rate per column
SELECT
    column_name,
    COUNT(*) AS total_rows,
    COUNT(*) - COUNT(column_name) AS null_count,
    ROUND(100.0 * (COUNT(*) - COUNT(column_name)) / COUNT(*), 1) AS null_pct
FROM table_name
GROUP BY column_name;

-- Missing date ranges (for time-series data)
WITH date_spine AS (
    SELECT generate_series(MIN(date_col), MAX(date_col), INTERVAL '1 day') AS expected_date
    FROM table_name
)
SELECT expected_date
FROM date_spine
LEFT JOIN table_name ON date_col = expected_date
WHERE table_name.date_col IS NULL;

-- Unexpected zeros in numeric columns
SELECT column_name, COUNT(*) AS zero_count
FROM table_name
WHERE numeric_column = 0
GROUP BY column_name;

Severity rules:

  • BLOCKER: Primary key has nulls, >50% nulls in a critical analysis column, entire date ranges missing
  • WARNING: 5-50% nulls in an analysis column, scattered missing dates, unexpected zeros in revenue/count columns
  • INFO: <5% nulls in non-critical columns, weekend gaps in business-day data

2. Consistency Checks

-- Duplicate detection
SELECT id_column, COUNT(*) AS dupes
FROM table_name
GROUP BY id_column
HAVING COUNT(*) > 1;

-- Referential integrity
SELECT child.fk_column, COUNT(*)
FROM child_table child
LEFT JOIN parent_table parent ON child.fk_column = parent.pk_column
WHERE parent.pk_column IS NULL
GROUP BY child.fk_column;

-- Date format consistency
SELECT DISTINCT LENGTH(date_column), LEFT(date_column, 4)
FROM table_name
WHERE date_column IS NOT NULL;

Severity rules:

  • BLOCKER: Duplicate primary keys, broken referential integrity affecting >10% of rows
  • WARNING: Mixed date formats, inconsistent casing in categorical columns, orphan records <10%
  • INFO: Minor casing inconsistencies, trailing whitespace

3. Coverage Checks

Use

check_temporal_coverage()
for time-series gap detection and
check_value_domain()
for categorical completeness:

from helpers.sql_helpers import check_temporal_coverage, check_value_domain

# Temporal coverage — detect missing days/weeks/months
coverage = check_temporal_coverage(df, "order_date", freq="D")
if coverage["status"] == "FAIL":
    print(f"BLOCKER: {coverage['message']}")

# Value domain — verify expected categories exist
domain = check_value_domain(df["device_type"], ["desktop", "mobile", "tablet"])
if domain["status"] == "FAIL":
    print(f"WARNING: {domain['message']}")

SQL checks for segment coverage:

-- Expected segments present
SELECT segment_column, COUNT(*) AS row_count,
       MIN(date_col) AS earliest, MAX(date_col) AS latest
FROM table_name
GROUP BY segment_column
ORDER BY row_count DESC;

-- Missing cohorts
SELECT date_trunc('month', created_at) AS cohort_month, COUNT(DISTINCT user_id)
FROM users
GROUP BY 1
ORDER BY 1;

Severity rules:

  • BLOCKER: Key segments entirely missing, temporal coverage <80%
  • WARNING: Some segments have <10% of expected rows, coverage 80-95%, unexpected category values
  • INFO: Minor imbalances in segment sizes, coverage >95%

4. Statistical Sanity Checks

Use the helper functions for systematic outlier and null concentration checks:

from helpers.tieout_helpers import check_null_concentration, check_outliers

# Null concentration — flags columns with high null rates
null_results = check_null_concentration(df)
for r in null_results:
    if r["status"] == "FAIL":
        print(f"BLOCKER: {r['column']} — {r['detail']}")
    elif r["status"] == "WARN":
        print(f"WARNING: {r['column']} — {r['detail']}")

# Outlier detection — IQR method (default) or z-score
for col in numeric_columns:
    iqr_result = check_outliers(df[col], method="iqr")
    zscore_result = check_outliers(df[col], method="zscore")
    # Use IQR as primary, z-score as cross-check
    if iqr_result["status"] in ("WARN", "FAIL"):
        print(f"WARNING: {col} — {iqr_result['detail']}")

For domain-specific sanity checks (impossible values, suspicious distributions):

def sanity_check(df, column):
    """Run statistical sanity checks on a numeric column."""
    stats = {
        "mean": df[column].mean(),
        "median": df[column].median(),
        "std": df[column].std(),
        "min": df[column].min(),
        "max": df[column].max(),
        "p1": df[column].quantile(0.01),
        "p99": df[column].quantile(0.99),
        "skew": df[column].skew(),
    }

    issues = []
    if column in ["conversion_rate", "percentage"] and (stats["max"] > 1 or stats["min"] < 0):
        issues.append(("BLOCKER", f"{column} has values outside [0,1] range"))
    if abs(stats["skew"]) > 3:
        issues.append(("WARNING", f"{column} is highly skewed (skew={stats['skew']:.1f})"))

    return stats, issues

Severity rules:

  • BLOCKER: Impossible values (negative revenue, conversion rate >100%, future dates), >95% nulls
  • WARNING: Extreme outliers (>3 IQR), >50% nulls, highly skewed distributions
  • INFO: Moderate outliers, slight skew, <5% nulls

5. Time-Series Anomaly Scan

For each date-indexed metric column in the dataset:

import pandas as pd
import numpy as np

def anomaly_scan(df, date_col, metric_col, window=14, threshold=2.0):
    """Detect time-series anomalies using rolling mean +/- std bands.

    IMPORTANT: Aggregate to daily/weekly granularity FIRST.
    Do NOT run on raw event rows.

    Args:
        df: DataFrame with date and metric columns (pre-aggregated).
        date_col: Name of the date column.
        metric_col: Name of the metric column.
        window: Rolling window size in periods. Default: 14.
        threshold: Number of standard deviations for anomaly band. Default: 2.0.

    Returns:
        Dict with 'anomalies' (list of dicts) and 'summary' (str).
    """
    ts = df.sort_values(date_col).copy()
    ts["rolling_mean"] = ts[metric_col].rolling(window, min_periods=3).mean()
    ts["rolling_std"] = ts[metric_col].rolling(window, min_periods=3).std()
    ts["upper"] = ts["rolling_mean"] + threshold * ts["rolling_std"]
    ts["lower"] = ts["rolling_mean"] - threshold * ts["rolling_std"]

    anomalies = []
    for _, row in ts.iterrows():
        if pd.notna(row["upper"]) and row[metric_col] > row["upper"]:
            pct = ((row[metric_col] - row["rolling_mean"]) / row["rolling_mean"]) * 100
            anomalies.append({
                "date": row[date_col], "value": row[metric_col],
                "direction": "spike", "pct_above_normal": round(pct, 1)
            })
        elif pd.notna(row["lower"]) and row[metric_col] < row["lower"]:
            pct = ((row["rolling_mean"] - row[metric_col]) / row["rolling_mean"]) * 100
            anomalies.append({
                "date": row[date_col], "value": row[metric_col],
                "direction": "drop", "pct_below_normal": round(pct, 1)
            })
    return {"anomalies": anomalies, "summary": f"{len(anomalies)} anomalies in {metric_col}"}

Sequencing: Run AFTER basic data profiling in the Data Explorer step, not before. Requires aggregated data.

Severity rules:

  • WARNING: Any anomaly detected — present as starting point for investigation
  • INFO: No anomalies found — note that the metric appears stable

Output format:

Notable patterns detected:
  - [metric] spiked [X]% above normal on [date range]
  - [metric] dropped [X]% below normal on [date range]

These are observations, not conclusions — present as starting points for investigation.

6. Data Freshness Check

For each table with a date/timestamp column:

import pandas as pd
from datetime import datetime, timedelta

def freshness_check(df, date_col, current_date=None):
    """Check data freshness and infer data cadence.

    Args:
        df: DataFrame with a date column.
        date_col: Name of the date/timestamp column.
        current_date: Override for current date (for testing). Default: today.

    Returns:
        Dict with 'max_date', 'days_ago', 'cadence', 'status'.
    """
    current_date = current_date or datetime.now().date()
    dates = pd.to_datetime(df[date_col]).dt.date
    max_date = dates.max()
    days_ago = (current_date - max_date).days

    # Infer cadence from median gap between consecutive distinct dates
    distinct_dates = sorted(dates.dropna().unique())
    if len(distinct_dates) >= 2:
        gaps = [(distinct_dates[i+1] - distinct_dates[i]).days
                for i in range(len(distinct_dates) - 1)]
        median_gap = sorted(gaps)[len(gaps) // 2]

        if median_gap <= 1.5:
            cadence = "daily"
            stale_threshold = 2
        elif median_gap <= 8:
            cadence = "weekly"
            stale_threshold = 10
        else:
            cadence = "static/historical"
            stale_threshold = None
    else:
        cadence = "unknown"
        stale_threshold = None

    # Determine status
    if days_ago > 90:
        cadence = "static/historical"
        status = "OK"
        note = f"Historical dataset, date range ends {max_date}"
    elif stale_threshold and days_ago > stale_threshold:
        status = "WARNING"
        note = f"Data is {days_ago} days old (expected {cadence} refresh)"
    else:
        status = "OK"
        note = f"Data is {days_ago} days old"

    return {
        "max_date": str(max_date), "days_ago": days_ago,
        "cadence": cadence, "status": status, "note": note
    }

Output format:

Data freshness:
  - events: most recent = [date] ([N] days ago) [OK/WARNING]
  - orders: most recent = [date] ([N] days ago) [OK/WARNING]
  - users: most recent = [date] ([N] days ago) [OK/WARNING]

Severity rules:

  • WARNING: Data is stale relative to inferred cadence
  • INFO: Data is fresh or dataset is static/historical

Output Format

# Data Quality Report: [Dataset Name]
## Date: [YYYY-MM-DD]
## Analyst: AI Product Analyst

### Summary
| Severity | Count | Details |
|----------|-------|---------|
| BLOCKER  | X     | [Must fix before analysis] |
| WARNING  | X     | [Note as caveat in analysis] |
| INFO     | X     | [For awareness only] |

### BLOCKERS
[List each blocker with: what's wrong, which column/table, how many rows affected, suggested fix]

### WARNINGS
[List each warning with: what's wrong, potential impact on analysis, recommended handling]

### INFO
[List each info item briefly]

### Data Profile
| Table | Rows | Columns | Date Range | Key Columns |
|-------|------|---------|------------|-------------|
| ... | ... | ... | ... | ... |

### Recommendation
[Can analysis proceed? With what caveats?]
- PROCEED: No blockers, warnings noted
- PROCEED WITH CAUTION: No blockers, significant warnings — note in findings
- BLOCKED: Blockers found — fix data before analyzing

Examples

Example 1: Clean dataset

### Summary
| Severity | Count | Details |
|----------|-------|---------|
| BLOCKER  | 0     | — |
| WARNING  | 1     | 8% null in `referral_source` column |
| INFO     | 2     | Weekend gaps in daily data; minor casing inconsistency in `country` |

### Recommendation
PROCEED — the null referral_source values should be noted as "unknown" in any segmentation by acquisition channel. All other columns are complete and consistent.

Example 2: Problematic dataset

### Summary
| Severity | Count | Details |
|----------|-------|---------|
| BLOCKER  | 2     | Duplicate order IDs (1,247 rows); revenue column has negative values (-$45K total) |
| WARNING  | 3     | March 2025 data missing entirely; `device_type` has 12% nulls; conversion rates >1.0 for 89 rows |
| INFO     | 1     | `country` has mixed casing ("US" vs "us") |

### BLOCKERS
1. **Duplicate order_ids**: 1,247 rows have duplicate `order_id` values. This will inflate revenue calculations. Must deduplicate before analysis — keep earliest record per order_id.
2. **Negative revenue**: 342 rows have negative `revenue` values totaling -$45K. These may be refunds. Must classify and handle separately (exclude from revenue analysis or create separate refund analysis).

### Recommendation
BLOCKED — Fix duplicate order_ids and classify negative revenue before proceeding. Estimated fix time: 15 minutes with SQL dedup + refund classification.

Anti-Patterns

  1. Never skip quality checks because "the data looks fine" — surprises hide in the tails
  2. Never treat all nulls the same — 2% nulls in a non-critical column ≠ 50% nulls in a key metric
  3. Never fix data silently — always document what you changed and why in the quality report
  4. Never analyze data with known blockers — fix blockers first, or the entire analysis is unreliable
  5. Never assume dates are clean — check for future dates, time zone issues, and format inconsistencies
  6. Never ignore outliers — investigate whether they're real (whale users) or errors (test accounts, data bugs)