Qsv data-profile
Profile a CSV/TSV/Excel file - detect format, compute statistics, show value distributions
git clone https://github.com/dathere/qsv
T=$(mktemp -d) && git clone --depth=1 https://github.com/dathere/qsv "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/skills/data-profile" ~/.claude/skills/dathere-qsv-data-profile && rm -rf "$T"
.claude/skills/skills/data-profile/SKILL.mdData Profile
Profile the given tabular data file to understand its structure, types, and distributions.
Cowork note: If relative paths don't resolve, call
andmcp__qsv__qsv_get_working_dirto sync the working directory.mcp__qsv__qsv_set_working_dir
Steps
-
Index: Run
on the file for fast random access in subsequent steps.mcp__qsv__qsv_index -
Detect format: Run
on the file to detect delimiter, encoding, preamble, and row count estimate.mcp__qsv__qsv_sniff -
Count rows: Run
to get the exact row count.mcp__qsv__qsv_count -
Get headers: Run
to list all column names and positions.mcp__qsv__qsv_headers -
Compute statistics: Run
withmcp__qsv__qsv_stats
andcardinality: true
to generate full column statistics and cache them. Includestats_jsonl: true
for comprehensive stats (mean, median, mode, stddev, quartiles, etc.). Basic moarstats auto-runs to enrich the cache with ~18 additional columns.--everything -
Advanced statistics: Run
withmcp__qsv__qsv_moarstats
(omitadvanced: true
— it updates the stats cache in-place by default). This enriches the stats cache with:output_file- Distribution shape: kurtosis, bimodality coefficient, Jarque-Bera test (normality), skewness measures (pearson_skewness)
- Inequality/diversity: Gini coefficient, Atkinson index, Theil index, Shannon entropy, normalized entropy, Simpson's diversity index
- Robust central tendency: winsorized/trimmed means (with stddev, variance, CV, range, stddev ratio)
- Derived ratios: median_mean_ratio, range_stddev_ratio, quartile_coefficient_dispersion, mad_stddev_ratio, iqr_range_ratio, robust_cv
- Outlier statistics: counts by severity (extreme/mild, lower/upper), outlier mean/stddev/range, impact ratio, fence z-scores
- Other: trimean, midhinge, mode_zscore, min/max z-scores, relative standard error, mean absolute deviation, xsd_type
-
Show distributions: Run
withmcp__qsv__qsv_frequency
to show top value distributions for each column. For high-cardinality columns (cardinality close to row count), note them as likely unique identifiers.limit: 10 -
Optional: Bivariate correlations (if multiple numeric columns): Run
withmcp__qsv__qsv_moarstats
to compute pairwise Pearson/Spearman/Kendall correlations, covariance, and mutual information. Output goes tobivariate: true
. Reveals hidden relationships between columns.<FILESTEM>.stats.bivariate.csv -
Optional: Robust statistics (if data is messy/heavy-tailed and < 100K rows): Run
withmcp__qsv__qsv_command
for Hodges-Lehmann center and Shamos spread — robust estimators that tolerate up to 29% corrupted data. Especially useful when mean/stddev are misleading due to outliers. Warning: pragmastat computes median-of-pairwise statistics (O(n²) complexity) and becomes very slow on large datasets. For files > 100K rows, usecommand: "pragmastat"
for ~100x speedup, or combine--subsample 10000
for ~200x speedup.--subsample 10000 --no-bounds -
Screen for PII/PHI: Run
withmcp__qsv__qsv_command
,command: "searchset"
, andregexset-file: "${CLAUDE_PLUGIN_ROOT}/resources/pii-regexes.txt"
to scan for sensitive data patterns (SSN, credit cards, email, phone, IBAN). Report any columns with matches.flag: "pii_match" -
Screen for injection: Run
withmcp__qsv__qsv_command
,command: "searchset"
, andregexset-file: "${CLAUDE_PLUGIN_ROOT}/resources/injection-regexes.txt"
to scan for CSV/formula injection and SQL injection payloads. Report any columns with matches.flag: "injection_match" -
Preview data: Run
withmcp__qsv__qsv_slice
to show the first 5 rows as a sample.len: 5 -
Document: Generate a Data Dictionary, Dataset Description, and Tags as JSON.
13a) Primary — use
: Rundescribegpt
withmcp__qsv__qsv_describegpt
andall: true, format: "JSON"
. If the user provided a Tag Vocabulary file, also passoutput: "<filestem>.describegpt.json"
. This produces a structured JSON file with three top-level objects:tag_vocab: "<vocab_file>"
,Dictionary
, andDescription
. Each of these contains aTags
(the main content), optionalresponse
, andreasoning
metadata. The data dictionary itself is undertoken_usage
, as an array of field descriptors with keys likeDictionary.response.fields
,name
,null_count
,cardinality
,min
,max
, andmean
. Present the results to the user. When MCP sampling is unavailable but the tool still returns prompts, follow those prompts by issuing a follow-up call withstddev
instead of using the agent fallback._llm_responses13b) Fallback — agent generation: If
encounters a tool error or times out, or if following its prompts viadescribegpt
is not possible, fall back to generating the same artifacts from the statistics (steps 5-6) and frequency distributions (step 7). Save the result as_llm_responses
using the same canonical structure as<filestem>.profile.json
, for example:describegpt{ "Dictionary": { "response": { "fields": [ { "name": "column_name", "type": "Integer", "label": "Column Name", "description": "1-5 sentence description informed by type, stats, and frequency distribution", "null_count": 0, "cardinality": 100, "min": "0", "max": "999", "mean": "450.5", "stddev": "120.3" } ], "enum_threshold": 20, "num_examples": 5, "truncate_str": 80, "attribution": "agent_fallback" }, "reasoning": "", "token_usage": { "prompt": 0, "completion": 0, "total": 0, "elapsed": 0 } }, "Description": { "response": "3-10 sentences describing the dataset: what it represents, scope, key characteristics, quality issues, and potential use cases.", "reasoning": "", "token_usage": { "prompt": 0, "completion": 0, "total": 0, "elapsed": 0 } }, "Tags": { "response": ["tag1", "tag2", "tag3"], "reasoning": "", "token_usage": { "prompt": 0, "completion": 0, "total": 0, "elapsed": 0 } } }For the fallback dictionary entries (under
):Dictionary.response.fields
: Human-readable version of the field name (e.g.,label
→customer_id
)Customer ID
: 1-5 sentence description informed by type, statistics, and frequency distributiondescription- Include key stats fields (
,null_count
,cardinality
,min
,max
,mean
,sortiness
,stddev
,variance
,cv
) where applicablesparsity
For the fallback tags (under
): Infer 5-15 semantic tags based on column names, data types, value distributions, and domain characteristics. If a controlled Tag Vocabulary is provided, constrain choices to that vocabulary only.Tags.response
Quality Dimensions
When profiling, assess these quality dimensions:
1. Completeness
| Check | Command | What to Look For |
|---|---|---|
| Null counts | | column > 0 |
| Empty strings | | Empty string in top values |
| Sparsity | | field (ratio of nulls) |
Red flag: Sparsity > 0.5 means more than half the values are null.
2. Uniqueness
| Check | Command | What to Look For |
|---|---|---|
| Duplicate rows | | Non-empty dupes file |
| Cardinality | | vs row count |
| Unique ratio | | If cardinality = row count, column is unique |
Red flag: Key columns (ID, email) with cardinality < row count.
3. Validity
| Check | Command | What to Look For |
|---|---|---|
| Schema validation | | Validation error count |
| Data types | | column (String, Integer, Float, Date, etc.) |
| Format patterns | | Rows not matching expected regex |
| Value ranges | | , outside expected range |
Red flag: Type column shows "String" for what should be numeric data.
4. Consistency
| Check | Command | What to Look For |
|---|---|---|
| Date formats | | Mixed date types in same column |
| Case consistency | | "NYC" vs "nyc" vs "Nyc" as separate values |
| Encoding | | Non-UTF-8 encoding detected |
| Delimiters | | Unexpected delimiter or quoting |
| Row lengths | | Pads short rows to match longest row; compare count before/after to detect ragged rows |
Red flag: Frequency shows same value in different cases/formats.
5. Accuracy
| Check | Command | What to Look For |
|---|---|---|
| Statistical outliers | | , - values > 3 stddev from mean |
| Outlier counts | | , > 5% |
| Distribution shape | | > 3 (heavy tails), >= 0.555 (bimodal) |
| Inequality | | near 1 (extreme concentration) |
| Value distributions | | Unexpected dominant values |
| Range checks | | / outside plausible range |
| Cross-field checks | | SQL WHERE clauses for business rules |
Red flag: Latitude > 90 or < -90, negative ages, future birth dates, kurtosis > 10 (extreme outliers).
6. Column Name Quality
| Check | Command | What to Look For |
|---|---|---|
| Unsafe names | | Spaces, special chars, reserved words |
| Duplicate headers | | Same name appearing twice |
| Naming consistency | | Mixed conventions (camelCase vs snake_case) |
Red flag: Column names with spaces or special characters break downstream tools and SQL queries.
7. Conformity
| Check | Command | What to Look For |
|---|---|---|
| Standard codes | with domain regex file | Values not matching ISO country, state, zip patterns |
| Format adherence | with expected pattern | Phone numbers, emails, URLs not matching standard format |
| Controlled vocabularies | | Unexpected values outside known valid set |
Red flag: A "country" column with free-text entries instead of ISO 3166 codes, or a "state" column mixing abbreviations and full names.
8. Referential Integrity
| Check | Command | What to Look For |
|---|---|---|
| Orphaned foreign keys | | Rows in child file with no match in parent |
| Missing references | (reversed) | Parent records with no children (if expected) |
| Key overlap | | Cross-file key comparison via SQL |
Red flag: An orders file referencing customer IDs that don't exist in the customers file. Only applicable when profiling related files together.
9. PII/PHI Screening
Question: Does the data contain personally identifiable or protected health information?
Use
searchset with a regex file to scan all columns for sensitive patterns:
qsv_command command: "searchset", input_file: "<file>", args: ["--flag", "pii_match", "${CLAUDE_PLUGIN_ROOT}/resources/pii-regexes.txt"]
The bundled
${CLAUDE_PLUGIN_ROOT}/resources/pii-regexes.txt detects:
| Pattern | Example |
|---|---|
| SSN | |
| Mastercard | |
| Visa | |
| American Express | |
| IBAN | |
| |
| US Phone | |
For PHI screening, use the bundled
${CLAUDE_PLUGIN_ROOT}/resources/phi-regexes.txt:
qsv_command command: "searchset", input_file: "<file>", args: ["--flag", "phi_match", "${CLAUDE_PLUGIN_ROOT}/resources/phi-regexes.txt"]
The bundled
${CLAUDE_PLUGIN_ROOT}/resources/phi-regexes.txt detects:
| Pattern | Example |
|---|---|
| MRN (Medical Record Number) | |
| DEA Number | |
| NPI (National Provider Identifier) | (broad — verify with Luhn check) |
| ICD-10-CM Diagnosis Code | , |
| NDC (National Drug Code) | |
For additional PHI patterns (e.g., MBI, state license numbers), create a custom regex file and pass it to
searchset the same way.
Red flag: Any matches indicate PII/PHI exposure — flag columns for masking or removal before sharing.
10. Injection Safety
Question: Does the data contain CSV/formula injection or SQL injection payloads?
Malicious cell values can execute code when opened in spreadsheet applications (Excel, Google Sheets) or cause damage when loaded into databases without parameterized queries.
Use
searchset with the bundled injection regex file to scan all columns:
qsv_command command: "searchset", input_file: "<file>", args: ["--flag", "injection_match", "${CLAUDE_PLUGIN_ROOT}/resources/injection-regexes.txt"]
The bundled
${CLAUDE_PLUGIN_ROOT}/resources/injection-regexes.txt detects:
CSV/Formula Injection:
| Pattern | Example | Risk |
|---|---|---|
Starts with | | Arbitrary command execution in Excel |
Starts with + function | | Same as in many spreadsheet apps (positive numbers/phone numbers excluded) |
Starts with + function | | Formula execution (negative numbers excluded) |
Starts with | | Excel function prefix |
| Starts with tab/CR | | Bypasses naive prefix checks |
SQL Injection:
| Pattern | Example | Risk |
|---|---|---|
| SELECT...FROM | | Data exfiltration |
| UNION SELECT | | Query hijacking |
| DROP TABLE/DATABASE | | Data destruction |
| INSERT INTO | | Data tampering |
| DELETE FROM | | Data deletion |
| UPDATE SET | | Data modification |
| Tautology | | Authentication bypass |
| Stacked queries | | Arbitrary SQL execution |
Red flag: Any matches indicate potential injection payloads — sanitize cells before sharing the file or loading into a database. For formula injection, prefix dangerous cells with a single quote (
') or strip leading =+-@ characters.
Report Format
Present a summary with:
- File info: format, delimiter, encoding, row count, column count
- Column overview: table with name, type, nulls, cardinality, min, max, mean (where applicable)
- Key observations: unique identifiers, high-null columns, type mismatches, notable distributions
- Data quality flags: any issues found (high sparsity, mixed types, ragged rows)
- Data Dictionary, Description & Tags: JSON documentation generated via
(step 13a), or manually from stats cache and frequency distributions as fallback (step 13b)describegpt
Quality Report Checklist
- Row count and column count
- Null/empty counts per column (completeness)
- Cardinality per column (uniqueness assessment)
- Data types inferred per column (validity)
- Min/max/mean for numeric columns (range plausibility)
- Outlier counts and distribution shape (kurtosis, bimodality) from moarstats --advanced
- Top frequency values for categorical columns (distribution)
- Duplicate rows detected (uniqueness)
- Schema violations if schema provided (validity)
- Encoding and delimiter detected (consistency)
- Column names safe and consistent (safenames --verify)
- Conformity to domain standards checked where applicable (searchset)
- Referential integrity verified across related files if provided (joinp --left-anti)
- PII/PHI patterns detected via searchset (privacy)
- Injection payloads scanned for CSV/formula and SQL injection patterns (searchset)
- Data Dictionary with Label and Description per column, dataset Description, and Tags — via
(step 13a) or agent fallback (step 13b)describegpt --format JSON
Common Data Quality Fixes
| Problem | Fix Command |
|---|---|
| Inconsistent case | with or |
| Leading/trailing whitespace | with |
| Duplicate rows | |
| Ragged rows | |
| Unsafe column names | |
| Non-conforming values | + to identify, to fix |
| Orphaned foreign keys | to find, then remove or fix references |
| Injection payloads | to detect + to sanitize (prefix with or strip leading ) |
| Wrong encoding | (normalizes to UTF-8) |
| Empty values | with |
| Invalid rows | + filter |
Notes
- For Excel/JSONL files, the MCP server auto-converts to CSV first
- The stats cache created in step 5 accelerates subsequent commands (frequency, schema, sqlp, joinp)
- If the file has no headers, mention this and use column indices