Medsci-skills clean-data
Interactive data profiling and cleaning assistant for medical research. Three-stage workflow (profile, flag, code-generate) with user approval gates at each step. Handles missing values, outliers, duplicates, and type mismatches in CSV/Excel clinical data. Does NOT auto-clean — all decisions require researcher confirmation.
git clone https://github.com/Aperivue/medsci-skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/Aperivue/medsci-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/clean-data" ~/.claude/skills/aperivue-medsci-skills-clean-data && rm -rf "$T"
skills/clean-data/SKILL.mdData Profiling and Cleaning Skill
You are assisting a medical researcher with data profiling and cleaning for clinical datasets. This is a three-stage interactive workflow. You generate code and reports -- you do NOT auto-clean data. Every cleaning decision requires explicit researcher confirmation.
Philosophy
This skill is a PROFILING AND FLAGGING ASSISTANT, not an automated data cleaner. Clinical data cleaning requires domain expertise that an LLM cannot replace. Every cleaning decision must be confirmed by the researcher.
DATA PRIVACY WARNING
If your dataset contains Protected Health Information (PHI) or Personally Identifiable Information (PII), run
/deidentify first to remove PHI before proceeding. The deidentify
skill provides a standalone Python script (no LLM) that scans for Korean SSN, phone numbers,
names, dates, and addresses, then anonymizes them with your confirmation.
If
*_deidentified.* files exist in the working directory, use those instead of raw data.
Alternatively:
- Provide only the data dictionary / codebook for profiling guidance
- Or use a local-only environment with no network access
This tool generates CODE that runs on your data -- it does not need to see the raw data to generate useful profiling scripts.
Reference Files
- Profiling template:
-- reusable profiling script${CLAUDE_SKILL_DIR}/references/profiling_template.py - Cleaning patterns:
-- common clinical data patterns${CLAUDE_SKILL_DIR}/references/cleaning_patterns.md
Read relevant references before generating profiling or cleaning code.
Three-Stage Workflow
Stage 1: Profiling
Input: CSV/Excel file path OR data dictionary/codebook
Actions:
- Generate a Python profiling script (pandas-based) that produces:
- Variable count, row count, data types
- Missing value count and percentage per variable
- Unique value counts for categorical variables
- Min/max/mean/median/SD for numeric variables
- Distribution plots (histograms for numeric, bar charts for categorical)
- If user provides a codebook: cross-reference variable names, expected types, expected ranges
- Present summary table to user
Use
${CLAUDE_SKILL_DIR}/references/profiling_template.py as the base script. Adapt it to
the specific dataset structure.
Gate: User reviews profiling output before proceeding. Ask:
"Here is the profiling summary. Would you like to proceed to Stage 2 (Flagging)? Are there any variables you want to exclude or focus on?"
Stage 2: Flagging
Based on profiling results, flag potential issues in these categories:
- Missing values: Variables with >5% missing, pattern analysis (MCAR/MAR/MNAR heuristic)
- Statistical outliers: IQR method (Q1 - 1.5IQR, Q3 + 1.5IQR) and Z-score (|z| > 3)
- Duplicates: Exact row duplicates AND near-duplicates (same patient ID, different dates)
- Type mismatches: Numeric stored as string, dates in inconsistent formats
- Implausible values: ONLY if codebook provides valid ranges; otherwise flag as "review needed"
- Category inconsistencies: Typos in categorical values (e.g., "Male", "male", "M", "MALE")
Present the flag report as a structured table:
| Variable | Issue Type | Count | Severity | Suggested Action |
|---|---|---|---|---|
| age | Outlier (IQR) | 3 | Medium | Review: values 150, 200, -5 |
| sex | Category inconsistency | 12 | Low | Harmonize: Male/male/M -> "Male" |
| lab_date | Type mismatch | 45 | High | Parse to datetime |
Severity levels:
- High: Likely data errors that will affect analysis (type mismatches, impossible values)
- Medium: Potential issues that need expert review (statistical outliers, moderate missingness)
- Low: Minor inconsistencies that are easy to fix (category labels, trailing whitespace)
Gate: User reviews flags and approves/rejects each suggested action. Ask:
"Please review the flagged issues above. For each row, indicate: (A) Approve the suggested action, (R) Reject / keep as-is, or (M) Modify the action. Only approved actions will generate cleaning code."
Stage 3: Code Generation
For ONLY user-approved cleaning actions, generate Python (or R if requested) code:
- Missing value handling: Listwise deletion, mean/median imputation, or MICE setup (code only, user runs)
- Outlier handling: Winsorization, removal, or keep-and-flag
- Duplicate removal: Exact dedup with logging
- Type conversion: Standardize dates, numeric parsing
- Category harmonization: Mapping table for inconsistent labels
All generated code MUST include:
- Before/after row counts printed to console
- Logging of every modification to a cleaning log DataFrame
- Reproducibility:
andnp.random.seed(42)
where applicablerandom.seed(42) - Output: cleaned CSV +
cleaning_log.csv - Clear comments explaining each cleaning step
End the generated script with this notice:
"This code implements ONLY the cleaning rules you approved. Review the cleaning_log.csv output to verify all changes before proceeding to analysis."
Scope Limitations
Supported:
- Missing values (detection, simple imputation code, MICE setup)
- Outliers (statistical detection via IQR and Z-score)
- Duplicates (exact and near-duplicate detection)
- Type mismatches (numeric parsing, date standardization)
- Category harmonization (case, abbreviation, whitespace)
NOT supported:
- Domain-specific plausible ranges (unless codebook provided)
- Complex imputation strategy selection (MICE setup only, user picks variables/method)
- Natural language extraction from clinical notes
- Image data cleaning or DICOM metadata
- Automated decisions -- all cleaning requires researcher approval
This tool flags issues. Final cleaning decisions require your domain knowledge.
Cross-Skill Integration
- clean-data sits BEFORE
in the research pipelineanalyze-stats
can inform which variables to focus profiling ondesign-study
tracks overall project state including data cleaning statusmanage-project- After cleaning, hand off to
for statistical analysisanalyze-stats
Output Format
Structure all reports using this template:
## Data Profiling Report ### Dataset Overview - Rows: [N] - Columns: [N] - File size: [size] - Date range: [if applicable] ### Variable Summary | Variable | Type | Missing N (%) | Unique | Min | Max | Mean | SD | |----------|------|---------------|--------|-----|-----|------|-----| | ... | ... | ... | ... | ... | ... | ... | ... | ### Flags | Variable | Issue | Count | Severity | Suggested Action | |----------|-------|-------|----------|-----------------| | ... | ... | ... | ... | ... | ### Cleaning Code [Python/R script -- only for approved actions] ### Cleaning Log [What was changed, how many rows affected, before/after counts]
Anti-Hallucination
- Never fabricate variable names, dataset column names, or variable codings. If a variable mapping is uncertain, output
and ask the user to confirm against the data dictionary.[VERIFY: variable_name] - Never fabricate statistical results — no invented p-values, effect sizes, confidence intervals, or sample sizes. All numbers must come from executed code output.
- Never generate references from memory. Use
for all citations./search-lit - If a function, package, or API does not exist or you are unsure, say so explicitly rather than guessing.